Skip to content
Snippets Groups Projects
Commit 43dfc84f authored by Xiangrui Meng's avatar Xiangrui Meng
Browse files

[SPARK-2830][MLLIB] doc update for 1.1

1. renamed mllib-basics to mllib-data-types
1. renamed mllib-stats to mllib-statistics
1. moved random data generation to the bottom of mllib-stats
1. updated toc accordingly

atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #2151 from mengxr/mllib-doc-1.1 and squashes the following commits:

0bd79f3 [Xiangrui Meng] add mllib-data-types
b64a5d7 [Xiangrui Meng] update the content list of basis statistics in mllib-guide
f625cc2 [Xiangrui Meng] move mllib-basics to mllib-data-types
4d69250 [Xiangrui Meng] move random data generation to the bottom of statistics
e64f3ce [Xiangrui Meng] move mllib-stats.md to mllib-statistics.md
parent e1139dd6
No related branches found
No related tags found
No related merge requests found
---
layout: global
title: Basics - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Basics
title: Data Types - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Data Types
---
* Table of contents
......
......@@ -11,7 +11,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
of reducing the number of variables under consideration.
It can be used to extract latent features from raw and noisy features
or compress data while maintaining the structure.
MLlib provides support for dimensionality reduction on the <a href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
MLlib provides support for dimensionality reduction on the <a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.
## Singular value decomposition (SVD)
......@@ -58,7 +58,7 @@ passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.
### SVD Example
MLlib provides SVD functionality to row-oriented matrices, provided in the
<a href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
<a href="mllib-data-types.html#rowmatrix">RowMatrix</a> class.
<div class="codetabs">
<div data-lang="scala" markdown="1">
......
......@@ -7,12 +7,13 @@ MLlib is Spark's scalable machine learning library consisting of common learning
including classification, regression, clustering, collaborative
filtering, dimensionality reduction, as well as underlying optimization primitives, as outlined below:
* [Data types](mllib-basics.html)
* [Basic statistics](mllib-stats.html)
* random data generation
* stratified sampling
* [Data types](mllib-data-types.html)
* [Basic statistics](mllib-statistics.html)
* summary statistics
* correlations
* stratified sampling
* hypothesis testing
* random data generation
* [Classification and regression](mllib-classification-regression.html)
* [linear models (SVMs, logistic regression, linear regression)](mllib-linear-methods.html)
* [decision trees](mllib-decision-tree.html)
......
---
layout: global
title: Statistics Functionality - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
title: Basic Statistics - MLlib
displayTitle: <a href="mllib-guide.html">MLlib</a> - Basic Statistics
---
* Table of contents
......@@ -25,7 +25,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Statistics Functionality
\newcommand{\zero}{\mathbf{0}}
\]`
## Summary Statistics
## Summary statistics
We provide column summary statistics for `RDD[Vector]` through the function `colStats`
available in `Statistics`.
......@@ -104,81 +104,7 @@ print summary.numNonzeros()
</div>
## Random data generation
Random data generation is useful for randomized algorithms, prototyping, and performance testing.
MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution:
uniform, standard normal, or Poisson.
<div class="codetabs">
<div data-lang="scala" markdown="1">
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._
val sc: SparkContext = ...
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
{% highlight java %}
import org.apache.spark.SparkContext;
import org.apache.spark.api.JavaDoubleRDD;
import static org.apache.spark.mllib.random.RandomRDDs.*;
JavaSparkContext jsc = ...
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
// Apply a transform to get a random double RDD following `N(1, 4)`.
JavaDoubleRDD v = u.map(
new Function<Double, Double>() {
public Double call(Double x) {
return 1.0 + 2.0 * x;
}
});
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
{% highlight python %}
from pyspark.mllib.random import RandomRDDs
sc = ... # SparkContext
# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
# Apply a transform to get a random double RDD following `N(1, 4)`.
v = u.map(lambda x: 1.0 + 2.0 * x)
{% endhighlight %}
</div>
</div>
## Correlations calculation
## Correlations
Calculating the correlation between two series of data is a common operation in Statistics. In MLlib
we provide the flexibility to calculate pairwise correlations among many series. The supported
......@@ -455,3 +381,77 @@ for (ChiSqTestResult result : featureTestResults) {
</div>
</div>
## Random data generation
Random data generation is useful for randomized algorithms, prototyping, and performance testing.
MLlib supports generating random RDDs with i.i.d. values drawn from a given distribution:
uniform, standard normal, or Poisson.
<div class="codetabs">
<div data-lang="scala" markdown="1">
[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
{% highlight scala %}
import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._
val sc: SparkContext = ...
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
val u = normalRDD(sc, 1000000L, 10)
// Apply a transform to get a random double RDD following `N(1, 4)`.
val v = u.map(x => 1.0 + 2.0 * x)
{% endhighlight %}
</div>
<div data-lang="java" markdown="1">
[`RandomRDDs`](api/java/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
{% highlight java %}
import org.apache.spark.SparkContext;
import org.apache.spark.api.JavaDoubleRDD;
import static org.apache.spark.mllib.random.RandomRDDs.*;
JavaSparkContext jsc = ...
// Generate a random double RDD that contains 1 million i.i.d. values drawn from the
// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
JavaDoubleRDD u = normalJavaRDD(jsc, 1000000L, 10);
// Apply a transform to get a random double RDD following `N(1, 4)`.
JavaDoubleRDD v = u.map(
new Function<Double, Double>() {
public Double call(Double x) {
return 1.0 + 2.0 * x;
}
});
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
[`RandomRDDs`](api/python/pyspark.mllib.random.RandomRDDs-class.html) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
{% highlight python %}
from pyspark.mllib.random import RandomRDDs
sc = ... # SparkContext
# Generate a random double RDD that contains 1 million i.i.d. values drawn from the
# standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.
u = RandomRDDs.uniformRDD(sc, 1000000L, 10)
# Apply a transform to get a random double RDD following `N(1, 4)`.
v = u.map(lambda x: 1.0 + 2.0 * x)
{% endhighlight %}
</div>
</div>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment