updated content

d52edfa7 · Ameet Talwalkar · a5478667 · d52edfa7
Commit d52edfa7 authored 11 years ago by Ameet Talwalkar
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -3,4 +3,150 @@ layout: global
 title: Machine Learning Library (MLlib)
 ---

-Coming soon.
+MLlib is a Spark implementation of some common ML functionality, as well
+associated unit tests and data generators.  MLlib currently supports four
+common types of machine learning problem settings, namely, binary
+classification, regression, clustering and collaborative filtering, as well as an
+underlying gradient descent optimization primitive.  This guide will outline
+the functionality supported in MLlib and also provides an example of invoking
+MLlib.
+
+# Binary Classification
+
+Binary classification is a supervised learning problem in which we want to
+classify entities into one of two distinct categories or labels, e.g.,
+predicting whether or not emails are spam.  This problem involves executing a
+learning *Algorithm* on a set of *labeled* examples, i.e., a set of entities
+represented via (numerical) features along with underlying category labels.
+The algorithm returns a trained *Model* that can predict the label for new
+entities for which the underlying label is unknown. 
+ 
+MLlib currently supports two standard model families for binary classification,
+namely [Linear Support Vector Machines
+(SVMs)](http://en.wikipedia.org/wiki/Support_vector_machine) and [Logistic
+Regression](http://en.wikipedia.org/wiki/Logistic_regression), along with [L1
+and L2 regularized](http://en.wikipedia.org/wiki/Regularization_(mathematics))
+variants of each model family.  The training algorithms all leverage an
+underlying gradient descent primitive (described
+[below](#gradient-descent-primitive)), and take as input a regularization
+parameter (*regParam*) along with various parameters associated with gradient
+descent (*stepSize*, *numIterations*, *miniBatchFraction*). 
+
+The following code snippet illustrates how to load a sample dataset, execute a
+training algorithm on this training data, and to make predictions with the
+resulting model to compute the training error.
+
+    import org.apache.spark.SparkContext
+    import org.apache.spark.mllib.classification.SVMWithSGD
+    import org.apache.spark.mllib.regression.LabeledPoint
+
+    // Load and parse the data file
+    val data = sc.textFile("sample_wiki_ngrams.txt")
+    val parsedData = data.map(line => {
+      val parts = line.split(' ')
+      LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
+    })
+
+    // Run training algorithm
+    val svmAlg = new SVMWithSGD()
+    svmAlg.optimizer.setNumIterations(200)
+      .setStepSize(1.0)
+      .setRegParam(0.1)
+      .setMiniBatchFraction(1.0)
+    val model = svmAlg.run(parsedData)
+                
+    // Evaluate model on training examples and compute training error
+    val labelAndPreds = parsedData.map(r => {
+      val prediction = model.predict(r.features)
+      (r.label, prediction)
+    })
+    val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
+    println("trainError = " + trainErr)
+
+The `SVMWithSGD` algorithm performs L2 regularization by default,
+and if we want to generate an L1 regularized variant of SVMs, we can do the
+following:
+
+    import org.apache.spark.mllib.optimization.L1Updater
+    svmAlg.optimizer.setUpdater(new L1Updater)
+    val modelL1 = svmAlg.run(parsedData)
+
+# Linear Regression
+
+Linear regression is another classical supervised learning setting.  In this
+problem, each entity is associated with a real-valued label (as opposed to a
+binary label as in binary classification), and we want to predict labels as
+closely as possible given numerical features representing entities.  MLlib
+supports linear regression as well as L1
+([lasso](http://en.wikipedia.org/wiki/Lasso_(statistics)#Lasso_method)) and L2
+([ridge](http://en.wikipedia.org/wiki/Ridge_regression)) regularized variants.
+The regression algorithms in MLlib also leverage the underlying gradient
+descent primitive (described [below](#gradient-descent-primitive)), and have
+the same parameters as the binary classification algorithms described above. 
+
+# Clustering
+
+Clustering is an unsupervised learning problem whereby we aim to group subsets
+of entities with one another based on some notion of similarity.  Clustering is
+often used for exploratary analysis and/or as a component of a hierarchical
+supervised learning pipeline (in which distinct classifiers or regression
+models are trained for each cluster). MLlib supports
+[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, arguably
+the most commonly used clustering approach that clusters the data points into
+*k* clusters. The implementation in MLlib has the following parameters:  
+
+* *k* is the number of clusters.
+* *maxIterations* is the maximum number of iterations to run.
+* *initializationMode* specifies either random initialization or
+initialization via a parallelized variant of the
+[k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method.
+* *runs* is the number of times to run the k-means algorithm (k-means is not
+guaranteed to find a globally optimal solution, and when run multiple times on
+a given dataset, the algorithm returns the best clustering result).
+* *initializiationSteps* determines the number of steps in the k-means++ algorithm.
+* *epsilon* determines the distance threshold within which we consider k-means to have converged. 
+
+# Collaborative Filtering 
+
+[Collaborative
+filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
+is commonly used for recommender systems.  These techniques aim to fill in the
+missing entries of a user-product association matrix.  MLlib currently supports
+model-based collaborative filtering, in which users and products are described
+by a small set of latent factors that can be used to predict missing entries.
+In particular, we implement the [alternating least squares
+(ALS)](http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf)
+algorithm to learn these latent factors. The implementation in MLlib has the
+following parameters:
+
+* *numBlocks* is the number of blacks used to parallelize computation (set to -1 to auto-configure). 
+* *rank* is the number of latent factors in our model.
+* *iterations* is the number of iterations to run.
+* *lambda* specifies the regularization parameter in ALS. 
+
+# Gradient Descent Primitive
+
+[Gradient descent](http://en.wikipedia.org/wiki/Gradient_descent) (along with
+stochastic variants thereof) are first-order optimization methods that are
+well-suited for large-scale and distributed computation. Gradient descent
+methods aim to find a local minimum of a function by iteratively taking steps
+in the direction of the negative gradient of the function at the current point,
+i.e., the current parameter value. Gradient descent is included as a low-level
+primitive in MLlib, upon which various ML algorithms are developed, and has the
+following parameters:
+
+* *gradient* is a class that computes the stochastic gradient of the function
+being optimized, i.e., with respect to a single training example, at the
+current parameter value. MLlib includes gradient classes for common loss
+functions, e.g., hinge, logistic, least-squares.  The gradient class takes as
+input a training example, its label, and the current parameter value. 
+* *updater* is a class that updates weights in each iteration of gradient
+descent. MLlib includes updaters for cases without regularization, as well as
+L1 and L2 regularizers.
+* *stepSize* is a scalar value denoting the initial step size for gradient
+descent. All updaters in MLlib use a step size at the t-th step equal to
+stepSize / sqrt(t). 
+* *numIterations* is the number of iterations to run.
+* *regParam* is the regularization parameter when using L1 or L2 regularization.
+* *miniBatchFraction* is the fraction of the data used to compute the gradient
+at each iteration.