Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
spark
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
cs525-sp18-g07
spark
Commits
d52edfa7
Commit
d52edfa7
authored
11 years ago
by
Ameet Talwalkar
Browse files
Options
Downloads
Patches
Plain Diff
updated content
parent
a5478667
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
docs/mllib-guide.md
+147
-1
147 additions, 1 deletion
docs/mllib-guide.md
with
147 additions
and
1 deletion
docs/mllib-guide.md
+
147
−
1
View file @
d52edfa7
...
@@ -3,4 +3,150 @@ layout: global
...
@@ -3,4 +3,150 @@ layout: global
title
:
Machine Learning Library (MLlib)
title
:
Machine Learning Library (MLlib)
---
---
Coming soon.
MLlib is a Spark implementation of some common ML functionality, as well
associated unit tests and data generators. MLlib currently supports four
common types of machine learning problem settings, namely, binary
classification, regression, clustering and collaborative filtering, as well as an
underlying gradient descent optimization primitive. This guide will outline
the functionality supported in MLlib and also provides an example of invoking
MLlib.
# Binary Classification
Binary classification is a supervised learning problem in which we want to
classify entities into one of two distinct categories or labels, e.g.,
predicting whether or not emails are spam. This problem involves executing a
learning
*Algorithm*
on a set of
*labeled*
examples, i.e., a set of entities
represented via (numerical) features along with underlying category labels.
The algorithm returns a trained
*Model*
that can predict the label for new
entities for which the underlying label is unknown.
MLlib currently supports two standard model families for binary classification,
namely
[
Linear Support Vector Machines
(SVMs)
](
http://en.wikipedia.org/wiki/Support_vector_machine
)
and
[
Logistic
Regression
](
http://en.wikipedia.org/wiki/Logistic_regression
)
, along with
[
L1
and L2 regularized
](
http://en.wikipedia.org/wiki/Regularization_(mathematics
)
)
variants of each model family. The training algorithms all leverage an
underlying gradient descent primitive (described
[
below
](
#gradient-descent-primitive
)
), and take as input a regularization
parameter (
*regParam*
) along with various parameters associated with gradient
descent (
*stepSize*
,
*numIterations*
,
*miniBatchFraction*
).
The following code snippet illustrates how to load a sample dataset, execute a
training algorithm on this training data, and to make predictions with the
resulting model to compute the training error.
import org.apache.spark.SparkContext
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
// Load and parse the data file
val data = sc.textFile("sample_wiki_ngrams.txt")
val parsedData = data.map(line => {
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
})
// Run training algorithm
val svmAlg = new SVMWithSGD()
svmAlg.optimizer.setNumIterations(200)
.setStepSize(1.0)
.setRegParam(0.1)
.setMiniBatchFraction(1.0)
val model = svmAlg.run(parsedData)
// Evaluate model on training examples and compute training error
val labelAndPreds = parsedData.map(r => {
val prediction = model.predict(r.features)
(r.label, prediction)
})
val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedData.count
println("trainError = " + trainErr)
The
`SVMWithSGD`
algorithm performs L2 regularization by default,
and if we want to generate an L1 regularized variant of SVMs, we can do the
following:
import org.apache.spark.mllib.optimization.L1Updater
svmAlg.optimizer.setUpdater(new L1Updater)
val modelL1 = svmAlg.run(parsedData)
# Linear Regression
Linear regression is another classical supervised learning setting. In this
problem, each entity is associated with a real-valued label (as opposed to a
binary label as in binary classification), and we want to predict labels as
closely as possible given numerical features representing entities. MLlib
supports linear regression as well as L1
(
[
lasso
](
http://en.wikipedia.org/wiki/Lasso_(statistics
)
#Lasso_method)) and L2
(
[
ridge
](
http://en.wikipedia.org/wiki/Ridge_regression
)
) regularized variants.
The regression algorithms in MLlib also leverage the underlying gradient
descent primitive (described
[
below
](
#gradient-descent-primitive
)
), and have
the same parameters as the binary classification algorithms described above.
# Clustering
Clustering is an unsupervised learning problem whereby we aim to group subsets
of entities with one another based on some notion of similarity. Clustering is
often used for exploratary analysis and/or as a component of a hierarchical
supervised learning pipeline (in which distinct classifiers or regression
models are trained for each cluster). MLlib supports
[
k-means
](
http://en.wikipedia.org/wiki/K-means_clustering
)
clustering, arguably
the most commonly used clustering approach that clusters the data points into
*k*
clusters. The implementation in MLlib has the following parameters:
*
*k*
is the number of clusters.
*
*maxIterations*
is the maximum number of iterations to run.
*
*initializationMode*
specifies either random initialization or
initialization via a parallelized variant of the
[
k-means++
](
http://en.wikipedia.org/wiki/K-means%2B%2B
)
method.
*
*runs*
is the number of times to run the k-means algorithm (k-means is not
guaranteed to find a globally optimal solution, and when run multiple times on
a given dataset, the algorithm returns the best clustering result).
*
*initializiationSteps*
determines the number of steps in the k-means++ algorithm.
*
*epsilon*
determines the distance threshold within which we consider k-means to have converged.
# Collaborative Filtering
[
Collaborative
filtering
](
http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering
)
is commonly used for recommender systems. These techniques aim to fill in the
missing entries of a user-product association matrix. MLlib currently supports
model-based collaborative filtering, in which users and products are described
by a small set of latent factors that can be used to predict missing entries.
In particular, we implement the
[
alternating least squares
(ALS)
](
http://www2.research.att.com/~volinsky/papers/ieeecomputer.pdf
)
algorithm to learn these latent factors. The implementation in MLlib has the
following parameters:
*
*numBlocks*
is the number of blacks used to parallelize computation (set to -1 to auto-configure).
*
*rank*
is the number of latent factors in our model.
*
*iterations*
is the number of iterations to run.
*
*lambda*
specifies the regularization parameter in ALS.
# Gradient Descent Primitive
[
Gradient descent
](
http://en.wikipedia.org/wiki/Gradient_descent
)
(
along
with
stochastic variants thereof) are first-order optimization methods that are
well-suited for large-scale and distributed computation. Gradient descent
methods aim to find a local minimum of a function by iteratively taking steps
in the direction of the negative gradient of the function at the current point,
i.e., the current parameter value. Gradient descent is included as a low-level
primitive in MLlib, upon which various ML algorithms are developed, and has the
following parameters:
*
*gradient*
is a class that computes the stochastic gradient of the function
being optimized, i.e., with respect to a single training example, at the
current parameter value. MLlib includes gradient classes for common loss
functions, e.g., hinge, logistic, least-squares. The gradient class takes as
input a training example, its label, and the current parameter value.
*
*updater*
is a class that updates weights in each iteration of gradient
descent. MLlib includes updaters for cases without regularization, as well as
L1 and L2 regularizers.
*
*stepSize*
is a scalar value denoting the initial step size for gradient
descent. All updaters in MLlib use a step size at the t-th step equal to
stepSize / sqrt(t).
*
*numIterations*
is the number of iterations to run.
*
*regParam*
is the regularization parameter when using L1 or L2 regularization.
*
*miniBatchFraction*
is the fraction of the data used to compute the gradient
at each iteration.
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment