Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
spark
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
cs525-sp18-g07
spark
Commits
81a8bd46
Commit
81a8bd46
authored
11 years ago
by
Ameet Talwalkar
Browse files
Options
Downloads
Patches
Plain Diff
respose to PR comments
parent
bf280c8b
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
2
Expand all
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
docs/mllib-guide.md
+30
-25
30 additions, 25 deletions
docs/mllib-guide.md
mllib/data/sample_svm_data.txt
+322
-0
322 additions, 0 deletions
mllib/data/sample_svm_data.txt
with
352 additions
and
25 deletions
docs/mllib-guide.md
+
30
−
25
View file @
81a8bd46
...
...
@@ -43,26 +43,20 @@ import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.regression.LabeledPoint
// Load and parse the data file
val data = sc.textFile("
sample_wiki_ngrams
.txt")
val data = sc.textFile("
mllib/data/sample_svm_data
.txt")
val parsedData = data.map(line => {
val parts = line.split(' ')
LabeledPoint(parts(0).toDouble, parts.tail.map(x => x.toDouble).toArray)
})
// Run training algorithm
val stepSizeVal = 1.0
val regParamVal = 0.1
val numIterationsVal = 200
val miniBatchFractionVal = 1.0
val numIterations = 20
val model = SVMWithSGD.train(
parsedData,
numIterationsVal,
stepSizeVal,
regParamVal,
miniBatchFractionVal)
numIterations)
// Evaluate model on training examples and compute training error
val labelAnPreds = parsedData.map(r => {
val labelAn
d
Preds = parsedData.map(r => {
val prediction = model.predict(r.features)
(r.label, prediction)
})
...
...
@@ -70,30 +64,31 @@ val trainErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / parsedDa
println("trainError = " + trainErr)
{% endhighlight %}
The
`SVMWithSGD`
algorithm performs L2 regularization by default. If we want to
configure this algorithm to generate an L1 regularized variant of SVMs, we can
use the builder design pattern as follows:
The
`SVMWithSGD.train()`
method by default performs L2 regularization with the
regularization parameter set to 1.0. If we want to configure this algorithm, we
can customize
`SVMWithSGD`
further by creating a new object directly and
calling setter methods. All other MLlib algorithms support customization in
this way as well. For example, the following code produces an L1 regularized
variant of SVMs with regularization parameter set to 0.1, and runs the training
algorithm for 200 iterations.
{% highlight scala %}
import org.apache.spark.mllib.optimization.L1Updater
val svmAlg = new SVMWithSGD()
svmAlg.optimizer.setNumIterations(200)
.setStepSize(1.0)
.setRegParam(0.1)
.setMiniBatchFraction(1.0)
svmAlg.optimizer.setUpdater(new L1Updater)
.setUpdater(new L1Updater)
val modelL1 = svmAlg.run(parsedData)
{% endhighlight %}
Both of the code snippets above can be executed in
`spark-shell`
to generate a
classifier for the provided dataset. Moreover, note that static methods and
builder patterns, similar to the ones displayed above, are available for all
algorithms in MLlib.
classifier for the provided dataset.
[
SVMWithSGD
](
`api/mllib/index.html#org.apache.spark.mllib.
classification
.SVMWithSGD`
)
Available algorithms for binary
classification
:
[
LogisticRegressionWithSGD
](
`api/mllib/index.html#org.apache.spark.mllib.classification.LogistictRegressionWithSGD`
)
*
[
SVMWithSGD
](
api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD
)
*
[
LogisticRegressionWithSGD
](
api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD
)
# Linear Regression
...
...
@@ -108,7 +103,11 @@ The regression algorithms in MLlib also leverage the underlying gradient
descent primitive (described
[
below
](
#gradient-descent-primitive
)
), and have
the same parameters as the binary classification algorithms described above.
[
RidgeRegressionWithSGD
](
`api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD`
)
Available algorithms for linear regression:
*
[
LinearRegressionWithSGD
](
api/mllib/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD
)
*
[
RidgeRegressionWithSGD
](
api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD
)
*
[
LassoWithSGD
](
api/mllib/index.html#org.apache.spark.mllib.regression.LassoWithSGD
)
# Clustering
...
...
@@ -134,7 +133,9 @@ a given dataset, the algorithm returns the best clustering result).
*
*initializiationSteps*
determines the number of steps in the k-means
\|\|
algorithm.
*
*epsilon*
determines the distance threshold within which we consider k-means to have converged.
[
KMeans
](
`api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans`
)
Available algorithms for clustering:
*
[
KMeans
](
api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans
)
# Collaborative Filtering
...
...
@@ -154,7 +155,9 @@ following parameters:
*
*iterations*
is the number of iterations to run.
*
*lambda*
specifies the regularization parameter in ALS.
[
ALS
](
`api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS`
)
Available algorithms for collaborative filtering:
*
[
ALS
](
api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS
)
# Gradient Descent Primitive
...
...
@@ -183,4 +186,6 @@ stepSize / sqrt(t).
*
*miniBatchFraction*
is the fraction of the data used to compute the gradient
at each iteration.
[
GradientDescent
](
`api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent`
)
Available algorithms for gradient descent:
*
[
GradientDescent
](
api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent
)
This diff is collapsed.
Click to expand it.
mllib/data/sample_svm_data.txt
0 → 100644
+
322
−
0
View file @
81a8bd46
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment