-
Timothy Hunter authored
[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.
Timothy Hunter authored[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.
layout: global
title: Linear Methods - spark.mllib
displayTitle: Linear Methods - spark.mllib
- Table of contents {:toc}
\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]
Mathematical formulation
Many standard machine learning methods can be formulated as a convex optimization problem, i.e.
the task of finding a minimizer of a convex function $f$
that depends on a variable vector
$\wv$
(called weights
in the code), which has $d$
entries.
Formally, we can write this as the optimization problem $\min_{\wv \in\R^d} \; f(\wv)$
, where
the objective function is of the form
\begin{equation} f(\wv) := \lambda\, R(\wv) + \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) \label{eq:regPrimal} \ . \end{equation}
Here the vectors $\x_i\in\R^d$
are the training data examples, for $1\le i\le n$
, and
$y_i\in\R$
are their corresponding labels, which we want to predict.
We call the method linear if L(\wv; \x, y) can be expressed as a function of \wv^T x and y.
Several of spark.mllib
's classification and regression algorithms fall into this category,
and are discussed here.
The objective function $f$
has two parts:
the regularizer that controls the complexity of the model,
and the loss that measures the error of the model on the training data.
The loss function $L(\wv;.)$
is typically a convex function in $\wv$
. The
fixed regularization parameter $\lambda \ge 0$
(regParam
in the code)
defines the trade-off between the two goals of minimizing the loss (i.e.,
training error) and minimizing model complexity (i.e., to avoid overfitting).
Loss functions
The following table summarizes the loss functions and their gradients or sub-gradients for the
methods spark.mllib
supports:
loss function $L(\wv; \x, y)$ | gradient or sub-gradient | |
---|---|---|
hinge loss | $\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$ | $\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 & \text{otherwise}.\end{cases}$ |
logistic loss | $\log(1+\exp( -y \wv^T \x)), \quad y \in \{-1, +1\}$ | $-y \left(1-\frac1{1+\exp(-y \wv^T \x)} \right) \cdot \x$ |
squared loss | $\frac{1}{2} (\wv^T \x - y)^2, \quad y \in \R$ | $(\wv^T \x - y) \cdot \x$ |
Regularizers
The purpose of the
regularizer is to
encourage simple models and avoid overfitting. We support the following
regularizers in spark.mllib
:
regularizer $R(\wv)$ | gradient or sub-gradient | |
---|---|---|
zero (unregularized) | 0 | $\0$ |
L2 | $\frac{1}{2}\|\wv\|_2^2$ | $\wv$ |
L1 | $\|\wv\|_1$ | $\mathrm{sign}(\wv)$ |
elastic net | $\alpha \|\wv\|_1 + (1-\alpha)\frac{1}{2}\|\wv\|_2^2$ | $\alpha \mathrm{sign}(\wv) + (1-\alpha) \wv$ |
Here $\mathrm{sign}(\wv)$
is the vector consisting of the signs ($\pm1$
) of all the entries
of $\wv$
.
L2-regularized problems are generally easier to solve than L1-regularized due to smoothness. However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection. Elastic net is a combination of L1 and L2 regularization. It is not recommended to train models without any regularization, especially when the number of training examples is small.
Optimization
Under the hood, linear methods use convex optimization methods to optimize the objective functions.
spark.mllib
uses two methods, SGD and L-BFGS, described in the optimization section.
Currently, most algorithm APIs support Stochastic Gradient Descent (SGD), and a few support L-BFGS.
Refer to this optimization section for guidelines on choosing between optimization methods.
Classification
Classification aims to divide items into
categories.
The most common classification type is
binary classification, where there are two
categories, usually named positive and negative.
If there are more than two categories, it is called
multiclass classification.
spark.mllib
supports two linear methods for classification: linear Support Vector Machines (SVMs)
and logistic regression.
Linear SVMs supports only binary classification, while logistic regression supports both binary and
multiclass classification problems.
For both methods, spark.mllib
supports L1 and L2 regularized variants.
The training data set is represented by an RDD of LabeledPoint in MLlib,
where labels are class indices starting from zero: 0, 1, 2, \ldots.
Note that, in the mathematical formulation in this guide, a binary label y is denoted as either
+1 (positive) or -1 (negative), which is convenient for the formulation.
However, the negative label is represented by 0 in spark.mllib
instead of -1, to be consistent with
multiclass labeling.
Linear Support Vector Machines (SVMs)
The linear SVM
is a standard method for large-scale classification tasks. It is a linear method as described above in equation $\eqref{eq:regPrimal}$
, with the loss function in the formulation given by the hinge loss: