Skip to content
Snippets Groups Projects
  • Timothy Hunter's avatar
    2ecbe02d
    [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib... · 2ecbe02d
    Timothy Hunter authored
    [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.
    
    Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
    
    It also removes some files that I forgot to delete with #10207
    
    Author: Timothy Hunter <timhunter@databricks.com>
    
    Closes #10234 from thunterdb/12212.
    2ecbe02d
    History
    [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib...
    Timothy Hunter authored
    [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.
    
    Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
    
    It also removes some files that I forgot to delete with #10207
    
    Author: Timothy Hunter <timhunter@databricks.com>
    
    Closes #10234 from thunterdb/12212.
mllib-linear-methods.md 36.25 KiB
layout: global
title: Linear Methods - spark.mllib
displayTitle: Linear Methods - spark.mllib
  • Table of contents {:toc}

\[ \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\mathbb{E}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\wv}{\mathbf{w}} \newcommand{\av}{\mathbf{\alpha}} \newcommand{\bv}{\mathbf{b}} \newcommand{\N}{\mathbb{N}} \newcommand{\id}{\mathbf{I}} \newcommand{\ind}{\mathbf{1}} \newcommand{\0}{\mathbf{0}} \newcommand{\unit}{\mathbf{e}} \newcommand{\one}{\mathbf{1}} \newcommand{\zero}{\mathbf{0}} \]

Mathematical formulation

Many standard machine learning methods can be formulated as a convex optimization problem, i.e. the task of finding a minimizer of a convex function $f$ that depends on a variable vector $\wv$ (called weights in the code), which has $d$ entries. Formally, we can write this as the optimization problem $\min_{\wv \in\R^d} \; f(\wv)$, where the objective function is of the form \begin{equation} f(\wv) := \lambda\, R(\wv) + \frac1n \sum_{i=1}^n L(\wv;\x_i,y_i) \label{eq:regPrimal} \ . \end{equation} Here the vectors $\x_i\in\R^d$ are the training data examples, for $1\le i\le n$, and $y_i\in\R$ are their corresponding labels, which we want to predict. We call the method linear if L(\wv; \x, y) can be expressed as a function of \wv^T x and y. Several of spark.mllib's classification and regression algorithms fall into this category, and are discussed here.

The objective function $f$ has two parts: the regularizer that controls the complexity of the model, and the loss that measures the error of the model on the training data. The loss function $L(\wv;.)$ is typically a convex function in $\wv$. The fixed regularization parameter $\lambda \ge 0$ (regParam in the code) defines the trade-off between the two goals of minimizing the loss (i.e., training error) and minimizing model complexity (i.e., to avoid overfitting).

Loss functions

The following table summarizes the loss functions and their gradients or sub-gradients for the methods spark.mllib supports:

loss function $L(\wv; \x, y)$ gradient or sub-gradient
hinge loss $\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$ $\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 & \text{otherwise}.\end{cases}$
logistic loss $\log(1+\exp( -y \wv^T \x)), \quad y \in \{-1, +1\}$ $-y \left(1-\frac1{1+\exp(-y \wv^T \x)} \right) \cdot \x$
squared loss $\frac{1}{2} (\wv^T \x - y)^2, \quad y \in \R$ $(\wv^T \x - y) \cdot \x$

Regularizers

The purpose of the regularizer is to encourage simple models and avoid overfitting. We support the following regularizers in spark.mllib:

regularizer $R(\wv)$ gradient or sub-gradient
zero (unregularized) 0 $\0$
L2 $\frac{1}{2}\|\wv\|_2^2$ $\wv$
L1 $\|\wv\|_1$ $\mathrm{sign}(\wv)$
elastic net $\alpha \|\wv\|_1 + (1-\alpha)\frac{1}{2}\|\wv\|_2^2$ $\alpha \mathrm{sign}(\wv) + (1-\alpha) \wv$

Here $\mathrm{sign}(\wv)$ is the vector consisting of the signs ($\pm1$) of all the entries of $\wv$.

L2-regularized problems are generally easier to solve than L1-regularized due to smoothness. However, L1 regularization can help promote sparsity in weights leading to smaller and more interpretable models, the latter of which can be useful for feature selection. Elastic net is a combination of L1 and L2 regularization. It is not recommended to train models without any regularization, especially when the number of training examples is small.

Optimization

Under the hood, linear methods use convex optimization methods to optimize the objective functions. spark.mllib uses two methods, SGD and L-BFGS, described in the optimization section. Currently, most algorithm APIs support Stochastic Gradient Descent (SGD), and a few support L-BFGS. Refer to this optimization section for guidelines on choosing between optimization methods.

Classification

Classification aims to divide items into categories. The most common classification type is binary classification, where there are two categories, usually named positive and negative. If there are more than two categories, it is called multiclass classification. spark.mllib supports two linear methods for classification: linear Support Vector Machines (SVMs) and logistic regression. Linear SVMs supports only binary classification, while logistic regression supports both binary and multiclass classification problems. For both methods, spark.mllib supports L1 and L2 regularized variants. The training data set is represented by an RDD of LabeledPoint in MLlib, where labels are class indices starting from zero: 0, 1, 2, \ldots. Note that, in the mathematical formulation in this guide, a binary label y is denoted as either +1 (positive) or -1 (negative), which is convenient for the formulation. However, the negative label is represented by 0 in spark.mllib instead of -1, to be consistent with multiclass labeling.

Linear Support Vector Machines (SVMs)

The linear SVM is a standard method for large-scale classification tasks. It is a linear method as described above in equation $\eqref{eq:regPrimal}$, with the loss function in the formulation given by the hinge loss: