Skip to content
Snippets Groups Projects
Commit 36827dda authored by Shuai Lin's avatar Shuai Lin Committed by Sean Owen
Browse files

[SPARK-16822][DOC] Support latex in scaladoc.

## What changes were proposed in this pull request?

Support using latex in scaladoc by adding MathJax javascript to the js template.

## How was this patch tested?

Generated scaladoc.  Preview:

- LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient)

- MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler)

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
parent 511dede1
No related branches found
No related tags found
No related merge requests found
......@@ -41,3 +41,23 @@ function addBadges(allAnnotations, name, tag, html) {
.add(annotations.closest("div.fullcomment").prevAll("h4.signature"))
.prepend(html);
}
$(document).ready(function() {
var script = document.createElement('script');
script.type = 'text/javascript';
script.async = true;
script.onload = function(){
MathJax.Hub.Config({
displayAlign: "left",
tex2jax: {
inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
processEscapes: true,
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre', 'a']
}
});
};
script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
document.getElementsByTagName('head')[0].appendChild(script);
});
......@@ -76,11 +76,15 @@ private[feature] trait MinMaxScalerParams extends Params with HasInputCol with H
/**
* Rescale each feature individually to a common range [min, max] linearly using column summary
* statistics, which is also known as min-max normalization or Rescaling. The rescaled value for
* feature E is calculated as,
* feature E is calculated as:
*
* `Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min`
* <p><blockquote>
* $$
* Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
* $$
* </blockquote></p>
*
* For the case `E_{max} == E_{min}`, `Rescaled(e_i) = 0.5 * (max + min)`.
* For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.
* Note that since zero values will probably be transformed to non-zero values, output of the
* transformer will be DenseVector even for sparse input.
*/
......
......@@ -412,50 +412,72 @@ object AFTSurvivalRegressionModel extends MLReadable[AFTSurvivalRegressionModel]
* Two AFTAggregator can be merged together to have a summary of loss and gradient of
* the corresponding joint dataset.
*
* Given the values of the covariates x^{'}, for random lifetime t_{i} of subjects i = 1, ..., n,
* Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of subjects i = 1,..,n,
* with possible right-censoring, the likelihood function under the AFT model is given as
* {{{
* L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}
* (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}
* (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
* }}}
* Where \delta_{i} is the indicator of the event has occurred i.e. uncensored or not.
* Using \epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}, the log-likelihood function
*
* <p><blockquote>
* $$
* L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}
* (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}
* (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
* $$
* </blockquote></p>
*
* Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
* Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
* assumes the form
* {{{
* \iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+
* \delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
* }}}
* Where S_{0}(\epsilon_{i}) is the baseline survivor function,
* and f_{0}(\epsilon_{i}) is corresponding density function.
*
* <p><blockquote>
* $$
* \iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+
* \delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
* $$
* </blockquote></p>
* Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
* and $f_{0}(\epsilon_{i})$ is corresponding density function.
*
* The most commonly used log-linear survival regression method is based on the Weibull
* distribution of the survival time. The Weibull distribution for lifetime corresponding
* to extreme value distribution for log of the lifetime,
* and the S_{0}(\epsilon) function is
* {{{
* S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
* }}}
* the f_{0}(\epsilon_{i}) function is
* {{{
* f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
* }}}
* and the $S_{0}(\epsilon)$ function is
*
* <p><blockquote>
* $$
* S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
* $$
* </blockquote></p>
*
* and the $f_{0}(\epsilon_{i})$ function is
*
* <p><blockquote>
* $$
* f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
* $$
* </blockquote></p>
*
* The log-likelihood function for Weibull distribution of lifetime is
* {{{
* \iota(\beta,\sigma)=
* -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
* }}}
*
* <p><blockquote>
* $$
* \iota(\beta,\sigma)=
* -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
* $$
* </blockquote></p>
*
* Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
* the loss function we use to optimize is -\iota(\beta,\sigma).
* The gradient functions for \beta and \log\sigma respectively are
* {{{
* \frac{\partial (-\iota)}{\partial \beta}=
* \sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
* }}}
* {{{
* \frac{\partial (-\iota)}{\partial (\log\sigma)}=
* \sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
* }}}
* the loss function we use to optimize is $-\iota(\beta,\sigma)$.
* The gradient functions for $\beta$ and $\log\sigma$ respectively are
*
* <p><blockquote>
* $$
* \frac{\partial (-\iota)}{\partial \beta}=
* \sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma} \\
*
* \frac{\partial (-\iota)}{\partial (\log\sigma)}=
* \sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
* $$
* </blockquote></p>
*
* @param parameters including three part: The log of scale parameter, the intercept and
* regression coefficients corresponding to the features.
* @param fitIntercept Whether to fit an intercept term.
......
......@@ -58,7 +58,12 @@ private[regression] trait LinearRegressionParams extends PredictorParams
*
* The learning objective is to minimize the squared error, with regularization.
* The specific squared error loss function used is:
* L = 1/2n ||A coefficients - y||^2^
*
* <p><blockquote>
* $$
* L = 1/2n ||A coefficients - y||^2^
* $$
* </blockquote></p>
*
* This supports multiple types of regularization:
* - none (a.k.a. ordinary least squares)
......@@ -759,66 +764,103 @@ class LinearRegressionSummary private[regression] (
*
* When training with intercept enabled,
* The objective function in the scaled space is given by
* {{{
* L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2,
* }}}
* where \bar{x_i} is the mean of x_i, \hat{x_i} is the standard deviation of x_i,
* \bar{y} is the mean of label, and \hat{y} is the standard deviation of label.
*
* <p><blockquote>
* $$
* L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2,
* $$
* </blockquote></p>
*
* where $\bar{x_i}$ is the mean of $x_i$, $\hat{x_i}$ is the standard deviation of $x_i$,
* $\bar{y}$ is the mean of label, and $\hat{y}$ is the standard deviation of label.
*
* If we fitting the intercept disabled (that is forced through 0.0),
* we can use the same equation except we set \bar{y} and \bar{x_i} to 0 instead
* we can use the same equation except we set $\bar{y}$ and $\bar{x_i}$ to 0 instead
* of the respective means.
*
* This can be rewritten as
* {{{
* L = 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i (w_i/\hat{x_i})\bar{x_i} - y / \hat{y}
* + \bar{y} / \hat{y}||^2
* = 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n diff^2
* }}}
* where w_i^\prime^ is the effective coefficients defined by w_i/\hat{x_i}, offset is
* {{{
* - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}.
* }}}, and diff is
* {{{
* \sum_i w_i^\prime x_i - y / \hat{y} + offset
* }}}
*
* <p><blockquote>
* $$
* \begin{align}
* L &= 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i (w_i/\hat{x_i})\bar{x_i} - y / \hat{y}
* + \bar{y} / \hat{y}||^2 \\
* &= 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n diff^2
* \end{align}
* $$
* </blockquote></p>
*
* where $w_i^\prime$ is the effective coefficients defined by $w_i/\hat{x_i}$, offset is
*
* <p><blockquote>
* $$
* - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}.
* $$
* </blockquote></p>
*
* and diff is
*
* <p><blockquote>
* $$
* \sum_i w_i^\prime x_i - y / \hat{y} + offset
* $$
* </blockquote></p>
*
* Note that the effective coefficients and offset don't depend on training dataset,
* so they can be precomputed.
*
* Now, the first derivative of the objective function in scaled space is
* {{{
* \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i}
* }}}
* However, ($x_i - \bar{x_i}$) will densify the computation, so it's not
*
* <p><blockquote>
* $$
* \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i}
* $$
* </blockquote></p>
*
* However, $(x_i - \bar{x_i})$ will densify the computation, so it's not
* an ideal formula when the training dataset is sparse format.
*
* This can be addressed by adding the dense \bar{x_i} / \hat{x_i} terms
* This can be addressed by adding the dense $\bar{x_i} / \hat{x_i}$ terms
* in the end by keeping the sum of diff. The first derivative of total
* objective function from all the samples is
* {{{
* \frac{\partial L}{\partial w_i} =
* 1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i}
* = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i} / \hat{x_i})
* = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i)
* }}},
* where correction_i = - diffSum \bar{x_i} / \hat{x_i}
*
*
* <p><blockquote>
* $$
* \begin{align}
* \frac{\partial L}{\partial w_i} &=
* 1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i} \\
* &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i} / \hat{x_i}) \\
* &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i)
* \end{align}
* $$
* </blockquote></p>
*
* where $correction_i = - diffSum \bar{x_i} / \hat{x_i}$
*
* A simple math can show that diffSum is actually zero, so we don't even
* need to add the correction terms in the end. From the definition of diff,
* {{{
* diffSum = \sum_j (\sum_i w_i(x_{ij} - \bar{x_i}) / \hat{x_i} - (y_j - \bar{y}) / \hat{y})
* = N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - (\bar{y} - \bar{y}) / \hat{y})
* = 0
* }}}
*
* <p><blockquote>
* $$
* \begin{align}
* diffSum &= \sum_j (\sum_i w_i(x_{ij} - \bar{x_i})
* / \hat{x_i} - (y_j - \bar{y}) / \hat{y}) \\
* &= N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - (\bar{y} - \bar{y}) / \hat{y}) \\
* &= 0
* \end{align}
* $$
* </blockquote></p>
*
* As a result, the first derivative of the total objective function only depends on
* the training dataset, which can be easily computed in distributed fashion, and is
* sparse format friendly.
* {{{
* \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i})
* }}},
*
* <p><blockquote>
* $$
* \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i})
* $$
* </blockquote></p>
*
* @param coefficients The coefficients corresponding to the features.
* @param labelStd The standard deviation value of the label.
......
......@@ -25,7 +25,7 @@ import breeze.numerics._
private[clustering] object LDAUtils {
/**
* Log Sum Exp with overflow protection using the identity:
* For any a: \log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N \exp\{x_n - a\}
* For any a: $\log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N \exp\{x_n - a\}$
*/
private[clustering] def logSumExp(x: BDV[Double]): Double = {
val a = max(x)
......
......@@ -73,7 +73,7 @@ class RegressionMetrics @Since("2.0.0") (
/**
* Returns the variance explained by regression.
* explainedVariance = \sum_i (\hat{y_i} - \bar{y})^2 / n
* explainedVariance = $\sum_i (\hat{y_i} - \bar{y})^2 / n$
* @see [[https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained]]
*/
@Since("1.2.0")
......
......@@ -67,43 +67,53 @@ abstract class Gradient extends Serializable {
* http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 119 gives the formula of
* multinomial logistic regression model. A simple calculation shows that
*
* {{{
* P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))
* P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))
* ...
* P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))
* }}}
* <p><blockquote>
* $$
* P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))\\
* P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))\\
* ...\\
* P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))\\
* $$
* </blockquote></p>
*
* for K classes multiclass classification problem.
*
* The model weights w = (w_1, w_2, ..., w_{K-1})^T becomes a matrix which has dimension of
* The model weights $w = (w_1, w_2, ..., w_{K-1})^T$ becomes a matrix which has dimension of
* (K-1) * (N+1) if the intercepts are added. If the intercepts are not added, the dimension
* will be (K-1) * N.
*
* As a result, the loss of objective function for a single instance of data can be written as
* {{{
* l(w, x) = -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) log P(y|x, w)
* = log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1}
* = log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
* }}}
* <p><blockquote>
* $$
* \begin{align}
* l(w, x) &= -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) log P(y|x, w) \\
* &= log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1} \\
* &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
* \end{align}
* $$
* </blockquote></p>
*
* where \alpha(i) = 1 if i != 0, and
* \alpha(i) = 0 if i == 0,
* margins_i = x w_i.
* where $\alpha(i) = 1$ if $i \ne 0$, and
* $\alpha(i) = 0$ if $i == 0$,
* $margins_i = x w_i$.
*
* For optimization, we have to calculate the first derivative of the loss function, and
* a simple calculation shows that
*
* {{{
* \frac{\partial l(w, x)}{\partial w_{ij}}
* = (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - (1-\alpha(y)\delta_{y, i+1})) * x_j
* = multiplier_i * x_j
* }}}
* <p><blockquote>
* $$
* \begin{align}
* \frac{\partial l(w, x)}{\partial w_{ij}} &=
* (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - (1-\alpha(y)\delta_{y, i+1})) * x_j \\
* &= multiplier_i * x_j
* \end{align}
* $$
* </blockquote></p>
*
* where \delta_{i, j} = 1 if i == j,
* \delta_{i, j} = 0 if i != j, and
* where $\delta_{i, j} = 1$ if $i == j$,
* $\delta_{i, j} = 0$ if $i != j$, and
* multiplier =
* \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})
* $\exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})$
*
* If any of margins is larger than 709.78, the numerical computation of multiplier and loss
* function will be suffered from arithmetic overflow. This issue occurs when there are outliers
......@@ -113,26 +123,36 @@ abstract class Gradient extends Serializable {
* Fortunately, when max(margins) = maxMargin > 0, the loss function and the multiplier can be
* easily rewritten into the following equivalent numerically stable formula.
*
* {{{
* l(w, x) = log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
* = log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin)) + maxMargin
* - (1-\alpha(y)) margins_{y-1}
* = log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1}
* }}}
*
* where sum = \exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) - 1.
* <p><blockquote>
* $$
* \begin{align}
* l(w, x) &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1} \\
* &= log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin)) + maxMargin
* - (1-\alpha(y)) margins_{y-1} \\
* &= log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1}
* \end{align}
* $$
* </blockquote></p>
* where sum = $\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) - 1$.
*
* Note that each term, (margins_i - maxMargin) in \exp is smaller than zero; as a result,
* Note that each term, $(margins_i - maxMargin)$ in $\exp$ is smaller than zero; as a result,
* overflow will not happen with this formula.
*
* For multiplier, similar trick can be applied as the following,
*
* {{{
* multiplier = \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})
* = \exp(margins_i - maxMargin) / (1 + sum) - (1-\alpha(y)\delta_{y, i+1})
* }}}
* <p><blockquote>
* $$
* \begin{align}
* multiplier
* &= \exp(margins_i) /
* (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1}) \\
* &= \exp(margins_i - maxMargin) / (1 + sum) - (1-\alpha(y)\delta_{y, i+1})
* \end{align}
* $$
* </blockquote></p>
*
* where each term in \exp is also smaller than zero, so overflow is not a concern.
* where each term in $\exp$ is also smaller than zero, so overflow is not a concern.
*
* For the detailed mathematical derivation, see the reference at
* http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment