Skip to content
Snippets Groups Projects
Commit 42d65681 authored by vijaykiran's avatar vijaykiran Committed by Xiangrui Meng
Browse files

[SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to consistent format

Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.

Author: vijaykiran <mail@vijaykiran.com>
Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
parent 90de6b2f
No related branches found
No related tags found
No related merge requests found
...@@ -94,16 +94,19 @@ class LogisticRegressionModel(LinearClassificationModel): ...@@ -94,16 +94,19 @@ class LogisticRegressionModel(LinearClassificationModel):
Classification model trained using Multinomial/Binary Logistic Classification model trained using Multinomial/Binary Logistic
Regression. Regression.
:param weights: Weights computed for every feature. :param weights:
:param intercept: Intercept computed for this model. (Only used Weights computed for every feature.
in Binary Logistic Regression. In Multinomial Logistic :param intercept:
Regression, the intercepts will not be a single value, Intercept computed for this model. (Only used in Binary Logistic
so the intercepts will be part of the weights.) Regression. In Multinomial Logistic Regression, the intercepts will
:param numFeatures: the dimension of the features. not bea single value, so the intercepts will be part of the
:param numClasses: the number of possible outcomes for k classes weights.)
classification problem in Multinomial Logistic Regression. :param numFeatures:
By default, it is binary logistic regression so numClasses The dimension of the features.
will be set to 2. :param numClasses:
The number of possible outcomes for k classes classification problem
in Multinomial Logistic Regression. By default, it is binary
logistic regression so numClasses will be set to 2.
>>> data = [ >>> data = [
... LabeledPoint(0.0, [0.0, 1.0]), ... LabeledPoint(0.0, [0.0, 1.0]),
...@@ -189,8 +192,8 @@ class LogisticRegressionModel(LinearClassificationModel): ...@@ -189,8 +192,8 @@ class LogisticRegressionModel(LinearClassificationModel):
@since('1.4.0') @since('1.4.0')
def numClasses(self): def numClasses(self):
""" """
Number of possible outcomes for k classes classification problem in Multinomial Number of possible outcomes for k classes classification problem
Logistic Regression. in Multinomial Logistic Regression.
""" """
return self._numClasses return self._numClasses
...@@ -272,37 +275,42 @@ class LogisticRegressionWithSGD(object): ...@@ -272,37 +275,42 @@ class LogisticRegressionWithSGD(object):
""" """
Train a logistic regression model on the given data. Train a logistic regression model on the given data.
:param data: The training data, an RDD of :param data:
LabeledPoint. The training data, an RDD of LabeledPoint.
:param iterations: The number of iterations :param iterations:
(default: 100). The number of iterations.
:param step: The step parameter used in SGD (default: 100)
(default: 1.0). :param step:
:param miniBatchFraction: Fraction of data to be used for each The step parameter used in SGD.
SGD iteration (default: 1.0). (default: 1.0)
:param initialWeights: The initial weights (default: None). :param miniBatchFraction:
:param regParam: The regularizer parameter Fraction of data to be used for each SGD iteration.
(default: 0.01). (default: 1.0)
:param regType: The type of regularizer used for :param initialWeights:
training our model. The initial weights.
(default: None)
:Allowed values: :param regParam:
- "l1" for using L1 regularization The regularizer parameter.
- "l2" for using L2 regularization (default: 0.01)
- None for no regularization :param regType:
The type of regularizer used for training our model.
(default: "l2") Allowed values:
:param intercept: Boolean parameter which indicates the - "l1" for using L1 regularization
use or not of the augmented representation - "l2" for using L2 regularization (default)
for training data (i.e. whether bias - None for no regularization
features are activated or not, :param intercept:
default: False). Boolean parameter which indicates the use or not of the
:param validateData: Boolean parameter which indicates if augmented representation for training data (i.e., whether bias
the algorithm should validate data features are activated or not).
before training. (default: True) (default: False)
:param convergenceTol: A condition which decides iteration termination. :param validateData:
(default: 0.001) Boolean parameter which indicates if the algorithm should
validate data before training.
(default: True)
:param convergenceTol:
A condition which decides iteration termination.
(default: 0.001)
""" """
def train(rdd, i): def train(rdd, i):
return callMLlibFunc("trainLogisticRegressionModelWithSGD", rdd, int(iterations), return callMLlibFunc("trainLogisticRegressionModelWithSGD", rdd, int(iterations),
...@@ -323,38 +331,43 @@ class LogisticRegressionWithLBFGS(object): ...@@ -323,38 +331,43 @@ class LogisticRegressionWithLBFGS(object):
""" """
Train a logistic regression model on the given data. Train a logistic regression model on the given data.
:param data: The training data, an RDD of :param data:
LabeledPoint. The training data, an RDD of LabeledPoint.
:param iterations: The number of iterations :param iterations:
(default: 100). The number of iterations.
:param initialWeights: The initial weights (default: None). (default: 100)
:param regParam: The regularizer parameter :param initialWeights:
(default: 0.01). The initial weights.
:param regType: The type of regularizer used for (default: None)
training our model. :param regParam:
The regularizer parameter.
:Allowed values: (default: 0.01)
- "l1" for using L1 regularization :param regType:
- "l2" for using L2 regularization The type of regularizer used for training our model.
- None for no regularization Allowed values:
(default: "l2") - "l1" for using L1 regularization
- "l2" for using L2 regularization (default)
:param intercept: Boolean parameter which indicates the - None for no regularization
use or not of the augmented representation :param intercept:
for training data (i.e. whether bias Boolean parameter which indicates the use or not of the
features are activated or not, augmented representation for training data (i.e., whether bias
default: False). features are activated or not).
:param corrections: The number of corrections used in the (default: False)
LBFGS update (default: 10). :param corrections:
:param tolerance: The convergence tolerance of iterations The number of corrections used in the LBFGS update.
for L-BFGS (default: 1e-4). (default: 10)
:param validateData: Boolean parameter which indicates if the :param tolerance:
algorithm should validate data before The convergence tolerance of iterations for L-BFGS.
training. (default: True) (default: 1e-4)
:param numClasses: The number of classes (i.e., outcomes) a :param validateData:
label can take in Multinomial Logistic Boolean parameter which indicates if the algorithm should
Regression (default: 2). validate data before training.
(default: True)
:param numClasses:
The number of classes (i.e., outcomes) a label can take in
Multinomial Logistic Regression.
(default: 2)
>>> data = [ >>> data = [
... LabeledPoint(0.0, [0.0, 1.0]), ... LabeledPoint(0.0, [0.0, 1.0]),
...@@ -387,8 +400,10 @@ class SVMModel(LinearClassificationModel): ...@@ -387,8 +400,10 @@ class SVMModel(LinearClassificationModel):
""" """
Model for Support Vector Machines (SVMs). Model for Support Vector Machines (SVMs).
:param weights: Weights computed for every feature. :param weights:
:param intercept: Intercept computed for this model. Weights computed for every feature.
:param intercept:
Intercept computed for this model.
>>> data = [ >>> data = [
... LabeledPoint(0.0, [0.0]), ... LabeledPoint(0.0, [0.0]),
...@@ -490,37 +505,42 @@ class SVMWithSGD(object): ...@@ -490,37 +505,42 @@ class SVMWithSGD(object):
""" """
Train a support vector machine on the given data. Train a support vector machine on the given data.
:param data: The training data, an RDD of :param data:
LabeledPoint. The training data, an RDD of LabeledPoint.
:param iterations: The number of iterations :param iterations:
(default: 100). The number of iterations.
:param step: The step parameter used in SGD (default: 100)
(default: 1.0). :param step:
:param regParam: The regularizer parameter The step parameter used in SGD.
(default: 0.01). (default: 1.0)
:param miniBatchFraction: Fraction of data to be used for each :param regParam:
SGD iteration (default: 1.0). The regularizer parameter.
:param initialWeights: The initial weights (default: None). (default: 0.01)
:param regType: The type of regularizer used for :param miniBatchFraction:
training our model. Fraction of data to be used for each SGD iteration.
(default: 1.0)
:Allowed values: :param initialWeights:
- "l1" for using L1 regularization The initial weights.
- "l2" for using L2 regularization (default: None)
- None for no regularization :param regType:
The type of regularizer used for training our model.
(default: "l2") Allowed values:
:param intercept: Boolean parameter which indicates the - "l1" for using L1 regularization
use or not of the augmented representation - "l2" for using L2 regularization (default)
for training data (i.e. whether bias - None for no regularization
features are activated or not, :param intercept:
default: False). Boolean parameter which indicates the use or not of the
:param validateData: Boolean parameter which indicates if augmented representation for training data (i.e. whether bias
the algorithm should validate data features are activated or not).
before training. (default: True) (default: False)
:param convergenceTol: A condition which decides iteration termination. :param validateData:
(default: 0.001) Boolean parameter which indicates if the algorithm should
validate data before training.
(default: True)
:param convergenceTol:
A condition which decides iteration termination.
(default: 0.001)
""" """
def train(rdd, i): def train(rdd, i):
return callMLlibFunc("trainSVMModelWithSGD", rdd, int(iterations), float(step), return callMLlibFunc("trainSVMModelWithSGD", rdd, int(iterations), float(step),
...@@ -536,11 +556,13 @@ class NaiveBayesModel(Saveable, Loader): ...@@ -536,11 +556,13 @@ class NaiveBayesModel(Saveable, Loader):
""" """
Model for Naive Bayes classifiers. Model for Naive Bayes classifiers.
:param labels: list of labels. :param labels:
:param pi: log of class priors, whose dimension is C, List of labels.
number of labels. :param pi:
:param theta: log of class conditional probabilities, whose Log of class priors, whose dimension is C, number of labels.
dimension is C-by-D, where D is number of features. :param theta:
Log of class conditional probabilities, whose dimension is C-by-D,
where D is number of features.
>>> data = [ >>> data = [
... LabeledPoint(0.0, [0.0, 0.0]), ... LabeledPoint(0.0, [0.0, 0.0]),
...@@ -639,8 +661,11 @@ class NaiveBayes(object): ...@@ -639,8 +661,11 @@ class NaiveBayes(object):
it can also be used as Bernoulli NB (U{http://tinyurl.com/p7c96j6}). it can also be used as Bernoulli NB (U{http://tinyurl.com/p7c96j6}).
The input feature values must be nonnegative. The input feature values must be nonnegative.
:param data: RDD of LabeledPoint. :param data:
:param lambda_: The smoothing parameter (default: 1.0). RDD of LabeledPoint.
:param lambda_:
The smoothing parameter.
(default: 1.0)
""" """
first = data.first() first = data.first()
if not isinstance(first, LabeledPoint): if not isinstance(first, LabeledPoint):
...@@ -652,9 +677,9 @@ class NaiveBayes(object): ...@@ -652,9 +677,9 @@ class NaiveBayes(object):
@inherit_doc @inherit_doc
class StreamingLogisticRegressionWithSGD(StreamingLinearAlgorithm): class StreamingLogisticRegressionWithSGD(StreamingLinearAlgorithm):
""" """
Train or predict a logistic regression model on streaming data. Training uses Train or predict a logistic regression model on streaming data.
Stochastic Gradient Descent to update the model based on each new batch of Training uses Stochastic Gradient Descent to update the model based on
incoming data from a DStream. each new batch of incoming data from a DStream.
Each batch of data is assumed to be an RDD of LabeledPoints. Each batch of data is assumed to be an RDD of LabeledPoints.
The number of data points per batch can vary, but the number The number of data points per batch can vary, but the number
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment