Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    1c53a5db
    [SPARK-4439] [MLlib] add python api for random forest · 1c53a5db
    Davies Liu authored
    ```
        class RandomForestModel
         |  A model trained by RandomForest
         |
         |  numTrees(self)
         |      Get number of trees in forest.
         |
         |  predict(self, x)
         |      Predict values for a single data point or an RDD of points using the model trained.
         |
         |  toDebugString(self)
         |      Full model
         |
         |  totalNumNodes(self)
         |      Get total number of nodes, summed over all trees in the forest.
         |
    
        class RandomForest
         |  trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
         |      Method to train a decision tree model for binary or multiclass classification.
         |
         |      :param data: Training dataset: RDD of LabeledPoint.
         |                   Labels should take values {0, 1, ..., numClasses-1}.
         |      :param numClassesForClassification: number of classes for classification.
         |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
         |                                  E.g., an entry (n -> k) indicates that feature n is categorical
         |                                  with k categories indexed from 0: {0, 1, ..., k-1}.
         |      :param numTrees: Number of trees in the random forest.
         |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
         |                                Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
         |                                If "auto" is set, this parameter is set based on numTrees:
         |                                  if numTrees == 1, set to "all";
         |                                  if numTrees > 1 (forest) set to "sqrt".
         |      :param impurity: Criterion used for information gain calculation.
         |                   Supported values: "gini" (recommended) or "entropy".
         |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
         |                       1 internal node + 2 leaf nodes. (default: 4)
         |      :param maxBins: maximum number of bins used for splitting features (default: 100)
         |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
         |      :return: RandomForestModel that can be used for prediction
         |
         |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
         |      Method to train a decision tree model for regression.
         |
         |      :param data: Training dataset: RDD of LabeledPoint.
         |                   Labels are real numbers.
         |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
         |                                   E.g., an entry (n -> k) indicates that feature n is categorical
         |                                   with k categories indexed from 0: {0, 1, ..., k-1}.
         |      :param numTrees: Number of trees in the random forest.
         |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
         |                                 Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
         |                                 If "auto" is set, this parameter is set based on numTrees:
         |                                 if numTrees == 1, set to "all";
         |                                 if numTrees > 1 (forest) set to "onethird".
         |      :param impurity: Criterion used for information gain calculation.
         |                       Supported values: "variance".
         |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
         |                       1 internal node + 2 leaf nodes.(default: 4)
         |      :param maxBins: maximum number of bins used for splitting features (default: 100)
         |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
         |      :return: RandomForestModel that can be used for prediction
         |
    ```
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #3320 from davies/forest and squashes the following commits:
    
    8003dfc [Davies Liu] reorder
    53cf510 [Davies Liu] fix docs
    4ca593d [Davies Liu] fix docs
    e0df852 [Davies Liu] fix docs
    0431746 [Davies Liu] rebased
    2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
    885abee [Davies Liu] address comments
    dae7fc0 [Davies Liu] address comments
    89a000f [Davies Liu] fix docs
    565d476 [Davies Liu] add python api for random forest
    1c53a5db
    History
    [SPARK-4439] [MLlib] add python api for random forest
    Davies Liu authored
    ```
        class RandomForestModel
         |  A model trained by RandomForest
         |
         |  numTrees(self)
         |      Get number of trees in forest.
         |
         |  predict(self, x)
         |      Predict values for a single data point or an RDD of points using the model trained.
         |
         |  toDebugString(self)
         |      Full model
         |
         |  totalNumNodes(self)
         |      Get total number of nodes, summed over all trees in the forest.
         |
    
        class RandomForest
         |  trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
         |      Method to train a decision tree model for binary or multiclass classification.
         |
         |      :param data: Training dataset: RDD of LabeledPoint.
         |                   Labels should take values {0, 1, ..., numClasses-1}.
         |      :param numClassesForClassification: number of classes for classification.
         |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
         |                                  E.g., an entry (n -> k) indicates that feature n is categorical
         |                                  with k categories indexed from 0: {0, 1, ..., k-1}.
         |      :param numTrees: Number of trees in the random forest.
         |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
         |                                Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
         |                                If "auto" is set, this parameter is set based on numTrees:
         |                                  if numTrees == 1, set to "all";
         |                                  if numTrees > 1 (forest) set to "sqrt".
         |      :param impurity: Criterion used for information gain calculation.
         |                   Supported values: "gini" (recommended) or "entropy".
         |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
         |                       1 internal node + 2 leaf nodes. (default: 4)
         |      :param maxBins: maximum number of bins used for splitting features (default: 100)
         |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
         |      :return: RandomForestModel that can be used for prediction
         |
         |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
         |      Method to train a decision tree model for regression.
         |
         |      :param data: Training dataset: RDD of LabeledPoint.
         |                   Labels are real numbers.
         |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
         |                                   E.g., an entry (n -> k) indicates that feature n is categorical
         |                                   with k categories indexed from 0: {0, 1, ..., k-1}.
         |      :param numTrees: Number of trees in the random forest.
         |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
         |                                 Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
         |                                 If "auto" is set, this parameter is set based on numTrees:
         |                                 if numTrees == 1, set to "all";
         |                                 if numTrees > 1 (forest) set to "onethird".
         |      :param impurity: Criterion used for information gain calculation.
         |                       Supported values: "variance".
         |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
         |                       1 internal node + 2 leaf nodes.(default: 4)
         |      :param maxBins: maximum number of bins used for splitting features (default: 100)
         |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
         |      :return: RandomForestModel that can be used for prediction
         |
    ```
    
    Author: Davies Liu <davies@databricks.com>
    
    Closes #3320 from davies/forest and squashes the following commits:
    
    8003dfc [Davies Liu] reorder
    53cf510 [Davies Liu] fix docs
    4ca593d [Davies Liu] fix docs
    e0df852 [Davies Liu] fix docs
    0431746 [Davies Liu] rebased
    2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
    885abee [Davies Liu] address comments
    dae7fc0 [Davies Liu] address comments
    89a000f [Davies Liu] fix docs
    565d476 [Davies Liu] add python api for random forest
epytext.py 679 B
import re

RULES = (
    (r"<(!BLANKLINE)[\w.]+>", r""),
    (r"L{([\w.()]+)}", r":class:`\1`"),
    (r"[LC]{(\w+\.\w+)\(\)}", r":func:`\1`"),
    (r"C{([\w.()]+)}", r":class:`\1`"),
    (r"[IBCM]{([^}]+)}", r"`\1`"),
    ('pyspark.rdd.RDD', 'RDD'),
)

def _convert_epytext(line):
    """
    >>> _convert_epytext("L{A}")
    :class:`A`
    """
    line = line.replace('@', ':')
    for p, sub in RULES:
        line = re.sub(p, sub, line)
    return line

def _process_docstring(app, what, name, obj, options, lines):
    for i in range(len(lines)):
        lines[i] = _convert_epytext(lines[i])

def setup(app):
    app.connect("autodoc-process-docstring", _process_docstring)