Skip to content
Snippets Groups Projects
  • Joseph K. Bradley's avatar
    657a8883
    [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix · 657a8883
    Joseph K. Bradley authored
    Major changes:
    * Added programming guide sections for tree ensembles
    * Added examples for tree ensembles
    * Updated DecisionTree programming guide with more info on parameters
    * **API change**: Standardized the tree parameter for the number of classes (for classification)
    
    Minor changes:
    * Updated decision tree documentation
    * Updated existing tree and tree ensemble examples
     * Use train/test split, and compute test error instead of training error.
     * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
    
    Note: I know this is a lot of lines, but most is covered by:
    * Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
    * New examples (which were copied from the programming guide)
    * The "numClasses" renaming
    
    I have run all examples and relevant unit tests.
    
    CC: mengxr manishamde codedeft
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
    
    Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:
    
    70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
    d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
    8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
    6fab846 [Joseph K. Bradley] small fixes based on review
    b9f8576 [Joseph K. Bradley] updated decision tree doc
    375204c [Joseph K. Bradley] fixed python style
    2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
    706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
    c76c823 [Joseph K. Bradley] added migration guide for mllib
    abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
    07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
    cdfdfbc [Joseph K. Bradley] added examples for GBT
    6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
    ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
    657a8883
    History
    [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix
    Joseph K. Bradley authored
    Major changes:
    * Added programming guide sections for tree ensembles
    * Added examples for tree ensembles
    * Updated DecisionTree programming guide with more info on parameters
    * **API change**: Standardized the tree parameter for the number of classes (for classification)
    
    Minor changes:
    * Updated decision tree documentation
    * Updated existing tree and tree ensemble examples
     * Use train/test split, and compute test error instead of training error.
     * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
    
    Note: I know this is a lot of lines, but most is covered by:
    * Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
    * New examples (which were copied from the programming guide)
    * The "numClasses" renaming
    
    I have run all examples and relevant unit tests.
    
    CC: mengxr manishamde codedeft
    
    Author: Joseph K. Bradley <joseph@databricks.com>
    Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
    
    Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:
    
    70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
    d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
    8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
    6fab846 [Joseph K. Bradley] small fixes based on review
    b9f8576 [Joseph K. Bradley] updated decision tree doc
    375204c [Joseph K. Bradley] fixed python style
    2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
    706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
    c76c823 [Joseph K. Bradley] added migration guide for mllib
    abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
    07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
    cdfdfbc [Joseph K. Bradley] added examples for GBT
    6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
    ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples