Skip to content
Snippets Groups Projects
  1. Apr 01, 2015
    • Joseph K. Bradley's avatar
      [SPARK-6657] [Python] [Docs] fixed python doc build warnings · fb25e8c7
      Joseph K. Bradley authored
      fixed python doc build warnings
      
      CC whomever wants to review: rxin mengxr davies
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5317 from jkbradley/python-doc-warnings and squashes the following commits:
      
      4cd43c2 [Joseph K. Bradley] fixed python doc build warnings
      fb25e8c7
  2. Mar 03, 2015
    • Xiangrui Meng's avatar
      [SPARK-6097][MLLIB] Support tree model save/load in PySpark/MLlib · 7e53a79c
      Xiangrui Meng authored
      Similar to `MatrixFactorizaionModel`, we only need wrappers to support save/load for tree models in Python.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4854 from mengxr/SPARK-6097 and squashes the following commits:
      
      4586a4d [Xiangrui Meng] fix more typos
      8ebcac2 [Xiangrui Meng] fix python style
      91172d8 [Xiangrui Meng] fix typos
      201b3b9 [Xiangrui Meng] update user guide
      b5158e2 [Xiangrui Meng] support tree model save/load in PySpark/MLlib
      7e53a79c
  3. Feb 25, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5974] [SPARK-5980] [mllib] [python] [docs] Update ML guide with save/load, Python GBT · d20559b1
      Joseph K. Bradley authored
      * Add GradientBoostedTrees Python examples to ML guide
        * I ran these in the pyspark shell, and they worked.
      * Add save/load to examples in ML guide
      * Added note to python docs about predict,transform not working within RDD actions,transformations in some cases (See SPARK-5981)
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4750 from jkbradley/SPARK-5974 and squashes the following commits:
      
      c410e38 [Joseph K. Bradley] Added note to LabeledPoint about attributes
      bcae18b [Joseph K. Bradley] Added import of models for save/load examples in ml guide.  Fixed line length for tree.py, feature.py (but not other ML Pyspark files yet).
      6d81c3e [Joseph K. Bradley] completed python GBT examples
      9903309 [Joseph K. Bradley] Added note to python docs about predict,transform not working within RDD actions,transformations in some cases
      c7dfad8 [Joseph K. Bradley] Added model save/load to ML guide.  Added GBT examples to ML guide
      d20559b1
  4. Feb 20, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 release · 4a17eedb
      Joseph K. Bradley authored
      For SPARK-5867:
      * The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API.
      * It should also include Python examples now.
      
      For SPARK-5892:
      * Fix Python docs
      * Various other cleanups
      
      BTW, I accidentally merged this with master.  If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check]
      
      CC: mengxr  (ML),  davies  (Python docs)
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits:
      
      f191bb0 [Joseph K. Bradley] small cleanups
      e786efa [Joseph K. Bradley] small doc corrections
      6b1ab4a [Joseph K. Bradley] fixed python lint test
      946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example.  Changed spark.ml Java examples to use DataFrames API instead of sql()
      da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3
      629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python
      b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation
      34b067f [Joseph K. Bradley] small doc correction
      da16aef [Joseph K. Bradley] Fixed python mllib docs
      8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc
      695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs
      a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs
      b05a80d [Joseph K. Bradley] organize imports. doc cleanups
      e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
      4a17eedb
  5. Jan 30, 2015
    • Kazuki Taniguchi's avatar
      [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees · bc1fc9b6
      Kazuki Taniguchi authored
      This PR is implementing the Gradient Boosted Trees for Python API.
      
      Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>
      
      Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:
      
      620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
      bc1fc9b6
  6. Dec 03, 2014
    • Joseph K. Bradley's avatar
      [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix · 657a8883
      Joseph K. Bradley authored
      Major changes:
      * Added programming guide sections for tree ensembles
      * Added examples for tree ensembles
      * Updated DecisionTree programming guide with more info on parameters
      * **API change**: Standardized the tree parameter for the number of classes (for classification)
      
      Minor changes:
      * Updated decision tree documentation
      * Updated existing tree and tree ensemble examples
       * Use train/test split, and compute test error instead of training error.
       * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)
      
      Note: I know this is a lot of lines, but most is covered by:
      * Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
      * New examples (which were copied from the programming guide)
      * The "numClasses" renaming
      
      I have run all examples and relevant unit tests.
      
      CC: mengxr manishamde codedeft
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:
      
      70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
      d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
      8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
      6fab846 [Joseph K. Bradley] small fixes based on review
      b9f8576 [Joseph K. Bradley] updated decision tree doc
      375204c [Joseph K. Bradley] fixed python style
      2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
      706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
      c76c823 [Joseph K. Bradley] added migration guide for mllib
      abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
      07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
      cdfdfbc [Joseph K. Bradley] added examples for GBT
      6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
      ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
      657a8883
  7. Nov 20, 2014
    • Davies Liu's avatar
      [SPARK-4439] [MLlib] add python api for random forest · 1c53a5db
      Davies Liu authored
      ```
          class RandomForestModel
           |  A model trained by RandomForest
           |
           |  numTrees(self)
           |      Get number of trees in forest.
           |
           |  predict(self, x)
           |      Predict values for a single data point or an RDD of points using the model trained.
           |
           |  toDebugString(self)
           |      Full model
           |
           |  totalNumNodes(self)
           |      Get total number of nodes, summed over all trees in the forest.
           |
      
          class RandomForest
           |  trainClassifier(cls, data, numClassesForClassification, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='gini', maxDepth=4, maxBins=32, seed=None):
           |      Method to train a decision tree model for binary or multiclass classification.
           |
           |      :param data: Training dataset: RDD of LabeledPoint.
           |                   Labels should take values {0, 1, ..., numClasses-1}.
           |      :param numClassesForClassification: number of classes for classification.
           |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
           |                                  E.g., an entry (n -> k) indicates that feature n is categorical
           |                                  with k categories indexed from 0: {0, 1, ..., k-1}.
           |      :param numTrees: Number of trees in the random forest.
           |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
           |                                Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
           |                                If "auto" is set, this parameter is set based on numTrees:
           |                                  if numTrees == 1, set to "all";
           |                                  if numTrees > 1 (forest) set to "sqrt".
           |      :param impurity: Criterion used for information gain calculation.
           |                   Supported values: "gini" (recommended) or "entropy".
           |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
           |                       1 internal node + 2 leaf nodes. (default: 4)
           |      :param maxBins: maximum number of bins used for splitting features (default: 100)
           |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
           |      :return: RandomForestModel that can be used for prediction
           |
           |   trainRegressor(cls, data, categoricalFeaturesInfo, numTrees, featureSubsetStrategy='auto', impurity='variance', maxDepth=4, maxBins=32, seed=None):
           |      Method to train a decision tree model for regression.
           |
           |      :param data: Training dataset: RDD of LabeledPoint.
           |                   Labels are real numbers.
           |      :param categoricalFeaturesInfo: Map storing arity of categorical features.
           |                                   E.g., an entry (n -> k) indicates that feature n is categorical
           |                                   with k categories indexed from 0: {0, 1, ..., k-1}.
           |      :param numTrees: Number of trees in the random forest.
           |      :param featureSubsetStrategy: Number of features to consider for splits at each node.
           |                                 Supported: "auto" (default), "all", "sqrt", "log2", "onethird".
           |                                 If "auto" is set, this parameter is set based on numTrees:
           |                                 if numTrees == 1, set to "all";
           |                                 if numTrees > 1 (forest) set to "onethird".
           |      :param impurity: Criterion used for information gain calculation.
           |                       Supported values: "variance".
           |      :param maxDepth: Maximum depth of the tree. E.g., depth 0 means 1 leaf node; depth 1 means
           |                       1 internal node + 2 leaf nodes.(default: 4)
           |      :param maxBins: maximum number of bins used for splitting features (default: 100)
           |      :param seed:  Random seed for bootstrapping and choosing feature subsets.
           |      :return: RandomForestModel that can be used for prediction
           |
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3320 from davies/forest and squashes the following commits:
      
      8003dfc [Davies Liu] reorder
      53cf510 [Davies Liu] fix docs
      4ca593d [Davies Liu] fix docs
      e0df852 [Davies Liu] fix docs
      0431746 [Davies Liu] rebased
      2b6f239 [Davies Liu] Merge branch 'master' of github.com:apache/spark into forest
      885abee [Davies Liu] address comments
      dae7fc0 [Davies Liu] address comments
      89a000f [Davies Liu] fix docs
      565d476 [Davies Liu] add python api for random forest
      1c53a5db
  8. Nov 12, 2014
    • Davies Liu's avatar
      [SPARK-4369] [MLLib] fix TreeModel.predict() with RDD · bd86118c
      Davies Liu authored
      Fix  TreeModel.predict() with RDD, added tests for it.
      
      (Also checked that other models don't have this issue)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3230 from davies/predict and squashes the following commits:
      
      81172aa [Davies Liu] fix predict
      bd86118c
  9. Oct 31, 2014
    • Davies Liu's avatar
      [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API · 872fc669
      Davies Liu authored
      Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.
      
      After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2995 from davies/cleanup and squashes the following commits:
      
      8fa6ec6 [Davies Liu] address comments
      16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
      43743e5 [Davies Liu] bugfix
      731331f [Davies Liu] simplify serialization in MLlib Python API
      872fc669
  10. Oct 20, 2014
    • Qiping Li's avatar
      [SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively · eadc4c59
      Qiping Li authored
      DecisionTree splits on continuous features by choosing an array of values from a subsample of the data.
      Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps:
      
      1. Sort sample values for this feature
      2. Get number of occurrence of each distinct value
      3. Iterate the value count array computed in step 2 to choose splits.
      
      After find splits, `numSplits` and `numBins` in metadata will be updated.
      
      CC: mengxr manishamde jkbradley, please help me review this, thanks.
      
      Author: Qiping Li <liqiping1991@gmail.com>
      Author: chouqin <liqiping1991@gmail.com>
      Author: liqi <liqiping1991@gmail.com>
      Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
      
      Closes #2780 from chouqin/dt-findsplits and squashes the following commits:
      
      18d0301 [Qiping Li] check explicitly findsplits return distinct splits
      8dc28ab [chouqin] remove blank lines
      ffc920f [chouqin] adjust code based on comments and add more test cases
      9857039 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
      d353596 [qiping.lqp] fix pyspark doc test
      9e64699 [Qiping Li] fix random forest unit test
      3c72913 [Qiping Li] fix random forest unit test
      092efcb [Qiping Li] fix bug
      f69f47f [Qiping Li] fix bug
      ab303a4 [Qiping Li] fix bug
      af6dc97 [Qiping Li] fix bug
      2a8267a [Qiping Li] fix bug
      c339a61 [Qiping Li] fix bug
      369f812 [Qiping Li] fix style
      8f46af6 [Qiping Li] add comments and unit test
      9e7138e [Qiping Li] Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into dt-findsplits
      1b25a35 [Qiping Li] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
      0cd744a [liqi] fix bug
      3652823 [Qiping Li] fix bug
      af7cb79 [Qiping Li] Choose splits for continuous features in DecisionTree more adaptively
      eadc4c59
  11. Oct 16, 2014
    • Davies Liu's avatar
      [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode · 091d32c5
      Davies Liu authored
      Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.
      
      So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2830 from davies/fix_pickle and squashes the following commits:
      
      0c85fb9 [Davies Liu] revert the privacy change
      6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
      0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
      091d32c5
  12. Oct 06, 2014
    • cocoatomo's avatar
      [SPARK-3773][PySpark][Doc] Sphinx build warning · 2300eb58
      cocoatomo authored
      When building Sphinx documents for PySpark, we have 12 warnings.
      Their causes are almost docstrings in broken ReST format.
      
      To reproduce this issue, we should run following commands on the commit: 6e27cb63.
      
      ```bash
      $ cd ./python/docs
      $ make clean html
      ...
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: missing attribute mentioned in :members: or __all__: module pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation.
      ...
      checking consistency... /Users/<user>/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document isn't included in any toctree
      ...
      copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
      ...
      build succeeded, 12 warnings.
      ```
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2653 from cocoatomo/issues/3773-sphinx-build-warnings and squashes the following commits:
      
      6f65661 [cocoatomo] [SPARK-3773][PySpark][Doc] Sphinx build warning
      2300eb58
  13. Oct 01, 2014
    • Joseph K. Bradley's avatar
      [SPARK-3751] [mllib] DecisionTree: example update + print options · 7bf6cc97
      Joseph K. Bradley authored
      DecisionTreeRunner functionality additions:
      * Allow user to pass in a test dataset
      * Do not print full model if the model is too large.
      
      As part of this, modify DecisionTreeModel and RandomForestModel to allow printing less info.  Proposed updates:
      * toString: prints model summary
      * toDebugString: prints full model (named after RDD.toDebugString)
      
      Similar update to Python API:
      * __repr__() now prints a model summary
      * toDebugString() now prints the full model
      
      CC: mengxr  chouqin manishamde codedeft  Small update (whomever can take a look).  Thanks!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2604 from jkbradley/dtrunner-update and squashes the following commits:
      
      b2b3c60 [Joseph K. Bradley] re-added python sql doc test, temporarily removed before
      07b1fae [Joseph K. Bradley] repr() now prints a model summary toDebugString() now prints the full model
      1d0d93d [Joseph K. Bradley] Updated DT and RF to print less when toString is called. Added toDebugString for verbose printing.
      22eac8c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
      e007a95 [Joseph K. Bradley] Updated DecisionTreeRunner to accept a test dataset.
      7bf6cc97
  14. Sep 19, 2014
    • Davies Liu's avatar
      [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib · fce5e251
      Davies Liu authored
      Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.
      
      This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.
      
      All the modules are refactored to use this protocol.
      
      Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2378 from davies/pickle_mllib and squashes the following commits:
      
      dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
      810f97f [Davies Liu] fix equal of matrix
      032cd62 [Davies Liu] add more type check and conversion for user_product
      bd738ab [Davies Liu] address comments
      e431377 [Davies Liu] fix cache of rdd, refactor
      19d0967 [Davies Liu] refactor Picklers
      2511e76 [Davies Liu] cleanup
      1fccf1a [Davies Liu] address comments
      a2cc855 [Davies Liu] fix tests
      9ceff73 [Davies Liu] test size of serialized Rating
      44e0551 [Davies Liu] fix cache
      a379a81 [Davies Liu] fix pickle array in python2.7
      df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
      154d141 [Davies Liu] fix autobatchedpickler
      44736d7 [Davies Liu] speed up pickling array in Python 2.7
      e1d1bfc [Davies Liu] refactor
      708dc02 [Davies Liu] fix tests
      9dcfb63 [Davies Liu] fix style
      88034f0 [Davies Liu] rafactor, address comments
      46a501e [Davies Liu] choose batch size automatically
      df19464 [Davies Liu] memorize the module and class name during pickleing
      f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
      722dd96 [Davies Liu] cleanup _common.py
      0ee1525 [Davies Liu] remove outdated tests
      b02e34f [Davies Liu] remove _common.py
      84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
      4d7963e [Davies Liu] remove muanlly serialization
      6d26b03 [Davies Liu] fix tests
      c383544 [Davies Liu] classification
      f2a0856 [Davies Liu] mllib/regression
      d9f691f [Davies Liu] mllib/util
      cccb8b1 [Davies Liu] mllib/tree
      8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
      aa2287e [Davies Liu] random
      f1544c4 [Davies Liu] refactor clustering
      52d1350 [Davies Liu] use new protocol in mllib/stat
      b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      fce5e251
  15. Sep 15, 2014
    • qiping.lqp's avatar
      [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params... · fdb302f4
      qiping.lqp authored
      [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params to example and Python API
      
      Added minInstancesPerNode, minInfoGain params to:
      * DecisionTreeRunner.scala example
      * Python API (tree.py)
      
      Also:
      * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
      * small style fixes
      
      CC: mengxr
      
      Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      Author: chouqin <liqiping1991@gmail.com>
      
      Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:
      
      61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
      a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
      95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
      e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
      19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
      f1d11d1 [chouqin] fix typo
      c7ebaf1 [chouqin] fix typo
      39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
      c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
      0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
      d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
      efcc736 [qiping.lqp] fix bug
      10b8012 [qiping.lqp] fix style
      6728fad [qiping.lqp] minor fix: remove empty lines
      bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
      cadd569 [qiping.lqp] add api docs
      46b891f [qiping.lqp] fix bug
      e72c7e4 [qiping.lqp] add comments
      845c6fa [qiping.lqp] fix style
      f195e83 [qiping.lqp] fix style
      987cbf4 [qiping.lqp] fix bug
      ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
      ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
      fdb302f4
  16. Sep 08, 2014
    • Xiangrui Meng's avatar
      [SPARK-3443][MLLIB] update default values of tree: · 50a4fa77
      Xiangrui Meng authored
      Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :
      
      1. maxMemoryInMB: 128 -> 256
      2. maxBins: 100 -> 32
      3. maxDepth: 4 -> 5 (in some example code)
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #2322 from mengxr/tree-defaults and squashes the following commits:
      
      cda453a [Xiangrui Meng] fix tests
      5900445 [Xiangrui Meng] update comments
      8c81831 [Xiangrui Meng] update default values of tree:
      50a4fa77
  17. Sep 03, 2014
    • Davies Liu's avatar
      [SPARK-3309] [PySpark] Put all public API in __all__ · 6481d274
      Davies Liu authored
      Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2205 from davies/public and squashes the following commits:
      
      c6c5567 [Davies Liu] fix message
      f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
      7e3016a [Davies Liu] add __all__ in mllib
      6281b48 [Davies Liu] fix doc for SchemaRDD
      6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
      6481d274
  18. Aug 18, 2014
    • Joseph K. Bradley's avatar
      [mllib] DecisionTree: treeAggregate + Python example bug fix · 115eeb30
      Joseph K. Bradley authored
      Small DecisionTree updates:
      * Changed main DecisionTree aggregate to treeAggregate.
      * Fixed bug in python example decision_tree_runner.py with missing argument (since categoricalFeaturesInfo is no longer an optional argument for trainClassifier).
      * Fixed same bug in python doc tests, and added tree.py to doc tests.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #2015 from jkbradley/dt-opt2 and squashes the following commits:
      
      b5114fa [Joseph K. Bradley] Fixed python tree.py doc test (extra newline)
      8e4665d [Joseph K. Bradley] Added tree.py to python doc tests.  Fixed bug from missing categoricalFeaturesInfo argument.
      b7b2922 [Joseph K. Bradley] Fixed bug in python example decision_tree_runner.py with missing argument.  Changed main DecisionTree aggregate to treeAggregate.
      85bbc1f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      66d076f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      a0ed0da [Joseph K. Bradley] Renamed DTMetadata to DecisionTreeMetadata.  Small doc updates.
      3726d20 [Joseph K. Bradley] Small code improvements based on code review.
      ac0b9f8 [Joseph K. Bradley] Small updates based on code review. Main change: Now using << instead of math.pow.
      db0d773 [Joseph K. Bradley] scala style fix
      6a38f48 [Joseph K. Bradley] Added DTMetadata class for cleaner code
      931a3a7 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt2
      797f68a [Joseph K. Bradley] Fixed DecisionTreeSuite bug for training second level.  Needed to update treePointToNodeIndex with groupShift.
      f40381c [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      5f2dec2 [Joseph K. Bradley] Fixed scalastyle issue in TreePoint
      6b5651e [Joseph K. Bradley] Updates based on code review.  1 major change: persisting to memory + disk, not just memory.
      2d2aaaf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      26d10dd [Joseph K. Bradley] Removed tree/model/Filter.scala since no longer used.  Removed debugging println calls in DecisionTree.scala.
      356daba [Joseph K. Bradley] Merge branch 'dt-opt1' into dt-opt2
      430d782 [Joseph K. Bradley] Added more debug info on binning error.  Added some docs.
      d036089 [Joseph K. Bradley] Print timing info to logDebug.
      e66f1b1 [Joseph K. Bradley] TreePoint * Updated doc * Made some methods private
      8464a6e [Joseph K. Bradley] Moved TimeTracker to tree/impl/ in its own file, and cleaned it up.  Removed debugging println calls from DecisionTree.  Made TreePoint extend Serialiable
      a87e08f [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt1
      c1565a5 [Joseph K. Bradley] Small DecisionTree updates: * Simplification: Updated calculateGainForSplit to take aggregates for a single (feature, split) pair. * Internal doc: findAggForOrderedFeatureClassification
      b914f3b [Joseph K. Bradley] DecisionTree optimization: eliminated filters + small changes
      b2ed1f3 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-opt
      0f676e2 [Joseph K. Bradley] Optimizations + Bug fix for DecisionTree
      3211f02 [Joseph K. Bradley] Optimizing DecisionTree * Added TreePoint representation to avoid calling findBin multiple times. * (not working yet, but debugging)
      f61e9d2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      bcf874a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      511ec85 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-timing
      a95bc22 [Joseph K. Bradley] timing for DecisionTree internals
      115eeb30
  19. Aug 07, 2014
    • Joseph K. Bradley's avatar
      [SPARK-2851] [mllib] DecisionTree Python consistency update · 47ccd5e7
      Joseph K. Bradley authored
      Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
      
      Added factory classes for Algo and Impurity, but made private[mllib].
      
      CC: mengxr dorx  Please let me know if there are other changes which would help with API consistency---thanks!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1798 from jkbradley/dt-python-consistency and squashes the following commits:
      
      6f7edf8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
      a0d7dbe [Joseph K. Bradley] DecisionTree: In Java-friendly train* methods, changed to use JavaRDD instead of RDD.
      ee1d236 [Joseph K. Bradley] DecisionTree API updates: * Removed train() function in Python API (tree.py) ** Removed corresponding function in Scala/Java API (the ones taking basic types)
      00f820e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-python-consistency
      fe6dbfa [Joseph K. Bradley] removed unnecessary imports
      e358661 [Joseph K. Bradley] DecisionTree API change: * Added 6 static train methods to match Python API, but without default arguments (but with Python default args noted in docs).
      c699850 [Joseph K. Bradley] a few doc comments
      eaf84c0 [Joseph K. Bradley] Added DecisionTree static train() methods API to match Python, but without default parameters
      47ccd5e7
  20. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
  21. Aug 02, 2014
    • Joseph K. Bradley's avatar
      [SPARK-2478] [mllib] DecisionTree Python API · 3f67382e
      Joseph K. Bradley authored
      Added experimental Python API for Decision Trees.
      
      API:
      * class DecisionTreeModel
      ** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
      ** numNodes()
      ** depth()
      ** __str__()
      * class DecisionTree
      ** trainClassifier()
      ** trainRegressor()
      ** train()
      
      Examples and testing:
      * Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
      * Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
      
      Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
      
      CC mengxr manishamde
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:
      
      3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
      6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
      67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
      aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
      fa10ea7 [Joseph K. Bradley] Small style update
      7968692 [Joseph K. Bradley] small braces typo fix
      e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
      db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
      6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      93953f1 [Joseph K. Bradley] Likely done with Python API.
      6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
      188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
      2b20c61 [Joseph K. Bradley] Small doc and style updates
      1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
      8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
      376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
      e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
      52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
      8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
      cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      2283df8 [Joseph K. Bradley] 2 bug fixes.
      73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
      f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
      3f67382e
Loading