Skip to content
Snippets Groups Projects
  1. Mar 29, 2016
    • wm624@hotmail.com's avatar
      [SPARK-14071][PYSPARK][ML] Change MLWritable.write to be a property · 63b200e8
      wm624@hotmail.com authored
      Add property to MLWritable.write method, so we can use .write instead of .write()
      
      Add a new test to ml/test.py to check whether the write is a property.
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (11s)
      Finished test(python2.7): pyspark.ml.clustering (16s)
      Finished test(python2.7): pyspark.ml.classification (24s)
      Finished test(python2.7): pyspark.ml.recommendation (24s)
      Finished test(python2.7): pyspark.ml.feature (39s)
      Finished test(python2.7): pyspark.ml.regression (26s)
      Finished test(python2.7): pyspark.ml.tuning (15s)
      Finished test(python2.7): pyspark.ml.tests (30s)
      Tests passed in 55 seconds
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #11945 from wangmiao1981/fix_property.
      63b200e8
  2. Mar 24, 2016
  3. Mar 23, 2016
    • sethah's avatar
      [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params · 30bdb5cb
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.
      
      This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.
      
      ## How was this patch tested?
      
      Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11663 from sethah/SPARK-13068-tc.
      30bdb5cb
  4. Mar 22, 2016
    • Joseph K. Bradley's avatar
      [SPARK-13951][ML][PYTHON] Nested Pipeline persistence · 7e3423b9
      Joseph K. Bradley authored
      Adds support for saving and loading nested ML Pipelines from Python.  Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.
      
      Also:
      * Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
      * Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java
      
      Added new unit test for nested Pipelines.  Abstracted validity check into a helper method for the 2 unit tests.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11866 from jkbradley/nested-pipeline-io.
      Closes #11835
      7e3423b9
  5. Mar 16, 2016
    • GayathriMurali's avatar
      [SPARK-13034] PySpark ml.classification support export/import · 27e1f388
      GayathriMurali authored
      ## What changes were proposed in this pull request?
      
      Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py.
      
      ## How was this patch tested?
      
      ./python/run-tests
      ./dev/lint-python
      Unit tests added to check persistence in Logistic Regression
      
      Author: GayathriMurali <gayathri.m.softie@gmail.com>
      
      Closes #11707 from GayathriMurali/SPARK-13034.
      27e1f388
    • Xusen Yin's avatar
      [SPARK-13038][PYSPARK] Add load/save to pipeline · ae6c677c
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038
      
      1. Add load/save to PySpark Pipeline and PipelineModel
      
      2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.
      
      ## How was this patch tested?
      
      Test with doctest.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #11683 from yinxusen/SPARK-13038-only.
      ae6c677c
  6. Mar 08, 2016
    • Bryan Cutler's avatar
      [SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property... · d8813fa0
      Bryan Cutler authored
      [SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property when getting param list
      
      ## What changes were proposed in this pull request?
      
      Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance.  This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error.
      
      ## How was this patch tested?
      
      Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class.  Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.
      d8813fa0
  7. Mar 03, 2016
    • JeremyNixon's avatar
      [SPARK-12877][ML] Add train-validation-split to pyspark · 511d4929
      JeremyNixon authored
      ## What changes were proposed in this pull request?
      The changes proposed were to add train-validation-split to pyspark.ml.tuning.
      
      ## How was the this patch tested?
      This patch was tested through unit tests located in pyspark/ml/test.py.
      
      This is my original work and I license it to Spark.
      
      Author: JeremyNixon <jnixon2@gmail.com>
      
      Closes #11335 from JeremyNixon/tvs_pyspark.
      511d4929
  8. Feb 11, 2016
    • sethah's avatar
      [SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw an error · b3546738
      sethah authored
      Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.
      
      In Python:
      ```python
      from pyspark.ml.classification import NaiveBayes
      nb = NaiveBayes()
      print nb.hasParam("smoothing")
      print nb.hasParam("notAParam")
      ```
      produces:
      > True
      > AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
      
      However, in Scala:
      ```scala
      import org.apache.spark.ml.classification.NaiveBayes
      val nb  = new NaiveBayes()
      nb.hasParam("smoothing")
      nb.hasParam("notAParam")
      ```
      produces:
      > true
      > false
      
      cc holdenk
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #10962 from sethah/SPARK-13047.
      b3546738
    • Yanbo Liang's avatar
      [MINOR][ML][PYSPARK] Cleanup test cases of clustering.py · 2426eb3e
      Yanbo Liang authored
      Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
      cc mengxr jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #10975 from yanboliang/clustering-cleanup.
      2426eb3e
  9. Jan 29, 2016
    • Yanbo Liang's avatar
      [SPARK-13032][ML][PYSPARK] PySpark support model export/import and take LinearRegression as example · e51b6eaa
      Yanbo Liang authored
      * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
      * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.
      
      cc mengxr jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #10469 from yanboliang/spark-11939.
      e51b6eaa
  10. Jan 26, 2016
    • Holden Karau's avatar
      [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code · eb917291
      Holden Karau authored
      The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
      eb917291
  11. Jan 20, 2016
  12. Jan 19, 2016
  13. Jan 06, 2016
    • Holden Karau's avatar
      [SPARK-7675][ML][PYSPARK] sparkml params type conversion · 3b29004d
      Holden Karau authored
      From JIRA:
      Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.
      
      A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.
      
      This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
      3b29004d
  14. Oct 22, 2015
  15. Oct 20, 2015
  16. Sep 18, 2015
  17. Sep 01, 2015
  18. Aug 28, 2015
    • noelsmith's avatar
      [SPARK-10188] [PYSPARK] Pyspark CrossValidator with RMSE selects incorrect model · 7583681e
      noelsmith authored
      * Added isLargerBetter() method to Pyspark Evaluator to match the Scala version.
      * JavaEvaluator delegates isLargerBetter() to underlying Scala object.
      * Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax.
      * Added test cases for where smaller is better (RMSE) and larger is better (R-Squared).
      
      (This contribution is my original work and that I license the work to the project under Sparks' open source license)
      
      Author: noelsmith <mail@noelsmith.com>
      
      Closes #8399 from noel-smith/pyspark-rmse-xval-fix.
      7583681e
  19. Jun 29, 2015
    • Feynman Liang's avatar
      [SPARK-8456] [ML] Ngram featurizer python · 620605a4
      Feynman Liang authored
      Python API for N-gram feature transformer
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #6960 from feynmanliang/ngram-featurizer-python and squashes the following commits:
      
      f9e37c9 [Feynman Liang] Remove debugging code
      4dd81f4 [Feynman Liang] Fix typo and doctest
      06c79ac [Feynman Liang] Style guide
      26c1175 [Feynman Liang] Add python NGram API
      620605a4
  20. May 20, 2015
    • Holden Karau's avatar
      [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42... · 191ee474
      Holden Karau authored
      [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6139 from holdenk/SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random and squashes the following commits:
      
      591f8e5 [Holden Karau] specify old seed for doc tests
      2470004 [Holden Karau] Fix a bunch of seeds with default values to have None as the default which will then result in using the hash of the class name
      cbad96d [Holden Karau] Add the setParams function that is used in the real code
      423b8d7 [Holden Karau] Switch the test code to behave slightly more like production code. also don't check the param map value only check for key existence
      140d25d [Holden Karau] remove extra space
      926165a [Holden Karau] Add some missing newlines for pep8 style
      8616751 [Holden Karau] merge in master
      58532e6 [Holden Karau] its the __name__ method, also treat None values as not set
      56ef24a [Holden Karau] fix test and regenerate base
      afdaa5c [Holden Karau] make sure different classes have different results
      68eb528 [Holden Karau] switch default seed to hash of type of self
      89c4611 [Holden Karau] Merge branch 'master' into SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random
      31cd96f [Holden Karau] specify the seed to randomforestregressor test
      e1b947f [Holden Karau] Style fixes
      ce90ec8 [Holden Karau] merge in master
      bcdf3c9 [Holden Karau] update docstring seeds to none and some other default seeds from 42
      65eba21 [Holden Karau] pep8 fixes
      0e3797e [Holden Karau] Make seed default to random in more places
      213a543 [Holden Karau] Simplify the generated code to only include set default if there is a default rather than having None is note None in the generated code
      1ff17c2 [Holden Karau] Make the seed random for HasSeed in python
      191ee474
  21. May 18, 2015
    • Xiangrui Meng's avatar
      [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python · 9c7e802a
      Xiangrui Meng authored
      This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:
      
      1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
      2. Accept a list of param maps in `fit`.
      3. Use parent uid and name to identify param.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:
      
      413c463 [Xiangrui Meng] remove unnecessary doc
      4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      611c719 [Xiangrui Meng] fix python style
      68862b8 [Xiangrui Meng] update _java_obj initialization
      927ad19 [Xiangrui Meng] fix ml/tests.py
      0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
      9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
      c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
      7e0d27f [Xiangrui Meng] merge master
      46840fb [Xiangrui Meng] update wrappers
      b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
      46cb6ed [Xiangrui Meng] merge master
      a163413 [Xiangrui Meng] fix style
      1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      9630eae [Xiangrui Meng] fix Identifiable._randomUID
      13bd70a [Xiangrui Meng] update ml/tests.py
      64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
      02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
      66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
      7431272 [Joseph K. Bradley] Rebased with master
      9c7e802a
  22. May 10, 2015
    • Glenn Weidner's avatar
      [SPARK-7427] [PYSPARK] Make sharedParams match in Scala, Python · c5aca0c2
      Glenn Weidner authored
      Modified 2 files:
      python/pyspark/ml/param/_shared_params_code_gen.py
      python/pyspark/ml/param/shared.py
      
      Generated shared.py on Linux using Python 2.6.6 on Redhat Enterprise Linux Server 6.6.
      python _shared_params_code_gen.py > shared.py
      
      Only changed maxIter, regParam, rawPredictionCol based on strings from SharedParamsCodeGen.scala.  Note warning was displayed when committing shared.py:
      warning: LF will be replaced by CRLF in python/pyspark/ml/param/shared.py.
      
      Author: Glenn Weidner <gweidner@us.ibm.com>
      
      Closes #6023 from gweidner/br-7427 and squashes the following commits:
      
      db72e32 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      825e4a9 [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      e6a865e [Glenn Weidner] [SPARK-7427] [PySpark] Make sharedParams match in Scala, Python
      1eee702 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      1ac10e5 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      cafd104 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      9bea1eb [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      4a35c20 [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      9790cbe [Glenn Weidner] Merge remote-tracking branch 'upstream/master'
      d9c30f4 [Glenn Weidner] [SPARK-7275] [SQL] [WIP] Make LogicalRelation public
      c5aca0c2
    • Joseph K. Bradley's avatar
      [SPARK-7431] [ML] [PYTHON] Made CrossValidatorModel call parent init in PySpark · 3038443e
      Joseph K. Bradley authored
      Fixes bug with PySpark cvModel not having UID
      Also made small PySpark fixes: Evaluator should inherit from Params.  MockModel should inherit from Model.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5968 from jkbradley/pyspark-cv-uid and squashes the following commits:
      
      57f13cd [Joseph K. Bradley] Made CrossValidatorModel call parent init in PySpark
      3038443e
  23. Apr 16, 2015
    • Xiangrui Meng's avatar
      [SPARK-6893][ML] default pipeline parameter handling in python · 57cd1e86
      Xiangrui Meng authored
      Same as #5431 but for Python. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5534 from mengxr/SPARK-6893 and squashes the following commits:
      
      d3b519b [Xiangrui Meng] address comments
      ebaccc6 [Xiangrui Meng] style update
      fce244e [Xiangrui Meng] update explainParams with test
      4d6b07a [Xiangrui Meng] add tests
      5294500 [Xiangrui Meng] update default param handling in python
      57cd1e86
  24. Jan 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-4586][MLLIB] Python API for ML pipeline and parameters · e80dc1c5
      Xiangrui Meng authored
      This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code.
      
      TODO:
      - [x] handle parameters in LRModel
      - [x] unit tests
      - [x] missing some docs
      
      CC: davies jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4151 from mengxr/SPARK-4586 and squashes the following commits:
      
      415268e [Xiangrui Meng] remove inherit_doc from __init__
      edbd6fe [Xiangrui Meng] move Identifiable to ml.util
      44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml
      dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      14ae7e2 [Davies Liu] fix docs
      54ca7df [Davies Liu] fix tests
      78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml
      fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      1dca16a [Davies Liu] refactor
      090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml
      0882513 [Xiangrui Meng] update doc style
      a4f4dbf [Xiangrui Meng] add unit test for LR
      7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer
      ba0ba1e [Xiangrui Meng] add unit tests for pipeline
      0586c7b [Xiangrui Meng] add more comments to the example
      5153cff [Xiangrui Meng] simplify java models
      036ca04 [Xiangrui Meng] gen numFeatures
      46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly
      1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc
      f66ba0c [Xiangrui Meng] make params a property
      d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute
      f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example
      05e3e40 [Xiangrui Meng] update example
      d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl
      56de571 [Xiangrui Meng] fix style
      d0c5bb8 [Xiangrui Meng] a working copy
      bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
      17ecfb9 [Xiangrui Meng] code gen for shared params
      d9ea77c [Xiangrui Meng] update doc
      c18dca1 [Xiangrui Meng] make the example working
      dadd84e [Xiangrui Meng] add base classes and docs
      a3015cf [Xiangrui Meng] add Estimator and Transformer
      46eea43 [Xiangrui Meng] a pipeline in python
      33b68e0 [Xiangrui Meng] a working LR
      e80dc1c5
Loading