Skip to content
Snippets Groups Projects
  1. Apr 25, 2016
    • wm624@hotmail.com's avatar
      [SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixture · b50e2eca
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Add Python API in ML for GaussianMixture
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Add doctest and test cases are the same as mllib Python tests
      ./dev/lint-python
      PEP8 checks passed.
      rm -rf _build/*
      pydoc checks passed.
      
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (18s)
      Finished test(python2.7): pyspark.ml.clustering (40s)
      Finished test(python2.7): pyspark.ml.classification (49s)
      Finished test(python2.7): pyspark.ml.recommendation (44s)
      Finished test(python2.7): pyspark.ml.feature (64s)
      Finished test(python2.7): pyspark.ml.regression (45s)
      Finished test(python2.7): pyspark.ml.tuning (30s)
      Finished test(python2.7): pyspark.ml.tests (56s)
      Tests passed in 106 seconds
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12402 from wangmiao1981/gmm.
      b50e2eca
  2. Apr 20, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
       - ContinuousQuery
       - Trigger
       - ProcessingTime
      in pyspark under `pyspark.sql.streaming`.
      
      In addition, it contains the new methods added under:
       -  `DataFrameWriter`
           a) `startStream`
           b) `trigger`
           c) `queryName`
      
       -  `DataFrameReader`
           a) `stream`
      
       - `DataFrame`
          a) `isStreaming`
      
      This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
       - `exception`
       - `sourceStatuses`
       - `sinkStatus`
      
      They may be added in a follow up.
      
      This PR also contains some very minor doc fixes in the Scala side.
      
      ## How was this patch tested?
      
      Python doc tests
      
      TODO:
       - [ ] verify Python docs look good
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <burak@databricks.com>
      
      Closes #12320 from brkyvz/stream-python.
      80bf48f4
  3. Apr 18, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14714][ML][PYTHON] Fixed issues with non-kwarg typeConverter arg for Param constructor · d29e429e
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      PySpark Param constructors need to pass the TypeConverter argument by name, partly to make sure it is not mistaken for the expectedType arg and partly because we will remove the expectedType arg in 2.1. In several places, this is not being done correctly.
      
      This PR changes all usages in pyspark/ml/ to keyword args.
      
      ## How was this patch tested?
      
      Existing unit tests.  I will not test type conversion for every Param unless we really think it necessary.
      
      Also, if you start the PySpark shell and import classes (e.g., pyspark.ml.feature.StandardScaler), then you no longer get this warning:
      ```
      /Users/josephkb/spark/python/pyspark/ml/param/__init__.py:58: UserWarning: expectedType is deprecated and will be removed in 2.1. Use typeConverter instead, as a keyword argument.
        "Use typeConverter instead, as a keyword argument.")
      ```
      That warning came from the typeConverter argument being passes as the expectedType arg by mistake.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12480 from jkbradley/typeconverter-fix.
      d29e429e
  4. Apr 15, 2016
    • sethah's avatar
      [SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` method · 129f2f45
      sethah authored
      ## What changes were proposed in this pull request?
      
      Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.
      
      Additional changes:
      * [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
      * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.
      
      ## How was this patch tested?
      
      Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11939 from sethah/SPARK-14104.
      129f2f45
  5. Apr 01, 2016
  6. Mar 23, 2016
    • sethah's avatar
      [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params · 30bdb5cb
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.
      
      This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.
      
      ## How was this patch tested?
      
      Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11663 from sethah/SPARK-13068-tc.
      30bdb5cb
  7. Mar 22, 2016
    • Joseph K. Bradley's avatar
      [SPARK-13951][ML][PYTHON] Nested Pipeline persistence · 7e3423b9
      Joseph K. Bradley authored
      Adds support for saving and loading nested ML Pipelines from Python.  Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.
      
      Also:
      * Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
      * Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java
      
      Added new unit test for nested Pipelines.  Abstracted validity check into a helper method for the 2 unit tests.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11866 from jkbradley/nested-pipeline-io.
      Closes #11835
      7e3423b9
  8. Mar 01, 2016
    • Joseph K. Bradley's avatar
      [SPARK-13008][ML][PYTHON] Put one alg per line in pyspark.ml all lists · 9495c40f
      Joseph K. Bradley authored
      This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top.  Since we keep it alphabetized, it often creates a lot more changes than needed.  It is also easy to add the Estimator and forget the Model.  I'm going to switch it to have one algorithm per line.
      
      This also alphabetizes a few out-of-place classes in pyspark.ml.feature.  No changes have been made to the moved classes.
      
      CC: thunterdb
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #10927 from jkbradley/ml-python-all-list.
      9495c40f
  9. Feb 20, 2016
  10. Feb 12, 2016
  11. Feb 11, 2016
  12. Jan 26, 2016
    • Holden Karau's avatar
      [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code · eb917291
      Holden Karau authored
      The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
      eb917291
  13. Jan 06, 2016
  14. Sep 17, 2015
  15. Aug 13, 2015
    • Xiangrui Meng's avatar
      [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol · 68f99571
      Xiangrui Meng authored
      This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.
      
      This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.
      
      jkbradley yu-iskw
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8148 from mengxr/SPARK-9918 and squashes the following commits:
      
      149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
      3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
      a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
      68f99571
  16. Aug 12, 2015
  17. Jul 17, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines · 34a889db
      Yu ISHIKAWA authored
      I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.
      
      [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:
      
      be752de [Yu ISHIKAWA] Add assertions
      a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
      4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
      fb2417c [Yu ISHIKAWA] Use getInt, instead of get
      f397be4 [Yu ISHIKAWA] Switch the comparisons.
      ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
      effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
      c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
      19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
      1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
      f8338bc [Yu ISHIKAWA] Add the placeholders in Python
      4a03003 [Yu ISHIKAWA] Test for contains in Python
      6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
      288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
      5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
      97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
      e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
      978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
      2ec80bc [Yu ISHIKAWA] Fit on 1 line
      e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
      b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
      f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
      3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
      4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
      2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
      19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
      4d2ad1e [Yu ISHIKAWA] Modify the indentations
      0ae422f [Yu ISHIKAWA] Add a test for `setParams`
      4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
      11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
      220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
      92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
      c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
      6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
      687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
      a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
      5bedc51 [Yu ISHIKAWA] Remve an extra new line
      444c289 [Yu ISHIKAWA] Add the validation for `runs`
      e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
      7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
      7991e15 [Yu ISHIKAWA] Add a validation for `k`
      c2df35d [Yu ISHIKAWA] Make `predict` private
      93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
      d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
      e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
      8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
      6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
      99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
      79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
      6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      11c2a12 [Yu ISHIKAWA] Limit the imports
      badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
      f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
      85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
      aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
      c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
      598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
      63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
      34a889db
Loading