Skip to content
Snippets Groups Projects
  1. Jun 13, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and... · baa3e633
      Liang-Chi Hsieh authored
      [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python
      
      ## What changes were proposed in this pull request?
      
      Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13219 from viirya/pyspark-pickler-ml.
      baa3e633
  2. May 31, 2016
    • Yanbo Liang's avatar
      [MINOR][DOC][ML] ml.clustering scala & python api doc sync · 594484cd
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Since we done Scala API audit for ml.clustering at #13148, we should also fix and update the corresponding Python API docs to keep them in sync.
      
      ## How was this patch tested?
      Docs change, no tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #13291 from yanboliang/spark-15361-followup.
      594484cd
  3. May 23, 2016
    • WeichenXu's avatar
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with... · a15ca553
      WeichenXu authored
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
      
      ## What changes were proposed in this pull request?
      
      Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
      a15ca553
  4. May 17, 2016
    • DB Tsai's avatar
      [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms · e2efe052
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12627 from dbtsai/SPARK-14615-NewML.
      e2efe052
  5. May 13, 2016
    • Zheng RuiFeng's avatar
      [MINOR][PYSPARK] update _shared_params_code_gen.py · 87d69a01
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      1, add arg-checkings for `tol` and `stepSize` to  keep in line with `SharedParamsCodeGen.scala`
      2, fix one typo
      
      ## How was this patch tested?
      local build
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #12996 from zhengruifeng/py_args_checking.
      87d69a01
  6. May 03, 2016
    • Yanbo Liang's avatar
      [SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean up · d26f7cb0
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      PySpark ML Params setter code clean up.
      For examples,
      ```setInputCol``` can be simplified from
      ```
      self._set(inputCol=value)
      return self
      ```
      to:
      ```
      return self._set(inputCol=value)
      ```
      This is a pretty big sweeps, and we cleaned wherever possible.
      ## How was this patch tested?
      Exist unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12749 from yanboliang/spark-14971.
      d26f7cb0
  7. Apr 29, 2016
  8. Apr 26, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local · bd2c9a6d
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.
      
      This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
      * Renamed fields to match numpy, scipy: mu => mean, sigma => cov
      
      This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
      * Modifying the constructor
      * Adding a computeProbabilities method
      
      Also:
      * Added EPSILON to mllib-local for use in MultivariateGaussian
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12593 from jkbradley/sparkml-gmm-fix.
      bd2c9a6d
    • Yanbo Liang's avatar
      [SPARK-11559][MLLIB] Make `runs` no effect in mllib.KMeans · 302a1868
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      We deprecated  ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
      This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.
      
      ## How was this patch tested?
      Existing unit tests.
      
      cc jkbradley
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #12608 from yanboliang/spark-11559.
      302a1868
  9. Apr 25, 2016
    • wm624@hotmail.com's avatar
      [SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixture · b50e2eca
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Add Python API in ML for GaussianMixture
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Add doctest and test cases are the same as mllib Python tests
      ./dev/lint-python
      PEP8 checks passed.
      rm -rf _build/*
      pydoc checks passed.
      
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (18s)
      Finished test(python2.7): pyspark.ml.clustering (40s)
      Finished test(python2.7): pyspark.ml.classification (49s)
      Finished test(python2.7): pyspark.ml.recommendation (44s)
      Finished test(python2.7): pyspark.ml.feature (64s)
      Finished test(python2.7): pyspark.ml.regression (45s)
      Finished test(python2.7): pyspark.ml.tuning (30s)
      Finished test(python2.7): pyspark.ml.tests (56s)
      Tests passed in 106 seconds
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #12402 from wangmiao1981/gmm.
      b50e2eca
  10. Apr 20, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
       - ContinuousQuery
       - Trigger
       - ProcessingTime
      in pyspark under `pyspark.sql.streaming`.
      
      In addition, it contains the new methods added under:
       -  `DataFrameWriter`
           a) `startStream`
           b) `trigger`
           c) `queryName`
      
       -  `DataFrameReader`
           a) `stream`
      
       - `DataFrame`
          a) `isStreaming`
      
      This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
       - `exception`
       - `sourceStatuses`
       - `sinkStatus`
      
      They may be added in a follow up.
      
      This PR also contains some very minor doc fixes in the Scala side.
      
      ## How was this patch tested?
      
      Python doc tests
      
      TODO:
       - [ ] verify Python docs look good
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <burak@databricks.com>
      
      Closes #12320 from brkyvz/stream-python.
      80bf48f4
  11. Apr 18, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14714][ML][PYTHON] Fixed issues with non-kwarg typeConverter arg for Param constructor · d29e429e
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      PySpark Param constructors need to pass the TypeConverter argument by name, partly to make sure it is not mistaken for the expectedType arg and partly because we will remove the expectedType arg in 2.1. In several places, this is not being done correctly.
      
      This PR changes all usages in pyspark/ml/ to keyword args.
      
      ## How was this patch tested?
      
      Existing unit tests.  I will not test type conversion for every Param unless we really think it necessary.
      
      Also, if you start the PySpark shell and import classes (e.g., pyspark.ml.feature.StandardScaler), then you no longer get this warning:
      ```
      /Users/josephkb/spark/python/pyspark/ml/param/__init__.py:58: UserWarning: expectedType is deprecated and will be removed in 2.1. Use typeConverter instead, as a keyword argument.
        "Use typeConverter instead, as a keyword argument.")
      ```
      That warning came from the typeConverter argument being passes as the expectedType arg by mistake.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12480 from jkbradley/typeconverter-fix.
      d29e429e
  12. Apr 15, 2016
    • sethah's avatar
      [SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` method · 129f2f45
      sethah authored
      ## What changes were proposed in this pull request?
      
      Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.
      
      Additional changes:
      * [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
      * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.
      
      ## How was this patch tested?
      
      Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11939 from sethah/SPARK-14104.
      129f2f45
  13. Apr 01, 2016
  14. Mar 23, 2016
    • sethah's avatar
      [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params · 30bdb5cb
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.
      
      This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.
      
      ## How was this patch tested?
      
      Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11663 from sethah/SPARK-13068-tc.
      30bdb5cb
  15. Mar 22, 2016
    • Joseph K. Bradley's avatar
      [SPARK-13951][ML][PYTHON] Nested Pipeline persistence · 7e3423b9
      Joseph K. Bradley authored
      Adds support for saving and loading nested ML Pipelines from Python.  Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.
      
      Also:
      * Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
      * Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java
      
      Added new unit test for nested Pipelines.  Abstracted validity check into a helper method for the 2 unit tests.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #11866 from jkbradley/nested-pipeline-io.
      Closes #11835
      7e3423b9
  16. Mar 01, 2016
    • Joseph K. Bradley's avatar
      [SPARK-13008][ML][PYTHON] Put one alg per line in pyspark.ml all lists · 9495c40f
      Joseph K. Bradley authored
      This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top.  Since we keep it alphabetized, it often creates a lot more changes than needed.  It is also easy to add the Estimator and forget the Model.  I'm going to switch it to have one algorithm per line.
      
      This also alphabetizes a few out-of-place classes in pyspark.ml.feature.  No changes have been made to the moved classes.
      
      CC: thunterdb
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #10927 from jkbradley/ml-python-all-list.
      9495c40f
  17. Feb 20, 2016
  18. Feb 12, 2016
  19. Feb 11, 2016
  20. Jan 26, 2016
    • Holden Karau's avatar
      [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code · eb917291
      Holden Karau authored
      The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
      eb917291
  21. Jan 06, 2016
  22. Sep 17, 2015
  23. Aug 13, 2015
    • Xiangrui Meng's avatar
      [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol · 68f99571
      Xiangrui Meng authored
      This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.
      
      This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.
      
      jkbradley yu-iskw
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #8148 from mengxr/SPARK-9918 and squashes the following commits:
      
      149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
      3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
      a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
      68f99571
  24. Aug 12, 2015
  25. Jul 17, 2015
    • Yu ISHIKAWA's avatar
      [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines · 34a889db
      Yu ISHIKAWA authored
      I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.
      
      [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:
      
      be752de [Yu ISHIKAWA] Add assertions
      a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
      4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
      fb2417c [Yu ISHIKAWA] Use getInt, instead of get
      f397be4 [Yu ISHIKAWA] Switch the comparisons.
      ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
      effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
      c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
      19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
      1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
      f8338bc [Yu ISHIKAWA] Add the placeholders in Python
      4a03003 [Yu ISHIKAWA] Test for contains in Python
      6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
      288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
      5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
      97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
      e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
      978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
      2ec80bc [Yu ISHIKAWA] Fit on 1 line
      e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
      b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
      f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
      3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
      4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
      2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
      19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
      4d2ad1e [Yu ISHIKAWA] Modify the indentations
      0ae422f [Yu ISHIKAWA] Add a test for `setParams`
      4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
      11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
      220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
      92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
      c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
      6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
      687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
      a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
      5bedc51 [Yu ISHIKAWA] Remve an extra new line
      444c289 [Yu ISHIKAWA] Add the validation for `runs`
      e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
      7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
      7991e15 [Yu ISHIKAWA] Add a validation for `k`
      c2df35d [Yu ISHIKAWA] Make `predict` private
      93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
      d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
      e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
      8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
      6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
      99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
      79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
      6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      11c2a12 [Yu ISHIKAWA] Limit the imports
      badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
      f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
      85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
      aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
      c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
      598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
      63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
      34a889db
Loading