Skip to content
Snippets Groups Projects
  1. Feb 25, 2017
    • Bryan Cutler's avatar
      [SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 20a43295
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.
      
      ## How was this patch tested?
      Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17048 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.
      20a43295
  2. Dec 01, 2016
    • Sandeep Singh's avatar
      [SPARK-18274][ML][PYSPARK] Memory leak in PySpark JavaWrapper · 4c673c65
      Sandeep Singh authored
      
      ## What changes were proposed in this pull request?
      In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach`
      Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams`
      
      ## How was this patch tested?
      ```scala
      import random, string
      from pyspark.ml.feature import StringIndexer
      
      l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))]  # 700000 random strings of 10 characters
      df = spark.createDataFrame(l, ['string'])
      
      for i in range(50):
          indexer = StringIndexer(inputCol='string', outputCol='index')
          indexer.fit(df)
      ```
      * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway
      After: garbage collection works as the object is dereferenced, and computation completes
      * Mem footprint tested using profiler
      * Added a parameter copy related test which was failing before.
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      Author: jkbradley <joseph.kurata.bradley@gmail.com>
      
      Closes #15843 from techaddict/SPARK-18274.
      
      (cherry picked from commit 78bb7f80)
      Signed-off-by: default avatarJoseph K. Bradley <joseph@databricks.com>
      4c673c65
  3. Nov 29, 2016
  4. Nov 21, 2016
  5. Oct 13, 2016
  6. Oct 03, 2016
    • zero323's avatar
      [SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow __getitem__ contract · d8399b60
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior.
      
      Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`.
      
      ## How was this patch tested?
      
      PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #15144 from zero323/SPARK-17587.
      d8399b60
  7. Aug 20, 2016
    • Bryan Cutler's avatar
      [SPARK-15018][PYSPARK][ML] Improve handling of PySpark Pipeline when used without stages · 39f328ba
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      When fitting a PySpark Pipeline without the `stages` param set, a confusing NoneType error is raised as attempts to iterate over the pipeline stages.  A pipeline with no stages should act as an identity transform, however the `stages` param still needs to be set to an empty list.  This change improves the error output when the `stages` param is not set and adds a better description of what the API expects as input.  Also minor cleanup of related code.
      
      ## How was this patch tested?
      Added new unit tests to verify an empty Pipeline acts as an identity transformer
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #12790 from BryanCutler/pipeline-identity-SPARK-15018.
      39f328ba
  8. Jul 15, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide · 5ffd5d38
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Made DataFrame-based API primary
      * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
      * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
      * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
        * **Reviewers: please check this carefully**
      * (minor) Titles for DF API no longer include "- spark.ml" suffix.  Titles for RDD API have "- RDD-based API" suffix
      * Moved migration guide to ml-guide from mllib-guide
        * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
        * **Reviewers**: I did not change any of the content of the migration guides.
      
      Reorganized DataFrame-based guide:
      * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
      * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
        * **Reviewers**: I did not change the content of these guides, except some intro text.
      * Sidebar remains the same, but with pipeline and tuning sections added
      
      Other:
      * ml-classification-regression.html: Moved text about linear methods to new section in page
      
      ## How was this patch tested?
      
      Generated docs locally
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14213 from jkbradley/ml-guide-2.0.
      5ffd5d38
  9. Jul 05, 2016
    • Joseph K. Bradley's avatar
      [SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls · fdde7d0a
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark.
      
      This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X
      
      ## How was this patch tested?
      
      Existing unit tests.  Manual testing in an environment where this was an issue.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #14023 from jkbradley/SPARK-16348.
      fdde7d0a
  10. Jun 13, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and... · baa3e633
      Liang-Chi Hsieh authored
      [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python
      
      ## What changes were proposed in this pull request?
      
      Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      
      Closes #13219 from viirya/pyspark-pickler-ml.
      baa3e633
  11. May 27, 2016
    • yinxusen's avatar
      [SPARK-15008][ML][PYSPARK] Add integration test for OneVsRest · 130b8d07
      yinxusen authored
      ## What changes were proposed in this pull request?
      
      1. Add `_transfer_param_map_to/from_java` for OneVsRest;
      
      2. Add `_compare_params` in ml/tests.py to help compare params.
      
      3. Add `test_onevsrest` as the integration test for OneVsRest.
      
      ## How was this patch tested?
      
      Python unit test.
      
      Author: yinxusen <yinxusen@gmail.com>
      
      Closes #12875 from yinxusen/SPARK-15008.
      130b8d07
  12. May 18, 2016
  13. May 17, 2016
    • DB Tsai's avatar
      [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms · e2efe052
      DB Tsai authored
      ## What changes were proposed in this pull request?
      
      Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: DB Tsai <dbt@netflix.com>
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12627 from dbtsai/SPARK-14615-NewML.
      e2efe052
    • Xiangrui Meng's avatar
      [SPARK-14906][ML] Copy linalg in PySpark to new ML package · 8ad9f08c
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      Copy the linalg (Vector/Matrix and VectorUDT/MatrixUDT) in PySpark to new ML package.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #13099 from viirya/move-pyspark-vector-matrix-udt4.
      8ad9f08c
  14. May 13, 2016
    • sethah's avatar
      [SPARK-15181][ML][PYSPARK] Python API for GLR summaries. · 5b849766
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs.
      
      ## How was this patch tested?
      
      Added a unit test to `pyspark.ml.tests`
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #12961 from sethah/GLR_summary.
      5b849766
  15. May 11, 2016
  16. May 06, 2016
    • Burak Köse's avatar
      [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover · e20cd9f4
      Burak Köse authored
      ## What changes were proposed in this pull request?
      
      This PR continues the work from #11871 with the following changes:
      * load English stopwords as default
      * covert stopwords to list in Python
      * update some tests and doc
      
      ## How was this patch tested?
      
      Unit tests.
      
      Closes #11871
      
      cc: burakkose srowen
      
      Author: Burak Köse <burakks41@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Burak KOSE <burakks41@gmail.com>
      
      Closes #12843 from mengxr/SPARK-14050.
      e20cd9f4
  17. May 01, 2016
    • Xusen Yin's avatar
      [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update · a6428292
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      This PR is an update for [https://github.com/apache/spark/pull/12738] which:
      * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
      * Various fixes for bugs found
        * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.
      
      Defaults changed:
      * Scala
       * LogisticRegression: weightCol defaults to not set (instead of empty string)
       * StringIndexer: labels default to not set (instead of empty array)
       * GeneralizedLinearRegression:
         * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
         * weightCol defaults to not set (instead of empty string)
       * LinearRegression: weightCol defaults to not set (instead of empty string)
      * Python
       * MultilayerPerceptron: layers default to not set (instead of [1,1])
       * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)
      
      ## How was this patch tested?
      
      Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: yinxusen <yinxusen@gmail.com>
      
      Closes #12816 from jkbradley/yinxusen-SPARK-14931.
      a6428292
  18. Apr 30, 2016
    • Xiangrui Meng's avatar
      [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALS · 7fbe1bb2
      Xiangrui Meng authored
      ## What changes were proposed in this pull request?
      
      As discussed in #12660, this PR renames
      * intermediateRDDStorageLevel -> intermediateStorageLevel
      * finalRDDStorageLevel -> finalStorageLevel
      
      The argument name in `ALS.train` will be addressed in SPARK-15027.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #12803 from mengxr/SPARK-14412.
      7fbe1bb2
    • Nick Pentreath's avatar
      [SPARK-14412][ML][PYSPARK] Add StorageLevel params to ALS · 90fa2c6e
      Nick Pentreath authored
      `mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them.
      
      ## How was this patch tested?
      
      New test cases in `ALSSuite` and `tests.py`.
      
      cc yanboliang jkbradley sethah rishabhbhardwaj
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #12660 from MLnick/SPARK-14412-als-storage-params.
      90fa2c6e
  19. Apr 29, 2016
    • Joseph K. Bradley's avatar
      [SPARK-13786][ML][PYTHON] Removed save/load for python tuning · 09da43d5
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Per discussion on [https://github.com/apache/spark/pull/12604], this removes ML persistence for Python tuning (TrainValidationSplit, CrossValidator, and their Models) since they do not handle nesting easily.  This support should be re-designed and added in the next release.
      
      ## How was this patch tested?
      
      Removed unit test elements saving and loading the tuning algorithms, but kept tests to save and load their bestModel fields.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12782 from jkbradley/remove-python-tuning-saveload.
      09da43d5
    • Jeff Zhang's avatar
      [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 · 775772de
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      pyspark.ml API for LDA
      * LDA, LDAModel, LocalLDAModel, DistributedLDAModel
      * includes persistence
      
      This replaces [https://github.com/apache/spark/pull/10242]
      
      ## How was this patch tested?
      
      * doc test for LDA, including Param setters
      * unit test for persistence
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #12723 from jkbradley/zjffdu-SPARK-11940.
      775772de
  20. Apr 28, 2016
  21. Apr 27, 2016
  22. Apr 26, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14903][SPARK-14071][ML][PYTHON] Revert : MLWritable.write property · 89f082de
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      SPARK-14071 changed MLWritable.write to be a property.  This reverts that change since there was not a good way to make MLReadable.read appear to be a property.
      
      ## How was this patch tested?
      
      existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12671 from jkbradley/revert-MLWritable-write-py.
      89f082de
  23. Apr 25, 2016
    • Yanbo Liang's avatar
      [SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3 · 425f6916
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.
      
      Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.
      
      ## How was this patch tested?
      unit tests.
      
      cc jkbradley MLnick
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12498 from yanboliang/spark-10574.
      425f6916
  24. Apr 20, 2016
    • Burak Yavuz's avatar
      [SPARK-14555] First cut of Python API for Structured Streaming · 80bf48f4
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
       - ContinuousQuery
       - Trigger
       - ProcessingTime
      in pyspark under `pyspark.sql.streaming`.
      
      In addition, it contains the new methods added under:
       -  `DataFrameWriter`
           a) `startStream`
           b) `trigger`
           c) `queryName`
      
       -  `DataFrameReader`
           a) `stream`
      
       - `DataFrame`
          a) `isStreaming`
      
      This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
       - `exception`
       - `sourceStatuses`
       - `sinkStatus`
      
      They may be added in a follow up.
      
      This PR also contains some very minor doc fixes in the Scala side.
      
      ## How was this patch tested?
      
      Python doc tests
      
      TODO:
       - [ ] verify Python docs look good
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Burak Yavuz <burak@databricks.com>
      
      Closes #12320 from brkyvz/stream-python.
      80bf48f4
  25. Apr 18, 2016
  26. Apr 16, 2016
    • Joseph K. Bradley's avatar
      [SPARK-14605][ML][PYTHON] Changed Python to use unicode UIDs for spark.ml Identifiable · 36da5e32
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python.
      
      This PR: Use unicode everywhere in Python.
      
      ## How was this patch tested?
      
      Updated persistence unit test to check uid type
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #12368 from jkbradley/python-uid-unicode.
      36da5e32
  27. Apr 15, 2016
  28. Apr 14, 2016
    • Yong Tang's avatar
      [SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib · bc748b7b
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.
      
      Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.
      
      ## How was this patch tested?
      
      This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #12079 from yongtang/SPARK-14238.
      bc748b7b
    • Bryan Cutler's avatar
      [SPARK-13967][PYSPARK][ML] Added binary Param to Python CountVectorizer · c5172f82
      Bryan Cutler authored
      Added binary toggle param to CountVectorizer feature transformer in PySpark.
      
      Created a unit test for using CountVectorizer with the binary toggle on.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
      c5172f82
  29. Apr 13, 2016
    • Bryan Cutler's avatar
      [SPARK-14472][PYSPARK][ML] Cleanup ML JavaWrapper and related class hierarchy · fc3cd2f5
      Bryan Cutler authored
      Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls.  This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls.  Also, renames Java wrapper classes to better reflect their purpose.
      
      Ran existing Python ml tests and generated documentation to test this change.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
      fc3cd2f5
  30. Apr 06, 2016
  31. Mar 29, 2016
    • wm624@hotmail.com's avatar
      [SPARK-14071][PYSPARK][ML] Change MLWritable.write to be a property · 63b200e8
      wm624@hotmail.com authored
      Add property to MLWritable.write method, so we can use .write instead of .write()
      
      Add a new test to ml/test.py to check whether the write is a property.
      ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
      
      Will test against the following Python executables: ['python2.7']
      Will test the following Python modules: ['pyspark-ml']
      Finished test(python2.7): pyspark.ml.evaluation (11s)
      Finished test(python2.7): pyspark.ml.clustering (16s)
      Finished test(python2.7): pyspark.ml.classification (24s)
      Finished test(python2.7): pyspark.ml.recommendation (24s)
      Finished test(python2.7): pyspark.ml.feature (39s)
      Finished test(python2.7): pyspark.ml.regression (26s)
      Finished test(python2.7): pyspark.ml.tuning (15s)
      Finished test(python2.7): pyspark.ml.tests (30s)
      Tests passed in 55 seconds
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #11945 from wangmiao1981/fix_property.
      63b200e8
  32. Mar 24, 2016
  33. Mar 23, 2016
    • sethah's avatar
      [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params · 30bdb5cb
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.
      
      This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.
      
      ## How was this patch tested?
      
      Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #11663 from sethah/SPARK-13068-tc.
      30bdb5cb
Loading