- Feb 25, 2017
-
-
Bryan Cutler authored
## What changes were proposed in this pull request? Fixed the PySpark Params.copy method to behave like the Scala implementation. The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map. ## How was this patch tested? Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params. Author: Bryan Cutler <cutlerb@gmail.com> Closes #17048 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.
-
- Dec 01, 2016
-
-
Sandeep Singh authored
## What changes were proposed in this pull request? In`JavaWrapper `'s destructor make Java Gateway dereference object in destructor, using `SparkContext._active_spark_context._gateway.detach` Fixing the copying parameter bug, by moving the `copy` method from `JavaModel` to `JavaParams` ## How was this patch tested? ```scala import random, string from pyspark.ml.feature import StringIndexer l = [(''.join(random.choice(string.ascii_uppercase) for _ in range(10)), ) for _ in range(int(7e5))] # 700000 random strings of 10 characters df = spark.createDataFrame(l, ['string']) for i in range(50): indexer = StringIndexer(inputCol='string', outputCol='index') indexer.fit(df) ``` * Before: would keep StringIndexer strong reference, causing GC issues and is halted midway After: garbage collection works as the object is dereferenced, and computation completes * Mem footprint tested using profiler * Added a parameter copy related test which was failing before. Author: Sandeep Singh <sandeep@techaddict.me> Author: jkbradley <joseph.kurata.bradley@gmail.com> Closes #15843 from techaddict/SPARK-18274. (cherry picked from commit 78bb7f80) Signed-off-by:
Joseph K. Bradley <joseph@databricks.com>
-
- Nov 29, 2016
-
-
Jeff Zhang authored
## What changes were proposed in this pull request? Add python api for KMeansSummary ## How was this patch tested? unit test added Author: Jeff Zhang <zjffdu@apache.org> Closes #13557 from zjffdu/SPARK-15819. (cherry picked from commit 4c82ca86) Signed-off-by:
Yanbo Liang <ybliang8@gmail.com>
-
- Nov 21, 2016
-
-
sethah authored
## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <seth.hendrickson16@gmail.com> Closes #15777 from sethah/pyspark_cluster_summaries. (cherry picked from commit e811fbf9) Signed-off-by:
Yanbo Liang <ybliang8@gmail.com>
-
- Oct 13, 2016
-
-
Yanbo Liang authored
## What changes were proposed in this pull request? Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15430 from yanboliang/spark-15957-python.
-
- Oct 03, 2016
-
-
zero323 authored
## What changes were proposed in this pull request? Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior. Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`. ## How was this patch tested? PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved. Author: zero323 <zero323@users.noreply.github.com> Closes #15144 from zero323/SPARK-17587.
-
- Aug 20, 2016
-
-
Bryan Cutler authored
## What changes were proposed in this pull request? When fitting a PySpark Pipeline without the `stages` param set, a confusing NoneType error is raised as attempts to iterate over the pipeline stages. A pipeline with no stages should act as an identity transform, however the `stages` param still needs to be set to an empty list. This change improves the error output when the `stages` param is not set and adds a better description of what the API expects as input. Also minor cleanup of related code. ## How was this patch tested? Added new unit tests to verify an empty Pipeline acts as an identity transformer Author: Bryan Cutler <cutlerb@gmail.com> Closes #12790 from BryanCutler/pipeline-identity-SPARK-15018.
-
- Jul 15, 2016
-
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * **Reviewers: please check this carefully** * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * **Reviewers**: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * **Reviewers**: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.
-
- Jul 05, 2016
-
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark. This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X ## How was this patch tested? Existing unit tests. Manual testing in an environment where this was an issue. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14023 from jkbradley/SPARK-16348.
-
- Jun 13, 2016
-
-
Liang-Chi Hsieh authored
[SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.
-
- May 27, 2016
-
-
yinxusen authored
## What changes were proposed in this pull request? 1. Add `_transfer_param_map_to/from_java` for OneVsRest; 2. Add `_compare_params` in ml/tests.py to help compare params. 3. Add `test_onevsrest` as the integration test for OneVsRest. ## How was this patch tested? Python unit test. Author: yinxusen <yinxusen@gmail.com> Closes #12875 from yinxusen/SPARK-15008.
-
- May 18, 2016
-
-
Takuya Kuwahara authored
## What changes were proposed in this pull request? This pull request includes supporting validationMetrics for TrainValidationSplitModel with Python and test for it. ## How was this patch tested? test in `python/pyspark/ml/tests.py` Author: Takuya Kuwahara <taakuu19@gmail.com> Closes #12767 from taku-k/spark-14978.
-
- May 17, 2016
-
-
DB Tsai authored
## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.
-
Xiangrui Meng authored
## What changes were proposed in this pull request? Copy the linalg (Vector/Matrix and VectorUDT/MatrixUDT) in PySpark to new ML package. ## How was this patch tested? Existing tests. Author: Xiangrui Meng <meng@databricks.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #13099 from viirya/move-pyspark-vector-matrix-udt4.
-
- May 13, 2016
-
-
sethah authored
## What changes were proposed in this pull request? This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs. ## How was this patch tested? Added a unit test to `pyspark.ml.tests` Author: sethah <seth.hendrickson16@gmail.com> Closes #12961 from sethah/GLR_summary.
-
- May 11, 2016
-
-
Sandeep Singh authored
## What changes were proposed in this pull request? Use SparkSession instead of SQLContext in Python TestSuites ## How was this patch tested? Existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13044 from techaddict/SPARK-15037-python.
-
- May 06, 2016
-
-
Burak Köse authored
## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak Köse <burakks41@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Burak KOSE <burakks41@gmail.com> Closes #12843 from mengxr/SPARK-14050.
-
- May 01, 2016
-
-
Xusen Yin authored
## What changes were proposed in this pull request? This PR is an update for [https://github.com/apache/spark/pull/12738] which: * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side * Various fixes for bugs found * This includes changing classes taking weightCol to treat unset and empty String Param values the same way. Defaults changed: * Scala * LogisticRegression: weightCol defaults to not set (instead of empty string) * StringIndexer: labels default to not set (instead of empty array) * GeneralizedLinearRegression: * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver) * weightCol defaults to not set (instead of empty string) * LinearRegression: weightCol defaults to not set (instead of empty string) * Python * MultilayerPerceptron: layers default to not set (instead of [1,1]) * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set) ## How was this patch tested? Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test. Author: Joseph K. Bradley <joseph@databricks.com> Author: yinxusen <yinxusen@gmail.com> Closes #12816 from jkbradley/yinxusen-SPARK-14931.
-
- Apr 30, 2016
-
-
Xiangrui Meng authored
## What changes were proposed in this pull request? As discussed in #12660, this PR renames * intermediateRDDStorageLevel -> intermediateStorageLevel * finalRDDStorageLevel -> finalStorageLevel The argument name in `ALS.train` will be addressed in SPARK-15027. ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #12803 from mengxr/SPARK-14412.
-
Nick Pentreath authored
`mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them. ## How was this patch tested? New test cases in `ALSSuite` and `tests.py`. cc yanboliang jkbradley sethah rishabhbhardwaj Author: Nick Pentreath <nickp@za.ibm.com> Closes #12660 from MLnick/SPARK-14412-als-storage-params.
-
- Apr 29, 2016
-
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Per discussion on [https://github.com/apache/spark/pull/12604], this removes ML persistence for Python tuning (TrainValidationSplit, CrossValidator, and their Models) since they do not handle nesting easily. This support should be re-designed and added in the next release. ## How was this patch tested? Removed unit test elements saving and loading the tuning algorithms, but kept tests to save and load their bestModel fields. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12782 from jkbradley/remove-python-tuning-saveload.
-
Jeff Zhang authored
## What changes were proposed in this pull request? pyspark.ml API for LDA * LDA, LDAModel, LocalLDAModel, DistributedLDAModel * includes persistence This replaces [https://github.com/apache/spark/pull/10242] ## How was this patch tested? * doc test for LDA, including Param setters * unit test for persistence Author: Joseph K. Bradley <joseph@databricks.com> Author: Jeff Zhang <zjffdu@apache.org> Closes #12723 from jkbradley/zjffdu-SPARK-11940.
-
- Apr 28, 2016
-
-
Kai Jiang authored
## What changes were proposed in this pull request? support avgMetrics in CrossValidatorModel with Python ## How was this patch tested? Doctest and `test_save_load` in `pyspark/ml/test.py` [JIRA](https://issues.apache.org/jira/browse/SPARK-12810) Author: Kai Jiang <jiangkai@gmail.com> Closes #12464 from vectorijk/spark-12810.
-
- Apr 27, 2016
-
-
Yanbo Liang authored
## What changes were proposed in this pull request? Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone. ## How was this patch tested? Unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12702 from yanboliang/spark-14899.
-
- Apr 26, 2016
-
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? SPARK-14071 changed MLWritable.write to be a property. This reverts that change since there was not a good way to make MLReadable.read appear to be a property. ## How was this patch tested? existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12671 from jkbradley/revert-MLWritable-write-py.
-
- Apr 25, 2016
-
-
Yanbo Liang authored
## What changes were proposed in this pull request? As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method. Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work. ## How was this patch tested? unit tests. cc jkbradley MLnick Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #12498 from yanboliang/spark-10574.
-
- Apr 20, 2016
-
-
Burak Yavuz authored
## What changes were proposed in this pull request? This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes: - ContinuousQuery - Trigger - ProcessingTime in pyspark under `pyspark.sql.streaming`. In addition, it contains the new methods added under: - `DataFrameWriter` a) `startStream` b) `trigger` c) `queryName` - `DataFrameReader` a) `stream` - `DataFrame` a) `isStreaming` This PR doesn't contain all methods exposed for `ContinuousQuery`, for example: - `exception` - `sourceStatuses` - `sinkStatus` They may be added in a follow up. This PR also contains some very minor doc fixes in the Scala side. ## How was this patch tested? Python doc tests TODO: - [ ] verify Python docs look good Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> Closes #12320 from brkyvz/stream-python.
-
- Apr 18, 2016
-
-
Jason Lee authored
## What changes were proposed in this pull request? Added windowSize getter/setter to ML/MLlib ## How was this patch tested? Added test cases in tests.py under both ML and MLlib Author: Jason Lee <cjlee@us.ibm.com> Closes #12428 from jasoncl/SPARK-14564.
-
Xusen Yin authored
## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14306 Add PySpark OneVsRest save/load supports. ## How was this patch tested? Test with Python unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #12439 from yinxusen/SPARK-14306-0415.
-
- Apr 16, 2016
-
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python. This PR: Use unicode everywhere in Python. ## How was this patch tested? Updated persistence unit test to check uid type Author: Joseph K. Bradley <joseph@databricks.com> Closes #12368 from jkbradley/python-uid-unicode.
-
- Apr 15, 2016
-
-
Xusen Yin authored
## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-7861 Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline. ## How was this patch tested? Test with doctest. Author: Xusen Yin <yinxusen@gmail.com> Closes #12124 from yinxusen/SPARK-14306-7861.
-
Joseph K. Bradley authored
## What changes were proposed in this pull request? The default stopwords were a Java object. They are no longer. ## How was this patch tested? Unit test which failed before the fix Author: Joseph K. Bradley <joseph@databricks.com> Closes #12422 from jkbradley/pyspark-stopwords.
-
- Apr 14, 2016
-
-
Yong Tang authored
## What changes were proposed in this pull request? This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1. Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done. ## How was this patch tested? This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib. Author: Yong Tang <yong.tang.github@outlook.com> Closes #12079 from yongtang/SPARK-14238.
-
Bryan Cutler authored
Added binary toggle param to CountVectorizer feature transformer in PySpark. Created a unit test for using CountVectorizer with the binary toggle on. Author: Bryan Cutler <cutlerb@gmail.com> Closes #12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
-
- Apr 13, 2016
-
-
Bryan Cutler authored
Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls. This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls. Also, renames Java wrapper classes to better reflect their purpose. Ran existing Python ml tests and generated documentation to test this change. Author: Bryan Cutler <cutlerb@gmail.com> Closes #12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
-
- Apr 06, 2016
-
-
Bryan Cutler authored
## What changes were proposed in this pull request? Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML. ## How was this patch tested? Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
-
Xusen Yin authored
## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13786 Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model. ## How was this patch tested? Test with Python doctest. Author: Xusen Yin <yinxusen@gmail.com> Closes #12020 from yinxusen/SPARK-13786.
-
- Mar 29, 2016
-
-
wm624@hotmail.com authored
Add property to MLWritable.write method, so we can use .write instead of .write() Add a new test to ml/test.py to check whether the write is a property. ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-ml'] Finished test(python2.7): pyspark.ml.evaluation (11s) Finished test(python2.7): pyspark.ml.clustering (16s) Finished test(python2.7): pyspark.ml.classification (24s) Finished test(python2.7): pyspark.ml.recommendation (24s) Finished test(python2.7): pyspark.ml.feature (39s) Finished test(python2.7): pyspark.ml.regression (26s) Finished test(python2.7): pyspark.ml.tuning (15s) Finished test(python2.7): pyspark.ml.tests (30s) Tests passed in 55 seconds Author: wm624@hotmail.com <wm624@hotmail.com> Closes #11945 from wangmiao1981/fix_property.
-
- Mar 24, 2016
-
-
GayathriMurali authored
## What changes were proposed in this pull request? Added MLReadable and MLWritable to Decision Tree Classifier and Regressor. Added doctests. ## How was this patch tested? Python Unit tests. Tests added to check persistence in DecisionTreeClassifier and DecisionTreeRegressor. Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #11892 from GayathriMurali/SPARK-13949.
-
- Mar 23, 2016
-
-
sethah authored
## What changes were proposed in this pull request? This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type. This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira. ## How was this patch tested? Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided. Author: sethah <seth.hendrickson16@gmail.com> Closes #11663 from sethah/SPARK-13068-tc.
-