Commits · 40a10d7675578f8370d07e23810d9fc5d58e0550 · cs525-sp18-g07 / spark

Oct 21, 2015

[SPARK-11205][PYSPARK] Delegate to scala DataFrame API rather than p… · 5cdea7d1

Jeff Zhang authored 9 years ago

…rint in python

No test needed. Verify it manually in pyspark shell

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9177 from zjffdu/SPARK-11205.

5cdea7d1

Oct 20, 2015

[MINOR][ML] fix doc warnings · 135ade90

Xiangrui Meng authored 9 years ago

Without an empty line, sphinx will treat doctest as docstring. holdenk

~~~
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])".
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])".
~~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #9188 from mengxr/py-count-vec-doc-fix.

135ade90

[SPARK-10767][PYSPARK] Make pyspark shared params codegen more consistent · aea7142c

Holden Karau authored 9 years ago

Namely "." shows up in some places in the template when using the param docstring and not in others

Author: Holden Karau <holden@pigscanfly.ca>

Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.

aea7142c

[SPARK-10269][PYSPARK][MLLIB] Add @since annotation to pyspark.mllib.classification · 04521ea0

noelsmith authored 9 years ago

Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes derived from the file history.

Note - some methods are inherited from the regression module (i.e. LinearModel.intercept) so these won't have version numbers in the API docs until that model is updated.

Author: noelsmith <mail@noelsmith.com>

Closes #8626 from noel-smith/SPARK-10269-since-mlib-classification.

04521ea0

[SPARK-10272][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.evaluation · 82e9d9c8

noelsmith authored 9 years ago

Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to public methods + "versionadded::" to classes (derived from the git file history in pyspark).

Note - I added also the tags to MultilabelMetrics even though it isn't declared as public in the __all__ statement... if that's incorrect - I'll remove.

Author: noelsmith <mail@noelsmith.com>

Closes #8628 from noel-smith/SPARK-10272-since-mllib-evalutation.

82e9d9c8

[SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9 · e18b571c

Holden Karau authored 9 years ago

Upgrade to Py4j0.9

Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>

Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.

e18b571c

Oct 19, 2015

[SPARK-11114][PYSPARK] add getOrCreate for SparkContext/SQLContext in Python · 232d7f8d
Davies Liu authored 9 years ago
```
Also added SQLContext.newSession()

Author: Davies Liu <davies@databricks.com>

Closes #9122 from davies/py_create.
```
232d7f8d

[SPARK-7018][BUILD] Refactor dev/run-tests-jenkins into Python · d3180c25

Brennon York authored 9 years ago

This commit refactors the `run-tests-jenkins` script into Python. This refactoring was done by brennonyork in #7401; this PR contains a few minor edits from joshrosen in order to bring it up to date with other recent changes.

From the original PR description (by brennonyork):

Currently a few things are left out that, could and I think should, be smaller JIRA's after this.

1. There are still a few areas where we use environment variables where we don't need to (like `CURRENT_BLOCK`). I might get around to fixing this one in lieu of everything else, but wanted to point that out.
2. The PR tests are still written in bash. I opted to not change those and just rewrite the runner into Python. This is a great follow-on JIRA IMO.
3. All of the linting scripts are still in bash as well and would likely do to just add those in as follow-on JIRA's as well.

Closes #7401.

Author: Brennon York <brennon.york@capitalone.com>

Closes #9161 from JoshRosen/run-tests-jenkins-refactoring.

d3180c25

Oct 18, 2015

[SPARK-11158][SQL] Modified _verify_type() to be more informative on Errors by... · a337c235

Mahmoud Lababidi authored 9 years ago

[SPARK-11158][SQL] Modified _verify_type() to be more informative on Errors by presenting the Object

The _verify_type() function had Errors that were raised when there were Type conversion issues but left out the Object in question. The Object is now added in the Error to reduce the strain on the user to debug through to figure out the Object that failed the Type conversion.

The use case for me was a Pandas DataFrame that contained 'nan' as values for columns of Strings.

Author: Mahmoud Lababidi <mahmoud@thehumangeo.com>
Author: Mahmoud Lababidi <lababidi@gmail.com>

Closes #9149 from lababidi/master.

a337c235

Oct 17, 2015

[SPARK-10185] [SQL] Feat sql comma separated paths · 57f83e36

Koert Kuipers authored 9 years ago

Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider

Author: Koert Kuipers <koert@tresata.com>

Closes #8416 from koertkuipers/feat-sql-comma-separated-paths.

57f83e36

Oct 16, 2015

[SPARK-11084] [ML] [PYTHON] Check if index can contain non-zero value before binary search · 8ac71d62

zero323 authored 9 years ago

At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9098 from zero323/sparse_vector_getitem_improved.

8ac71d62

[SPARK-11050] [MLLIB] PySpark SparseVector can return wrong index in e… · 1ec0a0dc

Bhargav Mangipudi authored 9 years ago

…rror message

For negative indices in the SparseVector, we update the index value. If we have an incorrect index
at this point, the error message has the incorrect *updated* index instead of the original one. This
change contains the fix for the same.

Author: Bhargav Mangipudi <bhargav.mangipudi@gmail.com>

Closes #9069 from bhargav/spark-10759.

1ec0a0dc

Oct 13, 2015

[PYTHON] [MINOR] List modules in PySpark tests when given bad name · c75f058b

Joseph K. Bradley authored 9 years ago

Output list of supported modules for python tests in error message when given bad module name.

CC: davies

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9088 from jkbradley/python-tests-modules.

c75f058b

Oct 12, 2015

[SPARK-8170] [PYTHON] Add signal handler to trap Ctrl-C in pyspark and cancel all running jobs · 2e572c41

Ashwin Shankar authored 9 years ago

This patch adds a signal handler to trap Ctrl-C and cancels running job.

Author: Ashwin Shankar <ashankar@netflix.com>

Closes #9033 from ashwinshankar77/master.

2e572c41

Oct 09, 2015

[SPARK-10535] Sync up API for matrix factorization model between Scala and PySpark · c1b4ce43

Vladimir Vladimirov authored 9 years ago

Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark

Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>

Closes #8700 from smartkiwi/SPARK-10535_.

c1b4ce43

[SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with... · 5410747a

Bryan Cutler authored 9 years ago

[SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters

These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training.  Same with StreamingLinearRegressionWithSGD.  I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.

5410747a

Oct 08, 2015

[SPARK-10973] [ML] [PYTHON] __gettitem__ method throws IndexError exception when we… · 8e67882b

zero323 authored 9 years ago

__gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry

    from pyspark.mllib.linalg import Vectors
    sv = Vectors.sparse(5, {1: 3})
    sv[0]
    ## 0.0
    sv[1]
    ## 3.0
    sv[2]
    ## Traceback (most recent call last):
    ##   File "<stdin>", line 1, in <module>
    ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
    ##     row_ind = inds[insert_index]
    ## IndexError: index out of bounds

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9009 from zero323/sparse_vector_index_error.

8e67882b

Oct 07, 2015

[SPARK-9774] [ML] [PYSPARK] Add python api for ml regression isotonicregression · 3aff0866

Holden Karau authored 9 years ago

Add the Python API for isotonicregression.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.

3aff0866

[SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib) · da936fbb

Evan Chen authored 9 years ago

Provide initialModel param for pyspark.mllib.clustering.KMeans

Author: Evan Chen <chene@us.ibm.com>

Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.

da936fbb

Oct 06, 2015

[SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in... · 5e035403

Xiangrui Meng authored 9 years ago

[SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in PySpark's AFTSurvivalRegression

If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #9001 from mengxr/SPARK-10957.

5e035403

[SPARK-10688] [ML] [PYSPARK] Python API for AFTSurvivalRegression · 5952bdb7

vectorijk authored 9 years ago

Implement Python API for AFTSurvivalRegression

Author: vectorijk <jiangkai@gmail.com>

Closes #8926 from vectorijk/spark-10688.

5952bdb7

Sep 29, 2015

[SPARK-10782] [PYTHON] Update dropDuplicates documentation · c1ad373f

asokadiggs authored 9 years ago

Documentation for dropDuplicates() and drop_duplicates() is one and the same. Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases.

Author: asokadiggs <asoka.diggs@intel.com>

Closes #8930 from asokadiggs/jira-10782.

c1ad373f

[SPARK-6919] [PYSPARK] Add asDict method to StatCounter · 7d399c9d

Erik Shilts authored 9 years ago

Add method to easily convert a StatCounter instance into a Python dict

https://issues.apache.org/jira/browse/SPARK-6919

Note: This is my original work and the existing Spark license applies.

Author: Erik Shilts <erik.shilts@opower.com>

Closes #5516 from eshilts/statcounter-asdict.

7d399c9d

[SPARK-10415] [PYSPARK] [MLLIB] [DOCS] Enhance Navigation Sidebar in PySpark API · ab41864f

noelsmith authored 9 years ago

These are CSS/JavaScript changes changes to make navigation in the PySpark API a bit simpler by adding the following to the sidebar:

* Classes
* Functions
* Tags to highlight experimental features

![screen shot 2015-09-02 at 08 50 12](https://cloud.githubusercontent.com/assets/11915197/9634781/301f853a-518b-11e5-8d5c-fda202f6202f.png)

Online example here: https://dl.dropboxusercontent.com/u/20821334/pyspark-api-nav-enhance/pyspark.mllib.html

(The contribution is my original work and that I license the work to the project under the project's open source license)

Author: noelsmith <mail@noelsmith.com>

Closes #8571 from noel-smith/pyspark-api-nav-enhance.

ab41864f

Sep 25, 2015

[SPARK-9681] [ML] Support R feature interactions in RFormula · 92233881

Eric Liang authored 9 years ago

This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).

To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8830 from ericl/interaction-2.

92233881

Sep 23, 2015

[SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame. · 99522177

Reynold Xin authored 9 years ago

Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take).

This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion.

Author: Reynold Xin <rxin@databricks.com>

Closes #8876 from rxin/SPARK-10731.

99522177

Sep 22, 2015

[SPARK-10446][SQL] Support to specify join type when calling join with usingColumns · 1fcefef0

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-10446

Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8600 from viirya/usingcolumns_df.

1fcefef0

[SPARK-10577] [PYSPARK] DataFrame hint for broadcast join · 0180b849

Jian Feng authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-10577

Author: Jian Feng <jzhang.chs@gmail.com>

Closes #8801 from Jianfeng-chs/master.

0180b849

[SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on... · bf20d6c9

Sean Owen authored 9 years ago

[SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file

Remove ._SUCCESS.crc hidden file that may cause problems in distribution tar archive, and is not used

Author: Sean Owen <sowen@cloudera.com>

Closes #8846 from srowen/SPARK-10716.

bf20d6c9

[SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner · 1cd67415

Holden Karau authored 9 years ago

from the issue:

In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.

1cd67415

Sep 21, 2015

[DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formatting · 7c4f852b

noelsmith authored 9 years ago

Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e:

![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png)

.. looks like this once newline is added:

![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png)

Author: noelsmith <mail@noelsmith.com>

Closes #8851 from noel-smith/docstring-missing-newline-fix.

7c4f852b

[SPARK-9769] [ML] [PY] add python api for countvectorizermodel · ba882db6

Holden Karau authored 9 years ago

From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.

ba882db6

[SPARK-10631] [DOCUMENTATION, MLLIB, PYSPARK] Added documentation for few APIs · 01440395

vinodkc authored 9 years ago

There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts.

Author: vinodkc <vinod.kc.in@gmail.com>

Closes #8834 from vinodkc/fix_SPARK-10631.

01440395

Sep 19, 2015

[SPARK-10710] Remove ability to disable spilling in core and SQL · 2117eea7

Josh Rosen authored 9 years ago

It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.

This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.

2117eea7

Sep 18, 2015

[SPARK-10615] [PYSPARK] change assertEquals to assertEqual · 35e8ab93

Yanbo Liang authored 9 years ago

As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8814 from yanboliang/spark-10615.

35e8ab93

Sep 17, 2015

[SPARK-10642] [PYSPARK] Fix crash when calling rdd.lookup() on tuple keys · 136c77d8

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-10642

When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8796 from viirya/fix-pyrdd-lookup.

136c77d8

[SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.recommendation · 268088b8
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8692 from yu-iskw/SPARK-10282.
```
268088b8
[SPARK-10274] [MLLIB] Add @since annotation to pyspark.mllib.fpm · c74d38fd
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8665 from yu-iskw/SPARK-10274.
```
c74d38fd
[SPARK-10279] [MLLIB] [PYSPARK] [DOCS] Add @since annotation to pyspark.mllib.util · 4a0b56e8
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8689 from yu-iskw/SPARK-10279.
```
4a0b56e8
[SPARK-10278] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.tree · 39b44cb5
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8685 from yu-iskw/SPARK-10278.
```
39b44cb5