Commits · c8f25459ed4ad6b51a5f11665364cfe0b84f7b3c · cs525-sp18-g07 / spark

Mar 04, 2016

[SPARK-13676] Fix mismatched default values for regParam in LogisticRegression · c8f25459

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value.

**Scala**
```
scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
res0: Double = 0.0
```

**Python**
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.1
```

## How was this patch tested?
manual. Check the following in `pyspark`.
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.0
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11519 from dongjoon-hyun/SPARK-13676.

c8f25459

Mar 03, 2016

[SPARK-13647] [SQL] also check if numeric value is within allowed range in _verify_type · 15d57f9c

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range.

## How was this patch tested?

newly added doc test.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11492 from cloud-fan/py-verify.

15d57f9c

[MINOR] Fix typos in comments and testcase name of code · 941b270b

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

This PR fixes typos in comments and testcase name of code.

## How was this patch tested?

manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.

941b270b

[SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option() · cf95d728

hyukjinkwon authored 9 years ago

## What changes were proposed in this pull request?

This PR adds the support to specify compression codecs for both ORC and Parquet.

## How was this patch tested?

unittests within IDE and code style tests with `dev/run_tests`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11464 from HyukjinKwon/SPARK-13543.

cf95d728

[SPARK-12877][ML] Add train-validation-split to pyspark · 511d4929

JeremyNixon authored 9 years ago

## What changes were proposed in this pull request?
The changes proposed were to add train-validation-split to pyspark.ml.tuning.

## How was the this patch tested?
This patch was tested through unit tests located in pyspark/ml/test.py.

This is my original work and I license it to Spark.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11335 from JeremyNixon/tvs_pyspark.

511d4929

Mar 02, 2016

[SPARK-13594][SQL] remove typed operations(e.g. map, flatMap) from python DataFrame · 4dd24811

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

Remove `map`, `flatMap`, `mapPartitions` from python DataFrame, to prepare for Dataset API in the future.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11445 from cloud-fan/python-clean.

4dd24811

Mar 01, 2016

[SPARK-13008][ML][PYTHON] Put one alg per line in pyspark.ml all lists · 9495c40f

Joseph K. Bradley authored 9 years ago

This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top. Since we keep it alphabetized, it often creates a lot more changes than needed. It is also easy to add the Estimator and forget the Model. I'm going to switch it to have one algorithm per line.

This also alphabetizes a few out-of-place classes in pyspark.ml.feature. No changes have been made to the moved classes.

CC: thunterdb

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10927 from jkbradley/ml-python-all-list.

9495c40f

Feb 29, 2016

[SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call · 02aa499d

hyukjinkwon authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-13507
https://issues.apache.org/jira/browse/SPARK-13509

## What changes were proposed in this pull request?
This PR adds the support to write CSV data directly by a single call to the given path.

Several unitests were added for each functionality.
## How was this patch tested?

This was tested with unittests and with `dev/run_tests` for coding style

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #11389 from HyukjinKwon/SPARK-13507-13509.

02aa499d

[SPARK-12633][PYSPARK] [DOC] PySpark regression parameter desc to consistent format · 236e3c8f

vijaykiran authored 9 years ago

Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the regression module. Also, updated 2 params in classification to read as `Supported values:` to be consistent.

closes #10600

Author: vijaykiran <mail@vijaykiran.com>
Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11404 from BryanCutler/param-desc-consistent-regression-SPARK-12633.

236e3c8f

[SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default... · d81a7135

Yanbo Liang authored 9 years ago

[SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default parameters consistent in Scala and Python

## What changes were proposed in this pull request?
* The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.)
* BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route.
* Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly.

cc mengxr dbtsai
## How was this patch tested?
No new tests, it should pass all current tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11424 from yanboliang/spark-13545.

d81a7135

Feb 26, 2016

[SPARK-13505][ML] add python api for MaxAbsScaler · 1e5fcdf9

zlpmichelle authored 9 years ago

## What changes were proposed in this pull request?
After SPARK-13028, we should add Python API for MaxAbsScaler.

## How was this patch tested?
unit test

Author: zlpmichelle <zlpmichelle@gmail.com>

Closes #11393 from zlpmichelle/master.

1e5fcdf9

[SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format · b33261f9

Bryan Cutler authored 9 years ago

Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the tree module.

closes #10601

Author: Bryan Cutler <cutlerb@gmail.com>
Author: vijaykiran <mail@vijaykiran.com>

Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.

b33261f9

Feb 25, 2016

[SPARK-13033] [ML] [PYSPARK] Add import/export for ml.regression · f3be369e

Tommy YU authored 9 years ago

Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/regression.py.

yanboliang Please help to review.
For doctest, I though it's enough to add one since it's common usage. But I can add to all if we want it.

Author: Tommy YU <tummyyu@163.com>

Closes #11000 from Wenpei/spark-13033-ml.regression-exprot-import and squashes the following commits:

3646b36 [Tommy YU] address review comments
9cddc98 [Tommy YU] change base on review and pr 11197
cc61d9d [Tommy YU] remove default parameter set
19535d4 [Tommy YU] add export/import to regression
44a9dc2 [Tommy YU] add import/export for ml.regression

f3be369e

[SPARK-13292] [ML] [PYTHON] QuantileDiscretizer should take random seed in PySpark · 35316cb0

Yu ISHIKAWA authored 9 years ago

## What changes were proposed in this pull request?
QuantileDiscretizer in Python should also specify a random seed.

## How was this patch tested?
unit tests

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11362 from yu-iskw/SPARK-13292 and squashes the following commits:

02ffa76 [Yu ISHIKAWA] [SPARK-13292][ML][PYTHON] QuantileDiscretizer should take random seed in PySpark

35316cb0

[SPARK-7106][MLLIB][PYSPARK] Support model save/load in Python's FPGrowth · 4d2864b2

Kai Jiang authored 9 years ago

## What changes were proposed in this pull request?

Python API supports mode save/load in FPGrowth
JIRA: [https://issues.apache.org/jira/browse/SPARK-7106](https://issues.apache.org/jira/browse/SPARK-7106)
## How was the this patch tested?

The patch is tested with Python doctest.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #11321 from vectorijk/spark-7106.

4d2864b2

[SPARK-13479][SQL][PYTHON] Added Python API for approxQuantile · 13ce10e9

Joseph K. Bradley authored 9 years ago

## What changes were proposed in this pull request?

* Scala DataFrameStatFunctions: Added version of approxQuantile taking a List instead of an Array, for Python compatbility
* Python DataFrame and DataFrameStatFunctions: Added approxQuantile

## How was this patch tested?

* unit test in sql/tests.py

Documentation was copied from the existing approxQuantile exactly.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11356 from jkbradley/approx-quantile-python.

13ce10e9

Feb 24, 2016

[SPARK-13250] [SQL] Update PhysicallRDD to convert to UnsafeRow if using the vectorized scanner. · 5a7af9e7

Nong Li authored 9 years ago

Some parts of the engine rely on UnsafeRow which the vectorized parquet scanner does not want
to produce. This add a conversion in Physical RDD. In the case where codegen is used (and the
scan is the start of the pipeline), there is no requirement to use UnsafeRow. This patch adds
update PhysicallRDD to support codegen, which eliminates the need for the UnsafeRow conversion
in all cases.

The result of these changes for TPCDS-Q19 at the 10gb sf reduces the query time from 9.5 seconds
to 6.5 seconds.

Author: Nong Li <nong@databricks.com>

Closes #11141 from nongli/spark-13250.

5a7af9e7

[SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code · a60f9128

Wenchen Fan authored 9 years ago

## What changes were proposed in this pull request?

When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.

## How was the this patch tested?

by existing unit tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11342 from cloud-fan/python-clean.

a60f9128

Feb 23, 2016

[SPARK-13329] [SQL] considering output for statistics of logical plan · c481bdf5

Davies Liu authored 9 years ago

The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.

We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.

After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.

We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:

DecimalType: 8 or 16 bytes, based on the precision
StringType: 20 bytes
BinaryType: 100 bytes
UDF: default size of SQL type

These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.

Author: Davies Liu <davies@databricks.com>

Closes #11210 from davies/statics.

c481bdf5

[SPARK-13429][MLLIB] Unify Logistic Regression convergence tolerance of ML & MLlib · 72427c3e

Yanbo Liang authored 9 years ago

## What changes were proposed in this pull request?
In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
cc dbtsai
## How was the this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11299 from yanboliang/spark-13429.

72427c3e

Feb 22, 2016

[SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format · e298ac91

Bryan Cutler authored 9 years ago

Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules.

Closes #10602
Closes #10897

Author: Bryan Cutler <cutlerb@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.

e298ac91

[MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns in other comments · 024482bf

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.

## How was the this patch tested?

manual tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11300 from dongjoon-hyun/minor_fix_typos.

024482bf

[SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and... · ef1047fc

Yong Gang Cao authored 9 years ago

[SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and other tuning for Word2Vec

add support of arbitrary length sentence by using the nature representation of sentences in the input.

add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience

need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?

jira link: https://issues.apache.org/jira/browse/SPARK-12153

Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>

Closes #10152 from ygcao/improvementForSentenceBoundary.

ef1047fc

Feb 21, 2016

[SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns. · 0f90f4e6

Franklyn D'souza authored 9 years ago

## What changes were proposed in this pull request?

This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.

This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.

```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
```

## How was the this patch tested?

Tested using two unit tests in sql/test.py and the DataFrameSuite.

Additional information here : https://issues.apache.org/jira/browse/SPARK-13410

Author: Franklyn D'souza <franklynd@gmail.com>

Closes #11279 from damnMeddlingKid/udt-union-all.

0f90f4e6

[SPARK-12799] Simplify various string output for expressions · d9efe63e

Cheng Lian authored 9 years ago

This PR introduces several major changes:

1. Replacing `Expression.prettyString` with `Expression.sql`

   The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.

1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)

   Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:

   Expression         | `prettyString` | `sql`      | Note
   ------------------ | -------------- | ---------- | ---------------
   `a && b`           | `a && b`       | `a AND b`  |
   `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct

1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)

   `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.

Author: Cheng Lian <lian@databricks.com>

Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.

d9efe63e

Feb 20, 2016

[SPARK-13302][PYSPARK][TESTS] Move the temp file creation and cleanup outside of the doctests · 9ca79c1e

Holden Karau authored 9 years ago

Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
In part this is a follow up to https://github.com/apache/spark/pull/10999
Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.

Author: Holden Karau <holden@us.ibm.com>

Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.

9ca79c1e

Revert "[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs" · 6624a588
Reynold Xin authored 9 years ago
```
This reverts commit 4f9a6648.
```
6624a588
[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs · 4f9a6648
Kai Jiang authored 9 years ago
```
Author: Kai Jiang <jiangkai@gmail.com>

Closes #10527 from vectorijk/spark-12567.
```
4f9a6648

Feb 19, 2016

[SPARK-13339][DOCS] Clarify commutative / associative operator requirements for reduce, fold · fb7e2179

Sean Owen authored 9 years ago

Clarify that reduce functions need to be commutative, and fold functions do not

See https://github.com/apache/spark/pull/11091

Author: Sean Owen <sowen@cloudera.com>

Closes #11217 from srowen/SPARK-13339.

fb7e2179

Feb 16, 2016

Correct SparseVector.parse documentation · 827ed1c0

Miles Yucht authored 9 years ago

There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect.

Author: Miles Yucht <miles@databricks.com>

Closes #11213 from mgyucht/fix-sparsevector-docs.

827ed1c0

Feb 13, 2016

[SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions. · 354d4c24

Reynold Xin authored 9 years ago

This pull request has the following changes:

1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.

2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.

3. Move everything in execution/python.scala into the newly created execution.python package.

Most of the diffs are just straight copy-paste.

Author: Reynold Xin <rxin@databricks.com>

Closes #11181 from rxin/SPARK-13296.

354d4c24

[SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test · e3441e3f

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.

e3441e3f

Feb 12, 2016

[SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to consistent format · 42d65681

vijaykiran authored 9 years ago

Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.

Author: vijaykiran <mail@vijaykiran.com>
Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.

42d65681

[SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop · 90de6b2f

Yanbo Liang authored 9 years ago

PySpark support ```covar_samp``` and ```covar_pop```.

cc rxin davies marmbrus

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10876 from yanboliang/spark-12962.

90de6b2f

[SPARK-13154][PYTHON] Add linting for pydocs · 64515e5f

Holden Karau authored 9 years ago

We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.

Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.

Author: Holden Karau <holden@us.ibm.com>

Closes #11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.

64515e5f

[SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-means · a183dda6

Yanbo Liang authored 9 years ago

Add Python API for spark.ml bisecting k-means.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10889 from yanboliang/spark-12974.

a183dda6

Feb 11, 2016

[SPARK-13153][PYSPARK] ML persistence failed when handle no default value parameter · d3e2e202

Tommy YU authored 9 years ago

Fix this defect by check default value exist or not.

yanboliang Please help to review.

Author: Tommy YU <tummyyu@163.com>

Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.

d3e2e202

[SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw an error · b3546738

sethah authored 9 years ago

Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.

In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'

However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false

cc holdenk

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10962 from sethah/SPARK-13047.

b3546738

[SPARK-13035][ML][PYSPARK] PySpark ml.clustering support export/import · 30e00955

Yanbo Liang authored 9 years ago

PySpark ml.clustering support export/import.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10999 from yanboliang/spark-13035.

30e00955

[MINOR][ML][PYSPARK] Cleanup test cases of clustering.py · 2426eb3e

Yanbo Liang authored 9 years ago

Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10975 from yanboliang/clustering-cleanup.

2426eb3e