Commits · 3af53e61fd604fe8000e1fdf656d60b79c842d1c · cs525-sp18-g07 / spark

Dec 04, 2015

[SPARK-12058][STREAMING][KINESIS][TESTS] fix Kinesis python tests · 302d68de

Burak Yavuz authored 9 years ago

Python tests require access to the `KinesisTestUtils` file. When this file exists under src/test, python can't access it, since it is not available in the assembly jar.

However, if we move KinesisTestUtils to src/main, we need to add the KinesisProducerLibrary as a dependency. In order to avoid this, I moved KinesisTestUtils to src/main, and extended it with ExtendedKinesisTestUtils which is under src/test that adds support for the KPL.

cc zsxwing tdas

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #10050 from brkyvz/kinesis-py.

302d68de

Dec 03, 2015

[MINOR][ML] Use coefficients replace weights · d576e76b

Yanbo Liang authored 9 years ago

Use ```coefficients``` replace ```weights```, I wish they are the last two.
mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10065 from yanboliang/coefficients.

d576e76b

Dec 02, 2015
- [SPARK-12090] [PYSPARK] consider shuffle in coalesce() · 4375eb3f
  Davies Liu authored 9 years ago
  
  Author: Davies Liu <davies@databricks.com> Closes #10090 from davies/fix_coalesce.
  4375eb3f
Dec 01, 2015

[SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint recovery issue · f292018f

jerryshao authored 9 years ago

Fixed a minor race condition in #10017

Closes #10017

Author: jerryshao <sshao@hortonworks.com>
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10074 from zsxwing/review-pr10017.

f292018f

Nov 30, 2015

[SPARK-12058][HOTFIX] Disable KinesisStreamTests · edb26e7f

Shixiong Zhu authored 9 years ago

KinesisStreamTests in test.py is broken because of #9403. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46896/testReport/(root)/KinesisStreamTests/test_kinesis_stream/

Because Streaming Python didn’t work when merging https://github.com/apache/spark/pull/9403, the PR build didn’t report the Python test failure actually.

This PR just disabled the test to unblock #10039

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10047 from zsxwing/disable-python-kinesis-test.

edb26e7f

Nov 26, 2015

[SPARK-11917][PYSPARK] Add SQLContext#dropTempTable to PySpark · d8220885
Jeff Zhang authored 9 years ago
```
Author: Jeff Zhang <zjffdu@apache.org>

Closes #9903 from zjffdu/SPARK-11917.
```
d8220885

[SPARK-11980][SPARK-10621][SQL] Fix json_tuple and add test cases for · 068b6438

gatorsmile authored 9 years ago

Added Python test cases for the function `isnan`, `isnull`, `nanvl` and `json_tuple`.

Fixed a bug in the function `json_tuple`

rxin , could you help me review my changes? Please let me know anything is missing.

Thank you! Have a good Thanksgiving day!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9977 from gatorsmile/json_tuple.

068b6438

Nov 25, 2015

[SPARK-11935][PYSPARK] Send the Python exceptions in TransformFunction and... · d29e2ef4

Shixiong Zhu authored 9 years ago

[SPARK-11935][PYSPARK] Send the Python exceptions in TransformFunction and TransformFunctionSerializer to Java

The Python exception track in TransformFunction and TransformFunctionSerializer is not sent back to Java. Py4j just throws a very general exception, which is hard to debug.

This PRs adds `getFailure` method to get the failure message in Java side.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9922 from zsxwing/SPARK-11935.

d29e2ef4

[SPARK-11969] [SQL] [PYSPARK] visualization of SQL query for pyspark · dc1d324f

Davies Liu authored 9 years ago

Currently, we does not have visualization for SQL query from Python, this PR fix that.

cc zsxwing

Author: Davies Liu <davies@databricks.com>

Closes #9949 from davies/pyspark_sql_ui.

dc1d324f

[SPARK-11984][SQL][PYTHON] Fix typos in doc for pivot for scala and python · faabdfa2
felixcheung authored 9 years ago
```
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9967 from felixcheung/pypivotdoc.
```
faabdfa2

[SPARK-11860][PYSAPRK][DOCUMENTATION] Invalid argument specification … · b9b6fbe8

Jeff Zhang authored 9 years ago

…for registerFunction [Python]

Straightforward change on the python doc

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9901 from zjffdu/SPARK-11860.

b9b6fbe8

Nov 24, 2015

[SPARK-10621][SQL] Consistent naming for functions in SQL, Python, Scala · 151d7c2b
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9948 from rxin/SPARK-10621.
```
151d7c2b

[SPARK-11967][SQL] Consistent use of varargs for multiple paths in DataFrameReader · 25bbd3c1

Reynold Xin authored 9 years ago

This patch makes it consistent to use varargs in all DataFrameReader methods, including Parquet, JSON, text, and the generic load function.

Also added a few more API tests for the Java API.

Author: Reynold Xin <rxin@databricks.com>

Closes #9945 from rxin/SPARK-11967.

25bbd3c1

[SPARK-11946][SQL] Audit pivot API for 1.6. · f3152722

Reynold Xin authored 9 years ago

Currently pivot's signature looks like

```scala
scala.annotation.varargs
def pivot(pivotColumn: Column, values: Column*): GroupedData

scala.annotation.varargs
def pivot(pivotColumn: String, values: Any*): GroupedData
```

I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List.

I also made similar changes for Python.

Author: Reynold Xin <rxin@databricks.com>

Closes #9929 from rxin/SPARK-11946.

f3152722

Nov 23, 2015

[SPARK-10560][PYSPARK][MLLIB][DOCS] Make StreamingLogisticRegressionWithSGD... · 10574564

Bryan Cutler authored 9 years ago

[SPARK-10560][PYSPARK][MLLIB][DOCS] Make StreamingLogisticRegressionWithSGD Python API equal to Scala one

This is to bring the API documentation of StreamingLogisticReressionWithSGD and StreamingLinearRegressionWithSGC in line with the Scala versions.

-Fixed the algorithm descriptions
-Added default values to parameter descriptions
-Changed StreamingLogisticRegressionWithSGD regParam to default to 0, as in the Scala version

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9141 from BryanCutler/StreamingLogisticRegressionWithSGD-python-api-sync.

10574564

[SPARK-11836][SQL] udf/cast should not create new SQLContext · 1d912020

Davies Liu authored 9 years ago

They should use the existing SQLContext.

Author: Davies Liu <davies@databricks.com>

Closes #9914 from davies/create_udf.

1d912020

Nov 20, 2015

[SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction... · be7a2cfd

Shixiong Zhu authored 9 years ago

[SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction and TransformFunctionSerializer

TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9847 from zsxwing/pyspark-streaming-exception.

be7a2cfd

[SPARK-11875][ML][PYSPARK] Update doc for PySpark HasCheckpointInterval · 7216f405

Yanbo Liang authored 9 years ago

* Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint.
* Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9856 from yanboliang/spark-11875.

7216f405

Nov 19, 2015

[SPARK-11812][PYSPARK] invFunc=None works properly with python's reduceByKeyAndWindow · 599a8c6e

David Tolpin authored 9 years ago

invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None,
thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data.

In addition, the docstring used wrong parameter names, also fixed.

Author: David Tolpin <david.tolpin@gmail.com>

Closes #9775 from dtolpin/master.

599a8c6e

Nov 18, 2015

[SPARK-11820][ML][PYSPARK] PySpark LiR & LoR should support weightCol · 603a721c

Yanbo Liang authored 9 years ago

[SPARK-7685](https://issues.apache.org/jira/browse/SPARK-7685) and [SPARK-9642](https://issues.apache.org/jira/browse/SPARK-9642) have already supported setting weight column for ```LogisticRegression``` and ```LinearRegression```. It's a very important feature, PySpark should also support. mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9811 from yanboliang/spark-11820.

603a721c

[SPARK-11720][SQL][ML] Handle edge cases when count = 0 or 1 for Stats function · 09ad9533

JihongMa authored 9 years ago

return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null.

Author: JihongMa <linlin200605@gmail.com>

Closes #9705 from JihongMA/SPARK-11720.

09ad9533

[SPARK-11804] [PYSPARK] Exception raise when using Jdbc predicates opt… · 3a6807fd
Jeff Zhang authored 9 years ago
```
…ion in PySpark

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9791 from zjffdu/SPARK-11804.
```
3a6807fd

Nov 17, 2015

[SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python API · 75a29229

jerryshao authored 9 years ago

Fixed the merge conflicts in #7410

Closes #7410

Author: Shixiong Zhu <shixiong@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Author: jerryshao <sshao@hortonworks.com>

Closes #9742 from zsxwing/pr7410.

75a29229

[SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batch · 928d6316

Shixiong Zhu authored 9 years ago

We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9707 from zsxwing/fix-checkpoint.

928d6316

Nov 16, 2015

[SPARK-6328][PYTHON] Python API for StreamingListener · ace0db47
Daniel Jalova authored 9 years ago
```
Author: Daniel Jalova <djalova@us.ibm.com>

Closes #9186 from djalova/SPARK-6328.
```
ace0db47

[SPARK-11745][SQL] Enable more JSON parsing options · 42de5253

Reynold Xin authored 9 years ago

This patch adds the following options to the JSON data source, for dealing with non-standard JSON files:
* `allowComments` (default `false`): ignores Java/C++ style comment in JSON records
* `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names
* `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012)

To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options.

Also updated documentation to explain these options.

Scala

![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png)

Python

![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png)

Author: Reynold Xin <rxin@databricks.com>

Closes #9724 from rxin/SPARK-11745.

42de5253

Nov 13, 2015

[SPARK-11690][PYSPARK] Add pivot to python api · a2447799

Andrew Ray authored 9 years ago

This PR adds pivot to the python api of GroupedData with the same syntax as Scala/Java.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #9653 from aray/sql-pivot-python.

a2447799

[SPARK-11706][STREAMING] Fix the bug that Streaming Python tests cannot report failures · ec80c0c2

Shixiong Zhu authored 9 years ago

This PR just checks the test results and returns 1 if the test fails, so that `run-tests.py` can mark it fail.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9669 from zsxwing/streaming-python-tests.

ec80c0c2

Nov 12, 2015

[SPARK-11658] simplify documentation for PySpark combineByKey · 68ef61bb
Chris Snow authored 9 years ago
```
Author: Chris Snow <chsnow123@gmail.com>

Closes #9640 from snowch/patch-3.
```
68ef61bb

[SPARK-11671] documentation code example typo · 380dfcc0

Chris Snow authored 9 years ago

Example for sqlContext.createDataDrame from pandas.DataFrame has a typo

Author: Chris Snow <chsnow123@gmail.com>

Closes #9639 from snowch/patch-2.

380dfcc0

[SPARK-11420] Updating Stddev support via Imperative Aggregate · d292f748

JihongMa authored 9 years ago

switched stddev support from DeclarativeAggregate to ImperativeAggregate.

Author: JihongMa <linlin200605@gmail.com>

Closes #9380 from JihongMA/SPARK-11420.

d292f748

Nov 11, 2015

[SPARK-11463] [PYSPARK] only install signal in main thread · bd70244b

Davies Liu authored 9 years ago

Only install signal in main thread, or it will fail to create context in not-main thread.

Author: Davies Liu <davies@databricks.com>

Closes #9574 from davies/python_signal.

bd70244b

Nov 10, 2015

[SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in Python · c0e48dfa
Yu ISHIKAWA authored 9 years ago
```
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9534 from yu-iskw/SPARK-11566.
```
c0e48dfa

[SPARK-11567] [PYTHON] Add Python API for corr Aggregate function · 32790fe7

felixcheung authored 9 years ago

like `df.agg(corr("col1", "col2")`

davies

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9536 from felixcheung/pyfunc.

32790fe7

[SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to... · e0701c75

Yin Huai authored 9 years ago

[SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s

https://issues.apache.org/jira/browse/SPARK-9830

This PR contains the following main changes.
* Removing `AggregateExpression1`.
* Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
* Removing planner rule used to plan `Aggregate`.
* Linking `MultipleDistinctRewriter` to analyzer.
* Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
* Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
* Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).

Author: Yin Huai <yhuai@databricks.com>

Closes #9556 from yhuai/removeAgg1.

e0701c75

Nov 09, 2015

[SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics in Python more specific · 7dc9d8db
Yu ISHIKAWA authored 9 years ago
```
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9577 from yu-iskw/SPARK-11610.
```
7dc9d8db

[SPARK-9301][SQL] Add collect_set and collect_list aggregate functions · f138cb87

Nick Buroojy authored 9 years ago

For now they are thin wrappers around the corresponding Hive UDAFs.

One limitation with these in Hive 0.13.0 is they only support aggregating primitive types.

I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns.

Do we also want to add these to `functions.py`?

This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089



marmbrus rxin

Author: Nick Buroojy <nick.buroojy@civitaslearning.com>

Closes #9526 from nburoojy/nick/udaf-alias.

(cherry picked from commit a6ee4f98)
Signed-off-by: Michael Armbrust <michael@databricks.com>

f138cb87

[SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to pyspark.ml.classification · 88a3fdcc
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8690 from yu-iskw/SPARK-10280.
```
88a3fdcc

Nov 07, 2015

[SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python · 2ff0e79a

Yu ISHIKAWA authored 9 years ago

Could jkbradley and davies review it?

- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.

[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8643 from yu-iskw/SPARK-8467-2.

2ff0e79a

Nov 06, 2015

[HOTFIX] Fix python tests after #9527 · 105732dc

Michael Armbrust authored 9 years ago

#9527 missed updating the python tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #9533 from marmbrus/hotfixTextValue.

105732dc