Commits · 7015f6f0e7889616db9595b4e68b5a5b5ffe921a · cs525-sp18-g07 / spark

May 24, 2017

[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel · 13adc0fc

Bago Amirbekian authored 7 years ago


## What changes were proposed in this pull request?

Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float.

## How was this patch tested?

Existing tests run using python3 and numpy 1.12.

Author: Bago Amirbekian <bago@databricks.com>

Closes #18081 from MrBago/BF-py3floatbug.

(cherry picked from commit bc66a77b)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

13adc0fc

May 10, 2017

[SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. · 92a71a66

Josh Rosen authored 7 years ago


## What changes were proposed in this pull request?

There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error.

This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).

This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple.

## How was this patch tested?

New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17927 from JoshRosen/SPARK-20685.

(cherry picked from commit 8ddbc431)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>

92a71a66

[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should... · 69786ea3

zero323 authored 7 years ago

[SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency should use values not Params

## What changes were proposed in this pull request?

- Replace `getParam` calls with `getOrDefault` calls.
- Fix exception message to avoid unintended `TypeError`.
- Add unit tests

## How was this patch tested?

New unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #17891 from zero323/SPARK-20631.

(cherry picked from commit 804949c6)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

69786ea3

Apr 25, 2017
- Preparing development version 2.1.2-SNAPSHOT · 8460b090
  Patrick Wendell authored 7 years ago
  
  8460b090
- Preparing Spark release v2.1.1-rc4 · 267aca5b
  Patrick Wendell authored 7 years ago
  
  View commits for tag v2.1.1 v2.1.1
  
  267aca5b
Apr 14, 2017
- Preparing development version 2.1.2-SNAPSHOT · 2a3e50e2
  Patrick Wendell authored 7 years ago
  
  2a3e50e2
- Preparing Spark release v2.1.1-rc3 · 2ed19cff
  Patrick Wendell authored 7 years ago
  
  2ed19cff
Apr 10, 2017

[SPARK-20285][TESTS] Increase the pyspark streaming test timeout to 30 seconds · 489c1f35

Shixiong Zhu authored 7 years ago

## What changes were proposed in this pull request?

Saw the following failure locally:

```
Traceback (most recent call last):
  File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 351, in test_cogroup
    self._test_func(input, func, expected, sort=True, input2=input2)
  File "/home/jenkins/workspace/python/pyspark/streaming/tests.py", line 162, in _test_func
    self.assertEqual(expected, result)
AssertionError: Lists differ: [[(1, ([1], [2])), (2, ([1], [... != []

First list contains 3 additional elements.
First extra element 0:
[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))]

+ []
- [[(1, ([1], [2])), (2, ([1], [])), (3, ([1], []))],
-  [(1, ([1, 1, 1], [])), (2, ([1], [])), (4, ([], [1]))],
-  [('', ([1, 1], [1, 2])), ('a', ([1, 1], [1, 1])), ('b', ([1], [1]))]]
```

It also happened on Jenkins: http://spark-tests.appspot.com/builds/spark-branch-2.1-test-sbt-hadoop-2.7/120



It's because when the machine is overloaded, the timeout is not enough. This PR just increases the timeout to 30 seconds.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17597 from zsxwing/SPARK-20285.

(cherry picked from commit f9a50ba2)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

489c1f35

Apr 05, 2017

[SPARK-20214][ML] Make sure converted csc matrix has sorted indices · fb81a412

Liang-Chi Hsieh authored 7 years ago

## What changes were proposed in this pull request?

`_convert_to_vector` converts a scipy sparse matrix to csc matrix for initializing `SparseVector`. However, it doesn't guarantee the converted csc matrix has sorted indices and so a failure happens when you do something like that:

    from scipy.sparse import lil_matrix
    lil = lil_matrix((4, 1))
    lil[1, 0] = 1
    lil[3, 0] = 2
    _convert_to_vector(lil.todok())

    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 78, in _convert_to_vector
      return SparseVector(l.shape[0], csc.indices, csc.data)
    File "/home/jenkins/workspace/python/pyspark/mllib/linalg/__init__.py", line 556, in __init__
      % (self.indices[i], self.indices[i + 1]))
    TypeError: Indices 3 and 1 are not strictly increasing

A simple test can confirm that `dok_matrix.tocsc()` won't guarantee sorted indices:

    >>> from scipy.sparse import lil_matrix
    >>> lil = lil_matrix((4, 1))
    >>> lil[1, 0] = 1
    >>> lil[3, 0] = 2
    >>> dok = lil.todok()
    >>> csc = dok.tocsc()
    >>> csc.has_sorted_indices
    0
    >>> csc.indices
    array([3, 1], dtype=int32)

I checked the source codes of scipy. The only way to guarantee it is `csc_matrix.tocsr()` and `csr_matrix.tocsc()`.

## How was this patch tested?

Existing tests.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #17532 from viirya/make-sure-sorted-indices.

(cherry picked from commit 12206058)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

fb81a412

Mar 28, 2017
- Preparing development version 2.1.2-SNAPSHOT · 4964dbed
  Patrick Wendell authored 7 years ago
  
  4964dbed
- Preparing Spark release v2.1.1-rc2 · 02b165dc
  Patrick Wendell authored 7 years ago
  
  02b165dc
Mar 27, 2017

[SPARK-20102] Fix nightly packaging and RC packaging scripts w/ two minor build fixes · 4056191d

Josh Rosen authored 7 years ago


## What changes were proposed in this pull request?

The master snapshot publisher builds are currently broken due to two minor build issues:

1. For unknown reasons, the LFTP `mkdir -p` command began throwing errors when the remote directory already exists. This change of behavior might have been caused by configuration changes in the ASF's SFTP server, but I'm not entirely sure of that. To work around this problem, this patch updates the script to ignore errors from the `lftp mkdir -p` commands.
2. The PySpark `setup.py` file references a non-existent `pyspark.ml.stat` module, causing Python packaging to fail by complaining about a missing directory. The fix is to simply drop that line from the setup script.

## How was this patch tested?

The LFTP fix was tested by manually running the failing commands on AMPLab Jenkins against the ASF SFTP server. The PySpark fix was tested locally.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17437 from JoshRosen/spark-20102.

(cherry picked from commit 314cf51d)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>

4056191d

Mar 21, 2017
- Preparing development version 2.1.2-SNAPSHOT · c4d2b833
  Patrick Wendell authored 8 years ago
  
  c4d2b833
- Preparing Spark release v2.1.1-rc1 · 30abb95c
  Patrick Wendell authored 8 years ago
  
  30abb95c
Mar 17, 2017

[SPARK-19986][TESTS] Make pyspark.streaming.tests.CheckpointTests more stable · 5fb70831

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Sometimes, CheckpointTests will hang on a busy machine because the streaming jobs are too slow and cannot catch up. I observed the scheduled delay was keeping increasing for dozens of seconds locally.

This PR increases the batch interval from 0.5 seconds to 2 seconds to generate less Spark jobs. It should make `pyspark.streaming.tests.CheckpointTests` more stable. I also replaced `sleep` with `awaitTerminationOrTimeout` so that if the streaming job fails, it will also fail the test.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #17323 from zsxwing/SPARK-19986.

(cherry picked from commit 376d7821)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

5fb70831

Mar 15, 2017

[SPARK-19872] [PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition · 06225463

hyukjinkwon authored 8 years ago


## What changes were proposed in this pull request?

This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.

with the file, `text.txt` below:

```
a
b

d
e
f
g
h
i
j
k
l

```

- Before

```python
>>> sc.textFile('text.txt').repartition(1).collect()
```

```
UTF8Deserializer(True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../spark/python/pyspark/rdd.py", line 811, in collect
    return list(_load_from_socket(port, self._jrdd_deserializer))
  File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
    yield self.loads(stream)
  File ".../spark/python/pyspark/serializers.py", line 544, in loads
    return s.decode("utf-8") if self.use_unicode else s
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
```

- After

```python
>>> sc.textFile('text.txt').repartition(1).collect()
```

```
[u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
```

## How was this patch tested?

Unit test in `python/pyspark/tests.py`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17282 from HyukjinKwon/SPARK-19872.

(cherry picked from commit 7387126f)
Signed-off-by: Davies Liu <davies.liu@gmail.com>

06225463

Mar 09, 2017

[SPARK-19561][SQL] add int case handling for TimestampType · 2a76e242

Jason White authored 8 years ago

## What changes were proposed in this pull request?

Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.

These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.

Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.

## How was this patch tested?

Added a new PySpark-side test that fails without the change.

The contribution is my original work and I license the work to the project under the project’s open source license.

Resubmission of https://github.com/apache/spark/pull/16896

. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun

cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.

Author: Jason White <jason.white@shopify.com>

Closes #17200 from JasonMWhite/SPARK-19561.

(cherry picked from commit 206030bd)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

2a76e242

Mar 07, 2017

[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe · 0ba9ecbe

Bryan Cutler authored 8 years ago

## What changes were proposed in this pull request?
The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.

This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.

## How was this patch tested?
Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17193 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348-2_1.

0ba9ecbe

Revert "[SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long" · cbc37007
Wenchen Fan authored 8 years ago
```
This reverts commit 6f468462.
```
cbc37007

[SPARK-19561] [PYTHON] cast TimestampType.toInternal output to long · 711addd4

Jason White authored 8 years ago


## What changes were proposed in this pull request?

Cast the output of `TimestampType.toInternal` to long to allow for proper Timestamp creation in DataFrames near the epoch.

## How was this patch tested?

Added a new test that fails without the change.

dongjoon-hyun davies Mind taking a look?

The contribution is my original work and I license the work to the project under the project’s open source license.

Author: Jason White <jason.white@shopify.com>

Closes #16896 from JasonMWhite/SPARK-19561.

(cherry picked from commit 6f468462)
Signed-off-by: Davies Liu <davies.liu@gmail.com>

711addd4

Feb 25, 2017

[SPARK-14772][PYTHON][ML] Fixed Params.copy method to match Scala implementation · 20a43295

Bryan Cutler authored 8 years ago

## What changes were proposed in this pull request?
Fixed the PySpark Params.copy method to behave like the Scala implementation.  The main issue was that it did not account for the _defaultParamMap and merged it into the explicitly created param map.

## How was this patch tested?
Added new unit test to verify the copy method behaves correctly for copying uid, explicitly created params, and default params.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #17048 from BryanCutler/pyspark-ml-param_copy-Scala_sync-SPARK-14772-2_1.

20a43295

Feb 15, 2017

[SPARK-19604][TESTS] Log the start of every Python test · b9ab4c0e

Yin Huai authored 8 years ago


## What changes were proposed in this pull request?
Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running.

## How was this patch tested?
This is a change for python tests.

Author: Yin Huai <yhuai@databricks.com>

Closes #16935 from yhuai/SPARK-19604.

(cherry picked from commit f6c3bba2)
Signed-off-by: Yin Huai <yhuai@databricks.com>

b9ab4c0e

[SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column · 6c353990

Felix Cheung authored 8 years ago


Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16739 from felixcheung/rcoalesce.

(cherry picked from commit 671bc08e)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

6c353990

Feb 13, 2017

[SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util · ef4fb7eb

zero323 authored 8 years ago


## What changes were proposed in this pull request?

Add missing `warnings` import.

## How was this patch tested?

Manual tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16846 from zero323/SPARK-19506.

(cherry picked from commit 5e7cd332)
Signed-off-by: Holden Karau <holden@us.ibm.com>

ef4fb7eb

Jan 25, 2017

[SPARK-19064][PYSPARK] Fix pip installing of sub components · a5c10ff2

Holden Karau authored 8 years ago


## What changes were proposed in this pull request?

Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution.

## How was this patch tested?

Updated sanity test script to import mllib and ml sub-components.

Author: Holden Karau <holden@us.ibm.com>

Closes #16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.

(cherry picked from commit 965c82d8)
Signed-off-by: Holden Karau <holden@us.ibm.com>

a5c10ff2

[SPARK-19307][PYSPARK] Make sure user conf is propagated to SparkContext. · c9f075ab

Marcelo Vanzin authored 8 years ago


The code was failing to propagate the user conf in the case where the
JVM was already initialized, which happens when a user submits a
python script via spark-submit.

Tested with new unit test and by running a python script in a real cluster.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16682 from vanzin/SPARK-19307.

(cherry picked from commit 92afaa93)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

c9f075ab

Jan 20, 2017

[SPARK-18589][SQL] Fix Python UDF accessing attributes from both side of join · 4d286c90

Davies Liu authored 8 years ago

PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.

This PR fix this issue by checking the expression is evaluable or not before pushing it into Join.

Add a regression test.

Author: Davies Liu <davies@databricks.com>

Closes #16581 from davies/pyudf_join.

4d286c90

Jan 17, 2017

[SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port... · 2ff36691

hyukjinkwon authored 8 years ago

[SPARK-19019] [PYTHON] Fix hijacked `collections.namedtuple` and port cloudpickle changes for PySpark to work with Python 3.6.0

## What changes were proposed in this pull request?

Currently, PySpark does not work with Python 3.6.0.

Running `./bin/pyspark` simply throws the error as below and PySpark does not work at all:

```
Traceback (most recent call last):
  File ".../spark/python/pyspark/shell.py", line 30, in <module>
    import pyspark
  File ".../spark/python/pyspark/__init__.py", line 46, in <module>
    from pyspark.context import SparkContext
  File ".../spark/python/pyspark/context.py", line 36, in <module>
    from pyspark.java_gateway import launch_gateway
  File ".../spark/python/pyspark/java_gateway.py", line 31, in <module>
    from py4j.java_gateway import java_import, JavaGateway, GatewayClient
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 646, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 616, in _load_backward_compatible
  File ".../spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 18, in <module>
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pydoc.py", line 62, in <module>
    import pkgutil
  File "/usr/local/Cellar/python3/3.6.0/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pkgutil.py", line 22, in <module>
    ModuleInfo = namedtuple('ModuleInfo', 'module_finder name ispkg')
  File ".../spark/python/pyspark/serializers.py", line 394, in namedtuple
    cls = _old_namedtuple(*args, **kwargs)
TypeError: namedtuple() missing 3 required keyword-only arguments: 'verbose', 'rename', and 'module'
```

The root cause seems because some arguments of `namedtuple` are now completely keyword-only arguments from Python 3.6.0 (See https://bugs.python.org/issue25628

).

We currently copy this function via `types.FunctionType` which does not set the default values of keyword-only arguments (meaning `namedtuple.__kwdefaults__`) and this seems causing internally missing values in the function (non-bound arguments).

This PR proposes to work around this by manually setting it via `kwargs` as `types.FunctionType` seems not supporting to set this.

Also, this PR ports the changes in cloudpickle for compatibility for Python 3.6.0.

## How was this patch tested?

Manually tested with Python 2.7.6 and Python 3.6.0.

```
./bin/pyspsark
```

, manual creation of `namedtuple` both in local and rdd with Python 3.6.0,

and Jenkins tests for other Python versions.

Also,

```
./run-tests --python-executables=python3.6
```

```
Will test against the following Python executables: ['python3.6']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Finished test(python3.6): pyspark.sql.tests (192s)
Finished test(python3.6): pyspark.accumulators (3s)
Finished test(python3.6): pyspark.mllib.tests (198s)
Finished test(python3.6): pyspark.broadcast (3s)
Finished test(python3.6): pyspark.conf (2s)
Finished test(python3.6): pyspark.context (14s)
Finished test(python3.6): pyspark.ml.classification (21s)
Finished test(python3.6): pyspark.ml.evaluation (11s)
Finished test(python3.6): pyspark.ml.clustering (20s)
Finished test(python3.6): pyspark.ml.linalg.__init__ (0s)
Finished test(python3.6): pyspark.streaming.tests (240s)
Finished test(python3.6): pyspark.tests (240s)
Finished test(python3.6): pyspark.ml.recommendation (19s)
Finished test(python3.6): pyspark.ml.feature (36s)
Finished test(python3.6): pyspark.ml.regression (37s)
Finished test(python3.6): pyspark.ml.tuning (28s)
Finished test(python3.6): pyspark.mllib.classification (26s)
Finished test(python3.6): pyspark.mllib.evaluation (18s)
Finished test(python3.6): pyspark.mllib.clustering (44s)
Finished test(python3.6): pyspark.mllib.linalg.__init__ (0s)
Finished test(python3.6): pyspark.mllib.feature (26s)
Finished test(python3.6): pyspark.mllib.fpm (23s)
Finished test(python3.6): pyspark.mllib.random (8s)
Finished test(python3.6): pyspark.ml.tests (92s)
Finished test(python3.6): pyspark.mllib.stat.KernelDensity (0s)
Finished test(python3.6): pyspark.mllib.linalg.distributed (25s)
Finished test(python3.6): pyspark.mllib.stat._statistics (15s)
Finished test(python3.6): pyspark.mllib.recommendation (24s)
Finished test(python3.6): pyspark.mllib.regression (26s)
Finished test(python3.6): pyspark.profiler (9s)
Finished test(python3.6): pyspark.mllib.tree (16s)
Finished test(python3.6): pyspark.shuffle (1s)
Finished test(python3.6): pyspark.mllib.util (18s)
Finished test(python3.6): pyspark.serializers (11s)
Finished test(python3.6): pyspark.rdd (20s)
Finished test(python3.6): pyspark.sql.conf (8s)
Finished test(python3.6): pyspark.sql.catalog (17s)
Finished test(python3.6): pyspark.sql.column (18s)
Finished test(python3.6): pyspark.sql.context (18s)
Finished test(python3.6): pyspark.sql.group (27s)
Finished test(python3.6): pyspark.sql.dataframe (33s)
Finished test(python3.6): pyspark.sql.functions (35s)
Finished test(python3.6): pyspark.sql.types (6s)
Finished test(python3.6): pyspark.sql.streaming (13s)
Finished test(python3.6): pyspark.streaming.util (0s)
Finished test(python3.6): pyspark.sql.session (16s)
Finished test(python3.6): pyspark.sql.window (4s)
Finished test(python3.6): pyspark.sql.readwriter (35s)
Tests passed in 433 seconds
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16429 from HyukjinKwon/SPARK-19019.

(cherry picked from commit 20e62806)
Signed-off-by: Davies Liu <davies.liu@gmail.com>

2ff36691

Jan 13, 2017

[SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · b2c9a2c8

Vinayak authored 8 years ago

[SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error

Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.

Existing unit tests and a new unit test added to pyspark-sql:

/python/run-tests --python-executables=python --modules=pyspark-sql

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Vinayak <vijoshi5@in.ibm.com>
Author: Vinayak Joshi <vijoshi@users.noreply.github.com>

Closes #16119 from vijoshi/SPARK-18687_master.

(cherry picked from commit 285a7798)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

b2c9a2c8

Jan 12, 2017

[SPARK-19055][SQL][PYSPARK] Fix SparkSession initialization when SparkContext is stopped · 042e32d1

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

In SparkSession initialization, we store created the instance of SparkSession into a class variable _instantiatedContext. Next time we can use SparkSession.builder.getOrCreate() to retrieve the existing SparkSession instance.

However, when the active SparkContext is stopped and we create another new SparkContext to use, the existing SparkSession is still associated with the stopped SparkContext. So the operations with this existing SparkSession will be failed.

We need to detect such case in SparkSession and renew the class variable _instantiatedContext if needed.

## How was this patch tested?

New test added in PySpark.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16454 from viirya/fix-pyspark-sparksession.

(cherry picked from commit c6c37b8a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

042e32d1

Jan 10, 2017

[SPARK-19140][SS] Allow update mode for non-aggregation streaming queries · 230607d6

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16520 from zsxwing/update-without-agg.

(cherry picked from commit bc6c56e9)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

230607d6

Jan 08, 2017

[SPARK-19126][DOCS] Update Join Documentation Across Languages · 8779e6a4

anabranch authored 8 years ago

## What changes were proposed in this pull request?

- [X] Make sure all join types are clearly mentioned
- [X] Make join labeling/style consistent
- [X] Make join label ordering docs the same
- [X] Improve join documentation according to above for Scala
- [X] Improve join documentation according to above for Python
- [X] Improve join documentation according to above for R

## How was this patch tested?
No tests b/c docs.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>

Closes #16504 from anabranch/SPARK-19126.

(cherry picked from commit 19d9d4c8)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

8779e6a4

[SPARK-19127][DOCS] Update Rank Function Documentation · 8690d4bd

anabranch authored 8 years ago

## What changes were proposed in this pull request?

- [X] Fix inconsistencies in function reference for dense rank and dense
- [X] Make all languages equivalent in their reference to `dense_rank` and `rank`.

## How was this patch tested?

N/A for docs.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>

Closes #16505 from anabranch/SPARK-19127.

(cherry picked from commit 1f6ded64)
Signed-off-by: Reynold Xin <rxin@databricks.com>

8690d4bd

Dec 21, 2016

[SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?

This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.

----

Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.

Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
```Scala
spark.catalog.recoverPartitions("testTable")
```

### How was this patch tested?
Modified the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16372 from gatorsmile/repairTable2.1.1.

0e51bb08

Dec 20, 2016

[SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator · cd297c39

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

There is a timeout failure when using `rdd.toLocalIterator()` or `df.toLocalIterator()` for a PySpark RDD and DataFrame:

    df = spark.createDataFrame([[1],[2],[3]])
    it = df.toLocalIterator()
    row = next(it)

    df2 = df.repartition(1000)  # create many empty partitions which increase materialization time so causing timeout
    it2 = df2.toLocalIterator()
    row = next(it2)

The cause of this issue is, we open a socket to serve the data from JVM side. We set timeout for connection and reading through the socket in Python side. In Python we use a generator to read the data, so we only begin to connect the socket once we start to ask data from it. If we don't consume it immediately, there is connection timeout.

In the other side, the materialization time for RDD partitions is unpredictable. So we can't set a timeout for reading data through the socket. Otherwise, it is very possibly to fail.

## How was this patch tested?

Added tests into PySpark.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16263 from viirya/fix-pyspark-localiterator.

(cherry picked from commit 95c95b71)
Signed-off-by: Davies Liu <davies.liu@gmail.com>

cd297c39

Dec 15, 2016
- Preparing development version 2.1.1-SNAPSHOT · 483624c2
  Patrick Wendell authored 8 years ago
  
  483624c2
- Preparing Spark release v2.1.0-rc5 · cd0a0836
  Patrick Wendell authored 8 years ago
  
  View commits for tag v2.1.0 v2.1.0
  
  cd0a0836
- Preparing development version 2.1.1-SNAPSHOT · 62a6577b
  Patrick Wendell authored 8 years ago
  
  62a6577b
- Preparing Spark release v2.1.0-rc4 · ec317265
  Patrick Wendell authored 8 years ago
  
  ec317265
- Preparing development version 2.1.1-SNAPSHOT · a7364a82
  Patrick Wendell authored 8 years ago
  
  a7364a82