Commits · 48aa6775d6b54ccecdbe2287ae75d99c00b02d18 · cs525-sp18-g07 / spark

Dec 08, 2016
- Preparing development version 2.1.1-SNAPSHOT · 48aa6775
  Patrick Wendell authored 8 years ago
  
  48aa6775
- Preparing Spark release v2.1.0-rc2 · 08071749
  Patrick Wendell authored 8 years ago
  
  08071749
Dec 07, 2016

[SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1 · 1c3f1da8

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues:
* Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618

) in the next release cycle.
* Fix ```spark.als``` params to make it consistent with MLlib.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16169 from yanboliang/spark-18326.

(cherry picked from commit 97255497)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

1c3f1da8

[SPARK-18705][ML][DOC] Update user guide to reflect one pass solver for L1 and elastic-net · ab865cfd

sethah authored 8 years ago


## What changes were proposed in this pull request?

WeightedLeastSquares now supports L1 and elastic net penalties and has an additional solver option: QuasiNewton. The docs are updated to reflect this change.

## How was this patch tested?

Docs only. Generated documentation to make sure Latex looks ok.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #16139 from sethah/SPARK-18705.

(cherry picked from commit 82253617)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

ab865cfd

[SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should... · 617ce3ba

Tathagata Das authored 8 years ago

[SPARK-18758][SS] StreamingQueryListener events from a StreamingQuery should be sent only to the listeners in the same session as the query

## What changes were proposed in this pull request?

Listeners added with `sparkSession.streams.addListener(l)` are added to a SparkSession. So events only from queries in the same session as a listener should be posted to the listener. Currently, all the events gets rerouted through the Spark's main listener bus, that is,
- StreamingQuery posts event to StreamingQueryListenerBus. Only the queries associated with the same session as the bus posts events to it.
- StreamingQueryListenerBus posts event to Spark's main LiveListenerBus as a SparkEvent.
- StreamingQueryListenerBus also subscribes to LiveListenerBus events thus getting back the posted event in a different thread.
- The received is posted to the registered listeners.

The problem is that *all StreamingQueryListenerBuses in all sessions* gets the events and posts them to their listeners. This is wrong.

In this PR, I solve it by making StreamingQueryListenerBus track active queries (by their runIds) when a query posts the QueryStarted event to the bus. This allows the rerouted events to be filtered using the tracked queries.

Note that this list needs to be maintained separately
from the `StreamingQueryManager.activeQueries` because a terminated query is cleared from
`StreamingQueryManager.activeQueries` as soon as it is stopped, but the this ListenerBus must
clear a query only after the termination event of that query has been posted lazily, much after the query has been terminated.

Credit goes to zsxwing for coming up with the initial idea.

## How was this patch tested?
Updated test harness code to use the correct session, and added new unit test.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16186 from tdas/SPARK-18758.

(cherry picked from commit 9ab725ea)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

617ce3ba

[SPARK-18633][ML][EXAMPLE] Add multiclass logistic regression summary python example and document · 839c2eb9

wm624@hotmail.com authored 8 years ago


## What changes were proposed in this pull request?
Logistic Regression summary is added in Python API. We need to add example and document for summary.

The newly added example is consistent with Scala and Java examples.

## How was this patch tested?

Manually tests: Run the example with spark-submit; copy & paste code into pyspark; build document and check the document.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16064 from wangmiao1981/py.

(cherry picked from commit aad11209)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

839c2eb9

[SPARK-18754][SS] Rename recentProgresses to recentProgress · 1c641971

Michael Armbrust authored 8 years ago


Based on an informal survey, users find this option easier to understand / remember.

Author: Michael Armbrust <michael@databricks.com>

Closes #16182 from marmbrus/renameRecentProgress.

(cherry picked from commit 70b2bf71)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

1c641971

[SPARK-18588][TESTS] Fix flaky test: KafkaSourceStressForDontFailOnDataLossSuite · e9b3afac

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Fixed the following failures:

```
org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to eventually never returned normally. Attempted 3745 times over 1.0000790851666665 minutes. Last failure message: assertion failed: failOnDataLoss-0 not deleted after timeout.
```

```
sbt.ForkMain$ForkError: org.apache.spark.sql.streaming.StreamingQueryException: Query query-66 terminated with exception: null
	at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:252)
	at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:146)
Caused by: sbt.ForkMain$ForkError: java.lang.NullPointerException: null
	at java.util.ArrayList.addAll(ArrayList.java:577)
	at org.apache.kafka.clients.Metadata.getClusterForCurrentTopics(Metadata.java:257)
	at org.apache.kafka.clients.Metadata.update(Metadata.java:177)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.handleResponse(NetworkClient.java:605)
	at org.apache.kafka.clients.NetworkClient$DefaultMetadataUpdater.maybeHandleCompletedReceive(NetworkClient.java:582)
	at org.apache.kafka.clients.NetworkClient.handleCompletedReceives(NetworkClient.java:450)
	at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:269)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.clientPoll(ConsumerNetworkClient.java:360)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:224)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:192)
	at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.awaitPendingRequests(ConsumerNetworkClient.java:260)
	at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:222)
	at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.ensurePartitionAssignment(ConsumerCoordinator.java:366)
	at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:978)
	at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:938)
	at
...
```

## How was this patch tested?

Tested in #16048 by running many times.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16109 from zsxwing/fix-kafka-flaky-test.

(cherry picked from commit edc87e18)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

e9b3afac

[SPARK-18762][WEBUI] Web UI should be http:4040 instead of https:4040 · 76e1f165

sarutak authored 8 years ago

## What changes were proposed in this pull request?

When SSL is enabled, the Spark shell shows:
```
Spark context Web UI available at https://192.168.99.1:4040


```
This is wrong because 4040 is http, not https. It redirects to the https port.
More importantly, this introduces several broken links in the UI. For example, in the master UI, the worker link is https:8081 instead of http:8081 or https:8481.

CC: mengxr liancheng

I manually tested accessing by accessing MasterPage, WorkerPage and HistoryServer with SSL enabled.

Author: sarutak <sarutak@oss.nttdata.co.jp>

Closes #16190 from sarutak/SPARK-18761.

(cherry picked from commit bb94f61a)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

76e1f165

[SPARK-18764][CORE] Add a warning log when skipping a corrupted file · acb6ac5d

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

It's better to add a warning log when skipping a corrupted file. It will be helpful when we want to finish the job first, then find them in the log and fix these files.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16192 from zsxwing/SPARK-18764.

(cherry picked from commit dbf3e298)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

acb6ac5d

[SPARK-17760][SQL] AnalysisException with dataframe pivot when groupBy column is not attribute · 5dbcd4fc

Andrew Ray authored 8 years ago


## What changes were proposed in this pull request?

Fixes AnalysisException for pivot queries that have group by columns that are expressions and not attributes by substituting the expressions output attribute in the second aggregation and final projection.

## How was this patch tested?

existing and additional unit tests

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #16177 from aray/SPARK-17760.

(cherry picked from commit f1fca81b)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

5dbcd4fc

[SPARK-18208][SHUFFLE] Executor OOM due to a growing LongArray in BytesToBytesMap · 4432a2a8

Jie Xiong authored 8 years ago


## What changes were proposed in this pull request?

BytesToBytesMap currently does not release the in-memory storage (the longArray variable) after it spills to disk. This is typically not a problem during aggregation because the longArray should be much smaller than the pages, and because we grow the longArray at a conservative rate.

However this can lead to an OOM when an already running task is allocated more than its fair share, this can happen because of a scheduling delay. In this case the longArray can grow beyond the fair share of memory for the task. This becomes problematic when the task spills and the long array is not freed, that causes subsequent memory allocation requests to be denied by the memory manager resulting in an OOM.

This PR fixes this issuing by freeing the longArray when the BytesToBytesMap spills.

## How was this patch tested?

Existing tests and tested on realworld workloads.

Author: Jie Xiong <jiexiong@fb.com>
Author: jiexiong <jiexiong@gmail.com>

Closes #15722 from jiexiong/jie_oom_fix.

(cherry picked from commit c496d03b)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

4432a2a8

[SPARK-18678][ML] Skewed reservoir sampling in SamplingUtils · 51754d6d

Sean Owen authored 8 years ago


## What changes were proposed in this pull request?

Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k.

## How was this patch tested?

Existing test plus new test case.

Author: Sean Owen <sowen@cloudera.com>

Closes #16129 from srowen/SPARK-18678.

(cherry picked from commit 79f5f281)
Signed-off-by: Sean Owen <sowen@cloudera.com>

51754d6d

[SPARK-18701][ML] Fix Poisson GLM failure due to wrong initialization · 99c293ee

actuaryzhang authored 8 years ago


Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits.

## What changes were proposed in this pull request?
Update initialization in Poisson GLM

## How was this patch tested?
Add test in GeneralizedLinearRegressionSuite

srowen sethah yanboliang HyukjinKwon mengxr

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16131 from actuaryzhang/master.

(cherry picked from commit b8280271)
Signed-off-by: Sean Owen <sowen@cloudera.com>

99c293ee

[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. · 340e9aea

Yanbo Liang authored 8 years ago


## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.

## How was this patch tested?
Unit tests.

The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             versicolor  virginica   setosa
(Intercept)  1.514031    -2.609108   1.095077
Sepal_Length 0.02511006  0.2649821   -0.2900921
Sepal_Width  -0.5291215  -0.02016446 0.549286
Petal_Length 0.03647411  0.1544119   -0.190886
Petal_Width  0.000236092 0.4195804   -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
             Estimate
(Intercept)  -6.053815
Sepal_Length 0.2449379
Sepal_Width  0.1648321
Petal_Length 0.4730718
Petal_Width  1.031947
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16117 from yanboliang/spark-18686.

(cherry picked from commit 90b59d1b)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

340e9aea

Dec 06, 2016

[SPARK-18671][SS][TEST-MAVEN] Follow up PR to fix test for Maven · 3750c6e9

Tathagata Das authored 8 years ago


## What changes were proposed in this pull request?

Maven compilation seem to not allow resource is sql/test to be easily referred to in kafka-0-10-sql tests. So moved the kafka-source-offset-version-2.1.0 from sql test resources to kafka-0-10-sql test resources.

## How was this patch tested?

Manually ran maven test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16183 from tdas/SPARK-18671-1.

(cherry picked from commit 5c6bcdbd)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

3750c6e9

[SPARK-18734][SS] Represent timestamp in StreamingQueryProgress as formatted... · 9b5bc2a6

Tathagata Das authored 8 years ago

[SPARK-18734][SS] Represent timestamp in StreamingQueryProgress as formatted string instead of millis

## What changes were proposed in this pull request?

Easier to read while debugging as a formatted string (in ISO8601 format) than in millis

## How was this patch tested?
Updated unit tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16166 from tdas/SPARK-18734.

(cherry picked from commit 539bb3cf)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

9b5bc2a6

[SPARK-18652][PYTHON] Include the example data and third-party licenses in pyspark package. · 65f5331a

Shuai Lin authored 8 years ago


## What changes were proposed in this pull request?

Since we already include the python examples in the pyspark package, we should include the example data with it as well.

We should also include the third-party licences since we distribute their jars with the pyspark package.

## How was this patch tested?

Manually tested with python2.7 and python3.4
```sh
$ ./build/mvn -DskipTests -Phive -Phive-thriftserver -Pyarn -Pmesos clean package
$ cd python
$ python setup.py sdist
$ pip install  dist/pyspark-2.1.0.dev0.tar.gz

$ ls -1 /usr/local/lib/python2.7/dist-packages/pyspark/data/
graphx
mllib
streaming

$ du -sh /usr/local/lib/python2.7/dist-packages/pyspark/data/
600K    /usr/local/lib/python2.7/dist-packages/pyspark/data/

$ ls -1  /usr/local/lib/python2.7/dist-packages/pyspark/licenses/|head -5
LICENSE-AnchorJS.txt
LICENSE-DPark.txt
LICENSE-Mockito.txt
LICENSE-SnapTree.txt
LICENSE-antlr.txt
```

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #16082 from lins05/include-data-in-pyspark-dist.

(cherry picked from commit bd9a4a5a)
Signed-off-by: Sean Owen <sowen@cloudera.com>

65f5331a

[SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured... · d20e0d6b

Tathagata Das authored 8 years ago

[SPARK-18671][SS][TEST] Added tests to ensure stability of that all Structured Streaming log formats

## What changes were proposed in this pull request?

To be able to restart StreamingQueries across Spark version, we have already made the logs (offset log, file source log, file sink log) use json. We should added tests with actual json files in the Spark such that any incompatible changes in reading the logs is immediately caught. This PR add tests for FileStreamSourceLog, FileStreamSinkLog, and OffsetSeqLog.

## How was this patch tested?
new unit tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16128 from tdas/SPARK-18671.

(cherry picked from commit 1ef6b296)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

d20e0d6b

[SPARK-18714][SQL] Add a simple time function to SparkSession · ace4079c

Reynold Xin authored 8 years ago


## What changes were proposed in this pull request?
Many Spark developers often want to test the runtime of some function in interactive debugging and testing. This patch adds a simple time function to SparkSession:

```
scala> spark.time { spark.range(1000).count() }
Time taken: 77 ms
res1: Long = 1000
```

## How was this patch tested?
I tested this interactively in spark-shell.

Author: Reynold Xin <rxin@databricks.com>

Closes #16140 from rxin/SPARK-18714.

(cherry picked from commit cb1f10b4)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

ace4079c

[SPARK-18634][SQL][TRIVIAL] Touch-up Generate · e362d998

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
I jumped the gun on merging https://github.com/apache/spark/pull/16120

, and missed a tiny potential problem. This PR fixes that by changing a val into a def; this should prevent potential serialization/initialization weirdness from happening.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #16170 from hvanhovell/SPARK-18634.

(cherry picked from commit 381ef4ea)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

e362d998

Dec 05, 2016

[SPARK-18721][SS] Fix ForeachSink with watermark + append · 655297b3

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Right now ForeachSink creates a new physical plan, so StreamExecution cannot retrieval metrics and watermark.

This PR changes ForeachSink to manually convert InternalRows to objects without creating a new plan.

## How was this patch tested?

`test("foreach with watermark: append")`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16160 from zsxwing/SPARK-18721.

(cherry picked from commit 7863c623)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

655297b3

[SPARK-18572][SQL] Add a method `listPartitionNames` to `ExternalCatalog` · 8ca6a82c

Michael Allman authored 8 years ago

(Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-18572

)

## What changes were proposed in this pull request?

Currently Spark answers the `SHOW PARTITIONS` command by fetching all of the table's partition metadata from the external catalog and constructing partition names therefrom. The Hive client has a `getPartitionNames` method which is many times faster for this purpose, with the performance improvement scaling with the number of partitions in a table.

To test the performance impact of this PR, I ran the `SHOW PARTITIONS` command on two Hive tables with large numbers of partitions. One table has ~17,800 partitions, and the other has ~95,000 partitions. For the purposes of this PR, I'll call the former table `table1` and the latter table `table2`. I ran 5 trials for each table with before-and-after versions of this PR. The results are as follows:

Spark at bdc8153e, `SHOW PARTITIONS table1`, times in seconds:
7.901
3.983
4.018
4.331
4.261

Spark at bdc8153e, `SHOW PARTITIONS table2`
(Timed out after 10 minutes with a `SocketTimeoutException`.)

Spark at this PR, `SHOW PARTITIONS table1`, times in seconds:
3.801
0.449
0.395
0.348
0.336

Spark at this PR, `SHOW PARTITIONS table2`, times in seconds:
5.184
1.63
1.474
1.519
1.41

Taking the best times from each trial, we get a 12x performance improvement for a table with ~17,800 partitions and at least a 426x improvement for a table with ~95,000 partitions. More significantly, the latter command doesn't even complete with the current code in master.

This is actually a patch we've been using in-house at VideoAmp since Spark 1.1. It's made all the difference in the practical usability of our largest tables. Even with tables with about 1,000 partitions there's a performance improvement of about 2-3x.

## How was this patch tested?

I added a unit test to `VersionsSuite` which tests that the Hive client's `getPartitionNames` method returns the correct number of partitions.

Author: Michael Allman <michael@videoamp.com>

Closes #15998 from mallman/spark-18572-list_partition_names.

(cherry picked from commit 772ddbea)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

8ca6a82c

[SPARK-18722][SS] Move no data rate limit from StreamExecution to ProgressReporter · d4588165

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Move no data rate limit from StreamExecution to ProgressReporter to make `recentProgresses` and listener events consistent.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16155 from zsxwing/SPARK-18722.

(cherry picked from commit 4af142f5)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

d4588165

[SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and... · 1946854a

Tathagata Das authored 8 years ago

[SPARK-18657][SPARK-18668] Make StreamingQuery.id persists across restart and not auto-generate StreamingQuery.name

Here are the major changes in this PR.
- Added the ability to recover `StreamingQuery.id` from checkpoint location, by writing the id to `checkpointLoc/metadata`.
- Added `StreamingQuery.runId` which is unique for every query started and does not persist across restarts. This is to identify each restart of a query separately (same as earlier behavior of `id`).
- Removed auto-generation of `StreamingQuery.name`. The purpose of name was to have the ability to define an identifier across restarts, but since id is precisely that, there is no need for a auto-generated name. This means name becomes purely cosmetic, and is null by default.
- Added `runId` to `StreamingQueryListener` events and `StreamingQueryProgress`.

Implementation details
- Renamed existing `StreamExecutionMetadata` to `OffsetSeqMetadata`, and moved it to the file `OffsetSeq.scala`, because that is what this metadata is tied to. Also did some refactoring to make the code cleaner (got rid of a lot of `.json` and `.getOrElse("{}")`).
- Added the `id` as the new `StreamMetadata`.
- When a StreamingQuery is created it gets or writes the `StreamMetadata` from `checkpointLoc/metadata`.
- All internal logging in `StreamExecution` uses `(name, id, runId)` instead of just `name`

TODO
- [x] Test handling of name=null in json generation of StreamingQueryProgress
- [x] Test handling of name=null in json generation of StreamingQueryListener events
- [x] Test python API of runId

Updated unit tests and new unit tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16113 from tdas/SPARK-18657.

(cherry picked from commit bb57bfe9)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

1946854a

[SPARK-18729][SS] Move DataFrame.collect out of synchronized block in MemorySink · 6c4c3368

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Move DataFrame.collect out of synchronized block so that we can query content in MemorySink when `DataFrame.collect` is running.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16162 from zsxwing/SPARK-18729.

(cherry picked from commit 1b2785c3)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

6c4c3368

[SPARK-18634][PYSPARK][SQL] Corruption and Correctness issues with exploding Python UDFs · fecd23d2

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

As reported in the Jira, there are some weird issues with exploding Python UDFs in SparkSQL.

The following test code can reproduce it. Notice: the following test code is reported to return wrong results in the Jira. However, as I tested on master branch, it causes exception and so can't return any result.

    >>> from pyspark.sql.functions import *
    >>> from pyspark.sql.types import *
    >>>
    >>> df = spark.range(10)
    >>>
    >>> def return_range(value):
    ...   return [(i, str(i)) for i in range(value - 1, value + 1)]
    ...
    >>> range_udf = udf(return_range, ArrayType(StructType([StructField("integer_val", IntegerType()),
    ...                                                     StructField("string_val", StringType())])))
    >>>
    >>> df.select("id", explode(range_udf(df.id))).show()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/spark/python/pyspark/sql/dataframe.py", line 318, in show
        print(self._jdf.showString(n, 20))
      File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
      File "/spark/python/pyspark/sql/utils.py", line 63, in deco
        return f(*a, **kw)
      File "/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o126.showString.: java.lang.AssertionError: assertion failed
        at scala.Predef$.assert(Predef.scala:156)
        at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:120)
        at org.apache.spark.sql.execution.GenerateExec.consume(GenerateExec.scala:57)

The cause of this issue is, in `ExtractPythonUDFs` we insert `BatchEvalPythonExec` to run PythonUDFs in batch. `BatchEvalPythonExec` will add extra outputs (e.g., `pythonUDF0`) to original plan. In above case, the original `Range` only has one output `id`. After `ExtractPythonUDFs`, the added `BatchEvalPythonExec` has two outputs `id` and `pythonUDF0`.

Because the output of `GenerateExec` is given after analysis phase, in above case, it is the combination of `id`, i.e., the output of `Range`, and `col`. But in planning phase, we change `GenerateExec`'s child plan to `BatchEvalPythonExec` with additional output attributes.

It will cause no problem in non wholestage codegen. Because when evaluating the additional attributes are projected out the final output of `GenerateExec`.

However, as `GenerateExec` now supports wholestage codegen, the framework will input all the outputs of the child plan to `GenerateExec`. Then when consuming `GenerateExec`'s output data (i.e., calling `consume`), the number of output attributes is different to the output variables in wholestage codegen.

To solve this issue, this patch only gives the generator's output to `GenerateExec` after analysis phase. `GenerateExec`'s output is the combination of its child plan's output and the generator's output. So when we change `GenerateExec`'s child, its output is still correct.

## How was this patch tested?

Added test cases to PySpark.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16120 from viirya/fix-py-udf-with-generator.

(cherry picked from commit 3ba69b64)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

fecd23d2

[SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix... · c6a4e3d9

Shixiong Zhu authored 8 years ago

[SPARK-18694][SS] Add StreamingQuery.explain and exception to Python and fix StreamingQueryException (branch 2.1)

## What changes were proposed in this pull request?

Backport #16125 to branch 2.1.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16153 from zsxwing/SPARK-18694-2.1.

c6a4e3d9

[DOCS][MINOR] Update location of Spark YARN shuffle jar · 39759ff0

Nicholas Chammas authored 8 years ago


Looking at the distributions provided on spark.apache.org, I see that the Spark YARN shuffle jar is under `yarn/` and not `lib/`.

This change is so minor I'm not sure it needs a JIRA. But let me know if so and I'll create one.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #16130 from nchammas/yarn-doc-fix.

(cherry picked from commit 5a92dc76)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

39759ff0

[SPARK-18711][SQL] should disable subexpression elimination for LambdaVariable · e23c8cfc

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

This is kind of a long-standing bug, it's hidden until https://github.com/apache/spark/pull/15780

 , which may add `AssertNotNull` on top of `LambdaVariable` and thus enables subexpression elimination.

However, subexpression elimination will evaluate the common expressions at the beginning, which is invalid for `LambdaVariable`. `LambdaVariable` usually represents loop variable, which can't be evaluated ahead of the loop.

This PR skips expressions containing `LambdaVariable` when doing subexpression elimination.

## How was this patch tested?

updated test in `DatasetAggregatorSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16143 from cloud-fan/aggregator.

(cherry picked from commit 01a7d33d)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

e23c8cfc

Revert "[SPARK-18284][SQL] Make ExpressionEncoder.serializer.nullable precise" · 30c07430
Reynold Xin authored 8 years ago
```
This reverts commit fce1be6c from branch-2.1.
```
30c07430

[MINOR][DOC] Use SparkR `TRUE` value and add default values for `StructField` in SQL Guide. · afd2321b

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

In `SQL Programming Guide`, this PR uses `TRUE` instead of `True` in SparkR and adds default values of `nullable` for `StructField` in Scala/Python/R (i.e., "Note: The default value of nullable is true."). In Java API, `nullable` is not optional.

**BEFORE**
* SPARK 2.1.0 RC1
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc1-docs/sql-programming-guide.html#data-types

**AFTER**

* R
<img width="916" alt="screen shot 2016-12-04 at 11 58 19 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877443/abba19a6-ba7d-11e6-8984-afbe00333fb0.png">

* Scala
<img width="914" alt="screen shot 2016-12-04 at 11 57 37 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877433/99ce734a-ba7d-11e6-8bb5-e8619041b09b.png">

* Python
<img width="914" alt="screen shot 2016-12-04 at 11 58 04 pm" src="https://cloud.githubusercontent.com/assets/9700541/20877440/a5c89338-ba7d-11e6-8f92-6c0ae9388d7e.png

">

## How was this patch tested?

Manual.

```
cd docs
SKIP_API=1 jekyll build
open _site/index.html
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16141 from dongjoon-hyun/SPARK-SQL-GUIDE.

(cherry picked from commit 410b7898)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

afd2321b

[SPARK-18279][DOC][ML][SPARKR] Add R examples to ML programming guide. · 1821cbea

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
Add R examples to ML programming guide for the following algorithms as POC:
* spark.glm
* spark.survreg
* spark.naiveBayes
* spark.kmeans

The four algorithms were added to SparkR since 2.0.0, more docs for algorithms added during 2.1 release cycle will be addressed in a separate follow-up PR.

## How was this patch tested?
This is the screenshots of generated ML programming guide for ```GeneralizedLinearRegression```:
![image](https://cloud.githubusercontent.com/assets/1962026/20866403/babad856-b9e1-11e6-9984-62747801e8c4.png

)

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16136 from yanboliang/spark-18279.

(cherry picked from commit eb8dd681)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

1821cbea

[SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and setPredictionCol · 88e07efe

Zheng RuiFeng authored 8 years ago


## What changes were proposed in this pull request?
add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel`

## How was this patch tested?
added tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16059 from zhengruifeng/ovrm_setCol.

(cherry picked from commit bdfe7f67)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

88e07efe

Dec 04, 2016

[SPARK-18643][SPARKR] SparkR hangs at session start when installed as a package without Spark · c13c2939

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session.

This seems to be a regression on the earlier behavior.

Fix is to always try to install or check for the cached Spark if running in an interactive session.
As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc)

## How was this patch tested?

Manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16077 from felixcheung/rsessioninteractive.

(cherry picked from commit b019b3a8)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

c13c2939

[SPARK-18661][SQL] Creating a partitioned datasource table should not scan all files for table · 41d698ec

Eric Liang authored 8 years ago


## What changes were proposed in this pull request?

Even though in 2.1 creating a partitioned datasource table will not populate the partition data by default (until the user issues MSCK REPAIR TABLE), it seems we still scan the filesystem for no good reason.

We should avoid doing this when the user specifies a schema.

## How was this patch tested?

Perf stat tests.

Author: Eric Liang <ekl@databricks.com>

Closes #16090 from ericl/spark-18661.

(cherry picked from commit d9eb4c72)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

41d698ec

[SPARK-18091][SQL] Deep if expressions cause Generated... · 8145c82b

Kapil Singh authored 8 years ago

[SPARK-18091][SQL] Deep if expressions cause Generated SpecificUnsafeProjection code to exceed JVM code size limit

## What changes were proposed in this pull request?

Fix for SPARK-18091 which is a bug related to large if expressions causing generated SpecificUnsafeProjection code to exceed JVM code size limit.

This PR changes if expression's code generation to place its predicate, true value and false value expressions' generated code in separate methods in context so as to never generate too long combined code.
## How was this patch tested?

Added a unit test and also tested manually with the application (having transformations similar to the unit test) which caused the issue to be identified in the first place.

Author: Kapil Singh <kapsingh@adobe.com>

Closes #15620 from kapilsingh5050/SPARK-18091-IfCodegenFix.

(cherry picked from commit e463678b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

8145c82b

Dec 03, 2016

[SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH) · 28f698b4

Yunni authored 8 years ago


## What changes were proposed in this pull request?
The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples.

## How was this patch tested?
Doc has been generated through Jekyll, and checked through manual inspection.

Author: Yunni <Euler57721@gmail.com>
Author: Yun Ni <yunn@uber.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Yun Ni <Euler57721@gmail.com>

Closes #15795 from Yunni/SPARK-18081-lsh-guide.

(cherry picked from commit 34777184)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

28f698b4

[SPARK-18582][SQL] Whitelist LogicalPlan operators allowed in correlated subqueries · b098b484

Nattavut Sutyanyong authored 8 years ago


## What changes were proposed in this pull request?

This fix puts an explicit list of operators that Spark supports for correlated subqueries.

## How was this patch tested?

Run sql/test, catalyst/test and add a new test case on Generate.

Author: Nattavut Sutyanyong <nsy.can@gmail.com>

Closes #16046 from nsyca/spark18455.0.

(cherry picked from commit 4a3c0960)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

b098b484

[SPARK-18685][TESTS] Fix URI and release resources after opening in tests at... · 28ea432a

hyukjinkwon authored 8 years ago

[SPARK-18685][TESTS] Fix URI and release resources after opening in tests at ExecutorClassLoaderSuite

## What changes were proposed in this pull request?

This PR fixes two problems as below:

- Close `BufferedSource` after `Source.fromInputStream(...)` to release resource and make the tests pass on Windows in `ExecutorClassLoaderSuite`

  ```
  [info] Exception encountered when attempting to run a suite with class name: org.apache.spark.repl.ExecutorClassLoaderSuite *** ABORTED *** (7 seconds, 333 milliseconds)
  [info]   java.io.IOException: Failed to delete: C:\projects\spark\target\tmp\spark-77b2f37b-6405-47c4-af1c-4a6a206511f2
  [info]   at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
  [info]   at org.apache.spark.repl.ExecutorClassLoaderSuite.afterAll(ExecutorClassLoaderSuite.scala:76)
  [info]   at org.scalatest.BeforeAndAfterAll$class.afterAll(BeforeAndAfterAll.scala:213)
  ...
  ```

- Fix URI correctly so that related tests can be passed on Windows.

  ```
  [info] - child first *** FAILED *** (78 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - parent first *** FAILED *** (15 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - child first can fall back *** FAILED *** (0 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - child first can fail *** FAILED *** (0 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - resource from parent *** FAILED *** (0 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ...
  [info] - resources from parent *** FAILED *** (0 milliseconds)
  [info]   java.net.URISyntaxException: Illegal character in authority at index 7: file://C:\projects\spark\target\tmp\spark-00b66070-0548-463c-b6f3-8965d173da9b
  [info]   at java.net.URI$Parser.fail(URI.java:2848)
  [info]   at java.net.URI$Parser.parseAuthority(URI.java:3186)
  ```

## How was this patch tested?

Manually tested via AppVeyor.

**Before**
https://ci.appveyor.com/project/spark-test/spark/build/102-rpel-ExecutorClassLoaderSuite

**After**
https://ci.appveyor.com/project/spark-test/spark/build/108-rpel-ExecutorClassLoaderSuite



Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16116 from HyukjinKwon/close-after-open.

(cherry picked from commit d1312fb7)
Signed-off-by: Sean Owen <sowen@cloudera.com>

28ea432a