Commits · f48c5a57d6488d5598534ca5834e008504f464fe · cs525-sp18-g07 / spark

Feb 14, 2017

[SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the... · f48c5a57

sureshthalamati authored 8 years ago

[SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the user in case-sensitive manner.

## What changes were proposed in this pull request?
The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner.

This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case.

This PR  enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection.

Alternative approach PR https://github.com/apache/spark/pull/16847  is to pass original input keys to JDBC data source by adding check in the  Data source class and handle case-insensitivity in the JDBC source code.

## How was this patch tested?
Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully.

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.

f48c5a57

[SPARK-16475][SQL] Broadcast hint for SQL Queries · da7aef7a

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure.

The hint syntax looks like the following:
```
SELECT /*+ BROADCAST(t) */ * FROM t
```

For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name.

The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions.

Note that there was an earlier patch in https://github.com/apache/spark/pull/14426. This is a rewrite of that patch, with different semantics and simpler test cases.

## How was this patch tested?
Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite.

Author: Reynold Xin <rxin@databricks.com>
Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16925 from rxin/SPARK-16475-broadcast-hint.

da7aef7a

[SPARK-19387][SPARKR] Tests do not run with SparkR source package in CRAN check · a3626ca3

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

- this is cause by changes in SPARK-18444, SPARK-18643 that we no longer install Spark when `master = ""` (default), but also related to SPARK-18449 since the real `master` value is not known at the time the R code in `sparkR.session` is run. (`master` cannot default to "local" since it could be overridden by spark-submit commandline or spark config)
- as a result, while running SparkR as a package in IDE is working fine, CRAN check is not as it is launching it via non-interactive script
- fix is to add check to the beginning of each test and vignettes; the same would also work by changing `sparkR.session()` to `sparkR.session(master = "local")` in tests, but I think being more explicit is better.

## How was this patch tested?

Tested this by reverting version to 2.1, since it needs to download the release jar with matching version. But since there are changes in 2.2 (specifically around SparkR ML) that are incompatible with 2.1, some tests are failing in this config. Will need to port this to branch-2.1 and retest with 2.1 release jar.

manually as:
```
# modify DESCRIPTION to revert version to 2.1.0
SPARK_HOME=/usr/spark R CMD build pkg
# run cran check without SPARK_HOME
R CMD check --as-cran SparkR_2.1.0.tar.gz
```

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16720 from felixcheung/rcranchecktest.

a3626ca3

[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment · ab9872db

Jong Wook Kim authored 8 years ago

## What changes were proposed in this pull request?

As discussed in [JIRA](https://issues.apache.org/jira/browse/SPARK-19501), this patch addresses the problem where too many HDFS RPCs are made when there are many URIs specified in `spark.yarn.jars`, potentially adding hundreds of RTTs to YARN before the application launches. This becomes significant when submitting the application to a non-local YARN cluster (where the RTT may be in order of 100ms, for example). For each URI specified, the current implementation makes at least two HDFS RPCs, for:

- [Calling `getFileStatus()` before uploading each file to the distributed cache in `ClientDistributedCacheManager.addResource()`](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71).
- [Resolving any symbolic links in each of the file URI](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L377-L379), which repeatedly makes HDFS RPCs until the all symlinks are resolved. (see [`FileContext.resolve(Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java#L2189-L2195), [`FSLinkResolver.resolve(FileContext, Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSLinkResolver.java#L79-L112), and [`AbstractFileSystem.resolvePath()`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/AbstractFileSystem.java#L464-L468).)

The first `getFileStatus` RPC can be removed, using `statCache` populated with the file statuses retrieved with [the previous `globStatus` call](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531).

The second one can be largely reduced by caching the symlink resolution results in a mutable.HashMap. This patch adds a local variable in `yarn.Client.prepareLocalResources()` and passes it as an additional parameter to `yarn.Client.copyFileToRemote`.  [The symlink resolution code was added in 2013](https://github.com/apache/spark/commit/a35472e1dd2ea1b5a0b1fb6b382f5a98f5aeba5a#diff-b050df3f55b82065803d6e83453b9706R187) and has not changed since. I am assuming that this is still required, but otherwise we can remove using `symlinkCache` and symlink resolution altogether.

## How was this patch tested?

This patch is based off 8e8afb3a, currently the latest YARN patch on master. All tests except a few in spark-hive passed with `./dev/run-tests` on my machine, using JDK 1.8.0_112 on macOS 10.12.3; also tested myself with this modified version of SPARK 2.2.0-SNAPSHOT which performed a normal deployment and execution on a YARN cluster without errors.

Author: Jong Wook Kim <jongwook@nyu.edu>

Closes #16916 from jongwook/SPARK-19501.

ab9872db

[SPARK-19571][R] Fix SparkR test break on Windows via AppVeyor · f776e3b4

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

It seems wintuils for Hadoop 2.6.5 not exiting for now in https://github.com/steveloughran/winutils

This breaks the tests in SparkR on Windows so this PR proposes to use winutils built by Hadoop 2.6.4 for now.

## How was this patch tested?

Manually via AppVeyor

**Before**

https://ci.appveyor.com/project/spark-test/spark/build/627-r-test-break

**After**

https://ci.appveyor.com/project/spark-test/spark/build/629-r-test-break

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16927 from HyukjinKwon/spark-r-windows-break.

f776e3b4

[SPARK-18541][PYTHON] Add metadata parameter to pyspark.sql.Column.alias() · 7b64f7aa

Sheamus K. Parkes authored 8 years ago

## What changes were proposed in this pull request?

Add a `metadata` keyword parameter to `pyspark.sql.Column.alias()` to allow users to mix-in metadata while manipulating `DataFrame`s in `pyspark`.  Without this, I believe it was necessary to pass back through `SparkSession.createDataFrame` each time a user wanted to manipulate `StructField.metadata` in `pyspark`.

This pull request also improves consistency between the Scala and Python APIs (i.e. I did not add any functionality that was not already in the Scala API).

Discussed ahead of time on JIRA with marmbrus

## How was this patch tested?

Added unit tests (and doc tests).  Ran the pertinent tests manually.

Author: Sheamus K. Parkes <shea.parkes@milliman.com>

Closes #16094 from shea-parkes/pyspark-column-alias-metadata.

7b64f7aa

[SPARK-19162][PYTHON][SQL] UserDefinedFunction should validate that func is callable · e0eeb0f8

zero323 authored 8 years ago

## What changes were proposed in this pull request?

UDF constructor checks if `func` argument is callable and if it is not, fails fast instead of waiting for an action.

## How was this patch tested?

Unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16535 from zero323/SPARK-19162.

e0eeb0f8

[SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataFrame.replace docstring · 9c4405e8

zero323 authored 8 years ago

## What changes were proposed in this pull request?

- Provides correct description of the semantics of a `dict` argument passed as `to_replace`.
- Describes type requirements for collection arguments.
- Describes behavior with `to_replace: List[T]` and `value: T`

## How was this patch tested?

Manual testing, documentation build.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16792 from zero323/SPARK-19453.

9c4405e8

[SPARK-19589][SQL] Removal of SQLGEN files · 457850e6

Xiao Li authored 8 years ago

### What changes were proposed in this pull request?
SQLGen is removed. Thus, the generated files should be removed too.

### How was this patch tested?
N/A

Author: Xiao Li <gatorsmile@gmail.com>

Closes #16921 from gatorsmile/removeSQLGenFiles.

457850e6

[SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTable api call in the doc · 9b5e460a

Sunitha Kambhampati authored 8 years ago

## What changes were proposed in this pull request?

https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
In the doc, the call spark.cacheTable(“tableName”) and spark.uncacheTable(“tableName”) actually needs to be spark.catalog.cacheTable and spark.catalog.uncacheTable

## How was this patch tested?
Built the docs and verified the change shows up fine.

Author: Sunitha Kambhampati <skambha@us.ibm.com>

Closes #16919 from skambha/docChange.

9b5e460a

Feb 13, 2017

[SPARK-19539][SQL] Block duplicate temp table during creation · 1ab97310

Xin Wu authored 8 years ago

## What changes were proposed in this pull request?
Current `CREATE TEMPORARY TABLE ... ` is deprecated and recommend users to use `CREATE TEMPORARY VIEW ...` And it does not support `IF NOT EXISTS `clause. However, if there is an existing temporary view defined, it is possible to unintentionally replace this existing view by issuing `CREATE TEMPORARY TABLE ...`  with the same table/view name.

This PR is to disallow `CREATE TEMPORARY TABLE ...` with an existing view name.
Under the cover, `CREATE TEMPORARY TABLE ...` will be changed to create temporary view, however, passing in a flag `replace=false`, instead of currently `true`. So when creating temporary view under the cover, if there is existing view with the same name, the operation will be blocked.

## How was this patch tested?
New unit test case is added and updated some existing test cases to adapt the new behavior

Author: Xin Wu <xinwu@us.ibm.com>

Closes #16878 from xwu0226/block_duplicate_temp_table.

1ab97310

[SPARK-19115][SQL] Supporting Create Table Like Location · 6e45b547

ouyangxiaochen authored 8 years ago

What changes were proposed in this pull request?

Support CREATE [EXTERNAL] TABLE LIKE LOCATION... syntax for Hive serde and datasource tables.
In this PR,we follow SparkSQL design rules :

    supporting create table like view or physical table or temporary view with location.
    creating a table with location,this table will be an external table other than managed table.

How was this patch tested?

Add new test cases and update existing test cases

Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>

Closes #16868 from ouyangxiaochen/spark19115.

6e45b547

[SPARK-19429][PYTHON][SQL] Support slice arguments in Column.__getitem__ · e02ac303

zero323 authored 8 years ago

## What changes were proposed in this pull request?

- Add support for `slice` arguments in `Column.__getitem__`.
- Remove obsolete `__getslice__` bindings.

## How was this patch tested?

Existing unit tests, additional tests covering `[]` with `slice`.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16771 from zero323/SPARK-19429.

e02ac303

[SPARK-19520][STREAMING] Do not encrypt data written to the WAL. · 0169360e

Marcelo Vanzin authored 8 years ago

Spark's I/O encryption uses an ephemeral key for each driver instance.
So driver B cannot decrypt data written by driver A since it doesn't
have the correct key.

The write ahead log is used for recovery, thus needs to be readable by
a different driver. So it cannot be encrypted by Spark's I/O encryption
code.

The BlockManager APIs used by the WAL code to write the data automatically
encrypt data, so changes are needed so that callers can to opt out of
encryption.

Aside from that, the "putBytes" API in the BlockManager does not do
encryption, so a separate situation arised where the WAL would write
unencrypted data to the BM and, when those blocks were read, decryption
would fail. So the WAL code needs to ask the BM to encrypt that data
when encryption is enabled; this code is not optimal since it results
in a (temporary) second copy of the data block in memory, but should be
OK for now until a more performant solution is added. The non-encryption
case should not be affected.

Tested with new unit tests, and by running streaming apps that do
recovery using the WAL data with I/O encryption turned on.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16862 from vanzin/SPARK-19520.

0169360e

[SPARK-19435][SQL] Type coercion between ArrayTypes · 9af8f743

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

This PR proposes to support type coercion between `ArrayType`s where the element types are compatible.

**Before**

```
Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0;

Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0;

sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14

Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;;

sql("SELECT IF(1=1, array(1), array(1D))")
org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7;
```

**After**

```scala
Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>]

Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>]

sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
res8: org.apache.spark.sql.DataFrame = [a: array<double>]

Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>]

sql("SELECT IF(1=1, array(1), array(1D))")
res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>]
```

## How was this patch tested?

Unit tests in `TypeCoercion` and Jenkins tests and

building with scala 2.10

```scala
./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16777 from HyukjinKwon/SPARK-19435.

9af8f743

[SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using... · 905fdf0c

Shixiong Zhu authored 8 years ago

[SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorClassLoader to load Netty generated classes

## What changes were proposed in this pull request?

Netty's `MessageToMessageEncoder` uses [Javassist](https://github.com/netty/netty/blob/91a0bdc17a8298437d6de08a8958d753799bd4a6/common/src/main/java/io/netty/util/internal/JavassistTypeParameterMatcherGenerator.java#L62) to generate a matcher class and the implementation calls `Class.forName` to check if this class is already generated. If `MessageEncoder` or `MessageDecoder` is created in `ExecutorClassLoader.findClass`, it will cause `ClassCircularityError`. This is because loading this Netty generated class will call `ExecutorClassLoader.findClass` to search this class, and `ExecutorClassLoader` will try to use RPC to load it and cause to load the non-exist matcher class again. JVM will report `ClassCircularityError` to prevent such infinite recursion.

##### Why it only happens in Maven builds

It's because Maven and SBT have different class loader tree. The Maven build will set a URLClassLoader as the current context class loader to run the tests and expose this issue. The class loader tree is as following:

```
bootstrap class loader ------ ... ----- REPL class loader ---- ExecutorClassLoader
|
|
URLClasssLoader
```

The SBT build uses the bootstrap class loader directly and `ReplSuite.test("propagation of local properties")` is the first test in ReplSuite, which happens to load `io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher` into the bootstrap class loader (Note: in maven build, it's loaded into URLClasssLoader so it cannot be found in ExecutorClassLoader). This issue can be reproduced in SBT as well. Here are the produce steps:
- Enable `hadoop.caller.context.enabled`.
- Replace `Class.forName` with `Utils.classForName` in `object CallerContext`.
- Ignore `ReplSuite.test("propagation of local properties")`.
- Run `ReplSuite` using SBT.

This PR just creates a singleton MessageEncoder and MessageDecoder and makes sure they are created before switching to ExecutorClassLoader. TransportContext will be created when creating RpcEnv and that happens before creating ExecutorClassLoader.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16859 from zsxwing/SPARK-17714.

905fdf0c

[SPARK-19542][SS] Delete the temp checkpoint if a query is stopped without errors · 3dbff9be

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

When a query uses a temp checkpoint dir, it's better to delete it if it's stopped without errors.

## How was this patch tested?

New unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16880 from zsxwing/delete-temp-checkpoint.

3dbff9be

[SPARK-19514] Enhancing the test for Range interruption. · 0417ce87

Ala Luszczak authored 8 years ago

Improve the test for SPARK-19514, so that it's clear which stage is being cancelled.

Author: Ala Luszczak <ala@databricks.com>

Closes #16914 from ala/fix-range-test.

0417ce87

[SPARK-19529] TransportClientFactory.createClient() shouldn't call awaitUninterruptibly() · 1c4d10b1

Josh Rosen authored 8 years ago

## What changes were proposed in this pull request?

This patch replaces a single `awaitUninterruptibly()` call with a plain `await()` call in Spark's `network-common` library in order to fix a bug which may cause tasks to be uncancellable.

In Spark's Netty RPC layer, `TransportClientFactory.createClient()` calls `awaitUninterruptibly()` on a Netty future while waiting for a connection to be established. This creates problem when a Spark task is interrupted while blocking in this call (which can happen in the event of a slow connection which will eventually time out). This has bad impacts on task cancellation when `interruptOnCancel = true`.

As an example of the impact of this problem, I experienced significant numbers of uncancellable "zombie tasks" on a production cluster where several tasks were blocked trying to connect to a dead shuffle server and then continued running as zombies after I cancelled the associated Spark stage. The zombie tasks ran for several minutes with the following stack:

```
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:460)
io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607)
io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) => holding Monitor(java.lang.Object1849476028})
org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
350)
org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120)
org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
[...]
```

As far as I can tell, `awaitUninterruptibly()` might have been used in order to avoid having to declare that methods throw `InterruptedException` (this code is written in Java, hence the need to use checked exceptions). This patch simply replaces this with a regular, interruptible `await()` call,.

This required several interface changes to declare a new checked exception (these are internal interfaces, though, and this change doesn't significantly impact binary compatibility).

An alternative approach would be to wrap `InterruptedException` into `IOException` in order to avoid having to change interfaces. The problem with this approach is that the `network-shuffle` project's `RetryingBlockFetcher` code treats `IOExceptions` as transitive failures when deciding whether to retry fetches, so throwing a wrapped `IOException` might cause an interrupted shuffle fetch to be retried, further prolonging the lifetime of a cancelled zombie task.

Note that there are three other `awaitUninterruptibly()` in the codebase, but those calls have a hard 10 second timeout and are waiting on a `close()` operation which is expected to complete near instantaneously, so the impact of uninterruptibility there is much smaller.

## How was this patch tested?

Manually.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16866 from JoshRosen/SPARK-19529.

1c4d10b1

[SPARK-19427][PYTHON][SQL] Support data type string as a returnType argument of UDF · ab88b241

zero323 authored 8 years ago

## What changes were proposed in this pull request?

Add support for data type string as a return type argument of `UserDefinedFunction`:

```python
f = udf(lambda x: x, "integer")
 f.returnType

## IntegerType
```

## How was this patch tested?

Existing unit tests, additional unit tests covering new feature.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16769 from zero323/SPARK-19427.

ab88b241

[SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util · 5e7cd332

zero323 authored 8 years ago

## What changes were proposed in this pull request?

Add missing `warnings` import.

## How was this patch tested?

Manual tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16846 from zero323/SPARK-19506.

5e7cd332

[SPARK-19544][SQL] Improve error message when some column types are compatible... · 4321ff9e

hyukjinkwon authored 8 years ago

[SPARK-19544][SQL] Improve error message when some column types are compatible and others are not in set operations

## What changes were proposed in this pull request?

This PR proposes to fix the error message when some data types are compatible and others are not in set/union operation.

Currently, the code below:

```scala
Seq((1,("a", 1))).toDF.union(Seq((1L,("a", "b"))).toDF)
```

throws an exception saying `LongType` and `IntegerType` are incompatible types. It should say something about `StructType`s with more readable format as below:

**Before**

```
Union can only be performed on tables with the compatible column types.
LongType <> IntegerType at the first column of the second table;;
```

**After**

```
Union can only be performed on tables with the compatible column types.
struct<_1:string,_2:string> <> struct<_1:string,_2:int> at the second column of the second table;;
```

*I manually inserted a newline in the messages above for readability only in this PR description.

## How was this patch tested?

Unit tests in `AnalysisErrorSuite`, manual tests and build wth Scala 2.10.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16882 from HyukjinKwon/SPARK-19544.

4321ff9e

[SPARK-19496][SQL] to_date udf to return null when input date is invalid · 04ad8225

windpiger authored 8 years ago

## What changes were proposed in this pull request?

Currently the udf  `to_date` has different return value with an invalid date input.

```
SELECT to_date('2015-07-22', 'yyyy-dd-MM') ->  return `2016-10-07`
SELECT to_date('2014-31-12')    -> return null
```

As discussed in JIRA [SPARK-19496](https://issues.apache.org/jira/browse/SPARK-19496), we should return null in both situations when the input date is invalid

## How was this patch tested?
unit test added

Author: windpiger <songjun@outlook.com>

Closes #16870 from windpiger/to_date.

04ad8225

[SPARK-19562][BUILD] Added exclude for dev/pr-deps to gitignore · 8f03ad54

Armin Braun authored 8 years ago

## What changes were proposed in this pull request?

Just adding a missing .gitignore entry.

## How was this patch tested?

Entry added, now repo is not dirty anymore after running the build.

Author: Armin Braun <me@obrown.io>

Closes #16904 from original-brownbear/SPARK-19562.

Unverified

8f03ad54

[SPARK-19574][ML][DOCUMENTATION] Fix Liquid Exception: Start indices amount is... · 855a1b75

Xiao Li authored 8 years ago

[SPARK-19574][ML][DOCUMENTATION] Fix Liquid Exception: Start indices amount is not equal to end indices amount

### What changes were proposed in this pull request?
```
Liquid Exception: Start indices amount is not equal to end indices amount, see /Users/xiao/IdeaProjects/sparkDelivery/docs/../examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java. in ml-features.md
```

So far, the build is broken after merging https://github.com/apache/spark/pull/16789

This PR is to fix it.

## How was this patch tested?
Manual

Author: Xiao Li <gatorsmile@gmail.com>

Closes #16908 from gatorsmile/docMLFix.

Unverified

855a1b75

[SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group · 2bdbc870

Liwei Lin authored 8 years ago

## What changes were proposed in this pull request?

In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and create a new consumer. In our current implementation, the first consumer and the second consumer would be in the same group (which leads to SPARK-19559), **_violating our intention of the two consumers not being in the same group._**

The cause is that, in our current implementation, the first consumer is created before `groupId` and `nextId` are initialized in the constructor. Then even if `groupId` and `nextId` are increased during the creation of that first consumer, `groupId` and `nextId` would still be initialized to default values in the constructor for the second consumer.

We should make sure that `groupId` and `nextId` are initialized before any consumer is created.

## How was this patch tested?

Ran 100 times of `KafkaSourceSuite`; all passed

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16902 from lw-lin/SPARK-19564-.

2bdbc870

Feb 12, 2017

[SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column · bc0a0e63

titicaca authored 8 years ago

## What changes were proposed in this pull request?

Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:

```
library(SparkR)
sparkR.session(master = "local")
df <- data.frame(col1 = c(0, 1, 2),
                 col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))

sdf1 <- createDataFrame(df)
print(dtypes(sdf1))
df1 <- collect(sdf1)
print(lapply(df1, class))

sdf2 <- filter(sdf1, "col1 > 0")
print(dtypes(sdf2))
df2 <- collect(sdf2)
print(lapply(df2, class))
```

As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.

This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.

Therefore, we need to cast the data type of the vector explicitly.

## How was this patch tested?

The patch can be tested manually with the same code above.

Author: titicaca <fangzhou.yang@hotmail.com>

Closes #16689 from titicaca/sparkr-dev.

bc0a0e63

[SPARK-19448][SQL] optimize some duplication functions between HiveClientImpl and HiveUtils · 3881f342

windpiger authored 8 years ago

## What changes were proposed in this pull request?

There are some duplicate functions between `HiveClientImpl` and `HiveUtils`, we can merge them to one place. such as: `toHiveTable` 、`toHivePartition`、`fromHivePartition`.

And additional modify is change `MetastoreRelation.attributes` to `MetastoreRelation.dataColKeys`
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234

## How was this patch tested?
N/A

Author: windpiger <songjun@outlook.com>

Closes #16787 from windpiger/todoInMetaStoreRelation.

3881f342

Feb 11, 2017

[SPARK-19537] Move pendingPartitions to ShuffleMapStage. · 0fbecc73

Kay Ousterhout authored 8 years ago

The pendingPartitions instance variable should be moved to ShuffleMapStage,
because it is only used by ShuffleMapStages. This change is purely refactoring
and does not change functionality.

I fixed this in an attempt to clarify some of the discussion around #16620, which I was having trouble reasoning about. I stole the helpful comment Imran wrote for pendingPartitions and used it here.

cc squito markhamstra jinxing64

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #16876 from kayousterhout/SPARK-19537.

0fbecc73

Feb 10, 2017

[SPARK-19548][SQL] Support Hive UDFs which return typed Lists/Maps · 226d3884

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
This PR adds support for Hive UDFs that return fully typed java Lists or Maps, for example `List<String>` or `Map<String, Integer>`.  It is also allowed to nest these structures, for example `Map<String, List<Integer>>`. Raw collections or collections using wildcards are still not supported, and cannot be supported due to the lack of type information.

## How was this patch tested?
Modified existing tests in `HiveUDFSuite`, and I have added test cases for raw collection and collection using wildcards.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #16886 from hvanhovell/SPARK-19548.

226d3884

[SPARK-19549] Allow providing reason for stage/job cancelling · d785217b

Ala Luszczak authored 8 years ago

## What changes were proposed in this pull request?

This change add an optional argument to `SparkContext.cancelStage()` and `SparkContext.cancelJob()` functions, which allows the caller to provide exact reason  for the cancellation.

## How was this patch tested?

Adds unit test.

Author: Ala Luszczak <ala@databricks.com>

Closes #16887 from ala/cancel.

d785217b

[SPARK-18613][ML] make spark.mllib LDA dependencies in spark.ml LDA private · 3a43ae7c

sueann authored 8 years ago

## What changes were proposed in this pull request?
spark.ml.*LDAModel classes were exposing spark.mllib LDA models via protected methods. Made them package (clustering) private.

## How was this patch tested?
```
build/sbt doc  # "millib.clustering" no longer appears in the docs for *LDA* classes
build/sbt compile  # compiles
build/sbt
> mllib/testOnly   # tests pass
```

Author: sueann <sueann@databricks.com>

Closes #16860 from sueann/SPARK-18613.

3a43ae7c

[SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata · de8a03e6

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column.

This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see https://github.com/apache/spark/pull/16060 for more details on how the metadata is used.

## How was this patch tested?
Added a regression test to `OrcSourceSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #16804 from hvanhovell/SPARK-19459.

de8a03e6

[SPARK-19466][CORE][SCHEDULER] Improve Fair Scheduler Logging · dadff5f0

Eren Avsarogullari authored 8 years ago

Fair Scheduler Logging for the following cases can be useful for the user.

1. If **valid** `spark.scheduler.allocation.file` property is set, user can be informed and aware which scheduler file is processed when `SparkContext` initializes.

2. If **invalid** `spark.scheduler.allocation.file` property is set, currently, the following stacktrace is shown to user. In addition to this, more meaningful message can be shown to user by emphasizing the problem at building level of fair scheduler. Also other potential issues can be covered at this level as **Fair Scheduler can not be built. + exception stacktrace**
```
Exception in thread "main" java.io.FileNotFoundException: INVALID_FILE (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.<init>(FileInputStream.java:138)
at java.io.FileInputStream.<init>(FileInputStream.java:93)
at org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:76)
at org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:75)
```
3. If `spark.scheduler.allocation.file` property is not set and **default** fair scheduler file (**fairscheduler.xml**) is found in classpath, it will be loaded but currently, user is not informed for using default file so logging can be useful as **Fair Scheduler file: fairscheduler.xml is found successfully and will be parsed.**

4. If **spark.scheduler.allocation.file** property is not set and **default** fair scheduler file does not exist in classpath, currently, user is not informed so logging can be useful as **No Fair Scheduler file found.**

Also this PR is related with https://github.com/apache/spark/pull/15237 to emphasize fileName in warning logs when fair scheduler file has invalid minShare, weight or schedulingMode values.

## How was this patch tested?
Added new Unit Tests.

Author: erenavsarogullari <erenavsarogullari@gmail.com>

Closes #16813 from erenavsarogullari/SPARK-19466.

dadff5f0

Encryption of shuffle files · c5a66356

Hervé authored 8 years ago

Hello

According to my understanding of commits 4b4e329e & 8b325b17, one may now encrypt shuffle files regardless of the cluster manager in use.

However I have limited understanding of the code, I'm not able to find out whether theses changes also comprise all "temporary local storage, such as shuffle files, cached data, and other application files".

Please feel free to amend or reject my PR if I'm wrong.

dud

Author: Hervé <dud225@users.noreply.github.com>

Closes #16885 from dud225/patch-1.

c5a66356

[SPARK-10748][MESOS] Log error instead of crashing Spark Mesos dispatcher when... · 8640dc08

Devaraj K authored 8 years ago

[SPARK-10748][MESOS] Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured

## What changes were proposed in this pull request?

Now handling the spark exception which gets thrown for invalid job configuration, marking that job as failed and continuing to launch the other drivers instead of throwing the exception.
## How was this patch tested?

I verified manually, now the misconfigured jobs move to Finished Drivers section in UI and continue to launch the other jobs.

Author: Devaraj K <devaraj@apache.org>

Closes #13077 from devaraj-kavali/SPARK-10748.

Unverified

8640dc08

[SPARK-19545][YARN] Fix compile issue for Spark on Yarn when building against Hadoop 2.6.0~2.6.3 · 8e8afb3a

jerryshao authored 8 years ago

## What changes were proposed in this pull request?

Due to the newly added API in Hadoop 2.6.4+, Spark builds against Hadoop 2.6.0~2.6.3 will meet compile error. So here still reverting back to use reflection to handle this issue.

## How was this patch tested?

Manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #16884 from jerryshao/SPARK-19545.

Unverified

8e8afb3a

[SPARK-19543] from_json fails when the input row is empty · d5593f7f

Burak Yavuz authored 8 years ago

## What changes were proposed in this pull request?

Using from_json on a column with an empty string results in: java.util.NoSuchElementException: head of empty list.

This is because `parser.parse(input)` may return `Nil` when `input.trim.isEmpty`

## How was this patch tested?

Regression test in `JsonExpressionsSuite`

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #16881 from brkyvz/json-fix.

d5593f7f

Feb 09, 2017

[SPARK-19263] Fix race in SchedulerIntegrationSuite. · fd6c3a0b

jinxing authored 8 years ago

## What changes were proposed in this pull request?

All the process of offering resource and generating `TaskDescription` should be guarded by taskScheduler.synchronized in `reviveOffers`, otherwise there is race condition.

## How was this patch tested?

Existing unit tests.

Author: jinxing <jinxing@meituan.com>

Closes #16831 from jinxing64/SPARK-19263-FixRaceInTest.

fd6c3a0b

[SPARK-19481] [REPL] [MAVEN] Avoid to leak SparkContext in Signaling.cancelOnInterrupt · 303f00a4

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

`Signaling.cancelOnInterrupt` leaks a SparkContext per call and it makes ReplSuite unstable.

This PR adds `SparkContext.getActive` to allow `Signaling.cancelOnInterrupt` to get the active `SparkContext` to avoid the leak.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16825 from zsxwing/SPARK-19481.

303f00a4