Commits · 8ce645d4eeda203cf5e100c4bdba2d71edd44e6a · cs525-sp18-g07 / spark

Jan 05, 2016

[SPARK-12615] Remove some deprecated APIs in RDD/SparkContext · 8ce645d4

Reynold Xin authored 9 years ago

I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List).

Author: Reynold Xin <rxin@databricks.com>

Closes #10569 from rxin/SPARK-12615.

8ce645d4

[SPARK-12480][FOLLOW-UP] use a single column vararg for hash · 76768337

Wenchen Fan authored 9 years ago

address comments in #10435

This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10588 from cloud-fan/hash.

76768337

[SPARK-12643][BUILD] Set lib directory for antlr · 9a6ba7e2

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12643

Without setting lib directory for antlr, the updates of imported grammar files can not be detected. So SparkSqlParser.g will not be rebuilt automatically.

Since it is a minor update, no JIRA ticket is opened. Let me know if it is needed. Thanks.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10571 from viirya/antlr-build.

9a6ba7e2

[SPARK-12438][SQL] Add SQLUserDefinedType support for encoder · b3c48e39

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12438

ScalaReflection lacks the support of SQLUserDefinedType. We should add it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10390 from viirya/encoder-udt.

b3c48e39

[SPARK-12331][ML] R^2 for regression through the origin. · 1cdc42d2

Imran Younus authored 9 years ago

Modified the definition of R^2 for regression through origin. Added modified test for regression metrics.

Author: Imran Younus <iyounus@us.ibm.com>
Author: Imran Younus <imranyounus@gmail.com>

Closes #10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.

1cdc42d2

[SPARK-12641] Remove unused code related to Hadoop 0.23 · 8eb2dc71

Kousuke Saruta authored 9 years ago

Currently we don't support Hadoop 0.23 but there is a few code related to it so let's clean it up.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10590 from sarutak/SPARK-12641.

8eb2dc71

[SPARK-12568][SQL] Add BINARY to Encoders · 53beddc5
Michael Armbrust authored 9 years ago
```
Author: Michael Armbrust <michael@databricks.com>

Closes #10516 from marmbrus/datasetCleanup.
```
53beddc5
[SPARK-3873][EXAMPLES] Import ordering fixes. · 7058dc11
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10575 from vanzin/SPARK-3873-examples.
```
7058dc11

[SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API · cc4d5229

felixcheung authored 9 years ago

rxin davies shivaram
Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559

- [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed)

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10584 from felixcheung/rremovedeprecated.

cc4d5229

Jan 04, 2016

[SPARK-12600][SQL] follow up: add range check for DecimalType · b634901b

Reynold Xin authored 9 years ago

This addresses davies' code review feedback in https://github.com/apache/spark/pull/10559

Author: Reynold Xin <rxin@databricks.com>

Closes #10586 from rxin/remove-deprecated-sql-followup.

b634901b

[SPARKR][DOC] minor doc update for version in migration guide · 8896ec9f

felixcheung authored 9 years ago

checked that the change is in Spark 1.6.0.
shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10574 from felixcheung/rwritemodedoc.

8896ec9f

[SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions · b1a77123

Wenchen Fan authored 9 years ago

just write the arguments into unsafe row and use murmur3 to calculate hash code

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10435 from cloud-fan/hash-expr.

b1a77123

[SPARK-12600][SQL] Remove deprecated methods in Spark SQL · 77ab49b8
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #10559 from rxin/remove-deprecated-sql.
```
77ab49b8

[SPARK-12509][SQL] Fixed error messages for DataFrame correlation and covariance · fdfac22d

Narine Kokhlikyan authored 9 years ago

Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov:
   -  "Currently cov supports calculating the covariance between two columns"
   -  "Covariance calculation for columns with dataType "[DataType Name]" not supported."

I've fixed this issue by passing the function name as an argument. We could also do the input checks separately for each function. I avoided doing that because of code duplication.

Thanks!

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #10458 from NarineK/sparksqlstatsmessages.

fdfac22d

[SPARK-12589][SQL] Fix UnsafeRowParquetRecordReader to properly set the row length. · 34de24ab

Nong Li authored 9 years ago

The reader was previously not setting the row length meaning it was wrong if there were variable
length columns. This problem does not manifest usually, since the value in the column is correct and
projecting the row fixes the issue.

Author: Nong Li <nong@databricks.com>

Closes #10576 from nongli/spark-12589.

34de24ab

[SPARK-12541] [SQL] support cube/rollup as function · d084a2de

Davies Liu authored 9 years ago

This PR enable cube/rollup as function, so they can be used as this:
```
select a, b, sum(c) from t group by rollup(a, b)
```

Author: Davies Liu <davies@databricks.com>

Closes #10522 from davies/rollup.

d084a2de

[SPARK-9622][ML] DecisionTreeRegressor: provide variance of prediction · 93ef9b6a

Yanbo Liang authored 9 years ago

DecisionTreeRegressor will provide variance of prediction as a Double column.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8866 from yanboliang/spark-9622.

93ef9b6a

[SPARK-11259][ML] Params.validateParams() should be called automatically · ba5f8185

Yanbo Liang authored 9 years ago

See JIRA: https://issues.apache.org/jira/browse/SPARK-11259

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9224 from yanboliang/spark-11259.

ba5f8185

[SPARK-12421][SQL] Prevent Internal/External row from exposing state. · 0171b71e

Herman van Hovell authored 9 years ago

It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem.

This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1.

cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation).

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10553 from hvanhovell/SPARK-12421.

0171b71e

[DOC] Adjust coverage for partitionBy() · 40d03960

tedyu authored 9 years ago

This is the related thread: http://search-hadoop.com/m/q3RTtO3ReeJ1iF02&subj=Re+partitioning+json+data+in+spark

Michael suggested fixing the doc.

Please review.

Author: tedyu <yuzhihong@gmail.com>

Closes #10499 from ted-yu/master.

40d03960

[SPARK-12512][SQL] support column name with dot in withColumn() · 573ac55d
Xiu Guo authored 9 years ago
```
Author: Xiu Guo <xguo27@gmail.com>

Closes #10500 from xguo27/SPARK-12512.
```
573ac55d

[SPARK-12608][STREAMING] Remove submitJobThreadPool since submitJob doesn't... · 43706bf8

Shixiong Zhu authored 9 years ago

[SPARK-12608][STREAMING] Remove submitJobThreadPool since submitJob doesn't create a separate thread to wait for the job result

Before #9264, submitJob would create a separate thread to wait for the job result. `submitJobThreadPool` was a workaround in `ReceiverTracker` to run these waiting-job-result threads. Now #9264 has been merged to master and resolved this blocking issue, `submitJobThreadPool` can be removed now.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10560 from zsxwing/remove-submitJobThreadPool.

43706bf8

[SPARK-12470] [SQL] Fix size reduction calculation · b504b6a9

Pete Robbins authored 9 years ago

also only allocate required buffer size

Author: Pete Robbins <robbinspg@gmail.com>

Closes #10421 from robbinspg/master.

b504b6a9

[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence · 6c83d938

Josh Rosen authored 9 years ago

Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.

In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection.

This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly).

If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).

This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10519 from JoshRosen/jdbc-driver-precedence.

6c83d938

[SPARK-12486] Worker should kill the executors more forcefully if possible. · 8f659393

Nong Li authored 9 years ago

This patch updates the ExecutorRunner's terminate path to use the new java 8 API
to terminate processes more forcefully if possible. If the executor is unhealthy,
it would previously ignore the destroy() call. Presumably, the new java API was
added to handle cases like this.

We could update the termination path in the future to use OS specific commands
for older java versions.

Author: Nong Li <nong@databricks.com>

Closes #10438 from nongli/spark-12486-executors.

8f659393

[SPARK-12513][STREAMING] SocketReceiver hang in Netcat example · 962aac4d

guoxu1231 authored 9 years ago

Explicitly close client side socket connection before restart socket receiver.

Author: guoxu1231 <guoxu1231@gmail.com>
Author: Shawn Guo <guoxu1231@gmail.com>

Closes #10464 from guoxu1231/SPARK-12513.

962aac4d

[SPARK-10359][PROJECT-INFRA] Use more random number in... · 9fd7a2f0

Josh Rosen authored 9 years ago

[SPARK-10359][PROJECT-INFRA] Use more random number in dev/test-dependencies.sh; fix version switching

This patch aims to fix another potential source of flakiness in the `dev/test-dependencies.sh` script.

pwendell's original patch and my version used `$(date +%s | tail -c6)` to generate a suffix to use when installing temporary Spark versions into the local Maven cache, but this value only changes once per second and thus is highly collision-prone when concurrent builds launch on AMPLab Jenkins. In order to reduce the potential for conflicts, this patch updates the script to call Python's random number generator instead.

I also fixed a bug in how we captured the original project version; the bug was causing the exit handler code to fail.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10558 from JoshRosen/build-dep-tests-round-3.

9fd7a2f0

[SPARK-12612][PROJECT-INFRA] Add missing Hadoop profiles to dev/run-tests-*.py scripts and dev/deps · 0d165ec2

Josh Rosen authored 9 years ago

There are a couple of places in the `dev/run-tests-*.py` scripts which deal with Hadoop profiles, but the set of profiles that they handle does not include all Hadoop profiles defined in our POM. Similarly, the `hadoop-2.2` and `hadoop-2.6` profiles were missing from `dev/deps`.

This patch updates these scripts to include all four Hadoop profiles defined in our POM.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10565 from JoshRosen/add-missing-hadoop-profiles-in-test-scripts.

0d165ec2

Jan 03, 2016

[SPARK-12562][SQL] DataFrame.write.format(text) requires the column name to be called value · 84f8492c
Xiu Guo authored 9 years ago
```
Author: Xiu Guo <xguo27@gmail.com>

Closes #10515 from xguo27/SPARK-12562.
```
84f8492c

[SPARK-12611][SQL][PYSPARK][TESTS] Fix test_infer_schema_to_local · 13dab9c3

Holden Karau authored 9 years ago

Previously (when the PR was first created) not specifying b= explicitly was fine (and treated as default null) - instead be explicit about b being None in the test.

Author: Holden Karau <holden@us.ibm.com>

Closes #10564 from holdenk/SPARK-12611-fix-test-infer-schema-local.

13dab9c3

[SPARK-12537][SQL] Add option to accept quoting of all character backslash quoting mechanism · b8410ff9

Cazen authored 9 years ago

We can provides the option to choose JSON parser can be enabled to accept quoting of all character or not.

Author: Cazen <Cazen@korea.com>
Author: Cazen Lee <cazen.lee@samsung.com>
Author: Cazen Lee <Cazen@korea.com>
Author: cazen.lee <cazen.lee@samsung.com>

Closes #10497 from Cazen/master.

b8410ff9

Update MimaExcludes now Spark 1.6 is in Maven. · 7b92922f
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #10561 from rxin/update-mima.
```
7b92922f

[SPARK-12533][SQL] hiveContext.table() throws the wrong exception · c82924d5

thomastechs authored 9 years ago

Avoiding the the No such table exception and throwing analysis exception as per the bug: SPARK-12533

Author: thomastechs <thomas.sebastian@tcs.com>

Closes #10529 from thomastechs/topic-branch.

c82924d5

[SPARK-12327][SPARKR] fix code for lintr warning for commented code · c3d50560
felixcheung authored 9 years ago
```
shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10408 from felixcheung/rcodecomment.
```
c3d50560
Revert "Revert "[SPARK-12286][SPARK-12290][SPARK-12294][SPARK-12284][SQL] always output UnsafeRow"" · 6c5bbd62
Reynold Xin authored 9 years ago
```
This reverts commit 44ee920f.
```
6c5bbd62

[SPARK-12599][MLLIB][SQL] Remove the use of callUDF in MLlib · 513e3b09

Reynold Xin authored 9 years ago

callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that.

Author: Reynold Xin <rxin@databricks.com>

Closes #10547 from rxin/SPARK-12599.

513e3b09

Jan 02, 2016

[SPARK-12481][CORE][STREAMING][SQL] Remove usage of Hadoop deprecated APIs and... · 15bd7362

Sean Owen authored 9 years ago

[SPARK-12481][CORE][STREAMING][SQL] Remove usage of Hadoop deprecated APIs and reflection that supported 1.x

Remove use of deprecated Hadoop APIs now that 2.2+ is required

Author: Sean Owen <sowen@cloudera.com>

Closes #10446 from srowen/SPARK-12481.

15bd7362

[SPARK-10180][SQL] JDBC datasource are not processing EqualNullSafe filter · 94f7a12b

hyukjinkwon authored 9 years ago

This PR is followed by https://github.com/apache/spark/pull/8391.
Previous PR fixes JDBCRDD to support null-safe equality comparison for JDBC datasource. This PR fixes the problem that it can actually return null as a result of the comparison resulting error as using the value of that comparison.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: HyukjinKwon <gurwls223@gmail.com>

Closes #8743 from HyukjinKwon/SPARK-10180.

94f7a12b

[SPARK-12362][SQL][WIP] Inline Hive Parser · 970635a9

Herman van Hovell authored 9 years ago

This PR inlines the Hive SQL parser in Spark SQL.

The previous (merged) incarnation of this PR passed all tests, but had and still has problems with the build. These problems are caused by a the fact that - for some reason - in some cases the ANTLR generated code is not included in the compilation fase.

This PR is a WIP and should not be merged until we have sorted out the build issues.

Author: Herman van Hovell <hvanhovell@questtec.nl>
Author: Nong Li <nong@databricks.com>
Author: Nong Li <nongli@gmail.com>

Closes #10525 from hvanhovell/SPARK-12362.

970635a9

Jan 01, 2016
- Revert "[SPARK-12286][SPARK-12290][SPARK-12294][SPARK-12284][SQL] always output UnsafeRow" · 44ee920f
  Reynold Xin authored 9 years ago
  
  This reverts commit 0da7bd50.
  44ee920f