Commits · b3ba1be3b77e42120145252b2730a56f1d55fd21 · cs525-sp18-g07 / spark

Jan 05, 2016

[SPARK-3873][TESTS] Import ordering fixes. · b3ba1be3
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10582 from vanzin/SPARK-3873-tests.
```
b3ba1be3
[SPARK-3873][CORE] Import ordering fixes. · 7a375bb8
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10578 from vanzin/SPARK-3873-core.
```
7a375bb8

[SPARK-12659] fix NPE in UnsafeExternalSorter (used by cartesian product) · 70fe6ce5

Davies Liu authored 9 years ago

Cartesian product use UnsafeExternalSorter without comparator to do spilling, it will NPE if spilling happens.

This bug also hitted by #10605

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #10606 from davies/fix_spilling.

70fe6ce5

[SPARK-12504][SQL] Masking credentials in the sql plan explain output for JDBC data sources. · 0d42292f

sureshthalamati authored 9 years ago

This fix masks JDBC credentials in the explain output. URL patterns to specify credential seems to be vary between different databases. Added a new method to dialect to mask the credentials according to the database specific URL pattern.

While adding tests I noticed explain output includes array variable for partitions ([Lorg.apache.spark.Partition;3ff74546,). Modified the code to include the first, and last partition information.

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #10452 from sureshthalamati/mask_jdbc_credentials_spark-12504.

0d42292f

[SPARK-3873][SQL] Import ordering fixes. · df8bd975
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10573 from vanzin/SPARK-3873-sql.
```
df8bd975

[SPARK-12041][ML][PYSPARK] Add columnSimilarities to IndexedRowMatrix · 1537e556

Kai Jiang authored 9 years ago

Add `columnSimilarities` to IndexedRowMatrix for PySpark spark.mllib.linalg.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #10158 from vectorijk/spark-12041.

1537e556

[SPARK-12453][STREAMING] Remove explicit dependency on aws-java-sdk · ff899755

BrianLondon authored 9 years ago

Successfully ran kinesis demo on a live, aws hosted kinesis stream against master and 1.6 branches. For reasons I don't entirely understand it required a manual merge to 1.5 which I did as shown here: https://github.com/BrianLondon/spark/commit/075c22e89bc99d5e99be21f40e0d72154a1e23a2

The demo ran successfully on the 1.5 branch as well.

According to `mvn dependency:tree` it is still pulling a fairly old version of the aws-java-sdk (1.9.37), but this appears to have fixed the kinesis regression in 1.5.2.

Author: BrianLondon <brian@seatgeek.com>

Closes #10492 from BrianLondon/remove-only.

ff899755

[SPARK-12450][MLLIB] Un-persist broadcasted variables in KMeans · 78015a8b

RJ Nowling authored 9 years ago

SPARK-12450 . Un-persist broadcasted variables in KMeans.

Author: RJ Nowling <rnowling@gmail.com>

Closes #10415 from rnowling/spark-12450.

78015a8b

[SPARK-12570][ML][DOC] DecisionTreeRegressor: provide variance of prediction: user guide update · 1c6cf1a5

Yanbo Liang authored 9 years ago

Update user guide doc for ```DecisionTreeRegressor``` providing variance of prediction.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10594 from yanboliang/spark-12570.

1c6cf1a5

[SPARK-12511] [PYSPARK] [STREAMING] Make sure PythonDStream.registerSerializer is called only once · 6cfe341e

Shixiong Zhu authored 9 years ago

There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (https://github.com/bartdag/py4j/pull/184)

Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10514 from zsxwing/SPARK-12511.

6cfe341e

[SPARK-12636] [SQL] Update UnsafeRowParquetRecordReader to support reading files directly. · c26d1742

Nong authored 9 years ago

As noted in the code, this change is to make this component easier to test in isolation.

Author: Nong <nongli@gmail.com>

Closes #10581 from nongli/spark-12636.

c26d1742

[SPARK-6724][MLLIB] Support model save/load for FPGrowthModel · 13a3b636

Yanbo Liang authored 9 years ago

Support model save/load for FPGrowthModel

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9267 from yanboliang/spark-6724.

13a3b636

[SPARK-12617] [PYSPARK] Clean up the leak sockets of Py4J · 047a31bb

Shixiong Zhu authored 9 years ago

This patch added Py4jCallbackConnectionCleaner to clean the leak sockets of Py4J every 30 seconds. This is a workaround before Py4J fixes the leak issue https://github.com/bartdag/py4j/issues/187

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10579 from zsxwing/SPARK-12617.

047a31bb

[SPARK-12439][SQL] Fix toCatalystArray and MapObjects · d202ad2f

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12439

In toCatalystArray, we should look at the data type returned by dataTypeFor instead of silentSchemaFor, to determine if the element is native type. An obvious problem is when the element is Option[Int] class, catalsilentSchemaFor will return Int, then we will wrongly recognize the element is native type.

There is another problem when using Option as array element. When we encode data like Seq(Some(1), Some(2), None) with encoder, we will use MapObjects to construct an array for it later. But in MapObjects, we don't check if the return value of lambdaFunction is null or not. That causes a bug that the decoded data for Seq(Some(1), Some(2), None) would be Seq(1, 2, -1), instead of Seq(1, 2, null).

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10391 from viirya/fix-catalystarray.

d202ad2f

[SPARK-12615] Remove some deprecated APIs in RDD/SparkContext · 8ce645d4

Reynold Xin authored 9 years ago

I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List).

Author: Reynold Xin <rxin@databricks.com>

Closes #10569 from rxin/SPARK-12615.

8ce645d4

[SPARK-12480][FOLLOW-UP] use a single column vararg for hash · 76768337

Wenchen Fan authored 9 years ago

address comments in #10435

This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10588 from cloud-fan/hash.

76768337

[SPARK-12643][BUILD] Set lib directory for antlr · 9a6ba7e2

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12643

Without setting lib directory for antlr, the updates of imported grammar files can not be detected. So SparkSqlParser.g will not be rebuilt automatically.

Since it is a minor update, no JIRA ticket is opened. Let me know if it is needed. Thanks.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10571 from viirya/antlr-build.

9a6ba7e2

[SPARK-12438][SQL] Add SQLUserDefinedType support for encoder · b3c48e39

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12438

ScalaReflection lacks the support of SQLUserDefinedType. We should add it.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10390 from viirya/encoder-udt.

b3c48e39

[SPARK-12331][ML] R^2 for regression through the origin. · 1cdc42d2

Imran Younus authored 9 years ago

Modified the definition of R^2 for regression through origin. Added modified test for regression metrics.

Author: Imran Younus <iyounus@us.ibm.com>
Author: Imran Younus <imranyounus@gmail.com>

Closes #10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.

1cdc42d2

[SPARK-12641] Remove unused code related to Hadoop 0.23 · 8eb2dc71

Kousuke Saruta authored 9 years ago

Currently we don't support Hadoop 0.23 but there is a few code related to it so let's clean it up.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10590 from sarutak/SPARK-12641.

8eb2dc71

[SPARK-12568][SQL] Add BINARY to Encoders · 53beddc5
Michael Armbrust authored 9 years ago
```
Author: Michael Armbrust <michael@databricks.com>

Closes #10516 from marmbrus/datasetCleanup.
```
53beddc5
[SPARK-3873][EXAMPLES] Import ordering fixes. · 7058dc11
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10575 from vanzin/SPARK-3873-examples.
```
7058dc11

[SPARK-12625][SPARKR][SQL] replace R usage of Spark SQL deprecated API · cc4d5229

felixcheung authored 9 years ago

rxin davies shivaram
Took save mode from my PR #10480, and move everything to writer methods. This is related to PR #10559

- [x] it seems jsonRDD() is broken, need to investigate - this is not a public API though; will look into some more tonight. (fixed)

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10584 from felixcheung/rremovedeprecated.

cc4d5229

Jan 04, 2016

[SPARK-12600][SQL] follow up: add range check for DecimalType · b634901b

Reynold Xin authored 9 years ago

This addresses davies' code review feedback in https://github.com/apache/spark/pull/10559

Author: Reynold Xin <rxin@databricks.com>

Closes #10586 from rxin/remove-deprecated-sql-followup.

b634901b

[SPARKR][DOC] minor doc update for version in migration guide · 8896ec9f

felixcheung authored 9 years ago

checked that the change is in Spark 1.6.0.
shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10574 from felixcheung/rwritemodedoc.

8896ec9f

[SPARK-12480][SQL] add Hash expression that can calculate hash value for a group of expressions · b1a77123

Wenchen Fan authored 9 years ago

just write the arguments into unsafe row and use murmur3 to calculate hash code

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10435 from cloud-fan/hash-expr.

b1a77123

[SPARK-12600][SQL] Remove deprecated methods in Spark SQL · 77ab49b8
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #10559 from rxin/remove-deprecated-sql.
```
77ab49b8

[SPARK-12509][SQL] Fixed error messages for DataFrame correlation and covariance · fdfac22d

Narine Kokhlikyan authored 9 years ago

Currently, when we call corr or cov on dataframe with invalid input we see these error messages for both corr and cov:
   -  "Currently cov supports calculating the covariance between two columns"
   -  "Covariance calculation for columns with dataType "[DataType Name]" not supported."

I've fixed this issue by passing the function name as an argument. We could also do the input checks separately for each function. I avoided doing that because of code duplication.

Thanks!

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #10458 from NarineK/sparksqlstatsmessages.

fdfac22d

[SPARK-12589][SQL] Fix UnsafeRowParquetRecordReader to properly set the row length. · 34de24ab

Nong Li authored 9 years ago

The reader was previously not setting the row length meaning it was wrong if there were variable
length columns. This problem does not manifest usually, since the value in the column is correct and
projecting the row fixes the issue.

Author: Nong Li <nong@databricks.com>

Closes #10576 from nongli/spark-12589.

34de24ab

[SPARK-12541] [SQL] support cube/rollup as function · d084a2de

Davies Liu authored 9 years ago

This PR enable cube/rollup as function, so they can be used as this:
```
select a, b, sum(c) from t group by rollup(a, b)
```

Author: Davies Liu <davies@databricks.com>

Closes #10522 from davies/rollup.

d084a2de

[SPARK-9622][ML] DecisionTreeRegressor: provide variance of prediction · 93ef9b6a

Yanbo Liang authored 9 years ago

DecisionTreeRegressor will provide variance of prediction as a Double column.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8866 from yanboliang/spark-9622.

93ef9b6a

[SPARK-11259][ML] Params.validateParams() should be called automatically · ba5f8185

Yanbo Liang authored 9 years ago

See JIRA: https://issues.apache.org/jira/browse/SPARK-11259

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9224 from yanboliang/spark-11259.

ba5f8185

[SPARK-12421][SQL] Prevent Internal/External row from exposing state. · 0171b71e

Herman van Hovell authored 9 years ago

It is currently possible to change the values of the supposedly immutable ```GenericRow``` and ```GenericInternalRow``` classes. This is caused by the fact that scala's ArrayOps ```toArray``` (returned by calling ```toSeq```) will return the backing array instead of a copy. This PR fixes this problem.

This PR was inspired by https://github.com/apache/spark/pull/10374 by apo1.

cc apo1 sarutak marmbrus cloud-fan nongli (everyone in the previous conversation).

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10553 from hvanhovell/SPARK-12421.

0171b71e

[DOC] Adjust coverage for partitionBy() · 40d03960

tedyu authored 9 years ago

This is the related thread: http://search-hadoop.com/m/q3RTtO3ReeJ1iF02&subj=Re+partitioning+json+data+in+spark

Michael suggested fixing the doc.

Please review.

Author: tedyu <yuzhihong@gmail.com>

Closes #10499 from ted-yu/master.

40d03960

[SPARK-12512][SQL] support column name with dot in withColumn() · 573ac55d
Xiu Guo authored 9 years ago
```
Author: Xiu Guo <xguo27@gmail.com>

Closes #10500 from xguo27/SPARK-12512.
```
573ac55d

[SPARK-12608][STREAMING] Remove submitJobThreadPool since submitJob doesn't... · 43706bf8

Shixiong Zhu authored 9 years ago

[SPARK-12608][STREAMING] Remove submitJobThreadPool since submitJob doesn't create a separate thread to wait for the job result

Before #9264, submitJob would create a separate thread to wait for the job result. `submitJobThreadPool` was a workaround in `ReceiverTracker` to run these waiting-job-result threads. Now #9264 has been merged to master and resolved this blocking issue, `submitJobThreadPool` can be removed now.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10560 from zsxwing/remove-submitJobThreadPool.

43706bf8

[SPARK-12470] [SQL] Fix size reduction calculation · b504b6a9

Pete Robbins authored 9 years ago

also only allocate required buffer size

Author: Pete Robbins <robbinspg@gmail.com>

Closes #10421 from robbinspg/master.

b504b6a9

[SPARK-12579][SQL] Force user-specified JDBC driver to take precedence · 6c83d938

Josh Rosen authored 9 years ago

Spark SQL's JDBC data source allows users to specify an explicit JDBC driver to load (using the `driver` argument), but in the current code it's possible that the user-specified driver will not be used when it comes time to actually create a JDBC connection.

In a nutshell, the problem is that you might have multiple JDBC drivers on the classpath that claim to be able to handle the same subprotocol, so simply registering the user-provided driver class with the our `DriverRegistry` and JDBC's `DriverManager` is not sufficient to ensure that it's actually used when creating the JDBC connection.

This patch addresses this issue by first registering the user-specified driver with the DriverManager, then iterating over the driver manager's loaded drivers in order to obtain the correct driver and use it to create a connection (previously, we just called `DriverManager.getConnection()` directly).

If a user did not specify a JDBC driver to use, then we call `DriverManager.getDriver` to figure out the class of the driver to use, then pass that class's name to executors; this guards against corner-case bugs in situations where the driver and executor JVMs might have different sets of JDBC drivers on their classpaths (previously, there was the (rare) potential for `DriverManager.getConnection()` to use different drivers on the driver and executors if the user had not explicitly specified a JDBC driver class and the classpaths were different).

This patch is inspired by a similar patch that I made to the `spark-redshift` library (https://github.com/databricks/spark-redshift/pull/143), which contains its own modified fork of some of Spark's JDBC data source code (for cross-Spark-version compatibility reasons).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10519 from JoshRosen/jdbc-driver-precedence.

6c83d938

[SPARK-12486] Worker should kill the executors more forcefully if possible. · 8f659393

Nong Li authored 9 years ago

This patch updates the ExecutorRunner's terminate path to use the new java 8 API
to terminate processes more forcefully if possible. If the executor is unhealthy,
it would previously ignore the destroy() call. Presumably, the new java API was
added to handle cases like this.

We could update the termination path in the future to use OS specific commands
for older java versions.

Author: Nong Li <nong@databricks.com>

Closes #10438 from nongli/spark-12486-executors.

8f659393

[SPARK-12513][STREAMING] SocketReceiver hang in Netcat example · 962aac4d

guoxu1231 authored 9 years ago

Explicitly close client side socket connection before restart socket receiver.

Author: guoxu1231 <guoxu1231@gmail.com>
Author: Shawn Guo <guoxu1231@gmail.com>

Closes #10464 from guoxu1231/SPARK-12513.

962aac4d