Commits · bc0f30d0f5d01424d2f886adf3ffeaa1fc83a8af · cs525-sp18-g07 / spark

Dec 22, 2015

[SPARK-12475][BUILD] Upgrade Zinc from 0.3.5.3 to 0.3.9 · bc0f30d0

Josh Rosen authored 9 years ago

We should update to the latest version of Zinc in order to match our SBT version.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10426 from JoshRosen/update-zinc.

bc0f30d0

[SPARK-11677][SQL][FOLLOW-UP] Add tests for checking the ORC filter creation... · 364d244a

hyukjinkwon authored 9 years ago

[SPARK-11677][SQL][FOLLOW-UP] Add tests for checking the ORC filter creation against pushed down filters.

https://issues.apache.org/jira/browse/SPARK-11677
Although it checks correctly the filters by the number of results if ORC filter-push-down is enabled, the filters themselves are not being tested.
So, this PR includes the test similarly with `ParquetFilterSuite`.
Since the results are checked by `OrcQuerySuite`, this `OrcFilterSuite` only checks if the appropriate filters are created.

One thing different with `ParquetFilterSuite` here is, it does not check the results because that is checked in `OrcQuerySuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #10341 from HyukjinKwon/SPARK-11677-followup.

364d244a

[SPARK-12371][SQL] Runtime nullability check for NewInstance · 42bfde29

Cheng Lian authored 9 years ago

This PR adds a new expression `AssertNotNull` to ensure non-nullable fields of products and case classes don't receive null values at runtime.

Author: Cheng Lian <lian@databricks.com>

Closes #10331 from liancheng/dataset-nullability-check.

42bfde29

[SPARK-12446][SQL] Add unit tests for JDBCRDD internal functions · 8c1b867c

Takeshi YAMAMURO authored 9 years ago

No tests done for JDBCRDD#compileFilter.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10409 from maropu/AddTestsInJdbcRdd.

8c1b867c

[SPARK-12296][PYSPARK][MLLIB] Feature parity for pyspark mllib standard scaler model · 969d5665

Holden Karau authored 9 years ago

Some methods are missing, such as ways to access the std, mean, etc. This PR is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel.

Author: Holden Karau <holden@us.ibm.com>

Closes #10298 from holdenk/SPARK-12296-feature-parity-pyspark-mllib-StandardScalerModel.

969d5665

[SPARK-11823][SQL] Fix flaky JDBC cancellation test in HiveThriftBinaryServerSuite · 2235cd44

Josh Rosen authored 9 years ago

This patch fixes a flaky "test jdbc cancel" test in HiveThriftBinaryServerSuite. This test is prone to a race-condition which causes it to block indefinitely with while waiting for an extremely slow query to complete, which caused many Jenkins builds to time out.

For more background, see my comments on #6207 (the PR which introduced this test).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10425 from JoshRosen/SPARK-11823.

2235cd44

[MINOR] Fix typos in JavaStreamingContext · 93da8565
Shixiong Zhu authored 9 years ago
```
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10424 from zsxwing/typo.
```
93da8565

[SPARK-11807] Remove support for Hadoop < 2.2 · 0a38637d

Reynold Xin authored 9 years ago

i.e. Hadoop 1 and Hadoop 2.0

Author: Reynold Xin <rxin@databricks.com>

Closes #10404 from rxin/SPARK-11807.

0a38637d

Dec 21, 2015

[SPARK-12388] change default compression to lz4 · 29cecd4a

Davies Liu authored 9 years ago

According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.

After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).

[1] https://github.com/ning/jvm-compressor-benchmark/wiki

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #10342 from davies/lz4.

29cecd4a

[SPARK-12466] Fix harmless NPE in tests · d655d37d

Andrew Or authored 9 years ago

```
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull

This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests).

Tested locally to verify that the NPE is gone.

Author: Andrew Or <andrew@databricks.com>

Closes #10417 from andrewor14/fix-harmless-npe.

d655d37d

[SPARK-2331] SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T] · a820ca19
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #10394 from rxin/SPARK-2331.
```
a820ca19

[SPARK-12339][SPARK-11206][WEBUI] Added a null check that was removed in · b0849b8a

Alex Bozarth authored 9 years ago

Updates made in SPARK-11206 missed an edge case which cause's a NullPointerException when a task is killed. In some cases when a task ends in failure taskMetrics is initialized as null (see JobProgressListener.onTaskEnd()). To address this a null check was added. Before the changes in SPARK-11206 this null check was called at the start of the updateTaskAccumulatorValues() function.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10405 from ajbozarth/spark12339.

b0849b8a

Doc typo: ltrim = trim from left end, not right · fc6dbcc7
pshearer authored 9 years ago
```
Author: pshearer <pshearer@massmutual.com>

Closes #10414 from pshearer/patch-1.
```
fc6dbcc7
[SPARK-5882][GRAPHX] Add a test for GraphLoader.edgeListFile · 1eb90bc9
Takeshi YAMAMURO authored 9 years ago
```
Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #4674 from maropu/AddGraphLoaderSuite.
```
1eb90bc9

[SPARK-12392][CORE] Optimize a location order of broadcast blocks by... · 935f4663

Takeshi YAMAMURO authored 9 years ago

[SPARK-12392][CORE] Optimize a location order of broadcast blocks by considering preferred local hosts

When multiple workers exist in a host, we can bypass unnecessary remote access for broadcasts; block managers fetch broadcast blocks from the same host instead of remote hosts.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10346 from maropu/OptimizeBlockLocationOrder.

935f4663

[SPARK-12374][SPARK-12150][SQL] Adding logical/physical operators for Range · 4883a508

gatorsmile authored 9 years ago

Based on the suggestions from marmbrus , added logical/physical operators for Range for improving the performance.

Also added another API for resolving the JIRA Spark-12150.

Could you take a look at my implementation, marmbrus ? If not good, I can rework it. : )

Thank you very much!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10335 from gatorsmile/rangeOperators.

4883a508

[SPARK-12321][SQL] JSON format for TreeNode (use reflection) · 7634fe95

Wenchen Fan authored 9 years ago

An alternative solution for https://github.com/apache/spark/pull/10295 , instead of implementing json format for all logical/physical plans and expressions, use reflection to implement it in `TreeNode`.

Here I use pre-order traversal to flattern a plan tree to a plan list, and add an extra field `num-children` to each plan node, so that we can reconstruct the tree from the list.

example json:

logical plan tree:
```
[ {
  "class" : "org.apache.spark.sql.catalyst.plans.logical.Sort",
  "num-children" : 1,
  "order" : [ [ {
    "class" : "org.apache.spark.sql.catalyst.expressions.SortOrder",
    "num-children" : 1,
    "child" : 0,
    "direction" : "Ascending"
  }, {
    "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
    "num-children" : 0,
    "name" : "i",
    "dataType" : "integer",
    "nullable" : true,
    "metadata" : { },
    "exprId" : {
      "id" : 10,
      "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
    },
    "qualifiers" : [ ]
  } ] ],
  "global" : false,
  "child" : 0
}, {
  "class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
  "num-children" : 1,
  "projectList" : [ [ {
    "class" : "org.apache.spark.sql.catalyst.expressions.Alias",
    "num-children" : 1,
    "child" : 0,
    "name" : "i",
    "exprId" : {
      "id" : 10,
      "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
    },
    "qualifiers" : [ ]
  }, {
    "class" : "org.apache.spark.sql.catalyst.expressions.Add",
    "num-children" : 2,
    "left" : 0,
    "right" : 1
  }, {
    "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
    "num-children" : 0,
    "name" : "a",
    "dataType" : "integer",
    "nullable" : true,
    "metadata" : { },
    "exprId" : {
      "id" : 0,
      "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
    },
    "qualifiers" : [ ]
  }, {
    "class" : "org.apache.spark.sql.catalyst.expressions.Literal",
    "num-children" : 0,
    "value" : "1",
    "dataType" : "integer"
  } ], [ {
    "class" : "org.apache.spark.sql.catalyst.expressions.Alias",
    "num-children" : 1,
    "child" : 0,
    "name" : "j",
    "exprId" : {
      "id" : 11,
      "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
    },
    "qualifiers" : [ ]
  }, {
    "class" : "org.apache.spark.sql.catalyst.expressions.Multiply",
    "num-children" : 2,
    "left" : 0,
    "right" : 1
  }, {
    "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
    "num-children" : 0,
    "name" : "a",
    "dataType" : "integer",
    "nullable" : true,
    "metadata" : { },
    "exprId" : {
      "id" : 0,
      "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
    },
    "qualifiers" : [ ]
  }, {
    "class" : "org.apache.spark.sql.catalyst.expressions.Literal",
    "num-children" : 0,
    "value" : "2",
    "dataType" : "integer"
  } ] ],
  "child" : 0
}, {
  "class" : "org.apache.spark.sql.catalyst.plans.logical.LocalRelation",
  "num-children" : 0,
  "output" : [ [ {
    "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
    "num-children" : 0,
    "name" : "a",
    "dataType" : "integer",
    "nullable" : true,
    "metadata" : { },
    "exprId" : {
      "id" : 0,
      "jvmId" : "cd1313c7-3f66-4ed7-a320-7d91e4633ac6"
    },
    "qualifiers" : [ ]
  } ] ],
  "data" : [ ]
} ]
```

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10311 from cloud-fan/toJson-reflection.

7634fe95

[SPARK-12398] Smart truncation of DataFrame / Dataset toString · 474eb21a

Dilip Biswal authored 9 years ago

When a DataFrame or Dataset has a long schema, we should intelligently truncate to avoid flooding the screen with unreadable information.
// Standard output
[a: int, b: int]

// Truncate many top level fields
[a: int, b, string ... 10 more fields]

// Truncate long inner structs
[a: struct<a: Int ... 10 more fields>]

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #10373 from dilipbiswal/spark-12398.

474eb21a

[PYSPARK] Pyspark typo & Add missing abstractmethod annotation · 1920d72a

Jeff Zhang authored 9 years ago

No jira is created since this is a trivial change.

davies  Please help review it

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10143 from zjffdu/pyspark_typo.

1920d72a

[SPARK-12349][ML] Make spark.ml PCAModel load backwards compatible · d0f69508

Sean Owen authored 9 years ago

Only load explainedVariance in PCAModel if it was written with Spark > 1.6.x
jkbradley is this kind of what you had in mind?

Author: Sean Owen <sowen@cloudera.com>

Closes #10327 from srowen/SPARK-12349.

d0f69508

Dec 20, 2015

[SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDs · ce1798b3

Bryan Cutler authored 9 years ago

Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647."

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.

ce1798b3

[SPARK-11808] Remove Bagel. · 284e29a8

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #10395 from rxin/SPARK-11808.

284e29a8

Dec 19, 2015
- HOTFIX for the previous hot fix. · 0c4d6ad8
  Reynold Xin authored 9 years ago
  
  0c4d6ad8
- HOTFIX: Disable Java style test. · 6ad31e79
  Reynold Xin authored 9 years ago
  
  6ad31e79
- Bump master version to 2.0.0-SNAPSHOT. · f496031b
  Reynold Xin authored 9 years ago
  
  Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.
  f496031b
- [SQL] Fix mistake doc of join type for dataframe.join · a073a73a
  Yanbo Liang authored 9 years ago
  
  Fix mistake doc of join type for ```dataframe.join```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10378 from yanboliang/leftsemi.
  a073a73a
Dec 18, 2015

[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels · 499ac3e6

gatorsmile authored 9 years ago

The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.

davies Is this inconsistency intentional? Thanks!

Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.

Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10092 from gatorsmile/persistStorageLevel.

499ac3e6

Revert "[SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode." · a78a91f4
Andrew Or authored 9 years ago
```
This reverts commit ad8c1f0b.
```
a78a91f4
Revert "[SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos REST server" · 8a9417bc
Andrew Or authored 9 years ago
```
This reverts commit 81845688.
```
8a9417bc
Revert "[SPARK-12413] Fix Mesos ZK persistence" · 14be5dec
Andrew Or authored 9 years ago
```
This reverts commit 2bebaa39.
```
14be5dec

[SPARK-12345][CORE] Do not send SPARK_HOME through Spark submit REST interface · ba9332ed

Luc Bourlier authored 9 years ago

It is usually an invalid location on the remote machine executing the job.
It is picked up by the Mesos support in cluster mode, and most of the time causes
the job to fail.

Fixes SPARK-12345

Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #10329 from skyluc/issue/SPARK_HOME.

ba9332ed

[SPARK-11097][CORE] Add channelActive callback to RpcHandler to monitor the new connections · 007a32f9

Shixiong Zhu authored 9 years ago

Added `channelActive` to `RpcHandler` so that `NettyRpcHandler` doesn't need `clients` any more.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10301 from zsxwing/network-events.

007a32f9

[SPARK-12411][CORE] Decrease executor heartbeat timeout to match heartbeat interval · 0514e8d4

Nong Li authored 9 years ago

Previously, the rpc timeout was the default network timeout, which is the same value
the driver uses to determine dead executors. This means if there is a network issue,
the executor is determined dead after one heartbeat attempt. There is a separate config
for the heartbeat interval which is a better value to use for the heartbeat RPC. With
this change, the executor will make multiple heartbeat attempts even with RPC issues.

Author: Nong Li <nong@databricks.com>

Closes #10365 from nongli/spark-12411.

0514e8d4

[SPARK-9552] Return "false" while nothing to kill in killExecutors · 60da0e11

Grace authored 9 years ago

In discussion (SPARK-9552), we proposed a force kill in `killExecutors`. But if there is nothing to kill, it will return back with true (acknowledgement). And then, it causes the certain executor(s) (which is not eligible to kill) adding to pendingToRemove list for further actions.

In this patch, we'd like to change the return semantics. If there is nothing to kill, we will return "false". and therefore all those non-eligible executors won't be added to the pendingToRemove list.

vanzin andrewor14 As the follow up of PR#7888, please let me know your comments.

Author: Grace <jie.huang@intel.com>
Author: Jie Huang <hjie@fosun.com>
Author: Andrew Or <andrew@databricks.com>

Closes #9796 from GraceH/emptyPendingToRemove.

60da0e11

[SPARK-11985][STREAMING][KINESIS][DOCS] Update Kinesis docs · 2377b707

Burak Yavuz authored 9 years ago

 - Provide example on `message handler`
 - Provide bit on KPL record de-aggregation
 - Fix typos

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9970 from brkyvz/kinesis-docs.

2377b707

[SPARK-12404][SQL] Ensure objects passed to StaticInvoke is Serializable · 6eba6552

Kousuke Saruta authored 9 years ago

Now `StaticInvoke` receives `Any` as a object and `StaticInvoke` can be serialized but sometimes the object passed is not serializable.

For example, following code raises Exception because `RowEncoder#extractorsFor` invoked indirectly makes `StaticInvoke`.

```
case class TimestampContainer(timestamp: java.sql.Timestamp)
val rdd = sc.parallelize(1 to 2).map(_ => TimestampContainer(System.currentTimeMillis))
val df = rdd.toDF
val ds = df.as[TimestampContainer]
val rdd2 = ds.rdd                                 <----------------- invokes extractorsFor indirectory
```

I'll add test cases.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Michael Armbrust <michael@databricks.com>

Closes #10357 from sarutak/SPARK-12404.

6eba6552

[SPARK-12218][SQL] Invalid splitting of nested AND expressions in Data Source filter API · 41ee7c57

Yin Huai authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12218

When creating filters for Parquet/ORC, we should not push nested AND expressions partially.

Author: Yin Huai <yhuai@databricks.com>

Closes #10362 from yhuai/SPARK-12218.

41ee7c57

[SPARK-12054] [SQL] Consider nullability of expression in codegen · 4af647c7

Davies Liu authored 9 years ago

This could simplify the generated code for expressions that is not nullable.

This PR fix lots of bugs about nullability.

Author: Davies Liu <davies@databricks.com>

Closes #10333 from davies/skip_nullable.

4af647c7

[SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr · ee444fe4

Dilip Biswal authored 9 years ago

Description of the problem from cloud-fan

Actually this line: https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689
When we use `selectExpr`, we pass in `UnresolvedFunction` to `DataFrame.select` and fall in the last case. A workaround is to do special handling for UDTF like we did for `explode`(and `json_tuple` in 1.6), wrap it with `MultiAlias`.
Another workaround is using `expr`, for example, `df.select(expr("explode(a)").as(Nil))`, I think `selectExpr` is no longer needed after we have the `expr` function....

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #9981 from dilipbiswal/spark-11619.

ee444fe4

[SPARK-12350][CORE] Don't log errors when requested stream is not found. · 27828182

Marcelo Vanzin authored 9 years ago

If a client requests a non-existent stream, just send a failure message
back, without logging any error on the server side (since it's not a
server error).

On the executor side, avoid error logs by translating any errors during
transfer to a `ClassNotFoundException`, so that loading the class is
retried on a the parent class loader. This can mask IO errors during
transmission, but the most common cause is that the class is not
served by the remote end.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10337 from vanzin/SPARK-12350.

27828182