Commits · 284e29a870bbb62f59988a5d88cd12f1b0b6f9d3 · cs525-sp18-g07 / spark

Dec 20, 2015

Reynold Xin authored 9 years ago

Author: Reynold Xin <rxin@databricks.com>

Closes #10395 from rxin/SPARK-11808.

284e29a8

Dec 19, 2015
- HOTFIX for the previous hot fix. · 0c4d6ad8
  Reynold Xin authored 9 years ago
  
  0c4d6ad8
- HOTFIX: Disable Java style test. · 6ad31e79
  Reynold Xin authored 9 years ago
  
  6ad31e79
- Bump master version to 2.0.0-SNAPSHOT. · f496031b
  Reynold Xin authored 9 years ago
  
  Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.
  f496031b
- [SQL] Fix mistake doc of join type for dataframe.join · a073a73a
  Yanbo Liang authored 9 years ago
  
  Fix mistake doc of join type for ```dataframe.join```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10378 from yanboliang/leftsemi.
  a073a73a
Dec 18, 2015

[SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels · 499ac3e6

gatorsmile authored 9 years ago

The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.

davies Is this inconsistency intentional? Thanks!

Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.

Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10092 from gatorsmile/persistStorageLevel.

499ac3e6

Revert "[SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode." · a78a91f4
Andrew Or authored 9 years ago
```
This reverts commit ad8c1f0b.
```
a78a91f4
Revert "[SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos REST server" · 8a9417bc
Andrew Or authored 9 years ago
```
This reverts commit 81845688.
```
8a9417bc
Revert "[SPARK-12413] Fix Mesos ZK persistence" · 14be5dec
Andrew Or authored 9 years ago
```
This reverts commit 2bebaa39.
```
14be5dec

[SPARK-12345][CORE] Do not send SPARK_HOME through Spark submit REST interface · ba9332ed

Luc Bourlier authored 9 years ago

It is usually an invalid location on the remote machine executing the job.
It is picked up by the Mesos support in cluster mode, and most of the time causes
the job to fail.

Fixes SPARK-12345

Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #10329 from skyluc/issue/SPARK_HOME.

ba9332ed

[SPARK-11097][CORE] Add channelActive callback to RpcHandler to monitor the new connections · 007a32f9

Shixiong Zhu authored 9 years ago

Added `channelActive` to `RpcHandler` so that `NettyRpcHandler` doesn't need `clients` any more.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10301 from zsxwing/network-events.

007a32f9

[SPARK-12411][CORE] Decrease executor heartbeat timeout to match heartbeat interval · 0514e8d4

Nong Li authored 9 years ago

Previously, the rpc timeout was the default network timeout, which is the same value
the driver uses to determine dead executors. This means if there is a network issue,
the executor is determined dead after one heartbeat attempt. There is a separate config
for the heartbeat interval which is a better value to use for the heartbeat RPC. With
this change, the executor will make multiple heartbeat attempts even with RPC issues.

Author: Nong Li <nong@databricks.com>

Closes #10365 from nongli/spark-12411.

0514e8d4

[SPARK-9552] Return "false" while nothing to kill in killExecutors · 60da0e11

Grace authored 9 years ago

In discussion (SPARK-9552), we proposed a force kill in `killExecutors`. But if there is nothing to kill, it will return back with true (acknowledgement). And then, it causes the certain executor(s) (which is not eligible to kill) adding to pendingToRemove list for further actions.

In this patch, we'd like to change the return semantics. If there is nothing to kill, we will return "false". and therefore all those non-eligible executors won't be added to the pendingToRemove list.

vanzin andrewor14 As the follow up of PR#7888, please let me know your comments.

Author: Grace <jie.huang@intel.com>
Author: Jie Huang <hjie@fosun.com>
Author: Andrew Or <andrew@databricks.com>

Closes #9796 from GraceH/emptyPendingToRemove.

60da0e11

[SPARK-11985][STREAMING][KINESIS][DOCS] Update Kinesis docs · 2377b707

Burak Yavuz authored 9 years ago

 - Provide example on `message handler`
 - Provide bit on KPL record de-aggregation
 - Fix typos

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #9970 from brkyvz/kinesis-docs.

2377b707

[SPARK-12404][SQL] Ensure objects passed to StaticInvoke is Serializable · 6eba6552

Kousuke Saruta authored 9 years ago

Now `StaticInvoke` receives `Any` as a object and `StaticInvoke` can be serialized but sometimes the object passed is not serializable.

For example, following code raises Exception because `RowEncoder#extractorsFor` invoked indirectly makes `StaticInvoke`.

```
case class TimestampContainer(timestamp: java.sql.Timestamp)
val rdd = sc.parallelize(1 to 2).map(_ => TimestampContainer(System.currentTimeMillis))
val df = rdd.toDF
val ds = df.as[TimestampContainer]
val rdd2 = ds.rdd                                 <----------------- invokes extractorsFor indirectory
```

I'll add test cases.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Michael Armbrust <michael@databricks.com>

Closes #10357 from sarutak/SPARK-12404.

6eba6552

[SPARK-12218][SQL] Invalid splitting of nested AND expressions in Data Source filter API · 41ee7c57

Yin Huai authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-12218

When creating filters for Parquet/ORC, we should not push nested AND expressions partially.

Author: Yin Huai <yhuai@databricks.com>

Closes #10362 from yhuai/SPARK-12218.

41ee7c57

[SPARK-12054] [SQL] Consider nullability of expression in codegen · 4af647c7

Davies Liu authored 9 years ago

This could simplify the generated code for expressions that is not nullable.

This PR fix lots of bugs about nullability.

Author: Davies Liu <davies@databricks.com>

Closes #10333 from davies/skip_nullable.

4af647c7

[SPARK-11619][SQL] cannot use UDTF in DataFrame.selectExpr · ee444fe4

Dilip Biswal authored 9 years ago

Description of the problem from cloud-fan

Actually this line: https://github.com/apache/spark/blob/branch-1.5/sql/core/src/main/scala/org/apache/spark/sql/DataFrame.scala#L689
When we use `selectExpr`, we pass in `UnresolvedFunction` to `DataFrame.select` and fall in the last case. A workaround is to do special handling for UDTF like we did for `explode`(and `json_tuple` in 1.6), wrap it with `MultiAlias`.
Another workaround is using `expr`, for example, `df.select(expr("explode(a)").as(Nil))`, I think `selectExpr` is no longer needed after we have the `expr` function....

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #9981 from dilipbiswal/spark-11619.

ee444fe4

[SPARK-12350][CORE] Don't log errors when requested stream is not found. · 27828182

Marcelo Vanzin authored 9 years ago

If a client requests a non-existent stream, just send a failure message
back, without logging any error on the server side (since it's not a
server error).

On the executor side, avoid error logs by translating any errors during
transfer to a `ClassNotFoundException`, so that loading the class is
retried on a the parent class loader. This can mask IO errors during
transmission, but the most common cause is that the class is not
served by the remote end.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10337 from vanzin/SPARK-12350.

27828182

[SPARK-9057][STREAMING] Twitter example joining to static RDD of word sentiment values · ea59b0f3

Jeff L authored 9 years ago

Example of joining a static RDD of word sentiments to a streaming RDD of Tweets in order to demo the usage of the transform() method.

Author: Jeff L <sha0lin@alumni.carnegiemellon.edu>

Closes #8431 from Agent007/SPARK-9057.

ea59b0f3

[SPARK-12413] Fix Mesos ZK persistence · 2bebaa39

Michael Gummelt authored 9 years ago

I believe this fixes SPARK-12413.  I'm currently running an integration test to verify.

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #10366 from mgummelt/fix-zk-mesos.

2bebaa39

[CORE][TESTS] minor fix of JavaSerializerSuite · 40e52a27

Jeff Zhang authored 9 years ago

Not jira is created.
The original test is passed because the class cast is lazy (only when the object's method is invoked).

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10371 from zjffdu/minor_fix.

40e52a27

Dec 17, 2015

[MINOR] Hide the error logs for 'SQLListenerMemoryLeakSuite' · 0370abdf

Shixiong Zhu authored 9 years ago

Hide the error logs for 'SQLListenerMemoryLeakSuite' to avoid noises. Most of changes are space changes.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10363 from zsxwing/hide-log.

0370abdf

[SPARK-11749][STREAMING] Duplicate creating the RDD in file stream when... · f4346f61

jhu-chang authored 9 years ago

[SPARK-11749][STREAMING] Duplicate creating the RDD in file stream when recovering from checkpoint data

Add a transient flag `DStream.restoredFromCheckpointData` to control the restore processing in DStream to avoid duplicate works:  check this flag first in `DStream.restoreCheckpointData`, only when `false`, the restore process will be executed.

Author: jhu-chang <gt.hu.chang@gmail.com>

Closes #9765 from jhu-chang/SPARK-11749.

f4346f61

[SPARK-8641][SQL] Native Spark Window functions · 658f66e6

Herman van Hovell authored 9 years ago

This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features.

This has the following advantages:
* Better memory management.
* The ability to use spark UDAFs in Window functions.

cc rxin / yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9819 from hvanhovell/SPARK-8641-2.

658f66e6

[SPARK-12376][TESTS] Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method · ed6ebda5

Evan Chen authored 9 years ago

org.apache.spark.streaming.Java8APISuite.java is failing due to trying to sort immutable list in assertOrderInvariantEquals method.

Author: Evan Chen <chene@us.ibm.com>

Closes #10336 from evanyc15/SPARK-12376-StreamingJavaAPISuite.

ed6ebda5

[SPARK-12397][SQL] Improve error messages for data sources when they are not found · e096a652
Reynold Xin authored 9 years ago
```
Point users to spark-packages.org to find them.

Author: Reynold Xin <rxin@databricks.com>

Closes #10351 from rxin/SPARK-12397.
```
e096a652

[SPARK-12410][STREAMING] Fix places that use '.' and '|' directly in split · 540b5aea

Shixiong Zhu authored 9 years ago

String.split accepts a regular expression, so we should escape "." and "|".

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10361 from zsxwing/reg-bug.

540b5aea

[SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos REST server · 81845688

Iulian Dragos authored 9 years ago

Fix problem with #10332, this one should fix Cluster mode on Mesos

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10359 from dragos/issue/fix-spark-12345-one-more-time.

81845688

[SPARK-12220][CORE] Make Utils.fetchFile support files that contain special characters · 86e405f3

Shixiong Zhu authored 9 years ago

This PR encodes and decodes the file name to fix the issue.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10208 from zsxwing/uri.

86e405f3

[SQL] Update SQLContext.read.text doc · 6e077166

Yanbo Liang authored 9 years ago

Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10349 from yanboliang/text-value.

6e077166

[SPARK-12395] [SQL] fix resulting columns of outer join · a170d34a

Davies Liu authored 9 years ago

For API DataFrame.join(right, usingColumns, joinType), if the joinType is right_outer or full_outer, the resulting join columns could be wrong (will be null).

The order of columns had been changed to match that with MySQL and PostgreSQL [1].

This PR also fix the nullability of output for outer join.

[1] http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html

Author: Davies Liu <davies@databricks.com>

Closes #10353 from davies/fix_join.

a170d34a

Revert "Once driver register successfully, stop it to connect to master." · cd3d937b
Davies Liu authored 9 years ago
```
This reverts commit 5a514b61.
```
cd3d937b

Once driver register successfully, stop it to connect to master. · 5a514b61

echo2mei authored 9 years ago

This commit is to resolve SPARK-12396.

Author: echo2mei <534384876@qq.com>

Closes #10354 from echoTomei/master.

5a514b61

[SPARK-12057][SQL] Prevent failure on corrupt JSON records · 9d66c421

Yin Huai authored 9 years ago

This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference.

Regarding the schema inference change, if we have something like
```
{"f1":1}
[1,2,3]
```
originally, we will get a DF without any column.
After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`.

When merge this PR, please make sure that the author is simplyianm.

JIRA: https://issues.apache.org/jira/browse/SPARK-12057

Closes #10043

Author: Ian Macalinao <me@ian.pw>
Author: Yin Huai <yhuai@databricks.com>

Closes #10288 from yhuai/handleCorruptJson.

9d66c421

[SPARK-11904][PYSPARK] reduceByKeyAndWindow does not require checkpointing when invFunc is None · 437583f6

David Tolpin authored 9 years ago

when invFunc is None, `reduceByKeyAndWindow(func, None, winsize, slidesize)` is equivalent to

reduceByKey(func).window(winsize, slidesize).reduceByKey(winsize, slidesize)

and no checkpoint is necessary. The corresponding Scala code does exactly that, but Python code always creates a windowed stream with obligatory checkpointing. The patch fixes this.

I do not know how to unit-test this.

Author: David Tolpin <david.tolpin@gmail.com>

Closes #9888 from dtolpin/master.

437583f6

Dec 16, 2015

[SPARK-12390] Clean up unused serializer parameter in BlockManager · 97678ede

Andrew Or authored 9 years ago

No change in functionality is intended. This only changes internal API.

Author: Andrew Or <andrew@databricks.com>

Closes #10343 from andrewor14/clean-bm-serializer.

97678ede

[SPARK-12386][CORE] Fix NPE when spark.executor.port is set. · d1508dd9
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10339 from vanzin/SPARK-12386.
```
d1508dd9
[SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting. · fdb38227
Rohit Agarwal authored 9 years ago
```
Author: Rohit Agarwal <rohita@qubole.com>

Closes #10180 from mindprince/SPARK-12186.
```
fdb38227

[SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called · f590178d

tedyu authored 9 years ago

SPARK-9886 fixed ExternalBlockStore.scala

This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook()

Author: tedyu <yuzhihong@gmail.com>

Closes #10325 from ted-yu/master.

f590178d