Commits · 31921e0f0bd559d042148d1ea32f865fb3068f38 · cs525-sp18-g07 / spark

Nov 18, 2015

[SPARK-4557][STREAMING] Spark Streaming foreachRDD Java API method should... · 31921e0f

Bryan Cutler authored 9 years ago

[SPARK-4557][STREAMING] Spark Streaming foreachRDD Java API method should accept a VoidFunction<...>

Currently streaming foreachRDD Java API uses a function prototype requiring a return value of null.  This PR deprecates the old method and uses VoidFunction to allow for more concise declaration.  Also added VoidFunction2 to Java API in order to use in Streaming methods.  Unit test is added for using foreachRDD with VoidFunction, and changes have been tested with Java 7 and Java 8 using lambdas.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9488 from BryanCutler/foreachRDD-VoidFunction-SPARK-4557.

31921e0f

[SPARK-11739][SQL] clear the instantiated SQLContext · 94624eac

Davies Liu authored 9 years ago

Currently, if the first SQLContext is not removed after stopping SparkContext, a SQLContext could set there forever. This patch make this more robust.

Author: Davies Liu <davies@databricks.com>

Closes #9706 from davies/clear_context.

94624eac

[SPARK-11792] [SQL] [FOLLOW-UP] Change SizeEstimation to KnownSizeEstimation... · 6f99522d

Yin Huai authored 9 years ago

[SPARK-11792] [SQL] [FOLLOW-UP] Change SizeEstimation to KnownSizeEstimation and make estimatedSize return Long instead of Option[Long]

https://issues.apache.org/jira/browse/SPARK-11792

The main changes include:
* Renaming `SizeEstimation` to `KnownSizeEstimation`. Hopefully this new name has more information.
* Making `estimatedSize` return `Long` instead of `Option[Long]`.
* In `UnsaveHashedRelation`, `estimatedSize` will delegate the work to `SizeEstimator` if we have not created a `BytesToBytesMap`.

Since we will put `UnsaveHashedRelation` to `BlockManager`, it is generally good to let it provide a more accurate size estimation. Also, if we do not put `BytesToBytesMap` directly into `BlockerManager`, I feel it is not really necessary to make `BytesToBytesMap` extends `KnownSizeEstimation`.

Author: Yin Huai <yhuai@databricks.com>

Closes #9813 from yhuai/SPARK-11792-followup.

6f99522d

[MINOR][BUILD] Ignore ensime cache · 90a7519d

Jakob Odersky authored 9 years ago

Using ENSIME, I often have `.ensime_cache` polluting my source tree. This PR simply adds the cache directory to `.gitignore`

Author: Jakob Odersky <jodersky@gmail.com>

Closes #9708 from jodersky/master.

90a7519d

[SPARK-11795][SQL] combine grouping attributes into a single NamedExpression · dbf428c8

Wenchen Fan authored 9 years ago

we use `ExpressionEncoder.tuple` to build the result encoder, which assumes the input encoder should point to a struct type field if it’s non-flat.
However, our keyEncoder always point to a flat field/fields: `groupingAttributes`, we should combine them into a single `NamedExpression`.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9792 from cloud-fan/agg.

dbf428c8

[SPARK-11725][SQL] correctly handle null inputs for UDF · 33b83733

Wenchen Fan authored 9 years ago

If user use primitive parameters in UDF, there is no way for him to do the null-check for primitive inputs, so we are assuming the primitive input is null-propagatable for this case and return null if the input is null.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9770 from cloud-fan/udf.

33b83733

[SPARK-11803][SQL] fix Dataset self-join · cffb899c

Wenchen Fan authored 9 years ago

When we resolve the join operator, we may change the output of right side if self-join is detected. So in `Dataset.joinWith`, we should resolve the join operator first, and then get the left output and right output from it, instead of using `left.output` and `right.output` directly.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9806 from cloud-fan/self-join.

cffb899c

[SPARK-11195][CORE] Use correct classloader for TaskResultGetter · 3cca5ffb

Hurshal Patel authored 9 years ago

Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader.

The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`.

Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`.

See #9367 for previous comments
See SPARK-11195 for a full repro

Author: Hurshal Patel <hpatel516@gmail.com>

Closes #9779 from choochootrain/spark-11195-master.

3cca5ffb

[SPARK-11773][SPARKR] Implement collection functions in SparkR. · 224723e6
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9764 from sun-rui/SPARK-11773.
```
224723e6

[SPARK-11281][SPARKR] Add tests covering the issue. · a97d6f3a

zero323 authored 9 years ago

The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086).

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9743 from zero323/SPARK-11281-tests.

a97d6f3a

[SPARK-11804] [PYSPARK] Exception raise when using Jdbc predicates opt… · 3a6807fd
Jeff Zhang authored 9 years ago
```
…ion in PySpark

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9791 from zjffdu/SPARK-11804.
```
3a6807fd

rmse was wrongly calculated · 1429e0a2

Viveka Kulharia authored 9 years ago

It was multiplying with U instaed of dividing by U

Author: Viveka Kulharia <vivkul@iitk.ac.in>

Closes #9771 from vivkul/patch-1.

1429e0a2

[SPARK-11652][CORE] Remote code execution with InvokerTransformer · 9631ca35

Sean Owen authored 9 years ago

Update to Commons Collections 3.2.2 to avoid any potential remote code execution vulnerability

Author: Sean Owen <sowen@cloudera.com>

Closes #9731 from srowen/SPARK-11652.

9631ca35

[SPARK-6541] Sort executors by ID (numeric) · e62820c8

Jean-Baptiste Onofré authored 9 years ago

"Force" the executor ID sort with Int.

Author: Jean-Baptiste Onofré <jbonofre@apache.org>

Closes #9165 from jbonofre/SPARK-6541.

e62820c8

[SPARK-10946][SQL] JDBC - Use Statement.executeUpdate instead of... · b8f4379b

somideshmukh authored 9 years ago

[SPARK-10946][SQL] JDBC - Use Statement.executeUpdate instead of PreparedStatement.executeUpdate for DDLs

New changes with JDBCRDD

Author: somideshmukh <somilde@us.ibm.com>

Closes #9733 from somideshmukh/SomilBranch-1.1.

b8f4379b

[SPARK-11792][SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations · 1714350b

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11792

Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs.

Author: Yin Huai <yhuai@databricks.com>

Closes #9788 from yhuai/SPARK-11792.

1714350b

[SPARK-11802][SQL] Kryo-based encoder for opaque types in Datasets · 5e2b4447

Reynold Xin authored 9 years ago

I also found a bug with self-joins returning incorrect results in the Dataset API. Two test cases attached and filed SPARK-11803.

Author: Reynold Xin <rxin@databricks.com>

Closes #9789 from rxin/SPARK-11802.

5e2b4447

[SPARK-10186][SQL][FOLLOW-UP] simplify test · 8019f66d
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9783 from cloud-fan/postgre.
```
8019f66d

[SPARK-11728] Replace example code in ml-ensembles.md using include_example · 9154f89b

Xusen Yin authored 9 years ago

JIRA issue https://issues.apache.org/jira/browse/SPARK-11728.

The ml-ensembles.md file contains `OneVsRestExample`. Instead of writing new code files of two `OneVsRestExample`s, I use two existing files in the examples directory, they are `OneVsRestExample.scala` and `JavaOneVsRestExample.scala`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9716 from yinxusen/SPARK-11728.

9154f89b

[SPARK-11643] [SQL] parse year with leading zero · 2f191c66

Davies Liu authored 9 years ago

Support the years between 0 <= year < 1000

Author: Davies Liu <davies@databricks.com>

Closes #9701 from davies/leading_zero.

2f191c66

[SPARK-7013][ML][TEST] Add unit test for spark.ml StandardScaler · 67a5132c

RoyGaoVLIS authored 9 years ago

I have added unit test for ML's StandardScaler By comparing with R's output, please review  for me.
Thx.

Author: RoyGaoVLIS <roygao@zju.edu.cn>

Closes #6665 from RoyGao/7013.

67a5132c

[SPARK-11761] Prevent the call to StreamingContext#stop() in the listener bus's thread · 446738e5

tedyu authored 9 years ago

See discussion toward the tail of https://github.com/apache/spark/pull/9723
From zsxwing :
```
The user should not call stop or other long-time work in a listener since it will block the listener thread, and prevent from stopping SparkContext/StreamingContext.

I cannot see an approach since we need to stop the listener bus's thread before stopping SparkContext/StreamingContext totally.
```
Proposed solution is to prevent the call to StreamingContext#stop() in the listener bus's thread.

Author: tedyu <yuzhihong@gmail.com>

Closes #9741 from tedyu/master.

446738e5

[SPARK-11755][R] SparkR should export "predict" · 8fb775ba

Yanbo Liang authored 9 years ago

The bug described at [SPARK-11755](https://issues.apache.org/jira/browse/SPARK-11755), after exporting ```predict``` we can both get the help information from the SparkR and base R package like the following:
```Java
> help(predict)
Help on topic ‘predict’ was found in the following packages:

  Package               Library
  SparkR                /Users/yanboliang/data/trunk2/spark/R/lib
  stats                 /Library/Frameworks/R.framework/Versions/3.2/Resources/library

Choose one

1: Make predictions from a model {SparkR}
2: Model Predictions {stats}
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9732 from yanboliang/spark-11755.

8fb775ba

Nov 17, 2015

[SPARK-11797][SQL] collect, first, and take should use encoders for serialization · 91f4b6f2

Reynold Xin authored 9 years ago

They were previously using Spark's default serializer for serialization.

Author: Reynold Xin <rxin@databricks.com>

Closes #9787 from rxin/SPARK-11797.

91f4b6f2

[SPARK-11737] [SQL] Fix serialization of UTF8String with Kyro · 98be8169

Davies Liu authored 9 years ago

The default implementation of serialization UTF8String with Kyro may be not correct (BYTE_ARRAY_OFFSET could be different across JVM)

Author: Davies Liu <davies@databricks.com>

Closes #9704 from davies/kyro_string.

98be8169

[SPARK-11583] [CORE] MapStatus Using RoaringBitmap More Properly · e33053ee

Kent Yao authored 9 years ago

This PR upgrade the version of RoaringBitmap to 0.5.10, to optimize the memory layout, will be much smaller when most of blocks are empty.

This PR is based on #9661 (fix conflicts), see all of the comments at https://github.com/apache/spark/pull/9661 .

Author: Kent Yao <yaooqinn@hotmail.com>
Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>

Closes #9746 from davies/roaring_mapstatus.

e33053ee

[SPARK-11016] Move RoaringBitmap to explicit Kryo serializer · bf25f9bd

Davies Liu authored 9 years ago

Fix the serialization of RoaringBitmap with Kyro serializer

This PR came from https://github.com/metamx/spark/pull/1, thanks to drcrallen

Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>

Closes #9748 from davies/SPARK-11016.

bf25f9bd

[SPARK-11793][SQL] Dataset should set the resolved encoders internally for maps. · ed8d1531

Reynold Xin authored 9 years ago

I also wrote a test case -- but unfortunately the test case is not working due to SPARK-11795.

Author: Reynold Xin <rxin@databricks.com>

Closes #9784 from rxin/SPARK-11503.

ed8d1531

[SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python API · 75a29229

jerryshao authored 9 years ago

Fixed the merge conflicts in #7410

Closes #7410

Author: Shixiong Zhu <shixiong@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Author: jerryshao <sshao@hortonworks.com>

Closes #9742 from zsxwing/pr7410.

75a29229

[SPARK-11726] Throw exception on timeout when waiting for REST server response · b362d50f
Jacek Lewandowski authored 9 years ago
```
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #9692 from jacek-lewandowski/SPARK-11726.
```
b362d50f

[SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two... · 52c734b5

Holden Karau authored 9 years ago

[SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two params have both in error msg

When we exceed the max memory tell users to increase both params instead of just the one.

Author: Holden Karau <holden@us.ibm.com>

Closes #9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.

52c734b5

[SPARK-11790][STREAMING][TESTS] Increase the connection timeout · 3720b148

Shixiong Zhu authored 9 years ago

Sometimes, EmbeddedZookeeper may need more than 6 seconds to setup up in a slow Jenkins worker. So just increase the timeout, it won't increase the test time if the test passes.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9778 from zsxwing/SPARK-11790.

3720b148

[MINOR] Correct comments in JavaDirectKafkaWordCount · e29656f8
Rohan Bhanderi authored 9 years ago
```
Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9781 from RohanBhanderi/patch-3.
```
e29656f8

[SPARK-9552] Add force control for killExecutors to avoid false killing for those busy executors · 965245d0

Grace authored 9 years ago

By using the dynamic allocation, sometimes it occurs false killing for those busy executors. Some executors with assignments will be killed because of being idle for enough time (say 60 seconds). The root cause is that the Task-Launch listener event is asynchronized.

For example, some executors are under assigning tasks, but not sending out the listener notification yet. Meanwhile, the dynamic allocation's executor idle time is up (e.g., 60 seconds). It will trigger killExecutor event at the same time.
1. the timer expiration starts before the listener event arrives.
2. Then, the task is going to run on top of that killed/killing executor. It will lead to task failure finally.

Here is the proposal to fix it. We can add the force control for killExecutor. If the force control is not set (i.e., false), we'd better to check if the executor under killing is idle or busy. If the current executor has some assignment, we should not kill that executor and return back false (to indicate killing failure). In dynamic allocation, we'd better to turn off force killing (i.e., force = false), we will meet killing failure if tries to kill a busy executor. And then, the executor timer won't be invalid. Later on, the task assignment event arrives, we can remove the idle timer accordingly. So that we can avoid false killing for those busy executors in dynamic allocation.

For the rest of usages, the end users can decide if to use force killing or not by themselves. If to turn on that option, the killExecutor will do the action without any status checking.

Author: Grace <jie.huang@intel.com>
Author: Andrew Or <andrew@databricks.com>
Author: Jie Huang <jie.huang@intel.com>

Closes #7888 from GraceH/forcekill.

965245d0

[SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batch · 928d6316

Shixiong Zhu authored 9 years ago

We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9707 from zsxwing/fix-checkpoint.

928d6316

[SPARK-11786][CORE] Tone down messages from akka error monitor. · 936bc0bc

Marcelo Vanzin authored 9 years ago

There events happen normally during the app's lifecycle, so printing
out ERROR logs all the time is misleading, and can actually affect usability
of interactive shells.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9772 from vanzin/SPARK-11786.

936bc0bc

[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector · 3e9e6380

Xiangrui Meng authored 9 years ago

This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9776 from mengxr/SPARK-11764.

3e9e6380

[SPARK-11763][ML] Add save,load to LogisticRegression Estimator · 6eb7008b

Joseph K. Bradley authored 9 years ago

Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs.

Moved LogisticRegressionReader/Writer to within LogisticRegressionModel

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9749 from jkbradley/lr-io-2.

6eb7008b

[SPARK-11729] Replace example code in ml-linear-methods.md using include_example · 328eb49e

Xusen Yin authored 9 years ago

JIRA link: https://issues.apache.org/jira/browse/SPARK-11729

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9713 from yinxusen/SPARK-11729.

328eb49e

[SPARK-11732] Removes some MiMa false positives · fa603e08

Timothy Hunter authored 9 years ago

This adds an extra filter for private or protected classes. We only filter for package private right now.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #9697 from thunterdb/spark-11732.

fa603e08