Commits · ea59b0f3a6600f8046e5f3f55e89257614fb1f10 · cs525-sp18-g07 / spark

Dec 18, 2015

[SPARK-9057][STREAMING] Twitter example joining to static RDD of word sentiment values · ea59b0f3

Jeff L authored 9 years ago

Example of joining a static RDD of word sentiments to a streaming RDD of Tweets in order to demo the usage of the transform() method.

Author: Jeff L <sha0lin@alumni.carnegiemellon.edu>

Closes #8431 from Agent007/SPARK-9057.

ea59b0f3

[SPARK-12413] Fix Mesos ZK persistence · 2bebaa39

Michael Gummelt authored 9 years ago

I believe this fixes SPARK-12413.  I'm currently running an integration test to verify.

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #10366 from mgummelt/fix-zk-mesos.

2bebaa39

[CORE][TESTS] minor fix of JavaSerializerSuite · 40e52a27

Jeff Zhang authored 9 years ago

Not jira is created.
The original test is passed because the class cast is lazy (only when the object's method is invoked).

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10371 from zjffdu/minor_fix.

40e52a27

Dec 17, 2015

[MINOR] Hide the error logs for 'SQLListenerMemoryLeakSuite' · 0370abdf

Shixiong Zhu authored 9 years ago

Hide the error logs for 'SQLListenerMemoryLeakSuite' to avoid noises. Most of changes are space changes.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10363 from zsxwing/hide-log.

0370abdf

[SPARK-11749][STREAMING] Duplicate creating the RDD in file stream when... · f4346f61

jhu-chang authored 9 years ago

[SPARK-11749][STREAMING] Duplicate creating the RDD in file stream when recovering from checkpoint data

Add a transient flag `DStream.restoredFromCheckpointData` to control the restore processing in DStream to avoid duplicate works:  check this flag first in `DStream.restoreCheckpointData`, only when `false`, the restore process will be executed.

Author: jhu-chang <gt.hu.chang@gmail.com>

Closes #9765 from jhu-chang/SPARK-11749.

f4346f61

[SPARK-8641][SQL] Native Spark Window functions · 658f66e6

Herman van Hovell authored 9 years ago

This PR removes Hive windows functions from Spark and replaces them with (native) Spark ones. The PR is on par with Hive in terms of features.

This has the following advantages:
* Better memory management.
* The ability to use spark UDAFs in Window functions.

cc rxin / yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9819 from hvanhovell/SPARK-8641-2.

658f66e6

[SPARK-12376][TESTS] Spark Streaming Java8APISuite fails in assertOrderInvariantEquals method · ed6ebda5

Evan Chen authored 9 years ago

org.apache.spark.streaming.Java8APISuite.java is failing due to trying to sort immutable list in assertOrderInvariantEquals method.

Author: Evan Chen <chene@us.ibm.com>

Closes #10336 from evanyc15/SPARK-12376-StreamingJavaAPISuite.

ed6ebda5

[SPARK-12397][SQL] Improve error messages for data sources when they are not found · e096a652
Reynold Xin authored 9 years ago
```
Point users to spark-packages.org to find them.

Author: Reynold Xin <rxin@databricks.com>

Closes #10351 from rxin/SPARK-12397.
```
e096a652

[SPARK-12410][STREAMING] Fix places that use '.' and '|' directly in split · 540b5aea

Shixiong Zhu authored 9 years ago

String.split accepts a regular expression, so we should escape "." and "|".

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10361 from zsxwing/reg-bug.

540b5aea

[SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos REST server · 81845688

Iulian Dragos authored 9 years ago

Fix problem with #10332, this one should fix Cluster mode on Mesos

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10359 from dragos/issue/fix-spark-12345-one-more-time.

81845688

[SPARK-12220][CORE] Make Utils.fetchFile support files that contain special characters · 86e405f3

Shixiong Zhu authored 9 years ago

This PR encodes and decodes the file name to fix the issue.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10208 from zsxwing/uri.

86e405f3

[SQL] Update SQLContext.read.text doc · 6e077166

Yanbo Liang authored 9 years ago

Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10349 from yanboliang/text-value.

6e077166

[SPARK-12395] [SQL] fix resulting columns of outer join · a170d34a

Davies Liu authored 9 years ago

For API DataFrame.join(right, usingColumns, joinType), if the joinType is right_outer or full_outer, the resulting join columns could be wrong (will be null).

The order of columns had been changed to match that with MySQL and PostgreSQL [1].

This PR also fix the nullability of output for outer join.

[1] http://www.postgresql.org/docs/9.2/static/queries-table-expressions.html

Author: Davies Liu <davies@databricks.com>

Closes #10353 from davies/fix_join.

a170d34a

Revert "Once driver register successfully, stop it to connect to master." · cd3d937b
Davies Liu authored 9 years ago
```
This reverts commit 5a514b61.
```
cd3d937b

Once driver register successfully, stop it to connect to master. · 5a514b61

echo2mei authored 9 years ago

This commit is to resolve SPARK-12396.

Author: echo2mei <534384876@qq.com>

Closes #10354 from echoTomei/master.

5a514b61

[SPARK-12057][SQL] Prevent failure on corrupt JSON records · 9d66c421

Yin Huai authored 9 years ago

This PR makes JSON parser and schema inference handle more cases where we have unparsed records. It is based on #10043. The last commit fixes the failed test and updates the logic of schema inference.

Regarding the schema inference change, if we have something like
```
{"f1":1}
[1,2,3]
```
originally, we will get a DF without any column.
After this change, we will get a DF with columns `f1` and `_corrupt_record`. Basically, for the second row, `[1,2,3]` will be the value of `_corrupt_record`.

When merge this PR, please make sure that the author is simplyianm.

JIRA: https://issues.apache.org/jira/browse/SPARK-12057

Closes #10043

Author: Ian Macalinao <me@ian.pw>
Author: Yin Huai <yhuai@databricks.com>

Closes #10288 from yhuai/handleCorruptJson.

9d66c421

[SPARK-11904][PYSPARK] reduceByKeyAndWindow does not require checkpointing when invFunc is None · 437583f6

David Tolpin authored 9 years ago

when invFunc is None, `reduceByKeyAndWindow(func, None, winsize, slidesize)` is equivalent to

reduceByKey(func).window(winsize, slidesize).reduceByKey(winsize, slidesize)

and no checkpoint is necessary. The corresponding Scala code does exactly that, but Python code always creates a windowed stream with obligatory checkpointing. The patch fixes this.

I do not know how to unit-test this.

Author: David Tolpin <david.tolpin@gmail.com>

Closes #9888 from dtolpin/master.

437583f6

Dec 16, 2015

[SPARK-12390] Clean up unused serializer parameter in BlockManager · 97678ede

Andrew Or authored 9 years ago

No change in functionality is intended. This only changes internal API.

Author: Andrew Or <andrew@databricks.com>

Closes #10343 from andrewor14/clean-bm-serializer.

97678ede

[SPARK-12386][CORE] Fix NPE when spark.executor.port is set. · d1508dd9
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10339 from vanzin/SPARK-12386.
```
d1508dd9
[SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting. · fdb38227
Rohit Agarwal authored 9 years ago
```
Author: Rohit Agarwal <rohita@qubole.com>

Closes #10180 from mindprince/SPARK-12186.
```
fdb38227

[SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called · f590178d

tedyu authored 9 years ago

SPARK-9886 fixed ExternalBlockStore.scala

This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook()

Author: tedyu <yuzhihong@gmail.com>

Closes #10325 from ted-yu/master.

f590178d

[SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests · 38d9795a

Imran Rashid authored 9 years ago

`DAGSchedulerEventLoop` normally only logs errors (so it can continue to process more events, from other jobs).  However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception.

This was suggested by mateiz on https://github.com/apache/spark/pull/7699.  It may have already turned up an issue in "zero split job".

Author: Imran Rashid <irashid@cloudera.com>

Closes #8466 from squito/SPARK-10248.

38d9795a

MAINTENANCE: Automated closing of pull requests. · ce5fd400

Andrew Or authored 9 years ago

This commit exists to close the following pull requests on Github:

Closes #1217 (requested by ankurdave, srowen)
Closes #4650 (requested by andrewor14)
Closes #5307 (requested by vanzin)
Closes #5664 (requested by andrewor14)
Closes #5713 (requested by marmbrus)
Closes #5722 (requested by andrewor14)
Closes #6685 (requested by srowen)
Closes #7074 (requested by srowen)
Closes #7119 (requested by andrewor14)
Closes #7997 (requested by jkbradley)
Closes #8292 (requested by srowen)
Closes #8975 (requested by andrewor14, vanzin)
Closes #8980 (requested by andrewor14, davies)

ce5fd400

[MINOR] Add missing interpolation in NettyRPCEnv · 861549ac

Andrew Or authored 9 years ago

```
Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException:
Cannot receive any reply in ${timeout.duration}. This timeout is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
```

Author: Andrew Or <andrew@databricks.com>

Closes #10334 from andrewor14/rpc-typo.

861549ac

[SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib · 27b98e99

Davies Liu authored 9 years ago

MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext.

Author: Davies Liu <davies@databricks.com>

Closes #10338 from davies/create_context.

27b98e99

[SPARK-9690][ML][PYTHON] pyspark CrossValidator random seed · 3a44aebd

Martin Menestret authored 9 years ago

Extend CrossValidator with HasSeed in PySpark.

This PR replaces [https://github.com/apache/spark/pull/7997]

CC: yanboliang thunterdb mmenestret  Would one of you mind taking a look?  Thanks!

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Martin MENESTRET <mmenestret@ippon.fr>

Closes #10268 from jkbradley/pyspark-cv-seed.

3a44aebd

[SPARK-11677][SQL] ORC filter tests all pass if filters are actually not pushed down. · 9657ee87

hyukjinkwon authored 9 years ago

Currently ORC filters are not tested properly. All the tests pass even if the filters are not pushed down or disabled. In this PR, I add some logics for this.
Since ORC does not filter record by record fully, this checks the count of the result and if it contains the expected values.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #9687 from HyukjinKwon/SPARK-11677.

9657ee87

[SPARK-12164][SQL] Decode the encoded values and then display · edf65cd9

gatorsmile authored 9 years ago

Based on the suggestions from marmbrus cloud-fan in https://github.com/apache/spark/pull/10165 , this PR is to print the decoded values(user objects) in `Dataset.show`
```scala
    implicit val kryoEncoder = Encoders.kryo[KryoClassData]
    val ds = Seq(KryoClassData("a", 1), KryoClassData("b", 2), KryoClassData("c", 3)).toDS()
    ds.show(20, false);
```
The current output is like
```
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                                                                                 |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 97, 2]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 98, 4]|
|[1, 0, 111, 114, 103, 46, 97, 112, 97, 99, 104, 101, 46, 115, 112, 97, 114, 107, 46, 115, 113, 108, 46, 75, 114, 121, 111, 67, 108, 97, 115, 115, 68, 97, 116, -31, 1, 1, -126, 99, 6]|
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
```
After the fix, it will be like the below if and only if the users override the `toString` function in the class `KryoClassData`
```scala
override def toString: String = s"KryoClassData($a, $b)"
```
```
+-------------------+
|value              |
+-------------------+
|KryoClassData(a, 1)|
|KryoClassData(b, 2)|
|KryoClassData(c, 3)|
+-------------------+
```

If users do not override the `toString` function, the results will be like
```
+---------------------------------------+
|value                                  |
+---------------------------------------+
|org.apache.spark.sql.KryoClassData68ef|
|org.apache.spark.sql.KryoClassData6915|
|org.apache.spark.sql.KryoClassData693b|
+---------------------------------------+
```

Question: Should we add another optional parameter in the function `show`? It will decide if the function `show` will display the hex values or the object values?

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10215 from gatorsmile/showDecodedValue.

edf65cd9

[SPARK-12320][SQL] throw exception if the number of fields does not line up for Tuple encoder · a783a8ed
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #10293 from cloud-fan/err-msg.
```
a783a8ed

[SPARK-12364][ML][SPARKR] Add ML example for SparkR · 1a8b2a17

Yanbo Liang authored 9 years ago

We have DataFrame example for SparkR, we also need to add ML example under ```examples/src/main/r```.

cc mengxr jkbradley shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10324 from yanboliang/spark-12364.

1a8b2a17

[SPARK-11608][MLLIB][DOC] Added migration guide for MLlib 1.6 · 8148cc7a

Joseph K. Bradley authored 9 years ago

No known breaking changes, but some deprecations and changes of behavior.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10235 from jkbradley/mllib-guide-update-1.6.

8148cc7a

[SPARK-12361][PYSPARK][TESTS] Should set PYSPARK_DRIVER_PYTHON before Python tests · 6a880afa

Jeff Zhang authored 9 years ago

Although this patch still doesn't solve the issue why the return code is 0 (see JIRA description), it resolves the issue of python version mismatch.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10322 from zjffdu/SPARK-12361.

6a880afa

[SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml test suites · d252b2d5

Yanbo Liang authored 9 years ago

Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new one for spark.ml test suites. I have checked thoroughly and found there are four test cases need to update.

cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10279 from yanboliang/spark-12309.

d252b2d5

[SPARK-9694][ML] Add random seed Param to Scala CrossValidator · 860dc7f2

Yanbo Liang authored 9 years ago

Add random seed Param to Scala CrossValidator

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9108 from yanboliang/spark-9694.

860dc7f2

[SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means · 7b6dc29d

Yu ISHIKAWA authored 9 years ago

This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9952 from yu-iskw/SPARK-6518.

7b6dc29d

[SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode. · ad8c1f0b

Timothy Chen authored 9 years ago

SPARK_HOME is now causing problem with Mesos cluster mode since spark-submit script has been changed recently to take precendence when running spark-class scripts to look in SPARK_HOME if it's defined.

We should skip passing SPARK_HOME from the Spark client in cluster mode with Mesos, since Mesos shouldn't use this configuration but should use spark.executor.home instead.

Author: Timothy Chen <tnachen@gmail.com>

Closes #10332 from tnachen/scheduler_ui.

ad8c1f0b

[SPARK-12215][ML][DOC] User guide section for KMeans in spark.ml · 26d70bd2
Yu ISHIKAWA authored 9 years ago
```
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #10244 from yu-iskw/SPARK-12215.
```
26d70bd2

[SPARK-12310][SPARKR] Add write.json and write.parquet for SparkR · 22f6cd86

Yanbo Liang authored 9 years ago

Add ```write.json``` and ```write.parquet``` for SparkR, and deprecated ```saveAsParquetFile```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10281 from yanboliang/spark-12310.

22f6cd86

[SPARK-12318][SPARKR] Save mode in SparkR should be error by default · 2eb5af5f
Jeff Zhang authored 9 years ago
```
shivaram  Please help review.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10290 from zjffdu/SPARK-12318.
```
2eb5af5f

[SPARK-8745] [SQL] remove GenerateProjection · 54c512ba

Davies Liu authored 9 years ago

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #10316 from davies/remove_generate_projection.

54c512ba