Commits · 75a292291062783129d02607302f91c85655975e · cs525-sp18-g07 / spark

Nov 17, 2015

[SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python API · 75a29229

jerryshao authored 9 years ago

Fixed the merge conflicts in #7410

Closes #7410

Author: Shixiong Zhu <shixiong@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Author: jerryshao <sshao@hortonworks.com>

Closes #9742 from zsxwing/pr7410.

75a29229

[SPARK-11726] Throw exception on timeout when waiting for REST server response · b362d50f
Jacek Lewandowski authored 9 years ago
```
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #9692 from jacek-lewandowski/SPARK-11726.
```
b362d50f

[SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two... · 52c734b5

Holden Karau authored 9 years ago

[SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two params have both in error msg

When we exceed the max memory tell users to increase both params instead of just the one.

Author: Holden Karau <holden@us.ibm.com>

Closes #9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.

52c734b5

[SPARK-11790][STREAMING][TESTS] Increase the connection timeout · 3720b148

Shixiong Zhu authored 9 years ago

Sometimes, EmbeddedZookeeper may need more than 6 seconds to setup up in a slow Jenkins worker. So just increase the timeout, it won't increase the test time if the test passes.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9778 from zsxwing/SPARK-11790.

3720b148

[MINOR] Correct comments in JavaDirectKafkaWordCount · e29656f8
Rohan Bhanderi authored 9 years ago
```
Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9781 from RohanBhanderi/patch-3.
```
e29656f8

[SPARK-9552] Add force control for killExecutors to avoid false killing for those busy executors · 965245d0

Grace authored 9 years ago

By using the dynamic allocation, sometimes it occurs false killing for those busy executors. Some executors with assignments will be killed because of being idle for enough time (say 60 seconds). The root cause is that the Task-Launch listener event is asynchronized.

For example, some executors are under assigning tasks, but not sending out the listener notification yet. Meanwhile, the dynamic allocation's executor idle time is up (e.g., 60 seconds). It will trigger killExecutor event at the same time.
1. the timer expiration starts before the listener event arrives.
2. Then, the task is going to run on top of that killed/killing executor. It will lead to task failure finally.

Here is the proposal to fix it. We can add the force control for killExecutor. If the force control is not set (i.e., false), we'd better to check if the executor under killing is idle or busy. If the current executor has some assignment, we should not kill that executor and return back false (to indicate killing failure). In dynamic allocation, we'd better to turn off force killing (i.e., force = false), we will meet killing failure if tries to kill a busy executor. And then, the executor timer won't be invalid. Later on, the task assignment event arrives, we can remove the idle timer accordingly. So that we can avoid false killing for those busy executors in dynamic allocation.

For the rest of usages, the end users can decide if to use force killing or not by themselves. If to turn on that option, the killExecutor will do the action without any status checking.

Author: Grace <jie.huang@intel.com>
Author: Andrew Or <andrew@databricks.com>
Author: Jie Huang <jie.huang@intel.com>

Closes #7888 from GraceH/forcekill.

965245d0

[SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batch · 928d6316

Shixiong Zhu authored 9 years ago

We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9707 from zsxwing/fix-checkpoint.

928d6316

[SPARK-11786][CORE] Tone down messages from akka error monitor. · 936bc0bc

Marcelo Vanzin authored 9 years ago

There events happen normally during the app's lifecycle, so printing
out ERROR logs all the time is misleading, and can actually affect usability
of interactive shells.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9772 from vanzin/SPARK-11786.

936bc0bc

[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector · 3e9e6380

Xiangrui Meng authored 9 years ago

This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9776 from mengxr/SPARK-11764.

3e9e6380

[SPARK-11763][ML] Add save,load to LogisticRegression Estimator · 6eb7008b

Joseph K. Bradley authored 9 years ago

Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs.

Moved LogisticRegressionReader/Writer to within LogisticRegressionModel

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9749 from jkbradley/lr-io-2.

6eb7008b

[SPARK-11729] Replace example code in ml-linear-methods.md using include_example · 328eb49e

Xusen Yin authored 9 years ago

JIRA link: https://issues.apache.org/jira/browse/SPARK-11729

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9713 from yinxusen/SPARK-11729.

328eb49e

[SPARK-11732] Removes some MiMa false positives · fa603e08

Timothy Hunter authored 9 years ago

This adds an extra filter for private or protected classes. We only filter for package private right now.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #9697 from thunterdb/spark-11732.

fa603e08

[SPARK-11767] [SQL] limit the size of caced batch · 5aca6ad0

Davies Liu authored 9 years ago

Currently the size of cached batch in only controlled by `batchSize` (default value is 10000), which does not work well with the size of serialized columns (for example, complex types). The memory used to build the batch is not accounted, it's easy to OOM (especially after unified memory management).

This PR introduce a hard limit as 4M for total columns (up to 50 columns of uncompressed primitive columns).

This also change the way to grow buffer, double it each time, then trim it once finished.

cc liancheng

Author: Davies Liu <davies@databricks.com>

Closes #9760 from davies/cache_limit.

5aca6ad0

[SPARK-11769][ML] Add save, load to all basic Transformers · d98d1cb0

Joseph K. Bradley authored 9 years ago

This excludes Estimators and ones which include Vector and other non-basic types for Params or data.  This adds:
* Bucketizer
* DCT
* HashingTF
* Interaction
* NGram
* Normalizer
* OneHotEncoder
* PolynomialExpansion
* QuantileDiscretizer
* RFormula
* SQLTransformer
* StopWordsRemover
* StringIndexer
* Tokenizer
* VectorAssembler
* VectorSlicer

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9755 from jkbradley/transformer-io.

d98d1cb0

[SPARK-10186][SQL] support postgre array type in JDBCRDD · d9251496

Wenchen Fan authored 9 years ago

Add ARRAY support to `PostgresDialect`.

Nested ARRAY is not allowed for now because it's hard to get the array dimension info. See http://stackoverflow.com/questions/16619113/how-to-get-array-base-type-in-postgres-via-jdbc

Thanks for the initial work from mariusvniekerk !

Close https://github.com/apache/spark/pull/9137

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9662 from cloud-fan/postgre.

d9251496

[SPARK-8658][SQL][FOLLOW-UP] AttributeReference's equals method compares all the members · 0158ff77

gatorsmile authored 9 years ago

Based on the comment of cloud-fan in https://github.com/apache/spark/pull/9216, update the AttributeReference's hashCode function by including the hashCode of the other attributes including name, nullable and qualifiers.

Here, I am not 100% sure if we should include name in the hashCode calculation, since the original hashCode calculation does not include it.

marmbrus cloud-fan Please review if the changes are good.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9761 from gatorsmile/hashCodeNamedExpression.

0158ff77

[SPARK-11089][SQL] Adds option for disabling multi-session in Thrift server · 7b1407c7

Cheng Lian authored 9 years ago

This PR adds a new option `spark.sql.hive.thriftServer.singleSession` for disabling multi-session support in the Thrift server.

Note that this option is added as a Spark configuration (retrieved from `SparkConf`) rather than Spark SQL configuration (retrieved from `SQLConf`). This is because all SQL configurations are session-ized. Since multi-session support is by default on, no JDBC connection can modify global configurations like the newly added one.

Author: Cheng Lian <lian@databricks.com>

Closes #9740 from liancheng/spark-11089.single-session-option.

7b1407c7

[SPARK-11679][SQL] Invoking method " apply(fields:... · e8833dd1

mayuanwen authored 9 years ago

[SPARK-11679][SQL] Invoking method " apply(fields: java.util.List[StructField])" in "StructType" gets ClassCastException

In the previous method, fields.toArray will cast java.util.List[StructField] into Array[Object] which can not cast into Array[StructField], thus when invoking this method will throw "java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.sql.types.StructField;"
I directly cast java.util.List[StructField] into Array[StructField]  in this patch.

Author: mayuanwen <mayuanwen@qiyi.com>

Closes #9649 from jackieMaKing/Spark-11679.

e8833dd1

[SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors · 21fac543

Xiangrui Meng authored 9 years ago

This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9751 from mengxr/SPARK-11766.

21fac543

[SPARK-11695][CORE] Set s3a credentials · cc567b66

Chris Bannister authored 9 years ago

Set s3a credentials when creating a new default hadoop configuration.

Author: Chris Bannister <chris.bannister@swiftkey.com>

Closes #9663 from Zariel/set-s3a-creds.

cc567b66

[SPARK-11744][LAUNCHER] Fix print version throw exception when using pyspark shell · 6fc2740e

jerryshao authored 9 years ago

Exception details can be seen here (https://issues.apache.org/jira/browse/SPARK-11744).

Author: jerryshao <sshao@hortonworks.com>

Closes #9721 from jerryshao/SPARK-11744.

6fc2740e

[SPARK-11779][DOCS] Fix reference to deprecated MESOS_NATIVE_LIBRARY · 15cc36b7

Philipp Hoffmann authored 9 years ago

MESOS_NATIVE_LIBRARY was renamed in favor of MESOS_NATIVE_JAVA_LIBRARY. This commit fixes the reference in the documentation.

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #9768 from philipphoffmann/patch-2.

15cc36b7

[SPARK-11751] Doc describe error in the "Spark Streaming Programming Guide" page · 7276fa9a

yangping.wu authored 9 years ago

In the **[Task Launching Overheads](http://spark.apache.org/docs/latest/streaming-programming-guide.html#task-launching-overheads)** section,
>Task Serialization: Using Kryo serialization for serializing tasks can reduce the task sizes, and therefore reduce the time taken to send them to the slaves.

as we known **Task Serialization** is configuration by **spark.closure.serializer** parameter, but currently only the Java serializer is supported. If we set **spark.closure.serializer** to **org.apache.spark.serializer.KryoSerializer**, then this will throw a exception.

Author: yangping.wu <wyphao.2007@163.com>

Closes #9734 from 397090770/397090770-patch-1.

7276fa9a

[SPARK-11191][SQL][FOLLOW-UP] Cleans up unnecessary anonymous HiveFunctionRegistry · fa13301a

Cheng Lian authored 9 years ago

According to discussion in PR #9664, the anonymous `HiveFunctionRegistry` in `HiveContext` can be removed now.

Author: Cheng Lian <lian@databricks.com>

Closes #9737 from liancheng/spark-11191.follow-up.

fa13301a

[MINOR] [SQL] Fix randomly generated ArrayData in RowEncoderSuite · d79d8b08

Liang-Chi Hsieh authored 9 years ago

The randomly generated ArrayData used for the UDT `ExamplePoint` in `RowEncoderSuite` sometimes doesn't have enough elements. In this case, this test will fail. This patch is to fix it.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9757 from viirya/fix-randomgenerated-udt.

d79d8b08

[SPARK-11447][SQL] change NullType to StringType during binaryComparison... · e01865af

Kevin Yu authored 9 years ago

[SPARK-11447][SQL] change NullType to StringType during binaryComparison between NullType and StringType

During executing PromoteStrings rule, if one side of binaryComparison is StringType and the other side is not StringType, the current code will promote(cast) the StringType to DoubleType, and if the StringType doesn't contain the numbers, it will get null value. So if it is doing <=> (NULL-safe equal) with Null, it will not filter anything, caused the problem reported by this jira.

I proposal to the changes through this PR, can you review my code changes ?

This problem only happen for <=>, other operators works fine.

scala> val filteredDF = df.filter(df("column") > (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> val filteredDF = df.filter(df("column") === (new Column(Literal(null))))
filteredDF: org.apache.spark.sql.DataFrame = [column: string]

scala> filteredDF.show
+------+
|column|
+------+
+------+

scala> df.registerTempTable("DF")

scala> sqlContext.sql("select * from DF where 'column' = NULL")
res27: org.apache.spark.sql.DataFrame = [column: string]

scala> res27.show
+------+
|column|
+------+
+------+

Author: Kevin Yu <qyu@us.ibm.com>

Closes #9720 from kevinyu98/working_on_spark-11447.

e01865af

[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata... · 75d20207

hyukjinkwon authored 9 years ago

[SPARK-11694][FOLLOW-UP] Clean up imports, use a common function for metadata and add a test for FIXED_LEN_BYTE_ARRAY

As discussed https://github.com/apache/spark/pull/9660 https://github.com/apache/spark/pull/9060, I cleaned up unused imports, added a test for fixed-length byte array and used a common function for writing metadata for Parquet.

For the test for fixed-length byte array, I have tested and checked the encoding types with [parquet-tools](https://github.com/Parquet/parquet-mr/tree/master/parquet-tools).

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #9754 from HyukjinKwon/SPARK-11694-followup.

75d20207

Nov 16, 2015

[SPARK-11768][SPARK-9196][SQL] Support now function in SQL (alias for current_timestamp). · fbad920d

Reynold Xin authored 9 years ago

This patch adds an alias for current_timestamp (now function).

Also fixes SPARK-9196 to re-enable the test case for current_timestamp.

Author: Reynold Xin <rxin@databricks.com>

Closes #9753 from rxin/SPARK-11768.

fbad920d

[SPARK-11617][NETWORK] Fix leak in TransportFrameDecoder. · 540bf58f

Marcelo Vanzin authored 9 years ago

The code was using the wrong API to add data to the internal composite
buffer, causing buffers to leak in certain situations. Use the right
API and enhance the tests to catch memory leaks.

Also, avoid reusing the composite buffers when downstream handlers keep
references to them; this seems to cause a few different issues even though
the ref counting code seems to be correct, so instead pay the cost of copying
a few bytes when that situation happens.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9619 from vanzin/SPARK-11617.

540bf58f

[SPARK-11612][ML] Pipeline and PipelineModel persistence · 1c5475f1

Joseph K. Bradley authored 9 years ago

Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable.

Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9674 from jkbradley/pipeline-io.

1c5475f1

[EXAMPLE][MINOR] Add missing awaitTermination in click stream example · bd10eb81
jerryshao authored 9 years ago
```
Author: jerryshao <sshao@hortonworks.com>

Closes #9730 from jerryshao/clickstream-fix.
```
bd10eb81
[SPARK-11710] Document new memory management model · 33a0ec93
Andrew Or authored 9 years ago
```
Author: Andrew Or <andrew@databricks.com>

Closes #9676 from andrewor14/memory-management-docs.
```
33a0ec93

[SPARK-11480][CORE][WEBUI] Wrong callsite is displayed when using AsyncRDDActions#takeAsync · 30f3cfda

Kousuke Saruta authored 9 years ago

When we call AsyncRDDActions#takeAsync, actually another DAGScheduler#runJob is called from another thread so we cannot get proper callsite infomation.

Following screenshots are before this patch applied and after.

Before:
<img width="1268" alt="2015-11-04 1 26 40" src="https://cloud.githubusercontent.com/assets/4736016/10914069/0ffc1306-8294-11e5-8e89-c4fadf58dd12.png">
<img width="1258" alt="2015-11-04 1 26 52" src="https://cloud.githubusercontent.com/assets/4736016/10914070/0ffe84ce-8294-11e5-8b2a-69d36276bedb.png">

After:
<img width="1268" alt="2015-11-04 0 48 07" src="https://cloud.githubusercontent.com/assets/4736016/10914080/1d8cfb7a-8294-11e5-9e09-ede25c2563e8.png">
<img width="1269" alt="2015-11-04 0 48 26" src="https://cloud.githubusercontent.com/assets/4736016/10914081/1d934e3a-8294-11e5-8b5e-e3dc37aaced3.png">

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #9437 from sarutak/SPARK-11480.

30f3cfda

[SPARKR][HOTFIX] Disable flaky SparkR package build test · ea6f53e4

Shivaram Venkataraman authored 9 years ago

See https://github.com/apache/spark/pull/9390#issuecomment-157160063 and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #9744 from shivaram/sparkr-package-test-disable.

ea6f53e4

[SPARK-11625][SQL] add java test for typed aggregate · fd14936b
Wenchen Fan authored 9 years ago
```
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9591 from cloud-fan/agg-test.
```
fd14936b

[SPARK-8658][SQL] AttributeReference's equals method compares all the members · 75ee12f0

gatorsmile authored 9 years ago

This fix is to change the equals method to check all of the specified fields for equality of AttributeReference.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9216 from gatorsmile/namedExpressEqual.

75ee12f0

[SPARK-11553][SQL] Primitive Row accessors should not convert null to default value · 31296628

Bartlomiej Alberski authored 9 years ago

Invocation of getters for type extending AnyVal returns default value (if field value is null) instead of throwing NPE. Please check comments for SPARK-11553 issue for more details.

Author: Bartlomiej Alberski <bartlomiej.alberski@allegrogroup.com>

Closes #9642 from alberskib/bugfix/SPARK-11553.

31296628

[SPARK-11742][STREAMING] Add the failure info to the batch lists · bcea0bfd

Shixiong Zhu authored 9 years ago

<img width="1365" alt="screen shot 2015-11-13 at 9 57 43 pm" src="https://cloud.githubusercontent.com/assets/1000778/11162322/9b88e204-8a51-11e5-8c57-a44889cab713.png">

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9711 from zsxwing/failure-info.

bcea0bfd

Revert "[SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of... · 3c025087

Davies Liu authored 9 years ago

Revert "[SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of RoaringBitmap to reduce memory usage"

This reverts commit e209fa27.

3c025087

[SPARK-11390][SQL] Query plan with/without filterPushdown indistinguishable · 985b38dd

Zee Chen authored 9 years ago

…ishable

Propagate pushed filters to PhyicalRDD in DataSourceStrategy.apply

Author: Zee Chen <zeechen@us.ibm.com>

Closes #9679 from zeocio/spark-11390.

985b38dd