Commits · 87f82a5fb9c4350a97c761411069245f07aad46f · cs525-sp18-g07 / spark

Oct 25, 2015

[SPARK-11127][STREAMING] upgrade AWS SDK and Kinesis Client Library (KCL) · 87f82a5f

Xiangrui Meng authored 9 years ago

AWS SDK 1.9.40 is the latest 1.9.x release. KCL 1.5.1 is the latest release that using AWS SDK 1.9.x. The main goal is to have Kinesis consumer be able to read messages generated from Kinesis Producer Library (KPL). The API should be compatible with old versions.

tdas brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #9153 from mengxr/SPARK-11127.

87f82a5f

[SPARK-10984] Simplify *MemoryManager class structure · 85e654c5

Josh Rosen authored 9 years ago

This patch refactors the MemoryManager class structure. After #9000, Spark had the following classes:

- MemoryManager
- StaticMemoryManager
- ExecutorMemoryManager
- TaskMemoryManager
- ShuffleMemoryManager

This is fairly confusing. To simplify things, this patch consolidates several of these classes:

- ShuffleMemoryManager and ExecutorMemoryManager were merged into MemoryManager.
- TaskMemoryManager is moved into Spark Core.

**Key changes and tasks**:

- [x] Merge ExecutorMemoryManager into MemoryManager.
  - [x] Move pooling logic into Allocator.
- [x] Move TaskMemoryManager from `spark-unsafe` to `spark-core`.
- [x] Refactor the existing Tungsten TaskMemoryManager interactions so Tungsten code use only this and not both this and ShuffleMemoryManager.
- [x] Refactor non-Tungsten code to use the TaskMemoryManager instead of ShuffleMemoryManager.
- [x] Merge ShuffleMemoryManager into MemoryManager.
  - [x] Move code
  - [x] ~~Simplify 1/n calculation.~~ **Will defer to followup, since this needs more work.**
- [x] Port ShuffleMemoryManagerSuite tests.
- [x] Move classes from `unsafe` package to `memory` package.
- [ ] Figure out how to handle the hacky use of the memory managers in HashedRelation's broadcast variable construction.
- [x] Test porting and cleanup: several tests relied on mock functionality (such as `TestShuffleMemoryManager.markAsOutOfMemory`) which has been changed or broken during the memory manager consolidation
  - [x] AbstractBytesToBytesMapSuite
  - [x] UnsafeExternalSorterSuite
  - [x] UnsafeFixedWidthAggregationMapSuite
  - [x] UnsafeKVExternalSorterSuite

**Compatiblity notes**:

- This patch introduces breaking changes in `ExternalAppendOnlyMap`, which is marked as `DevloperAPI` (likely for legacy reasons): this class now cannot be used outside of a task.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9127 from JoshRosen/SPARK-10984.

85e654c5

[SPARK-10891][STREAMING][KINESIS] Add MessageHandler to... · 63accc79

Burak Yavuz authored 9 years ago

[SPARK-10891][STREAMING][KINESIS] Add MessageHandler to KinesisUtils.createStream similar to Direct Kafka

This PR allows users to map a Kinesis `Record` to a generic `T` when creating a Kinesis stream. This is particularly useful, if you would like to do extra work with Kinesis metadata such as sequence number, and partition key.

TODO:
 - [x] add tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #8954 from brkyvz/kinesis-handler.

63accc79

[SPARK-11287] Fixed class name to properly start TestExecutor from deploy.client.TestClient · 80279ac1

Bryan Cutler authored 9 years ago

Executing deploy.client.TestClient fails due to bad class name for TestExecutor in ApplicationDescription.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9255 from BryanCutler/fix-TestClient-classname-SPARK-11287.

80279ac1

[SPARK-6428][SQL] Removed unnecessary typecasts in MutableInt, MutableDouble etc. · 92b9c5ed

Alexander Slesarenko authored 9 years ago

marmbrus rxin I believe these typecasts are not required in the presence of explicit return types.

Author: Alexander Slesarenko <avslesarenko@gmail.com>

Closes #9262 from aslesarenko/remove-typecasts.

92b9c5ed

[SPARK-11299][DOC] Fix link to Scala DataFrame Functions reference · b67dc6a4

Josh Rosen authored 9 years ago

The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9269 from JoshRosen/SPARK-11299.

b67dc6a4

Oct 24, 2015

Fix typos · 146da0d8

Jacek Laskowski authored 9 years ago

Two typos squashed.

BTW Let me know how to proceed with other typos if I ran across any. I don't feel well to leave them aside as much as sending pull requests with such tiny changes. Guide me.

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9250 from jaceklaskowski/typos-hunting.

146da0d8

[SPARK-11264] bin/spark-class can't find assembly jars with certain GREP_OPTIONS set · 28132ceb

Jeffrey Naisbitt authored 9 years ago

Temporarily remove GREP_OPTIONS if set in bin/spark-class.

Some GREP_OPTIONS will modify the output of the grep commands that are looking for the assembly jars.
For example, if the -n option is specified, the grep output will look like:
5:spark-assembly-1.5.1-hadoop2.4.0.jar

This will not match the regular expressions, and so the jar files will not be found. We could improve the regular expression to handle this case and trim off extra characters, but it is difficult to know which options may or may not be set. Unsetting GREP_OPTIONS within the script handles all the cases and gives the desired output.

Author: Jeffrey Naisbitt <jnaisbitt@familysearch.org>

Closes #9231 from naisbitt/unset-GREP_OPTIONS.

28132ceb

[SPARK-11245] update twitter4j to 4.0.4 version · e5bc8c27

dima authored 9 years ago

update twitter4j to 4.0.4 version
https://issues.apache.org/jira/browse/SPARK-11245

Author: dima <pronix.service@gmail.com>

Closes #9221 from pronix/twitter4j_update.

e5bc8c27

[SPARK-11125] [SQL] Uninformative exception when running spark-sql witho… · ffed0049

Jeff Zhang authored 9 years ago

…ut building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

This is the exception after this patch. Please help review.
```
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	... 21 more
Failed to load hive class.
You need to build Spark with -Phive and -Phive-thriftserver.
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9134 from zjffdu/SPARK-11125.

ffed0049

Oct 23, 2015

[SPARK-11294][SPARKR] Improve R doc for read.df, write.df, saveAsTable · 5e458125

felixcheung authored 9 years ago

Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix formatting and text truncation for write.df, saveAsTable.

Several text issues:
![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png)
- text collapsed into a single paragraph
- text truncated at 2 places, eg. "overwrite: Existing data is expected to be overwritten by the contents of error:"

shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9261 from felixcheung/rdocreadwritedf.

5e458125

[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript. · 2462dbcc

Sun Rui authored 9 years ago

Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.

The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395).

BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host.

For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`).
PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script.

Author: Sun Rui <rui.sun@intel.com>

Closes #9179 from sun-rui/SPARK-10971.

2462dbcc

[SPARK-11194] [SQL] Use MutableURLClassLoader for the classLoader in IsolatedClientLoader. · 4725cb98
Yin Huai authored 9 years ago
```
https://issues.apache.org/jira/browse/SPARK-11194

Author: Yin Huai <yhuai@databricks.com>

Closes #9170 from yhuai/SPARK-11194.
```
4725cb98

[SPARK-11274] [SQL] Text data source support for Spark SQL. · e1a897b6

Reynold Xin authored 9 years ago

This adds API for reading and writing text files, similar to SparkContext.textFile and RDD.saveAsTextFile.
```
SQLContext.read.text("/path/to/something.txt")
DataFrame.write.text("/path/to/write.txt")
```

Using the new Dataset API, this also supports
```
val ds: Dataset[String] = SQLContext.read.text("/path/to/something.txt").as[String]
```

Author: Reynold Xin <rxin@databricks.com>

Closes #9240 from rxin/SPARK-11274.

e1a897b6

[SPARK-6723] [MLLIB] Model import/export for ChiSqSelector · 4e38defa

Jayant Shekar authored 9 years ago

This is a PR for Parquet-based model import/export.

* Added save/load for ChiSqSelectorModel
* Updated the test suite ChiSqSelectorSuite

Author: Jayant Shekar <jayant@user-MBPMBA-3.local>

Closes #6785 from jayantshekhar/SPARK-6723.

4e38defa

[SPARK-10277] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.regression · 282a15f7
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8684 from yu-iskw/SPARK-10277.
```
282a15f7

[SPARK-10382] Make example code in user guide testable · 03ccb220

Xusen Yin authored 9 years ago

A POC code for making example code in user guide testable.

mengxr We still need to talk about the labels in code.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9109 from yinxusen/SPARK-10382.

03ccb220

[SPARK-11243][SQL] zero out padding bytes in UnsafeRow · 487d409e

Davies Liu authored 9 years ago

For nested StructType, the underline buffer could be used for others before, we should zero out the padding bytes for those primitive types that have less than 8 bytes.

cc cloud-fan

Author: Davies Liu <davies@databricks.com>

Closes #9217 from davies/zero_out.

487d409e

Fix typo "Received" to "Receiver" in streaming-kafka-integration.md · 16dc9f34

Rohan Bhanderi authored 9 years ago

Removed typo on line 8 in markdown : "Received" -> "Receiver"

Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9242 from RohanBhanderi/patch-1.

16dc9f34

[SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util package · cdea0174
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9239 from rxin/types-private.
```
cdea0174

Fix a (very tiny) typo · b1c1597e

Jacek Laskowski authored 9 years ago

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9230 from jaceklaskowski/utils-seconds-typo.

b1c1597e

[SPARK-11134][CORE] Increase LauncherBackendSuite timeout. · fa6a4fbf

Marcelo Vanzin authored 9 years ago

This test can take a little while to finish on slow / loaded machines.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9235 from vanzin/SPARK-11134.

fa6a4fbf

Oct 22, 2015

[SPARK-11098][CORE] Add Outbox to cache the sending messages to resolve the message disorder issue · a88c66ca

zsxwing authored 9 years ago

The current NettyRpc has a message order issue because it uses a thread pool to send messages. E.g., running the following two lines in the same thread,

```
ref.send("A")
ref.send("B")
```

The remote endpoint may see "B" before "A" because sending "A" and "B" are in parallel.
To resolve this issue, this PR added an outbox for each connection, and if we are connecting to the remote node when sending messages, just cache the sending messages in the outbox and send them one by one when the connection is established.

Author: zsxwing <zsxwing@gmail.com>

Closes #9197 from zsxwing/rpc-outbox.

a88c66ca

[SPARK-11251] Fix page size calculation in local mode · 34e71c6d

Andrew Or authored 9 years ago

```
// My machine only has 8 cores
$ bin/spark-shell --master local[32]
scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()

Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
```

Author: Andrew Or <andrew@databricks.com>

Closes #9209 from andrewor14/fix-local-page-size.

34e71c6d

[SPARK-7021] Add JUnit output for Python unit tests · 163d53e8
Gábor Lipták authored 9 years ago
```
WIP

Author: Gábor Lipták <gliptak@gmail.com>

Closes #8323 from gliptak/SPARK-7021.
```
163d53e8

[SPARK-11116][SQL] First Draft of Dataset API · 53e83a3a

Michael Armbrust authored 9 years ago

*This PR adds a new experimental API to Spark, tentitively named Datasets.*

A `Dataset` is a strongly-typed collection of objects that can be transformed in parallel using functional or relational operations.  Example usage is as follows:

### Functional
```scala
> val ds: Dataset[Int] = Seq(1, 2, 3).toDS()
> ds.filter(_ % 1 == 0).collect()
res1: Array[Int] = Array(1, 2, 3)
```

### Relational
```scala
scala> ds.toDF().show()
+-----+
|value|
+-----+
|    1|
|    2|
|    3|
+-----+

> ds.select(expr("value + 1").as[Int]).collect()
res11: Array[Int] = Array(2, 3, 4)
```

## Comparison to RDDs
 A `Dataset` differs from an `RDD` in the following ways:
  - The creation of a `Dataset` requires the presence of an explicit `Encoder` that can be
    used to serialize the object into a binary format.  Encoders are also capable of mapping the
    schema of a given object to the Spark SQL type system.  In contrast, RDDs rely on runtime
    reflection based serialization.
  - Internally, a `Dataset` is represented by a Catalyst logical plan and the data is stored
    in the encoded form.  This representation allows for additional logical operations and
    enables many operations (sorting, shuffling, etc.) to be performed without deserializing to
    an object.

A `Dataset` can be converted to an `RDD` by calling the `.rdd` method.

## Comparison to DataFrames

A `Dataset` can be thought of as a specialized DataFrame, where the elements map to a specific
JVM object type, instead of to a generic `Row` container. A DataFrame can be transformed into
specific Dataset by calling `df.as[ElementType]`.  Similarly you can transform a strongly-typed
`Dataset` to a generic DataFrame by calling `ds.toDF()`.

## Implementation Status and TODOs

This is a rough cut at the least controversial parts of the API.  The primary purpose here is to get something committed so that we can better parallelize further work and get early feedback on the API.  The following is being deferred to future PRs:
 - Joins and Aggregations (prototype here https://github.com/apache/spark/commit/f11f91e6f08c8cf389b8388b626cd29eec32d937)
 - Support for Java

Additionally, the responsibility for binding an encoder to a given schema is currently done in a fairly ad-hoc fashion.  This is an internal detail, and what we are doing today works for the cases we care about.  However, as we add more APIs we'll probably need to do this in a more principled way (i.e. separate resolution from binding as we do in DataFrames).

## COMPATIBILITY NOTE
Long term we plan to make `DataFrame` extend `Dataset[Row]`.  However,
making this change to che class hierarchy would break the function signatures for the existing
function operations (map, flatMap, etc).  As such, this class should be considered a preview
of the final API.  Changes will be made to the interface after Spark 1.6.

Author: Michael Armbrust <michael@databricks.com>

Closes #9190 from marmbrus/dataset-infra.

53e83a3a

[SPARK-11242][SQL] In conf/spark-env.sh.template SPARK_DRIVER_MEMORY is documented incorrectly · 188ea348
guoxi authored 9 years ago
```
Minor fix on the comment

Author: guoxi <guoxi@us.ibm.com>

Closes #9201 from xguo27/SPARK-11242.
```
188ea348

[SPARK-9735][SQL] Respect the user specified schema than the infer partition... · d4950e6b

Cheng Hao authored 9 years ago

[SPARK-9735][SQL] Respect the user specified schema than the infer partition schema for HadoopFsRelation

To enable the unit test of `hadoopFsRelationSuite.Partition column type casting`. It previously threw exception like below, as we treat the auto infer partition schema with higher priority than the user specified one.

```
java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
07:44:01.344 ERROR org.apache.spark.executor.Executor: Exception in task 14.0 in stage 3.0 (TID 206)
java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.unsafe.types.UTF8String
at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getUTF8String(rows.scala:45)
at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getUTF8String(rows.scala:220)
at org.apache.spark.sql.catalyst.expressions.JoinedRow.getUTF8String(JoinedRow.scala:102)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(generated.java:62)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$17$$anonfun$apply$9.apply(DataSourceStrategy.scala:212)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:903)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1846)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
```

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8026 from chenghao-intel/partition_discovery.

d4950e6b

[SPARK-11163] Remove unnecessary addPendingTask calls. · 3535b91d

Kay Ousterhout authored 9 years ago

This commit removes unnecessary calls to addPendingTask in
TaskSetManager.executorLost. These calls are unnecessary: for
tasks that are still pending and haven't been launched, they're
still in all of the correct pending lists, so calling addPendingTask
has no effect. For tasks that are currently running (which may still be
in the pending lists, depending on how they were scheduled), we call
addPendingTask in handleFailedTask, so the calls at the beginning
of executorLost are redundant.

I think these calls are left over from when we re-computed the locality
levels in addPendingTask; now that we call recomputeLocality separately,
I don't think these are necessary.

Now that those calls are removed, the readding parameter in addPendingTask
is no longer necessary, so this commit also removes that parameter.

markhamstra can you take a look at this?

cc vanzin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9154 from kayousterhout/SPARK-11163.

3535b91d

[SPARK-11232][CORE] Use 'offer' instead of 'put' to make sure calling send won't be interrupted · 7bb6d31c

zsxwing authored 9 years ago

The current `NettyRpcEndpointRef.send` can be interrupted because it uses `LinkedBlockingQueue.put`, which may hang the application.

Image the following execution order:

  | thread 1: TaskRunner.kill | thread 2: TaskRunner.run
------------- | ------------- | -------------
1 | killed = true |
2 |  | if (killed) {
3 |  | throw new TaskKilledException
4 |  | case _: TaskKilledException  _: InterruptedException if task.killed =>
5 | task.kill(interruptThread): interruptThread is true |
6 | | execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled))
7 | | localEndpoint.send(StatusUpdate(taskId, state, serializedData)): in LocalBackend

Then `localEndpoint.send(StatusUpdate(taskId, state, serializedData))` will throw `InterruptedException`. This will prevent the executor from updating the task status and hang the application.

An failure caused by the above issue here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44062/consoleFull

Since `receivers` is an unbounded `LinkedBlockingQueue`, we can just use `LinkedBlockingQueue.offer` to resolve this issue.

Author: zsxwing <zsxwing@gmail.com>

Closes #9198 from zsxwing/dont-interrupt-send.

7bb6d31c

[SPARK-11216][SQL][FOLLOW-UP] add encoder/decoder for external row · 42d225f4

Wenchen Fan authored 9 years ago

address comments in https://github.com/apache/spark/pull/9184

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9212 from cloud-fan/encoder.

42d225f4

[SPARK-10708] Consolidate sort shuffle implementations · f6d06adf

Josh Rosen authored 9 years ago

There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.

f6d06adf

[SPARK-11244][SPARKR] sparkR.stop() should remove SQLContext · 94e2064f

Forest Fang authored 9 years ago

SparkR should remove `.sparkRSQLsc` and `.sparkRHivesc` when `sparkR.stop()` is called. Otherwise even when SparkContext is reinitialized, `sparkRSQL.init` returns the stale copy of the object and complains:

```r
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sparkR.stop()
sc <- sparkR.init("local")
sqlContext <- sparkRSQL.init(sc)
sqlContext
```
producing
```r
Error in callJMethod(x, "getClass") :
  Invalid jobj 1. If SparkR was restarted, Spark operations need to be re-executed.
```

I have added the check and removal only when SparkContext itself is initialized. I have also added corresponding test for this fix. Let me know if you want me to move the test to SQL test suite instead.

p.s. I tried lint-r but ended up a lots of errors on existing code.

Author: Forest Fang <forest.fang@outlook.com>

Closes #9205 from saurfang/sparkR.stop.

94e2064f

[SPARK-11121][CORE] Correct the TaskLocation type · c03b6d11

zhichao.li authored 9 years ago

Correct the logic to return `HDFSCacheTaskLocation` instance when the input `str` is a in memory location.

Author: zhichao.li <zhichao.li@intel.com>

Closes #9096 from zhichao-li/uselessBranch.

c03b6d11

Oct 21, 2015

[SPARK-11243][SQL] output UnsafeRow from columnar cache · 1d973327

Davies Liu authored 9 years ago

This PR change InMemoryTableScan to output UnsafeRow, and optimize the unrolling and scanning by coping the bytes for var-length types between UnsafeRow and ByteBuffer directly without creating the wrapper objects. When scanning the decimals in TPC-DS store_sales table, it's 80% faster (copy it as long without create Decimal objects).

Author: Davies Liu <davies@databricks.com>

Closes #9203 from davies/unsafe_cache.

1d973327

[SPARK-9392][SQL] Dataframe drop should work on unresolved columns · 40a10d76

Yanbo Liang authored 9 years ago

Dataframe drop should work on unresolved columns

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8821 from yanboliang/spark-9392.

40a10d76

Minor cleanup of ShuffleMapStage.outputLocs code. · 555b2086

Reynold Xin authored 9 years ago

I was looking at this code and found the documentation to be insufficient. I added more documentation, and refactored some relevant code path slightly to improve encapsulation. There are more that I want to do, but I want to get these changes in before doing more work.

My goal is to reduce exposing internal fields directly in ShuffleMapStage to improve encapsulation. After this change, DAGScheduler no longer directly writes outputLocs. There are still 3 places that reads outputLocs directly, but we can change those later.

Author: Reynold Xin <rxin@databricks.com>

Closes #9175 from rxin/stage-cleanup.

555b2086

[SPARK-10151][SQL] Support invocation of hive macro · f481090a

navis.ryu authored 9 years ago

Macro in hive (which is GenericUDFMacro) contains real function inside of it but it's not conveyed to tasks, resulting null-pointer exception.

Author: navis.ryu <navis@apache.org>

Closes #8354 from navis/SPARK-10151.

f481090a

[SPARK-8654][SQL] Analysis exception when using NULL IN (...) : invalid cast · dce2f8c9

Dilip Biswal authored 9 years ago

In the analysis phase , while processing the rules for IN predicate, we
compare the in-list types to the lhs expression type and generate
cast operation if necessary. In the case of NULL [NOT] IN expr1 , we end up
generating cast between in list types to NULL like cast (1 as NULL) which
is not a valid cast.

The fix is to find a common type between LHS and RHS expressions and cast
all the expression to the common type.

Author: Dilip Biswal <dbiswal@us.ibm.com>

This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>

Closes #9036 from dilipbiswal/spark_8654_new.

dce2f8c9

[SPARK-11233][SQL] register cosh in function registry · 19ad1863
Shagun Sodhani authored 9 years ago
```
Author: Shagun Sodhani <sshagunsodhani@gmail.com>

Closes #9199 from shagunsodhani/proposed-fix-#11233.
```
19ad1863