Commits · 17f499920776e0e995434cfa300ff2ff38658fa8 · cs525-sp18-g07 / spark

Oct 27, 2015

[SPARK-5569][STREAMING] fix ObjectInputStreamWithLoader for supporting load array classes. · 17f49992

maxwell authored 9 years ago

When use Kafka DirectStream API to create checkpoint and restore saved checkpoint when restart,
ClassNotFound exception would occur.

The reason for this error is that ObjectInputStreamWithLoader extends the ObjectInputStream class and override its resolveClass method. But Instead of Using Class.forName(desc,false,loader), Spark uses loader.loadClass(desc) to instance the class, which do not works with array class.

For example:
Class.forName("[Lorg.apache.spark.streaming.kafka.OffsetRange.",false,loader) works well while loader.loadClass("[Lorg.apache.spark.streaming.kafka.OffsetRange") would throw an class not found exception.

details of the difference between Class.forName and loader.loadClass can be found here.
http://bugs.java.com/view_bug.do?bug_id=6446627

Author: maxwell <maxwellzdm@gmail.com>
Author: DEMING ZHU <deming.zhu@linecorp.com>

Closes #8955 from maxwellzdm/master.

17f49992

[SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition... · 8f888eea

Nick Evans authored 9 years ago

[SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition from the Kafka Streaming API

jerryshao tdas

I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise.

Instead of doing something like:
```
assert topic_and_partition_instance._topic == "foo"
assert topic_and_partition_instance._partition == 0
```

You can do something like:
```
assert topic_and_partition_instance == TopicAndPartition("foo", 0)
```

Before:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
False
```

After:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
True
```

I couldn't find any tests - am I missing something?

Author: Nick Evans <me@nicolasevans.org>

Closes #9236 from manygrams/topic_and_partition_equality.

8f888eea

[SPARK-11276][CORE] SizeEstimator prevents class unloading · feb8d6a4

Sem Mulder authored 9 years ago

The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects as keys.
Which results in strong references to the Class objects. If these classes are dynamically created
this prevents the corresponding ClassLoader from being GCed. Leading to PermGen exhaustion.

We use a Map with WeakKeys to prevent this issue.

Author: Sem Mulder <sem.mulder@site2mobile.com>

Closes #9244 from SemMulder/fix-sizeestimator-classunloading.

feb8d6a4

[SPARK-11297] Add new code tags · d77d198f

Xusen Yin authored 9 years ago

mengxr https://issues.apache.org/jira/browse/SPARK-11297

Add new code tags to hold the same look and feel with previous documents.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9265 from yinxusen/SPARK-11297.

d77d198f

[SPARK-10654][MLLIB] Add columnSimilarities to IndexedRowMatrix · 8b292b19

Reza Zadeh authored 9 years ago

Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix.

With a test.

Author: Reza Zadeh <reza@databricks.com>

Closes #8792 from rezazadeh/colsims.

8b292b19

Oct 26, 2015

[SPARK-11184][MLLIB] Declare most of .mllib code not-Experimental · 3cac6614

Sean Owen authored 9 years ago

Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier

Author: Sean Owen <sowen@cloudera.com>

Closes #9169 from srowen/SPARK-11184.

3cac6614

[SPARK-10271][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.clustering · 5d4f6abe

noelsmith authored 9 years ago

Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).

Author: noelsmith <mail@noelsmith.com>

Closes #8627 from noel-smith/SPARK-10271-since-mllib-clustering.

5d4f6abe

[SPARK-11289][DOC] Substitute code examples in ML features extractors with include_example · 943d4fa2

Xusen Yin authored 9 years ago

mengxr https://issues.apache.org/jira/browse/SPARK-11289

I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9266 from yinxusen/SPARK-11289.

943d4fa2

[SPARK-10562] [SQL] support mixed case partitionBy column names for tables stored in metastore · a150e6c1
Wenchen Fan authored 9 years ago
```
https://issues.apache.org/jira/browse/SPARK-10562

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9226 from cloud-fan/par.
```
a150e6c1
[SPARK-11209][SPARKR] Add window functions into SparkR [step 1]. · dc3220ce
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9193 from sun-rui/SPARK-11209.
```
dc3220ce

[SPARK-10947] [SQL] With schema inference from JSON into a Dataframe, add... · 82464fb2

Stephen De Gennaro authored 9 years ago

[SPARK-10947] [SQL] With schema inference from JSON into a Dataframe, add option to infer all primitive object types as strings

Currently, when a schema is inferred from a JSON file using sqlContext.read.json, the primitive object types are inferred as string, long, boolean, etc.

However, if the inferred type is too specific (JSON obviously does not enforce types itself), this can cause issues with merging dataframe schemas.

This pull request adds the option "primitivesAsString" to the JSON DataFrameReader which when true (defaults to false if not set) will infer all primitives as strings.

Below is an example usage of this new functionality.
```
val jsonDf = sqlContext.read.option("primitivesAsString", "true").json(sampleJsonFile)

scala> jsonDf.printSchema()
root
|-- bigInteger: string (nullable = true)
|-- boolean: string (nullable = true)
|-- double: string (nullable = true)
|-- integer: string (nullable = true)
|-- long: string (nullable = true)
|-- null: string (nullable = true)
|-- string: string (nullable = true)
```

Author: Stephen De Gennaro <stepheng@realitymine.com>

Closes #9249 from stephend-realitymine/stephend-primitives.

82464fb2

[SPARK-11325] [SQL] Alias 'alias' in Scala's DataFrame API · d4c397a6
Nong Li authored 9 years ago
```
Author: Nong Li <nongli@gmail.com>

Closes #9286 from nongli/spark-11325.
```
d4c397a6

[SQL][DOC] Minor document fixes in interfaces.scala · 4bb2b369

Alexander Slesarenko authored 9 years ago

rxin just noticed this while reading the code.

Author: Alexander Slesarenko <avslesarenko@gmail.com>

Closes #9284 from aslesarenko/doc-typos.

4bb2b369

[SPARK-11258] Converting a Spark DataFrame into an R data.frame is slow / requires a lot of memory · b60aab8a

Frank Rosner authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11258

I was not able to locate an existing unit test for this function so I wrote one.

Author: Frank Rosner <frank@fam-rosner.de>

Closes #9222 from FRosner/master.

b60aab8a

[SPARK-10979][SPARKR] Sparkrmerge: Add merge to DataFrame with R signature · 3689beb9

Narine Kokhlikyan authored 9 years ago

Add merge function to DataFrame, which supports R signature.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html

Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>

Closes #9012 from NarineK/sparkrmerge.

3689beb9

[SPARK-5966][WIP] Spark-submit deploy-mode cluster is not compatible with master local> · 616be29c
Kevin Yu authored 9 years ago
```
… master local>

Author: Kevin Yu <qyu@us.ibm.com>

Closes #9220 from kevinyu98/working_on_spark-5966.
```
616be29c
[SPARK-11279][PYSPARK] Add DataFrame#toDF in PySpark · 05c4bdb5
Jeff Zhang authored 9 years ago
```
Author: Jeff Zhang <zjffdu@apache.org>

Closes #9248 from zjffdu/SPARK-11279.
```
05c4bdb5

[SPARK-11253] [SQL] reset all accumulators in physical operators before execute an action · 07ced434

Wenchen Fan authored 9 years ago

With this change, our query execution listener can get the metrics correctly.

The UI still looks good after this change.
<img width="257" alt="screen shot 2015-10-23 at 11 25 14 am" src="https://cloud.githubusercontent.com/assets/3182036/10683834/d516f37e-7978-11e5-8118-343ed40eb824.png">
<img width="494" alt="screen shot 2015-10-23 at 11 25 01 am" src="https://cloud.githubusercontent.com/assets/3182036/10683837/e1fa60da-7978-11e5-8ec8-178b88f27764.png">

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9215 from cloud-fan/metric.

07ced434

Oct 25, 2015

[SPARK-11127][STREAMING] upgrade AWS SDK and Kinesis Client Library (KCL) · 87f82a5f

Xiangrui Meng authored 9 years ago

AWS SDK 1.9.40 is the latest 1.9.x release. KCL 1.5.1 is the latest release that using AWS SDK 1.9.x. The main goal is to have Kinesis consumer be able to read messages generated from Kinesis Producer Library (KPL). The API should be compatible with old versions.

tdas brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #9153 from mengxr/SPARK-11127.

87f82a5f

[SPARK-10984] Simplify *MemoryManager class structure · 85e654c5

Josh Rosen authored 9 years ago

This patch refactors the MemoryManager class structure. After #9000, Spark had the following classes:

- MemoryManager
- StaticMemoryManager
- ExecutorMemoryManager
- TaskMemoryManager
- ShuffleMemoryManager

This is fairly confusing. To simplify things, this patch consolidates several of these classes:

- ShuffleMemoryManager and ExecutorMemoryManager were merged into MemoryManager.
- TaskMemoryManager is moved into Spark Core.

**Key changes and tasks**:

- [x] Merge ExecutorMemoryManager into MemoryManager.
  - [x] Move pooling logic into Allocator.
- [x] Move TaskMemoryManager from `spark-unsafe` to `spark-core`.
- [x] Refactor the existing Tungsten TaskMemoryManager interactions so Tungsten code use only this and not both this and ShuffleMemoryManager.
- [x] Refactor non-Tungsten code to use the TaskMemoryManager instead of ShuffleMemoryManager.
- [x] Merge ShuffleMemoryManager into MemoryManager.
  - [x] Move code
  - [x] ~~Simplify 1/n calculation.~~ **Will defer to followup, since this needs more work.**
- [x] Port ShuffleMemoryManagerSuite tests.
- [x] Move classes from `unsafe` package to `memory` package.
- [ ] Figure out how to handle the hacky use of the memory managers in HashedRelation's broadcast variable construction.
- [x] Test porting and cleanup: several tests relied on mock functionality (such as `TestShuffleMemoryManager.markAsOutOfMemory`) which has been changed or broken during the memory manager consolidation
  - [x] AbstractBytesToBytesMapSuite
  - [x] UnsafeExternalSorterSuite
  - [x] UnsafeFixedWidthAggregationMapSuite
  - [x] UnsafeKVExternalSorterSuite

**Compatiblity notes**:

- This patch introduces breaking changes in `ExternalAppendOnlyMap`, which is marked as `DevloperAPI` (likely for legacy reasons): this class now cannot be used outside of a task.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9127 from JoshRosen/SPARK-10984.

85e654c5

[SPARK-10891][STREAMING][KINESIS] Add MessageHandler to... · 63accc79

Burak Yavuz authored 9 years ago

[SPARK-10891][STREAMING][KINESIS] Add MessageHandler to KinesisUtils.createStream similar to Direct Kafka

This PR allows users to map a Kinesis `Record` to a generic `T` when creating a Kinesis stream. This is particularly useful, if you would like to do extra work with Kinesis metadata such as sequence number, and partition key.

TODO:
 - [x] add tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #8954 from brkyvz/kinesis-handler.

63accc79

[SPARK-11287] Fixed class name to properly start TestExecutor from deploy.client.TestClient · 80279ac1

Bryan Cutler authored 9 years ago

Executing deploy.client.TestClient fails due to bad class name for TestExecutor in ApplicationDescription.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9255 from BryanCutler/fix-TestClient-classname-SPARK-11287.

80279ac1

[SPARK-6428][SQL] Removed unnecessary typecasts in MutableInt, MutableDouble etc. · 92b9c5ed

Alexander Slesarenko authored 9 years ago

marmbrus rxin I believe these typecasts are not required in the presence of explicit return types.

Author: Alexander Slesarenko <avslesarenko@gmail.com>

Closes #9262 from aslesarenko/remove-typecasts.

92b9c5ed

[SPARK-11299][DOC] Fix link to Scala DataFrame Functions reference · b67dc6a4

Josh Rosen authored 9 years ago

The SQL programming guide's link to the DataFrame functions reference points to the wrong location; this patch fixes that.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9269 from JoshRosen/SPARK-11299.

b67dc6a4

Oct 24, 2015

Fix typos · 146da0d8

Jacek Laskowski authored 9 years ago

Two typos squashed.

BTW Let me know how to proceed with other typos if I ran across any. I don't feel well to leave them aside as much as sending pull requests with such tiny changes. Guide me.

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9250 from jaceklaskowski/typos-hunting.

146da0d8

[SPARK-11264] bin/spark-class can't find assembly jars with certain GREP_OPTIONS set · 28132ceb

Jeffrey Naisbitt authored 9 years ago

Temporarily remove GREP_OPTIONS if set in bin/spark-class.

Some GREP_OPTIONS will modify the output of the grep commands that are looking for the assembly jars.
For example, if the -n option is specified, the grep output will look like:
5:spark-assembly-1.5.1-hadoop2.4.0.jar

This will not match the regular expressions, and so the jar files will not be found. We could improve the regular expression to handle this case and trim off extra characters, but it is difficult to know which options may or may not be set. Unsetting GREP_OPTIONS within the script handles all the cases and gives the desired output.

Author: Jeffrey Naisbitt <jnaisbitt@familysearch.org>

Closes #9231 from naisbitt/unset-GREP_OPTIONS.

28132ceb

[SPARK-11245] update twitter4j to 4.0.4 version · e5bc8c27

dima authored 9 years ago

update twitter4j to 4.0.4 version
https://issues.apache.org/jira/browse/SPARK-11245

Author: dima <pronix.service@gmail.com>

Closes #9221 from pronix/twitter4j_update.

e5bc8c27

[SPARK-11125] [SQL] Uninformative exception when running spark-sql witho… · ffed0049

Jeff Zhang authored 9 years ago

…ut building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

This is the exception after this patch. Please help review.
```
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	... 21 more
Failed to load hive class.
You need to build Spark with -Phive and -Phive-thriftserver.
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9134 from zjffdu/SPARK-11125.

ffed0049

Oct 23, 2015

[SPARK-11294][SPARKR] Improve R doc for read.df, write.df, saveAsTable · 5e458125

felixcheung authored 9 years ago

Add examples for read.df, write.df; fix grouping for read.df, loadDF; fix formatting and text truncation for write.df, saveAsTable.

Several text issues:
![image](https://cloud.githubusercontent.com/assets/8969467/10708590/1303a44e-79c3-11e5-854f-3a2e16854cd7.png)
- text collapsed into a single paragraph
- text truncated at 2 places, eg. "overwrite: Existing data is expected to be overwritten by the contents of error:"

shivaram

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9261 from felixcheung/rdocreadwritedf.

5e458125

[SPARK-10971][SPARKR] RRunner should allow setting path to Rscript. · 2462dbcc

Sun Rui authored 9 years ago

Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.

The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395).

BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host.

For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`).
PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script.

Author: Sun Rui <rui.sun@intel.com>

Closes #9179 from sun-rui/SPARK-10971.

2462dbcc

[SPARK-11194] [SQL] Use MutableURLClassLoader for the classLoader in IsolatedClientLoader. · 4725cb98
Yin Huai authored 9 years ago
```
https://issues.apache.org/jira/browse/SPARK-11194

Author: Yin Huai <yhuai@databricks.com>

Closes #9170 from yhuai/SPARK-11194.
```
4725cb98

[SPARK-11274] [SQL] Text data source support for Spark SQL. · e1a897b6

Reynold Xin authored 9 years ago

This adds API for reading and writing text files, similar to SparkContext.textFile and RDD.saveAsTextFile.
```
SQLContext.read.text("/path/to/something.txt")
DataFrame.write.text("/path/to/write.txt")
```

Using the new Dataset API, this also supports
```
val ds: Dataset[String] = SQLContext.read.text("/path/to/something.txt").as[String]
```

Author: Reynold Xin <rxin@databricks.com>

Closes #9240 from rxin/SPARK-11274.

e1a897b6

[SPARK-6723] [MLLIB] Model import/export for ChiSqSelector · 4e38defa

Jayant Shekar authored 9 years ago

This is a PR for Parquet-based model import/export.

* Added save/load for ChiSqSelectorModel
* Updated the test suite ChiSqSelectorSuite

Author: Jayant Shekar <jayant@user-MBPMBA-3.local>

Closes #6785 from jayantshekhar/SPARK-6723.

4e38defa

[SPARK-10277] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.regression · 282a15f7
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8684 from yu-iskw/SPARK-10277.
```
282a15f7

[SPARK-10382] Make example code in user guide testable · 03ccb220

Xusen Yin authored 9 years ago

A POC code for making example code in user guide testable.

mengxr We still need to talk about the labels in code.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9109 from yinxusen/SPARK-10382.

03ccb220

[SPARK-11243][SQL] zero out padding bytes in UnsafeRow · 487d409e

Davies Liu authored 9 years ago

For nested StructType, the underline buffer could be used for others before, we should zero out the padding bytes for those primitive types that have less than 8 bytes.

cc cloud-fan

Author: Davies Liu <davies@databricks.com>

Closes #9217 from davies/zero_out.

487d409e

Fix typo "Received" to "Receiver" in streaming-kafka-integration.md · 16dc9f34

Rohan Bhanderi authored 9 years ago

Removed typo on line 8 in markdown : "Received" -> "Receiver"

Author: Rohan Bhanderi <rohan.bhanderi@sjsu.edu>

Closes #9242 from RohanBhanderi/patch-1.

16dc9f34

[SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util package · cdea0174
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #9239 from rxin/types-private.
```
cdea0174

Fix a (very tiny) typo · b1c1597e

Jacek Laskowski authored 9 years ago

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9230 from jaceklaskowski/utils-seconds-typo.

b1c1597e

[SPARK-11134][CORE] Increase LauncherBackendSuite timeout. · fa6a4fbf

Marcelo Vanzin authored 9 years ago

This test can take a little while to finish on slow / loaded machines.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9235 from vanzin/SPARK-11134.

fa6a4fbf