Commits · 708129187a460aca30790281e9221c0cd5e271df · cs525-sp18-g07 / spark

Dec 08, 2015

[SPARK-12166][TEST] Unset hadoop related environment in testing · 70812918
Jeff Zhang authored 9 years ago
```
Author: Jeff Zhang <zjffdu@apache.org>

Closes #10172 from zjffdu/SPARK-12166.
```
70812918
[SPARK-12103][STREAMING][KAFKA][DOC] document that K means Key and V … · 48a9804b
cody koeninger authored 9 years ago
```
…means Value

Author: cody koeninger <cody@koeninger.org>

Closes #10132 from koeninger/SPARK-12103.
```
48a9804b

[SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code · 4a39b5a1

Yanbo Liang authored 9 years ago

Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10006 from yanboliang/spark-11958.

4a39b5a1

[SPARK-10259][ML] Add @since annotation to ml.classification · 7d05a624

Takahashi Hiroshi authored 9 years ago

Add since annotation to ml.classification

Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>

Closes #8534 from taishi-oss/issue10259.

7d05a624

Closes #10098 · 73896588
Xiangrui Meng authored 9 years ago

73896588

[SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using include_example · 78209b0c

somideshmukh authored 9 years ago

Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three  java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer

Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #10002 from somideshmukh/SomilBranch1.33.

78209b0c

Dec 07, 2015

[SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlib · 3e7e05f5

Joseph K. Bradley authored 9 years ago

Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods.

This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml.

CC: mengxr yhuai

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10161 from jkbradley/mllib-sqlcontext-fix.

3e7e05f5

[SPARK-12184][PYTHON] Make python api doc for pivot consistant with scala doc · 36282f78

Andrew Ray authored 9 years ago

In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api though. This PR updates the python doc to be consistent.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #10176 from aray/sql-pivot-python-doc.

36282f78

[SPARK-11884] Drop multiple columns in the DataFrame API · 84b80944

tedyu authored 9 years ago

See the thread Ben started:
http://search-hadoop.com/m/q3RTtveEuhjsr7g/

This PR adds drop() method to DataFrame which accepts multiple column names

Author: tedyu <yuzhihong@gmail.com>

Closes #9862 from ted-yu/master.

84b80944

[SPARK-11963][DOC] Add docs for QuantileDiscretizer · 871e85d9

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11963

Author: Xusen Yin <yinxusen@gmail.com>

Closes #9962 from yinxusen/SPARK-11963.

871e85d9

[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize · 3f4efb5c

Shixiong Zhu authored 9 years ago

Merged #10051 again since #10083 is resolved.

This reverts commit 328b757d.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10167 from zsxwing/merge-SPARK-12060.

3f4efb5c

[SPARK-11932][STREAMING] Partition previous TrackStateRDD if partitioner not present · 5d80d8c6

Tathagata Das authored 9 years ago

The reason is that TrackStateRDDs generated by trackStateByKey expect the previous batch's TrackStateRDDs to have a partitioner. However, when recovery from DStream checkpoints, the RDDs recovered from RDD checkpoints do not have a partitioner attached to it. This is because RDD checkpoints do not preserve the partitioner (SPARK-12004).

While #9983 solves SPARK-12004 by preserving the partitioner through RDD checkpoints, there may be a non-zero chance that the saving and recovery fails. To be resilient, this PR repartitions the previous state RDD if the partitioner is not detected.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9988 from tdas/SPARK-11932.

5d80d8c6

[SPARK-12132] [PYSPARK] raise KeyboardInterrupt inside SIGINT handler · ef3f047c

Davies Liu authored 9 years ago

Currently, the current line is not cleared by Cltr-C

After this patch
```
>>> asdfasdf^C
Traceback (most recent call last):
  File "~/spark/python/pyspark/context.py", line 225, in signal_handler
    raise KeyboardInterrupt()
KeyboardInterrupt
```

It's still worse than 1.5 (and before).

Author: Davies Liu <davies@databricks.com>

Closes #10134 from davies/fix_cltrc.

ef3f047c

[SPARK-12034][SPARKR] Eliminate warnings in SparkR test cases. · 39d677c8

Sun Rui authored 9 years ago

This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.

Author: Sun Rui <rui.sun@intel.com>

Closes #10030 from sun-rui/SPARK-12034.

39d677c8

[SPARK-12032] [SQL] Re-order inner joins to do join with conditions first · 9cde7d5f

Davies Liu authored 9 years ago

Currently, the order of joins is exactly the same as SQL query, some conditions may not pushed down to the correct join, then those join will become cross product and is extremely slow.

This patch try to re-order the inner joins (which are common in SQL query), pick the joins that have self-contain conditions first, delay those that does not have conditions.

After this patch, the TPCDS query Q64/65 can run hundreds times faster.

cc marmbrus nongli

Author: Davies Liu <davies@databricks.com>

Closes #10073 from davies/reorder_joins.

9cde7d5f

[SPARK-12106][STREAMING][FLAKY-TEST] BatchedWAL test transiently flaky when Jenkins load is high · 6fd9e70e

Burak Yavuz authored 9 years ago

We need to make sure that the last entry is indeed the last entry in the queue.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #10110 from brkyvz/batch-wal-test-fix.

6fd9e70e

Dec 06, 2015

[SPARK-12152][PROJECT-INFRA] Speed up Scalastyle checks by only invoking SBT once · 80a824d3

Josh Rosen authored 9 years ago

Currently, `dev/scalastyle` invokes SBT four times, but these invocations can be replaced with a single invocation, saving about one minute of build time.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10151 from JoshRosen/speed-up-scalastyle.

80a824d3

[SPARK-12138][SQL] Escape \u in the generated comments of codegen · 49efd03b

gatorsmile authored 9 years ago

When \u appears in a comment block (i.e. in /**/), code gen will break. So, in Expression and CodegenFallback, we escape \u to \\u.

yhuai Please review it. I did reproduce it and it works after the fix. Thanks!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10155 from gatorsmile/escapeU.

49efd03b

[SPARK-12048][SQL] Prevent to close JDBC resources twice · 04b67999
gcc authored 9 years ago
```
Author: gcc <spark-src@condor.rhaag.ip>

Closes #10101 from rh99/master.
```
04b67999

[SPARK-12044][SPARKR] Fix usage of isnan, isNaN · b6e8e63a

Yanbo Liang authored 9 years ago

1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```.
2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0.
<del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del>

cc shivaram sun-rui felixcheung

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10037 from yanboliang/spark-12044.

b6e8e63a

Dec 05, 2015

[SPARK-12115][SPARKR] Change numPartitions() to getNumPartitions() to be... · 6979edf4

Yanbo Liang authored 9 years ago

[SPARK-12115][SPARKR] Change numPartitions() to getNumPartitions() to be consistent with Scala/Python

Change ```numPartitions()``` to ```getNumPartitions()``` to be consistent with Scala/Python.
<del>Note: If we can not catch up with 1.6 release, it will be breaking change for 1.7 that we also need to explain in release note.<del>

cc sun-rui felixcheung shivaram

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10123 from yanboliang/spark-12115.

6979edf4

[SPARK-11715][SPARKR] Add R support corr for Column Aggregration · 895b6c47

felixcheung authored 9 years ago

Need to match existing method signature

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9680 from felixcheung/rcorr.

895b6c47

[SPARK-11774][SPARKR] Implement struct(), encode(), decode() functions in SparkR. · c8d0e160
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9804 from sun-rui/SPARK-11774.
```
c8d0e160

[SPARK-11988][ML][MLLIB] Update JPMML to 1.2.7 · 7da67485

Sean Owen authored 9 years ago

Update JPMML pmml-model to 1.2.7

Author: Sean Owen <sowen@cloudera.com>

Closes #9972 from srowen/SPARK-11988.

7da67485

[SPARK-11994][MLLIB] Word2VecModel load and save cause SparkException when... · e9c9ae22

Antonio Murgia authored 9 years ago

[SPARK-11994][MLLIB] Word2VecModel load and save cause SparkException when model is bigger than spark.kryoserializer.buffer.max

Author: Antonio Murgia <antonio.murgia2@studio.unibo.it>

Closes #9989 from tmnd1991/SPARK-11932.

e9c9ae22

[SPARK-12096][MLLIB] remove the old constraint in word2vec · ee94b70c

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-12096

word2vec now can handle much bigger vocabulary.
The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed.

new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue)

I tested with vocabsize over 18M and vectorsize = 100.

srowen jkbradley Sorry to miss this in last PR. I was reminded today.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10103 from hhbyyh/w2vCapacity.

ee94b70c

Dec 04, 2015

[SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectly · 3af53e61

Shixiong Zhu authored 9 years ago

`ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct.

This patch fixed all places that use `ByteBuffer.array` incorrectly.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10083 from zsxwing/bytebuffer-array.

3af53e61

[SPARK-12080][CORE] Kryo - Support multiple user registrators · f30373f5
rotems authored 9 years ago
```
Author: rotems <roter>

Closes #10078 from Botnaim/KryoMultipleCustomRegistrators.
```
f30373f5

[SPARK-12142][CORE]Reply false when container allocator is not ready and reset target · bbfc16ec

meiyoula authored 9 years ago

Using Dynamic Allocation function, when a new AM is starting, and ExecutorAllocationManager send RequestExecutor message to AM. If the container allocator is not ready, the whole app will hang on

Author: meiyoula <1039320815@qq.com>

Closes #10138 from XuTingjun/patch-1.

bbfc16ec

[SPARK-12112][BUILD] Upgrade to SBT 0.13.9 · b7204e1d

Josh Rosen authored 9 years ago

We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.

b7204e1d

[SPARK-11314][BUILD][HOTFIX] Add exclusion for moved YARN classes. · d64806b3
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10147 from vanzin/SPARK-11314.
```
d64806b3

[SPARK-12058][STREAMING][KINESIS][TESTS] fix Kinesis python tests · 302d68de

Burak Yavuz authored 9 years ago

Python tests require access to the `KinesisTestUtils` file. When this file exists under src/test, python can't access it, since it is not available in the assembly jar.

However, if we move KinesisTestUtils to src/main, we need to add the KinesisProducerLibrary as a dependency. In order to avoid this, I moved KinesisTestUtils to src/main, and extended it with ExtendedKinesisTestUtils which is under src/test that adds support for the KPL.

cc zsxwing tdas

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #10050 from brkyvz/kinesis-py.

302d68de

[SPARK-6990][BUILD] Add Java linting script; fix minor warnings · d0d82227

Dmitry Erastov authored 9 years ago

This replaces https://github.com/apache/spark/pull/9696

Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.

Suggest fixing those TODOs in a separate PR(s).

More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).

Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):

> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1

Also fix some of the minor violations that didn't require sweeping changes.

Apologies for the previous botched PRs - I finally figured out the issue.

cr: JoshRosen, pwendell

> I state that the contribution is my original work, and I license the work to the project under the project's open source license.

Author: Dmitry Erastov <derastov@gmail.com>

Closes #9867 from dskrvk/master.

d0d82227

[SPARK-12089] [SQL] Fix memory corrupt due to freeing a page being referenced · 95296d9b

Nong authored 9 years ago

When the spillable sort iterator was spilled, it was mistakenly keeping
the last page in memory rather than the current page. This causes the
current record to get corrupted.

Author: Nong <nong@cloudera.com>

Closes #10142 from nongli/spark-12089.

95296d9b

Add links howto to setup IDEs for developing spark · 17e4e021

kaklakariada authored 9 years ago

These links make it easier for new developers to work with Spark in their IDE.

Author: kaklakariada <kaklakariada@users.noreply.github.com>

Closes #10104 from kaklakariada/readme-developing-ide-gettting-started.

17e4e021

[SPARK-12122][STREAMING] Prevent batches from being submitted twice after... · 4106d80f

Tathagata Das authored 9 years ago

[SPARK-12122][STREAMING] Prevent batches from being submitted twice after recovering StreamingContext from checkpoint

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #10127 from tdas/SPARK-12122.

4106d80f

Dec 03, 2015

[SPARK-12104][SPARKR] collect() does not handle multiple columns with same name. · 5011f264
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #10118 from sun-rui/SPARK-12104.
```
5011f264

[SPARK-11206] Support SQL UI on the history server (resubmit) · b6e9963e

Carson Wang authored 9 years ago

Resubmit #9297 and #9991
On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution.

To support SQL UI on the history server:
1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus.
2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module.
3. The new SQL events are written to event log using Jackson.
4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader.

Author: Carson Wang <carson.wang@intel.com>

Closes #10061 from carsonwang/SqlHistoryUI.

b6e9963e

[SPARK-12056][CORE] Create a TaskAttemptContext only after calling setConf. · f434f36d

Anderson de Andrade authored 9 years ago

TaskAttemptContext's constructor will clone the configuration instead of referencing it. Calling setConf after creating TaskAttemptContext makes any changes to the configuration made inside setConf unperceived by RecordReader instances.

As an example, Titan's InputFormat will change conf when calling setConf. They wrap their InputFormat around Cassandra's ColumnFamilyInputFormat, and append Cassandra's configuration. This change fixes the following error when using Titan's CassandraInputFormat with Spark:

*java.lang.RuntimeException: org.apache.thrift.protocol.TProtocolException: Required field 'keyspace' was not present! Struct: set_key space_args(keyspace:null)*

There's a discussion of this error here: https://groups.google.com/forum/#!topic/aureliusgraphs/4zpwyrYbGAE

Author: Anderson de Andrade <adeandrade@verticalscope.com>

Closes #10046 from adeandrade/newhadooprdd-fix.

f434f36d

[SPARK-12019][SPARKR] Support character vector for sparkR.init(), check param and fix doc · 2213441e

felixcheung authored 9 years ago

and add tests.
Spark submit expects comma-separated list

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10034 from felixcheung/sparkrinitdoc.

2213441e