Commits · 94fc57afdf8ac6be35f13956232b6cf58857d047 · cs525-sp18-g07 / spark

Oct 07, 2015

[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 94fc57af
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8775 from vanzin/SPARK-10300.
```
94fc57af

[SPARK-10941] [SQL] Refactor AggregateFunction2 and AlgebraicAggregate... · a9ecd061

Josh Rosen authored 9 years ago

[SPARK-10941] [SQL] Refactor AggregateFunction2 and AlgebraicAggregate interfaces to improve code clarity

This patch refactors several of the Aggregate2 interfaces in order to improve code clarity.

The biggest change is a refactoring of the `AggregateFunction2` class hierarchy. In the old code, we had a class named `AlgebraicAggregate` that inherited from `AggregateFunction2`, added a new set of methods, then banned the use of the inherited methods. I found this to be fairly confusing because.

If you look carefully at the existing code, you'll see that subclasses of `AggregateFunction2` fall into two disjoint categories: imperative aggregation functions which directly extended `AggregateFunction2` and declarative, expression-based aggregate functions which extended `AlgebraicAggregate`. In order to make this more explicit, this patch refactors things so that `AggregateFunction2` is a sealed abstract class with two subclasses, `ImperativeAggregateFunction` and `ExpressionAggregateFunction`. The superclass, `AggregateFunction2`, now only contains methods and fields that are common to both subclasses.

After making this change, I updated the various AggregationIterator classes to comply with this new naming scheme. I also performed several small renamings in the aggregate interfaces themselves in order to improve clarity and rewrote or expanded a number of comments.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8973 from JoshRosen/tungsten-agg-comments.

a9ecd061

[SPARK-9841] [ML] Make clear public · 5be5d247

Holden Karau authored 9 years ago

It is currently impossible to clear Param values once set. It would be helpful to be able to.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.

5be5d247

[SPARK-10964] [YARN] Correctly register the AM with the driver. · 6ca27f85

Marcelo Vanzin authored 9 years ago

The `self` method returns null when called from the constructor;
instead, registration should happen in the `onStart` method, at
which point the `self` reference has already been initialized.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9005 from vanzin/SPARK-10964.

6ca27f85

[SPARK-10812] [YARN] Fix shutdown of token renewer. · 4b747551

Marcelo Vanzin authored 9 years ago

A recent change to fix the referenced bug caused this exception in
the `SparkContext.stop()` path:

org.apache.spark.SparkException: YarnSparkHadoopUtil is not available in non-YARN mode!
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:167)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:182)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:440)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1579)
at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1730)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1729)

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8996 from vanzin/SPARK-10812.

4b747551

[SPARK-10966] [SQL] Codegen framework cleanup · f5d154bc

Michael Armbrust authored 9 years ago

This PR is mostly cosmetic and cleans up some warts in codegen (nearly all of which were inherited from the original quasiquote version).
 - Add lines numbers to errors (in stacktraces when debug logging is on, and always for compile fails)
 - Use a variable for input row instead of hardcoding "i" everywhere
 - rename `primitive` -> `value` (since its often actually an object)

Author: Michael Armbrust <michael@databricks.com>

Closes #9006 from marmbrus/codegen-cleanup.

f5d154bc

[SPARK-10952] Only add hive to classpath if HIVE_HOME is set. · 9672602c

Kevin Cox authored 9 years ago

Currently if it isn't set it scans `/lib/*` and adds every dir to the
classpath which makes the env too large and every command called
afterwords fails.

Author: Kevin Cox <kevincox@kevincox.ca>

Closes #8994 from kevincox/kevincox-only-add-hive-to-classpath-if-var-is-set.

9672602c

[SPARK-10752] [SPARKR] Implement corr() and cov in DataFrameStatFunctions. · f57c63d4
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #8869 from sun-rui/SPARK-10752.
```
f57c63d4

[SPARK-10669] [DOCS] Link to each language's API in codetabs in ML docs: spark.mllib · 27cdde2f

Xin Ren authored 9 years ago

In the Markdown docs for the spark.mllib Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "ChiSqSelector" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/mllib-feature-extraction.md
This JIRA is just for spark.mllib, not spark.ml.

Please let me know if more work is needed, thanks a lot.

Author: Xin Ren <iamshrek@126.com>

Closes #8977 from keypointt/SPARK-10669.

27cdde2f

Oct 06, 2015

[SPARK-10885] [STREAMING] Display the failed output op in Streaming UI · ffe6831e

zsxwing authored 9 years ago

This PR implements the following features for both `master` and `branch-1.5`.
1. Display the failed output op count in the batch list
2. Display the failure reason of output op in the batch detail page

Screenshots:
<img width="1356" alt="1" src="https://cloud.githubusercontent.com/assets/1000778/10198387/5b2b97ec-67ce-11e5-81c2-f818b9d2f3ad.png">
<img width="1356" alt="2" src="https://cloud.githubusercontent.com/assets/1000778/10198388/5b76ac14-67ce-11e5-8c8b-de2683c5b485.png">

There are still two remaining problems in the UI.
1. If an output operation doesn't run any spark job, we cannot get the its duration since now it's the sum of all jobs' durations.
2. If an output operation doesn't run any spark job, we cannot get the description since it's the latest job's call site.

We need to add new `StreamingListenerEvent` about output operations to fix them. So I'd like to fix them only for `master` in another PR.

Author: zsxwing <zsxwing@gmail.com>

Closes #8950 from zsxwing/batch-failure.

ffe6831e

[SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in... · 5e035403

Xiangrui Meng authored 9 years ago

[SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in PySpark's AFTSurvivalRegression

If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #9001 from mengxr/SPARK-10957.

5e035403

[SPARK-10688] [ML] [PYSPARK] Python API for AFTSurvivalRegression · 5952bdb7

vectorijk authored 9 years ago

Implement Python API for AFTSurvivalRegression

Author: vectorijk <jiangkai@gmail.com>

Closes #8926 from vectorijk/spark-10688.

5952bdb7

[SPARK-10901] [YARN] spark.yarn.user.classpath.first doesn't work · e9783601

Thomas Graves authored 9 years ago

This should go into 1.5.2 also.

The issue is we were no longer adding the __app__.jar to the system classpath.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #8959 from tgravescs/SPARK-10901.

e9783601

[SPARK-10916] [YARN] Set perm gen size when launching containers on YARN. · 744f03e7

Marcelo Vanzin authored 9 years ago

This makes YARN containers behave like all other processes launched by
Spark, which launch with a default perm gen size of 256m unless
overridden by the user (or not needed by the vm).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8970 from vanzin/SPARK-10916.

744f03e7

[SPARK-10938] [SQL] remove typeId in columnar cache · 27ecfe61

Davies Liu authored 9 years ago

This PR remove the typeId in columnar cache, it's not needed anymore, it also remove DATE and TIMESTAMP (use INT/LONG instead).

Author: Davies Liu <davies@databricks.com>

Closes #8989 from davies/refactor_cache.

27ecfe61

[SPARK-10585] [SQL] [FOLLOW-UP] remove no-longer-necessary code for unsafe generation · 4e0027fe

Wenchen Fan authored 9 years ago

These code was left there to produce clear diff for https://github.com/apache/spark/pull/8747

Author: Wenchen Fan <cloud0fan@163.com>

Closes #8991 from cloud-fan/clean.

4e0027fe

Oct 05, 2015

[SPARK-10900] [STREAMING] Add output operation events to StreamingListener · be7c5ff1

zsxwing authored 9 years ago

Add output operation events to StreamingListener so as to implement the following UI features:

1. Progress bar of a batch in the batch list.
2. Be able to display output operation `description` and `duration` when there is no spark job in a Streaming job.

Author: zsxwing <zsxwing@gmail.com>

Closes #8958 from zsxwing/output-operation-events.

be7c5ff1

[SPARK-10934] [SQL] handle hashCode of unsafe array correctly · a609eb20

Wenchen Fan authored 9 years ago

`Murmur3_x86_32.hashUnsafeWords` only accepts word-aligned bytes, but unsafe array is not.

Author: Wenchen Fan <cloud0fan@163.com>

Closes #8987 from cloud-fan/hash.

a609eb20

[SPARK-10585] [SQL] only copy data once when generate unsafe projection · c4871369

Wenchen Fan authored 9 years ago

This PR is a completely rewritten of GenerateUnsafeProjection, to accomplish the goal of copying data only once. The old code of GenerateUnsafeProjection is still there to reduce review difficulty.

Instead of creating unsafe conversion code for struct, array and map, we create code of writing the content to the global row buffer.

Author: Wenchen Fan <cloud0fan@163.com>
Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8747 from cloud-fan/copy-once.

c4871369

Oct 04, 2015

[SPARK-10889] [STREAMING] Bump KCL to add MillisBehindLatest metric · 883bd8fc

Avrohom Katz authored 9 years ago

I don't believe the API changed at all.

Author: Avrohom Katz <iambpentameter@gmail.com>

Closes #8957 from akatz/kcl-upgrade.

883bd8fc

[SPARK-9570] [DOCS] Consistent recommendation for submitting spark apps to... · 82bbc2a5

Sean Owen authored 9 years ago

[SPARK-9570] [DOCS] Consistent recommendation for submitting spark apps to YARN, -master yarn --deploy-mode x vs -master yarn-x'.

Recommend `--master yarn --deploy-mode {cluster,client}` consistently in docs.
Follow-on to https://github.com/apache/spark/pull/8385
CC nssalian

Author: Sean Owen <sowen@cloudera.com>

Closes #8968 from srowen/SPARK-9570.

82bbc2a5

[SPARK-10904] [SPARKR] Fix to support `select(df, c("col1", "col2"))` · 721e8b5f

felixcheung authored 9 years ago

The fix is to coerce `c("a", "b")` into a list such that it could be serialized to call JVM with.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #8961 from felixcheung/rselect.

721e8b5f

Oct 03, 2015

Remove TODO in ShuffleMemoryManager. · ae6570ec
Reynold Xin authored 9 years ago

ae6570ec

FIX: rememberDuration reassignment error message · be0dcd6e

Guillaume Poulin authored 9 years ago

I was reading throught the scheduler and found this small mistake.

Author: Guillaume Poulin <guillaume@hopper.com>

Closes #8966 from gpoulin/remember_duration_typo.

be0dcd6e

[SPARK-6028] [CORE] Remerge #6457: new RPC implemetation and also pick #8905 · 107320c9

zsxwing authored 9 years ago

This PR just reverted https://github.com/apache/spark/commit/02144d6745ec0a6d8877d969feb82139bd22437f to remerge #6457 and also included the commits in #8905.

Author: zsxwing <zsxwing@gmail.com>

Closes #8944 from zsxwing/SPARK-6028.

107320c9

[SPARK-7275] [SQL] Make LogicalRelation public · 314bc684

gweidner authored 9 years ago

Given LogicalRelation (and other classes) were moved from sources package to execution.sources package, removed private[sql] to make LogicalRelation public to facilitate access for data sources.

Author: gweidner <gweidner@us.ibm.com>

Closes #8965 from gweidner/SPARK-7275.

314bc684

Oct 02, 2015

[SPARK-10317] [CORE] Compatibility between history server script and functionality · f85aa064

Joshi authored 9 years ago

Compatibility between history server script and functionality

The history server has its argument parsing class in HistoryServerArguments. However, this doesn't get involved in the start-history-server.sh codepath where the $0 arg is assigned to spark.history.fs.logDirectory and all other arguments discarded (e.g --property-file.)
This stops the other options being usable from this script

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #8758 from rekhajoshm/SPARK-10317.

f85aa064

[HOT-FIX] Fix style. · b0baa11d

Yin Huai authored 9 years ago

https://github.com/apache/spark/pull/8882 broke our build.

Author: Yin Huai <yhuai@databricks.com>

Closes #8964 from yhuai/fixStyle.

b0baa11d

[SPARK-6530] [ML] Add chi-square selector for ml package · 633aaae0

Xusen Yin authored 9 years ago

See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530).

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5742 from yinxusen/SPARK-6530.

633aaae0

[SPARK-5890] [ML] Add feature discretizer · 23a9448c

Xusen Yin authored 9 years ago

JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890).

I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #5779 from yinxusen/SPARK-5890.

23a9448c

[SPARK-9798] [ML] CrossValidatorModel Documentation Improvements · 2a717821

Rerngvit Yanggratoke authored 9 years ago

Document CrossValidatorModel members: bestModel and avgMetrics

Author: Rerngvit Yanggratoke <rerngvit@kth.se>

Closes #8882 from rerngvit/Spark-9798.

2a717821

Oct 01, 2015

[SPARK-9867] [SQL] Move utilities for binary data into ByteArray · 2272962e

Takeshi YAMAMURO authored 9 years ago

The utilities such as Substring#substringBinarySQL and BinaryPrefixComparator#computePrefix for binary data are put together in ByteArray for easy-to-read.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #8122 from maropu/CleanUpForBinaryType.

2272962e

[SPARK-10400] [SQL] Renames SQLConf.PARQUET_FOLLOW_PARQUET_FORMAT_SPEC · 01cd688f

Cheng Lian authored 9 years ago

We introduced SQL option `spark.sql.parquet.followParquetFormatSpec` while working on implementing Parquet backwards-compatibility rules in SPARK-6777. It indicates whether we should use legacy Parquet format adopted by Spark 1.4 and prior versions or the standard format defined in parquet-format spec to write Parquet files.

This option defaults to `false` and is marked as a non-public option (`isPublic = false`) because we haven't finished refactored Parquet write path. The problem is, the name of this option is somewhat confusing, because it's not super intuitive why we shouldn't follow the spec. Would be nice to rename it to `spark.sql.parquet.writeLegacyFormat`, and invert its default value (the two option names have opposite meanings).

Although this option is private in 1.5, we'll make it public in 1.6 after refactoring Parquet write path. So that users can decide whether to write Parquet files in standard format or legacy format.

Author: Cheng Lian <lian@databricks.com>

Closes #8566 from liancheng/spark-10400/deprecate-follow-parquet-format-spec.

01cd688f

[SPARK-10671] [SQL] Throws an analysis exception if we cannot find Hive UDFs · 02026a81

Wenchen Fan authored 9 years ago

Takes over https://github.com/apache/spark/pull/8800

Author: Wenchen Fan <cloud0fan@163.com>

Closes #8941 from cloud-fan/hive-udf.

02026a81

[SPARK-10865] [SPARK-10866] [SQL] Fix bug of ceil/floor, which should returns... · 4d8c7c6d

Cheng Hao authored 9 years ago

[SPARK-10865] [SPARK-10866] [SQL] Fix bug of ceil/floor, which should returns long instead of the Double type

Floor & Ceiling function should returns Long type, rather than Double.

Verified with MySQL & Hive.

Author: Cheng Hao <hao.cheng@intel.com>

Closes #8933 from chenghao-intel/ceiling.

4d8c7c6d

[SPARK-10058] [CORE] [TESTS] Fix the flaky tests in HeartbeatReceiverSuite · 9b3e7768

zsxwing authored 9 years ago

Fixed the test failure here: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/

This failure is because `HeartbeatReceiverSuite. heartbeatReceiver` may receive `SparkListenerExecutorAdded("driver")` sent from [LocalBackend](https://github.com/apache/spark/blob/8fb3a65cbb714120d612e58ef9d12b0521a83260/core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala#L121).

There are other race conditions in `HeartbeatReceiverSuite` because `HeartbeatReceiver.onExecutorAdded` and `HeartbeatReceiver.onExecutorRemoved` are asynchronous. This PR also fixed them.

Author: zsxwing <zsxwing@gmail.com>

Closes #8946 from zsxwing/SPARK-10058.

9b3e7768

Sep 30, 2015

[SPARK-10807] [SPARKR] Added as.data.frame as a synonym for collect · f21e2da0

Oscar D. Lara Yejas authored 9 years ago

Created method as.data.frame as a synonym for collect().

Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: olarayej <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>

Closes #8908 from olarayej/SPARK-10807.

f21e2da0

[SPARK-9617] [SQL] Implement json_tuple · 89ea0041

Nathan Howell authored 9 years ago

This is an implementation of Hive's `json_tuple` function using Jackson Streaming.

Author: Nathan Howell <nhowell@godaddy.com>

Closes #7946 from NathanHowell/SPARK-9617.

89ea0041

[SPARK-10770] [SQL] SparkPlan.executeCollect/executeTake should return... · 03cca5dc

Reynold Xin authored 9 years ago

[SPARK-10770] [SQL] SparkPlan.executeCollect/executeTake should return InternalRow rather than external Row.

Author: Reynold Xin <rxin@databricks.com>

Closes #8900 from rxin/SPARK-10770-1.

03cca5dc

[SPARK-10851] [SPARKR] Exception not failing R applications (in yarn cluster mode) · c7b29ae6

Sun Rui authored 9 years ago

The YARN backend doesn't like when user code calls System.exit, since it cannot know the exit status and thus cannot set an appropriate final status for the application.

This PR remove the usage of system.exit to exit the RRunner. Instead, when the R process running an SparkR script returns an exit code other than 0, throws SparkUserAppException which will be caught by ApplicationMaster and ApplicationMaster knows it failed. For other failures, throws SparkException.

Author: Sun Rui <rui.sun@intel.com>

Closes #8938 from sun-rui/SPARK-10851.

c7b29ae6