Commits · e6dc89a33951e9197a77dbcacf022c27469ae41e · cs525-sp18-g07 / spark

Nov 30, 2015

[SPARK-12035] Add more debug information in include_example tag of Jekyll · e6dc89a3

Xusen Yin authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12035

When we debuging lots of example code files, like in https://github.com/apache/spark/pull/10002, it's hard to know which file causes errors due to limited information in `include_example.rb`. With their filenames, we can locate bugs easily.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10026 from yinxusen/SPARK-12035.

e6dc89a3

[SPARK-12000] Fix API doc generation issues · d3ca8cfa

Josh Rosen authored 9 years ago

This pull request fixes multiple issues with API doc generation.

- Modify the Jekyll plugin so that the entire doc build fails if API docs cannot be generated. This will make it easy to detect when the doc build breaks, since this will now trigger Jenkins failures.
- Change how we handle the `-target` compiler option flag in order to fix `javadoc` generation.
- Incorporate doc changes from thunterdb (in #10048).

Closes #10048.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Timothy Hunter <timhunter@databricks.com>

Closes #10049 from JoshRosen/fix-doc-build.

d3ca8cfa

[SPARK-12058][HOTFIX] Disable KinesisStreamTests · edb26e7f

Shixiong Zhu authored 9 years ago

KinesisStreamTests in test.py is broken because of #9403. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46896/testReport/(root)/KinesisStreamTests/test_kinesis_stream/

Because Streaming Python didn’t work when merging https://github.com/apache/spark/pull/9403, the PR build didn’t report the Python test failure actually.

This PR just disabled the test to unblock #10039

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10047 from zsxwing/disable-python-kinesis-test.

edb26e7f

fix Maven build · ecc00ec3
Davies Liu authored 9 years ago

ecc00ec3

[SPARK-11960][MLLIB][DOC] User guide for streaming tests · 55358889

Feynman Liang authored 9 years ago

CC jkbradley mengxr josepablocam

Author: Feynman Liang <feynman.liang@gmail.com>

Closes #10005 from feynmanliang/streaming-test-user-guide.

55358889

[SPARK-11975][ML] Remove duplicate mllib example (DT/RF/GBT in Java/Python) · de64b65f

Yanbo Liang authored 9 years ago

Remove duplicate mllib example (DT/RF/GBT in Java/Python).
Since we have tutorial code for DT/RF/GBT classification/regression in Scala/Java/Python and example applications for DT/RF/GBT in Scala, so we mark these as duplicated and remove them.
mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9954 from yanboliang/SPARK-11975.

de64b65f

[SPARK-11689][ML] Add user guide and example code for LDA under spark.ml · e232720a

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-11689

Add simple user guide for LDA under spark.ml and example code under examples/. Use include_example to include example code in the user guide markdown. Check SPARK-11606 for instructions.

Original PR is reverted due to document build error. https://github.com/apache/spark/pull/9722

mengxr feynmanliang yinxusen Sorry for the troubling.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9974 from hhbyyh/ldaMLExample.

e232720a

[SPARK-12053][CORE] EventLoggingListener.getLogPath needs 4 parameters · a8ceec5e

Teng Qiu authored 9 years ago

```EventLoggingListener.getLogPath``` needs 4 input arguments:
https://github.com/apache/spark/blob/v1.6.0-preview2/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L276-L280

the 3rd parameter should be appAttemptId, 4th parameter is codec...

Author: Teng Qiu <teng.qiu@gmail.com>

Closes #10044 from chutium/SPARK-12053.

a8ceec5e

Revert "[SPARK-11206] Support SQL UI on the history server" · 2c5dee0f

Josh Rosen authored 9 years ago

This reverts commit cc243a07 / PR #9297

I'm reverting this because it broke SQLListenerMemoryLeakSuite in the master Maven builds.

See #9991 for a discussion of why this broke the tests.

2c5dee0f

[MINOR][DOCS] fixed list display in ml-ensembles · f2fbfa44

BenFradet authored 9 years ago

The list in ml-ensembles.md wasn't properly formatted and, as a result, was looking like this:
![old](http://i.imgur.com/2ZhELLR.png)

This PR aims to make it look like this:
![new](http://i.imgur.com/0Xriwd2.png)

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10025 from BenFradet/ml-ensembles-doc.

f2fbfa44

[SPARK-11982] [SQL] improve performance of cartesian product · 8df584b0

Davies Liu authored 9 years ago

This PR improve the performance of CartesianProduct by caching the result of right plan.

After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster).

cc nongli

Author: Davies Liu <davies@databricks.com>

Closes #9969 from davies/improve_cartesian.

8df584b0

[SPARK-11700] [SQL] Remove thread local SQLContext in SparkPlan · 17275fa9

Davies Liu authored 9 years ago

In 1.6, we introduce a public API to have a SQLContext for current thread, SparkPlan should use that.

Author: Davies Liu <davies@databricks.com>

Closes #9990 from davies/leak_context.

17275fa9

[SPARK-11989][SQL] Only use commit in JDBC data source if the underlying... · 2db4662f

CK50 authored 9 years ago

[SPARK-11989][SQL] Only use commit in JDBC data source if the underlying database supports transactions

Fixes [SPARK-11989](https://issues.apache.org/jira/browse/SPARK-11989

)

Author: CK50 <christian.kurz@oracle.com>
Author: Christian Kurz <christian.kurz@oracle.com>

Closes #9973 from CK50/branch-1.6_non-transactional.

(cherry picked from commit a589736a)
Signed-off-by: Reynold Xin <rxin@databricks.com>

2db4662f

[SPARK-12023][BUILD] Fix warnings while packaging spark with maven. · bf0e85a7

Prashant Sharma authored 9 years ago

this is a trivial fix, discussed [here](http://stackoverflow.com/questions/28500401/maven-assembly-plugin-warning-the-assembly-descriptor-contains-a-filesystem-roo/).

Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #10014 from ScrapCodes/assembly-warning.

bf0e85a7

[DOC] Explicitly state that top maintains the order of elements · 26c3581f

Wieland Hoffmann authored 9 years ago

Top is implemented in terms of takeOrdered, which already maintains the
order, so top should, too.

Author: Wieland Hoffmann <themineo@gmail.com>

Closes #10013 from mineo/top-order.

26c3581f

[MINOR][BUILD] Changed the comment to reflect the plugin project is there to... · 953e8e6d

Prashant Sharma authored 9 years ago

[MINOR][BUILD] Changed the comment to reflect the plugin project is there to support SBT pom reader only.

Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #10012 from ScrapCodes/minor-build-comment.

953e8e6d

[SPARK-11859][MESOS] SparkContext accepts invalid Master URLs in the form... · e0749442

toddwan authored 9 years ago

[SPARK-11859][MESOS] SparkContext accepts invalid Master URLs in the form zk://host:port for a multi-master Mesos cluster using ZooKeeper

* According to below doc and validation logic in [SparkSubmit.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L231), master URL for a mesos cluster should always start with `mesos://`

http://spark.apache.org/docs/latest/running-on-mesos.html
`The Master URLs for Mesos are in the form mesos://host:5050 for a single-master Mesos cluster, or mesos://zk://host:2181 for a multi-master Mesos cluster using ZooKeeper.`

* However, [SparkContext.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2749) fails the validation and can receive master URL in the form `zk://host:port`

* For the master URLs in the form `zk:host:port`, the valid form should be `mesos://zk://host:port`

* This PR restrict the validation in `SparkContext.scala`, and now only mesos master URLs prefixed with `mesos://` can be accepted.

* This PR also updated corresponding unit test.

Author: toddwan <tawan0109@outlook.com>

Closes #9886 from toddwan/S11859.

e0749442

Nov 29, 2015

[SPARK-12039] [SQL] Ignore HiveSparkSubmitSuite's "SPARK-9757 Persist Parquet... · 0ddfe786

Yin Huai authored 9 years ago

[SPARK-12039] [SQL] Ignore HiveSparkSubmitSuite's "SPARK-9757 Persist Parquet relation with decimal column".

https://issues.apache.org/jira/browse/SPARK-12039

Since it is pretty flaky in hadoop 1 tests, we can disable it while we are investigating the cause.

Author: Yin Huai <yhuai@databricks.com>

Closes #10035 from yhuai/SPARK-12039-ignore.

0ddfe786

[SPARK-12024][SQL] More efficient multi-column counting. · 3d28081e

Herman van Hovell authored 9 years ago

In https://github.com/apache/spark/pull/9409 we enabled multi-column counting. The approach taken in that PR introduces a bit of overhead by first creating a row only to check if all of the columns are non-null.

This PR fixes that technical debt. Count now takes multiple columns as its input. In order to make this work I have also added support for multiple columns in the single distinct code path.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10015 from hvanhovell/SPARK-12024.

3d28081e

[SPARK-11781][SPARKR] SparkR has problem in inferring type of raw type. · cc7a1bc9
Sun Rui authored 9 years ago
```
Author: Sun Rui <rui.sun@intel.com>

Closes #9769 from sun-rui/SPARK-11781.
```
cc7a1bc9

Nov 28, 2015

[SPARK-9319][SPARKR] Add support for setting column names, types · c793d2d9

felixcheung authored 9 years ago

Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.

I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218

shivaram sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9654 from felixcheung/colnamescoltypes.

c793d2d9

[SPARK-12029][SPARKR] Improve column functions signature, param check, tests,... · 28e46ab4

felixcheung authored 9 years ago

[SPARK-12029][SPARKR] Improve column functions signature, param check, tests, fix doc and add examples

shivaram sun-rui

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10019 from felixcheung/rfunctionsdoc.

28e46ab4

[SPARK-12028] [SQL] get_json_object returns an incorrect result when the value is null literals · 149cd692

gatorsmile authored 9 years ago

When calling `get_json_object` for the following two cases, both results are `"null"`:

```scala
    val tuple: Seq[(String, String)] = ("5", """{"f1": null}""") :: Nil
    val df: DataFrame = tuple.toDF("key", "jstring")
    val res = df.select(functions.get_json_object($"jstring", "$.f1")).collect()
```
```scala
    val tuple2: Seq[(String, String)] = ("5", """{"f1": "null"}""") :: Nil
    val df2: DataFrame = tuple2.toDF("key", "jstring")
    val res3 = df2.select(functions.get_json_object($"jstring", "$.f1")).collect()
```

Fixed the problem and also added a test case.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10018 from gatorsmile/get_json_object.

149cd692

Nov 27, 2015

[SPARK-12020][TESTS][TEST-HADOOP2.0] PR builder cannot trigger hadoop 2.0 test · b9921524

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12020

Author: Yin Huai <yhuai@databricks.com>

Closes #10010 from yhuai/SPARK-12020.

b9921524

[SPARK-12021][STREAMING][TESTS] Fix the potential dead-lock in StreamingListenerSuite · f57e6c9e

Shixiong Zhu authored 9 years ago

In StreamingListenerSuite."don't call ssc.stop in listener", after the main thread calls `ssc.stop()`, `StreamingContextStoppingCollector` may call `ssc.stop()` in the listener bus thread, which is a dead-lock. This PR updated `StreamingContextStoppingCollector` to only call `ssc.stop()` in the first batch to avoid the dead-lock.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10011 from zsxwing/fix-test-deadlock.

f57e6c9e

[SPARK-12025][SPARKR] Rename some window rank function names for SparkR · ba02f6cb

Yanbo Liang authored 9 years ago

Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
There are two reasons that we should make this change:
* We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
* Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.

It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
cc shivaram sun-rui

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10016 from yanboliang/SPARK-12025.

ba02f6cb

Nov 26, 2015

[SPARK-11997] [SQL] NPE when save a DataFrame as parquet and partitioned by long column · a374e20b

Dilip Biswal authored 9 years ago

Check for partition column null-ability while building the partition spec.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #10001 from dilipbiswal/spark-11997.

a374e20b

Fix style violation for b63938a8 · 10e315c2
Reynold Xin authored 9 years ago

10e315c2

[SPARK-11991] fixes · 5eaed4e4

Jeremy Derr authored 9 years ago

If `--private-ips` is required but not provided, spark_ec2.py may behave inappropriately, including attempting to ssh to localhost in attempts to verify ssh connectivity to the cluster.

This fixes that behavior by raising a `UsageError` exception if `get_dns_name` is unable to determine a hostname as a result.

Author: Jeremy Derr <jcderr@radius.com>

Closes #9975 from jcderr/SPARK-11991/ec_spark.py_hostname_check.

5eaed4e4

[SPARK-11778][SQL] add regression test · 4d4cbc03

Huaxin Gao authored 9 years ago

Fix regression test for SPARK-11778.
 marmbrus
Could you please take a look?
Thank you very much!!

Author: Huaxin Gao <huaxing@oc0558782468.ibm.com>

Closes #9890 from huaxingao/spark-11778-regression-test.

4d4cbc03

[SPARK-11917][PYSPARK] Add SQLContext#dropTempTable to PySpark · d8220885
Jeff Zhang authored 9 years ago
```
Author: Jeff Zhang <zjffdu@apache.org>

Closes #9903 from zjffdu/SPARK-11917.
```
d8220885

[SPARK-11881][SQL] Fix for postgresql fetchsize > 0 · b63938a8

mariusvniekerk authored 9 years ago

Reference: https://jdbc.postgresql.org/documentation/head/query.html#query-with-cursor
In order for PostgreSQL to honor the fetchSize non-zero setting, its Connection.autoCommit needs to be set to false. Otherwise, it will just quietly ignore the fetchSize setting.

This adds a new side-effecting dialect specific beforeFetch method that will fire before a select query is ran.

Author: mariusvniekerk <marius.v.niekerk@gmail.com>

Closes #9861 from mariusvniekerk/SPARK-11881.

b63938a8

[SPARK-12011][SQL] Stddev/Variance etc should support columnName as arguments · 6f6bb0e8

Yanbo Liang authored 9 years ago

Spark SQL aggregate function:
```Java
stddev
stddev_pop
stddev_samp
variance
var_pop
var_samp
skewness
kurtosis
collect_list
collect_set
```
should support ```columnName``` as arguments like other aggregate function(max/min/count/sum).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9994 from yanboliang/SPARK-12011.

6f6bb0e8

[SPARK-11996][CORE] Make the executor thread dump work again · 0c1e72e7

Shixiong Zhu authored 9 years ago

In the previous implementation, the driver needs to know the executor listening address to send the thread dump request. However, in Netty RPC, the executor doesn't listen to any port, so the executor thread dump feature is broken.

This patch makes the driver use the endpointRef stored in BlockManagerMasterEndpoint to send the thread dump request to fix it.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9976 from zsxwing/executor-thread-dump.

0c1e72e7

doc typo: "classificaion" -> "classification" · 4376b5be
muxator authored 9 years ago
```
Author: muxator <muxator@users.noreply.github.com>

Closes #10008 from muxator/patch-1.
```
4376b5be

[SPARK-11973][SQL] Improve optimizer code readability. · de28e4d4

Reynold Xin authored 9 years ago

This is a followup for https://github.com/apache/spark/pull/9959.

I added more documentation and rewrote some monadic code into simpler ifs.

Author: Reynold Xin <rxin@databricks.com>

Closes #9995 from rxin/SPARK-11973.

de28e4d4

[SPARK-11998][SQL][TEST-HADOOP2.0] When downloading Hadoop artifacts from... · ad765623

Yin Huai authored 9 years ago

[SPARK-11998][SQL][TEST-HADOOP2.0] When downloading Hadoop artifacts from maven, we need to try to download the version that is used by Spark

If we need to download Hive/Hadoop artifacts, try to download a Hadoop that matches the Hadoop used by Spark. If the Hadoop artifact cannot be resolved (e.g. Hadoop version is a vendor specific version like 2.0.0-cdh4.1.1), we will use Hadoop 2.4.0 (we used to hard code this version as the hadoop that we will download from maven) and we will not share Hadoop classes.

I tested this match in my laptop with the following confs (these confs are used by our builds). All tests are good.
```
build/sbt -Phadoop-1 -Dhadoop.version=1.2.1 -Pkinesis-asl -Phive-thriftserver -Phive
build/sbt -Phadoop-1 -Dhadoop.version=2.0.0-mr1-cdh4.1.1 -Pkinesis-asl -Phive-thriftserver -Phive
build/sbt -Pyarn -Phadoop-2.2 -Pkinesis-asl -Phive-thriftserver -Phive
build/sbt -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Pkinesis-asl -Phive-thriftserver -Phive
```

Author: Yin Huai <yhuai@databricks.com>

Closes #9979 from yhuai/versionsSuite.

ad765623

[SPARK-11863][SQL] Unable to resolve order by if it contains mixture of aliases and real columns · bc16a675

Dilip Biswal authored 9 years ago

this is based on https://github.com/apache/spark/pull/9844, with some bug fix and clean up.

The problems is that, normal operator should be resolved based on its child, but `Sort` operator can also be resolved based on its grandchild. So we have 3 rules that can resolve `Sort`: `ResolveReferences`, `ResolveSortReferences`(if grandchild is `Project`) and `ResolveAggregateFunctions`(if grandchild is `Aggregate`).
For example, `select c1 as a , c2 as b from tab group by c1, c2 order by a, c2`, we need to resolve `a` and `c2` for `Sort`. Firstly `a` will be resolved in `ResolveReferences` based on its child, and when we reach `ResolveAggregateFunctions`, we will try to resolve both `a` and `c2` based on its grandchild, but failed because `a` is not a legal aggregate expression.

whoever merge this PR, please give the credit to dilipbiswal

Author: Dilip Biswal <dbiswal@us.ibm.com>
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9961 from cloud-fan/sort.

bc16a675

[SPARK-12005][SQL] Work around VerifyError in HyperLogLogPlusPlus. · 001f0528

Marcelo Vanzin authored 9 years ago

Just move the code around a bit; that seems to make the JVM happy.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9985 from vanzin/SPARK-12005.

001f0528

[SPARK-11973] [SQL] push filter through aggregation with alias and literals · 27d69a05

Davies Liu authored 9 years ago

Currently, filter can't be pushed through aggregation with alias or literals, this patch fix that.

After this patch, the time of TPC-DS query 4 go down to 13 seconds from 141 seconds (10x improvements).

cc nongli  yhuai

Author: Davies Liu <davies@databricks.com>

Closes #9959 from davies/push_filter2.

27d69a05