Commits · 20d8ef858af6e13db59df118b562ea33cba5464d · cs525-sp18-g07 / spark

Jan 13, 2016

[SPARK-12703][MLLIB][DOC][PYTHON] Fixed pyspark.mllib.clustering.KMeans user guide example · 20d8ef85

Joseph K. Bradley authored 9 years ago

Fixed WSSSE computeCost in Python mllib KMeans user guide example by using new computeCost method API in Python.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10707 from jkbradley/kmeans-doc-fix.

20d8ef85

[SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number of features is large · 021dafc6

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-12026

The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.

I tested on local and the change can improve the performance and the running time was stable.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10146 from hhbyyh/chiSq.

021dafc6

[SPARK-12400][SHUFFLE] Avoid generating temp shuffle files for empty partitions · cd81fc9e

jerryshao authored 9 years ago

This problem lies in `BypassMergeSortShuffleWriter`, empty partition will also generate a temp shuffle file with several bytes. So here change to only create file when partition is not empty.

This problem only lies in here, no such issue in `HashShuffleWriter`.

Please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #10376 from jerryshao/SPARK-12400.

cd81fc9e

[SPARK-12690][CORE] Fix NPE in UnsafeInMemorySorter.free() · eabc7b8e

Carson Wang authored 9 years ago

I hit the exception below. The `UnsafeKVExternalSorter` does pass `null` as the consumer when creating an `UnsafeInMemorySorter`. Normally the NPE doesn't occur because the `inMemSorter` is set to null later and the `free()` method is not called. It happens when there is another exception like OOM thrown before setting `inMemSorter` to null. Anyway, we can add the null check to avoid it.

```
ERROR spark.TaskContextImpl: Error in TaskCompletionListener
java.lang.NullPointerException
        at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:110)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:288)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:141)
        at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
        at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
        at org.apache.spark.scheduler.Task.run(Task.scala:91)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
```

Author: Carson Wang <carson.wang@intel.com>

Closes #10637 from carsonwang/FixNPE.

eabc7b8e

[SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values" · cbbcd8e4

Reynold Xin authored 9 years ago

This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.

Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.

Author: Reynold Xin <rxin@databricks.com>

Closes #10734 from rxin/simplify-case.

cbbcd8e4

[SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row · c2ea79f9

Wenchen Fan authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-12642

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10694 from cloud-fan/hash-expr.

c2ea79f9

[SPARK-12268][PYSPARK] Make pyspark shell pythonstartup work under python3 · e4e0b3f7

Erik Selin authored 9 years ago

This replaces the `execfile` used for running custom python shell scripts
with explicit open, compile and exec (as recommended by 2to3). The reason
for this change is to make the pythonstartup option compatible with python3.

Author: Erik Selin <erik.selin@gmail.com>

Closes #10255 from tyro89/pythonstartup-python3.

e4e0b3f7

[SPARK-9383][PROJECT-INFRA] PR merge script should reset back to previous branch when possible · 97e0c7c5

Josh Rosen authored 9 years ago

This patch modifies our PR merge script to reset back to a named branch when restoring the original checkout upon exit. When the committer is originally checked out to a detached head, then they will be restored back to that same ref (the same as today's behavior).

This is a slightly updated version of #7569, with an extra fix to handle the detached head corner-case.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10709 from JoshRosen/SPARK-9383.

97e0c7c5

[SPARK-12761][CORE] Remove duplicated code · 38148f73

Jakob Odersky authored 9 years ago

Removes some duplicated code that was reintroduced during a merge.

Author: Jakob Odersky <jodersky@gmail.com>

Closes #10711 from jodersky/repl-2.11-duplicate.

38148f73

[SPARK-12805][MESOS] Fixes documentation on Mesos run modes · cc91e218

Luc Bourlier authored 9 years ago

The default run has changed, but the documentation didn't fully reflect the change.

Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #10740 from skyluc/issue/mesos-modes-doc.

cc91e218

[SPARK-9297] [SQL] Add covar_pop and covar_samp · 63eee86c

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-9297

Add two aggregation functions: covar_pop and covar_samp.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #10029 from viirya/covar-funcs.

63eee86c

[SPARK-12692][BUILD][HOT-FIX] Fix the scala style of KinesisBackedBlockRDDSuite.scala. · d6fd9b37

Yin Huai authored 9 years ago

https://github.com/apache/spark/pull/10736 was merged yesterday and caused the master start to fail because of the style issue.

Author: Yin Huai <yhuai@databricks.com>

Closes #10742 from yhuai/fixStyle.

d6fd9b37

[SPARK-12692][BUILD] Enforce style checking about white space before comma · 3d81d63f

Kousuke Saruta authored 9 years ago

This is the final PR about SPARK-12692.
We have removed all of white spaces before comma from code so let's enforce style checking.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10736 from sarutak/SPARK-12692-followup-enforce-checking.

3d81d63f

[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before ",") · cb7b864a

Kousuke Saruta authored 9 years ago

Fix the style violation (space before , and :).
This PR is a followup for #10643 and rework of #10685 .

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10732 from sarutak/SPARK-12692-followup-sql.

cb7b864a

Jan 12, 2016

[SPARK-12558][SQL] AnalysisException when multiple functions applied in GROUP BY clause · dc7b3870

Dilip Biswal authored 9 years ago

cloud-fan Can you please take a look ?

In this case, we are failing during check analysis while validating the aggregation expression. I have added a semanticEquals for HiveGenericUDF to fix this. Please let me know if this is the right way to address this issue.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #10520 from dilipbiswal/spark-12558.

dc7b3870

[SPARK-12692][BUILD][CORE] Scala style: Fix the style violation (Space before ",") · f14922cf

Kousuke Saruta authored 9 years ago

Fix the style violation (space before , and :).
This PR is a followup for #10643

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10719 from sarutak/SPARK-12692-followup-core.

f14922cf

[SPARK-12788][SQL] Simplify BooleanEquality by using casts. · b3b9ad23
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #10730 from rxin/SPARK-12788.
```
b3b9ad23

[SPARK-12785][SQL] Add ColumnarBatch, an in memory columnar format for execution. · 92470849

Nong Li authored 9 years ago

There are many potential benefits of having an efficient in memory columnar format as an alternate
to UnsafeRow. This patch introduces ColumnarBatch/ColumnarVector which starts this effort. The
remaining implementation can be done as follow up patches.

As stated in the in the JIRA, there are useful external components that operate on memory in a
simple columnar format. ColumnarBatch would serve that purpose and could server as a
zero-serialization/zero-copy exchange for this use case.

This patch supports running the underlying data either on heap or off heap. On heap runs a bit
faster but we would need offheap for zero-copy exchanges. Currently, this mode is hidden behind one
interface (ColumnVector).

This differs from Parquet or the existing columnar cache because this is *not* intended to be used
as a storage format. The focus is entirely on CPU efficiency as we expect to only have 1 of these
batches in memory per task. The layout of the values is just dense arrays of the value type.

Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>

Closes #10628 from nongli/spark-12635.

92470849

[SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1 · 4f60651c

Shixiong Zhu authored 9 years ago

- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
  - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. https://github.com/zsxwing/spark/commit/bfd4b5c040eb29394c3132af3c670b1a7272457c
- [x] Verify no leak any more after reverting our workarounds

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10692 from zsxwing/py4j-0.9.1.

4f60651c

[SPARK-12724] SQL generation support for persisted data source tables · 8ed5f12d

Cheng Lian authored 9 years ago

This PR implements SQL generation support for persisted data source tables. A new field `metastoreTableIdentifier: Option[TableIdentifier]` is added to `LogicalRelation`. When a `LogicalRelation` representing a persisted data source relation is created, this field holds the database name and table name of the relation.

Author: Cheng Lian <lian@databricks.com>

Closes #10712 from liancheng/spark-12724-datasources-sql-gen.

8ed5f12d

Revert "[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":")" · 0d543b98
Reynold Xin authored 9 years ago
```
This reverts commit 8cfa218f.
```
0d543b98

[SPARK-12768][SQL] Remove CaseKeyWhen expression · 0ed430e3

Reynold Xin authored 9 years ago

This patch removes CaseKeyWhen expression and replaces it with a factory method that generates the equivalent CaseWhen. This reduces the amount of code we'd need to maintain in the future for both code generation and optimizer.

Note that we introduced CaseKeyWhen to avoid duplicate evaluations of the key. This is no longer a problem because we now have common subexpression elimination.

Author: Reynold Xin <rxin@databricks.com>

Closes #10722 from rxin/SPARK-12768.

0ed430e3

[SPARK-9843][SQL] Make catalyst optimizer pass pluggable at runtime · 508592b1

Robert Kruszewski authored 9 years ago

Let me know whether you'd like to see it in other place

Author: Robert Kruszewski <robertk@palantir.com>

Closes #10210 from robert3005/feature/pluggable-optimizer.

508592b1

[SPARK-12762][SQL] Add unit test for SimplifyConditionals optimization rule · 1d888795

Reynold Xin authored 9 years ago

This pull request does a few small things:

1. Separated if simplification from BooleanSimplification and created a new rule SimplifyConditionals. In the future we can also simplify other conditional expressions here.

2. Added unit test for SimplifyConditionals.

3. Renamed SimplifyCaseConversionExpressionsSuite to SimplifyStringCaseConversionSuite

Author: Reynold Xin <rxin@databricks.com>

Closes #10716 from rxin/SPARK-12762.

1d888795

[SPARK-12582][TEST] IndexShuffleBlockResolverSuite fails in windows · 7e15044d

Yucai Yu authored 9 years ago

[SPARK-12582][Test] IndexShuffleBlockResolverSuite fails in windows

* IndexShuffleBlockResolverSuite fails in windows due to file is not closed.
* mv IndexShuffleBlockResolverSuite.scala from "test/java" to "test/scala".

https://issues.apache.org/jira/browse/SPARK-12582

Author: Yucai Yu <yucai.yu@intel.com>

Closes #10526 from yucai/master.

7e15044d

[SPARK-12638][API DOC] Parameter explanation not very accurate for rdd function "aggregate" · 9f0995bb

Tommy YU authored 9 years ago

Currently, RDD function aggregate's parameter doesn't explain well, especially parameter "zeroValue".
It's helpful to let junior scala user know that "zeroValue" attend both "seqOp" and "combOp" phase.

Author: Tommy YU <tummyyu@163.com>

Closes #10587 from Wenpei/rdd_aggregate_doc.

9f0995bb

[SPARK-5273][MLLIB][DOCS] Improve documentation examples for LinearRegression · 9c7f34af

Sean Owen authored 9 years ago

Use a much smaller step size in LinearRegressionWithSGD MLlib examples to achieve a reasonable RMSE.

Our training folks hit this exact same issue when concocting an example and had the same solution.

Author: Sean Owen <sowen@cloudera.com>

Closes #10675 from srowen/SPARK-5273.

9c7f34af

[SPARK-7615][MLLIB] MLLIB Word2Vec wordVectors divided by Euclidean Norm equals to zero · c48f2a3a

Sean Owen authored 9 years ago

Cosine similarity with 0 vector should be 0

Related to https://github.com/apache/spark/pull/10152

Author: Sean Owen <sowen@cloudera.com>

Closes #10696 from srowen/SPARK-7615.

c48f2a3a

[SPARK-12692][BUILD][SQL] Scala style: Fix the style violation (Space before "," or ":") · 8cfa218f

Kousuke Saruta authored 9 years ago

Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10718 from sarutak/SPARK-12692-followup-sql.

8cfa218f

Jan 11, 2016

[SPARK-12692][BUILD][YARN] Scala style: Fix the style violation (Space before "," or ":") · 112abf91

Kousuke Saruta authored 9 years ago

Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10686 from sarutak/SPARK-12692-followup-yarn.

112abf91

[SPARK-12692][BUILD][STREAMING] Scala style: Fix the style violation (Space before "," or ":") · 39ae04e6

Kousuke Saruta authored 9 years ago

Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10685 from sarutak/SPARK-12692-followup-streaming.

39ae04e6

[SPARK-11823] Ignores HiveThriftBinaryServerSuite's test jdbc cancel · aaa2c3b6

Yin Huai authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11823

This test often hangs and times out, leaving hanging processes. Let's ignore it for now and improve the test.

Author: Yin Huai <yhuai@databricks.com>

Closes #10715 from yhuai/SPARK-11823-ignore.

aaa2c3b6

[SPARK-12498][SQL][MINOR] BooleanSimplication simplification · 36d49350

Cheng Lian authored 9 years ago

Scala syntax allows binary case classes to be used as infix operator in pattern matching. This PR makes use of this syntax sugar to make `BooleanSimplification` more readable.

Author: Cheng Lian <lian@databricks.com>

Closes #10445 from liancheng/boolean-simplification-simplification.

36d49350

[SPARK-12742][SQL] org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due... · 473907ad

wangfei authored 9 years ago

[SPARK-12742][SQL] org.apache.spark.sql.hive.LogicalPlanToSQLSuite failure due to Table already exists exception

```
[info] Exception encountered when attempting to run a suite with class name:
org.apache.spark.sql.hive.LogicalPlanToSQLSuite *** ABORTED *** (325 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Table `t1` already exists.;
[info]   at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:296)
[info]   at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:285)
[info]   at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:33)
[info]   at org.scalatest.BeforeAndAfterAll$class.beforeAll(BeforeAndAfterAll.scala:187)
[info]   at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.beforeAll(LogicalPlanToSQLSuite.scala:23)
[info]   at org.scalatest.BeforeAndAfterAll$class.run(BeforeAndAfterAll.scala:253)
[info]   at org.apache.spark.sql.hive.LogicalPlanToSQLSuite.run(LogicalPlanToSQLSuite.scala:23)
[info]   at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:462)
[info]   at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:671)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:296)
[info]   at sbt.ForkMain$Run$2.call(ForkMain.java:286)
[info]   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[info]   at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[info]   at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[info]   at java.lang.Thread.run(Thread.java:745)
```

/cc liancheng

Author: wangfei <wangfei_hello@126.com>

Closes #10682 from scwf/fix-test.

473907ad

[SPARK-12576][SQL] Enable expression parsing in CatalystQl · fe9eb0b0

Herman van Hovell authored 9 years ago

The PR allows us to use the new SQL parser to parse SQL expressions such as: ```1 + sin(x*x)```

We enable this functionality in this PR, but we will not start using this actively yet. This will be done as soon as we have reached grammar parity with the existing parser stack.

cc rxin

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10649 from hvanhovell/SPARK-12576.

fe9eb0b0

[SPARK-10809][MLLIB] Single-document topicDistributions method for LocalLDAModel · bbea8885

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-10809

We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.

add some missing assert too.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9484 from hhbyyh/ldaTopicPre.

bbea8885

[SPARK-12685][MLLIB] word2vec trainWordsCount gets overflow · 4f8eefa3

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-12685
the log of `word2vec` reports
trainWordsCount = -785727483
during computation over a large dataset.

Update the priority as it will affect the computation process.
`alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))`

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #10627 from hhbyyh/w2voverflow.

4f8eefa3

[SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single... · ee4ee02b

Yanbo Liang authored 9 years ago

[SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft

PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10552 from yanboliang/spark-12603.

ee4ee02b

[SPARK-12758][SQL] add note to Spark SQL Migration guide about TimestampType casting · a767ee8a
Brandon Bradley authored 9 years ago
```
Warning users about casting changes.

Author: Brandon Bradley <bradleytastic@gmail.com>

Closes #10708 from blbradley/spark-12758.
```
a767ee8a

[SPARK-12734][HOTFIX] Build changes must trigger all tests; clean after install in dep tests · a4499145

Josh Rosen authored 9 years ago

This patch fixes a build/test issue caused by the combination of #10672 and a latent issue in the original `dev/test-dependencies` script.

First, changes which _only_ touched build files were not triggering full Jenkins runs, making it possible for a build change to be merged even though it could cause failures in other tests. The `root` build module now depends on `build`, so all tests will now be run whenever a build-related file is changed.

I also added a `clean` step to the Maven install step in `dev/test-dependencies` in order to address an issue where the dummy JARs stuck around and caused "multiple assembly JARs found" errors in tests.

/cc zsxwing

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10704 from JoshRosen/fix-build-test-problems.

a4499145