Commits · 38700ea40cb1dd0805cc926a9e629f93c99527ad · cs525-sp18-g07 / spark

Sep 15, 2015

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator · 38700ea4

Josh Rosen authored 9 years ago

When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.

This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).

This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8544 from JoshRosen/SPARK-10381.

38700ea4

[SPARK-10575] [SPARK CORE] Wrapped RDD.takeSample with Scope · 99ecfa59

vinodkc authored 9 years ago

Remove return statements in RDD.takeSample and wrap it withScope

Author: vinodkc <vinod.kc.in@gmail.com>
Author: vinodkc <vinodkc@users.noreply.github.com>
Author: Vinod K C <vinod.kc@huawei.com>

Closes #8730 from vinodkc/fix_takesample_return.

99ecfa59

[SPARK-10612] [SQL] Add prepare to LocalNode. · a63cdc76

Reynold Xin authored 9 years ago

The idea is that we should separate the function call that does memory reservation (i.e. prepare) from the function call that consumes the input (e.g. open()), so all operators can be a chance to reserve memory before they are all consumed.

Author: Reynold Xin <rxin@databricks.com>

Closes #8761 from rxin/SPARK-10612.

a63cdc76

[SPARK-10548] [SPARK-10563] [SQL] Fix concurrent SQL executions · b6e99863

Andrew Or authored 9 years ago

*Note: this is for master branch only.* The fix for branch-1.5 is at #8721.

The query execution ID is currently passed from a thread to its children, which is not the intended behavior. This led to `IllegalArgumentException: spark.sql.execution.id is already set` when running queries in parallel, e.g.:
```
(1 to 100).par.foreach { _ =>
  sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count()
}
```
The cause is `SparkContext`'s local properties are inherited by default. This patch adds a way to exclude keys we don't want to be inherited, and makes SQL go through that code path.

Author: Andrew Or <andrew@databricks.com>

Closes #8710 from andrewor14/concurrent-sql-executions.

b6e99863

[SPARK-7685] [ML] Apply weights to different samples in Logistic Regression · be52faa7

DB Tsai authored 9 years ago

In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.

Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>

Closes #7884 from dbtsai/SPARK-7685.

be52faa7

[SPARK-10475] [SQL] improve column prunning for Project on Sort · 31a229aa

Wenchen Fan authored 9 years ago

Sometimes we can't push down the whole `Project` though `Sort`, but we still have a chance to push down part of it.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8644 from cloud-fan/column-prune.

31a229aa

[SPARK-10437] [SQL] Support aggregation expressions in Order By · 841972e2

Liang-Chi Hsieh authored 9 years ago

JIRA: https://issues.apache.org/jira/browse/SPARK-10437

If an expression in `SortOrder` is a resolved one, such as `count(1)`, the corresponding rule in `Analyzer` to make it work in order by will not be applied.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8599 from viirya/orderby-agg.

841972e2

Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py." · b42059d2
Marcelo Vanzin authored 9 years ago
```
This reverts commit 8abef21d.
```
b42059d2

[DOCS] Small fixes to Spark on Yarn doc · 416003b2

Jacek Laskowski authored 9 years ago

* a follow-up to 16b6d186 as `--num-executors` flag is not suppported.
* links + formatting

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #8762 from jaceklaskowski/docs-spark-on-yarn.

416003b2

Closes #8738 · 0d9ab016

Xiangrui Meng authored 9 years ago

Closes #8767
Closes #2491
Closes #6795
Closes #2096
Closes #7722

0d9ab016

[PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mllib.random · 7ca30b50

noelsmith authored 9 years ago

Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.

Author: noelsmith <mail@noelsmith.com>

Closes #8773 from noel-smith/mllib-random-versionadded-fix.

7ca30b50

[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 8abef21d

Marcelo Vanzin authored 9 years ago

This change does two things:

- tag a few tests and adds the mechanism in the build to be able to disable those tags,
  both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
  changed; that's used to disable expensive tests when a module hasn't explicitly
  been changed, to speed up testing for changes that don't directly affect those
  modules.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8437 from vanzin/test-tags.

8abef21d

[SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS · c35fdcb7

Yuhao Yang authored 9 years ago

jira: https://issues.apache.org/jira/browse/SPARK-10491

We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.

Let me know if new UT needed.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #8663 from hhbyyh/movedspr.

c35fdcb7

Update version to 1.6.0-SNAPSHOT. · 09b7e7c1
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #8350 from rxin/1.6.
```
09b7e7c1

[SPARK-10598] [DOCS] · 6503c4b5

Robin East authored 9 years ago

Comments preceding toMessage method state: "The edge partition is encoded in the lower
   * 30 bytes of the Int, and the position is encoded in the upper 2 bytes of the Int.". References to bytes should be changed to bits.

This contribution is my original work and I license the work to the Spark project under it's open source license.

Author: Robin East <robin.east@xense.co.uk>

Closes #8756 from insidedctm/master.

6503c4b5

Small fixes to docs · 833be733

Jacek Laskowski authored 9 years ago

Links work now properly + consistent use of *Spark standalone cluster* (Spark uppercase + lowercase the rest -- seems agreed in the other places in the docs).

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #8759 from jaceklaskowski/docs-submitting-apps.

833be733

Sep 14, 2015

[SPARK-10275] [MLLIB] Add @since annotation to pyspark.mllib.random · a2249359
Yu ISHIKAWA authored 9 years ago
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8666 from yu-iskw/SPARK-10275.
```
a2249359

[SPARK-10273] Add @since annotation to pyspark.mllib.feature · 610971ec

noelsmith authored 9 years ago

Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).

Author: noelsmith <mail@noelsmith.com>

Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.

610971ec

[SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement... · 4ae4d547

Yanbo Liang authored 9 years ago

[SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly

PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector.
Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8166 from yanboliang/spark-9793.

4ae4d547

[SPARK-10542] [PYSPARK] fix serialize namedtuple · 55204181
Davies Liu authored 9 years ago
```
Author: Davies Liu <davies@databricks.com>

Closes #8707 from davies/fix_namedtuple.
```
55204181

[SPARK-9851] Support submitting map stages individually in DAGScheduler · 1a095525

Matei Zaharia authored 9 years ago

This patch adds support for submitting map stages in a DAG individually so that we can make downstream decisions after seeing statistics about their output, as part of SPARK-9850. I also added more comments to many of the key classes in DAGScheduler. By itself, the patch is not super useful except maybe to switch between a shuffle and broadcast join, but with the other subtasks of SPARK-9850 we'll be able to do more interesting decisions.

The main entry point is SparkContext.submitMapStage, which lets you run a map stage and see stats about the map output sizes. Other stats could also be collected through accumulators. See AdaptiveSchedulingSuite for a short example.

Author: Matei Zaharia <matei@databricks.com>

Closes #8180 from mateiz/spark-9851.

1a095525

[SPARK-10564] ThreadingSuite: assertion failures in threads don't fail the test (round 2) · 7b6c8563

Andrew Or authored 9 years ago

This is a follow-up patch to #8723. I missed one case there.

Author: Andrew Or <andrew@databricks.com>

Closes #8727 from andrewor14/fix-threading-suite.

7b6c8563

[SPARK-10543] [CORE] Peak Execution Memory Quantile should be Per-task Basis · fd1e8cdd

Forest Fang authored 9 years ago

Read `PEAK_EXECUTION_MEMORY` using `update` to get per task partial value instead of cumulative value.

I tested with this workload:

```scala
val size = 1000
val repetitions = 10
val data = sc.parallelize(1 to size, 5).map(x => (util.Random.nextInt(size / repetitions),util.Random.nextDouble)).toDF("key", "value")
val res = data.toDF.groupBy("key").agg(sum("value")).count
```

Before:
![image](https://cloud.githubusercontent.com/assets/4317392/9828197/07dd6874-58b8-11e5-9bd9-6ba927c38b26.png)

After:
![image](https://cloud.githubusercontent.com/assets/4317392/9828151/a5ddff30-58b7-11e5-8d31-eda5dc4eae79.png)

Tasks view:
![image](https://cloud.githubusercontent.com/assets/4317392/9828199/17dc2b84-58b8-11e5-92a8-be89ce4d29d1.png)

cc andrewor14 I appreciate if you can give feedback on this since I think you introduced display of this metric.

Author: Forest Fang <forest.fang@outlook.com>

Closes #8726 from saurfang/stagepage.

fd1e8cdd

[SPARK-10549] scala 2.11 spark on yarn with security - Repl doesn't work · ffbbc2c5

Tom Graves authored 9 years ago

Make this lazy so that it can set the yarn mode before creating the securityManager.

Author: Tom Graves <tgraves@yahoo-inc.com>
Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #8719 from tgravescs/SPARK-10549.

ffbbc2c5

[SPARK-10576] [BUILD] Move .java files out of src/main/scala · 4e2242bb

Sean Owen authored 9 years ago

Move .java files in `src/main/scala` to `src/main/java` root, except for `package-info.java` (to stay next to package.scala)

Author: Sean Owen <sowen@cloudera.com>

Closes #8736 from srowen/SPARK-10576.

4e2242bb

[SPARK-10594] [YARN] Remove reference to --num-executors, add --properties-file · 16b6d186

Erick Tryzelaar authored 9 years ago

`ApplicationMaster` no longer has the `--num-executors` flag, and had an undocumented `--properties-file` configuration option.

cc srowen

Author: Erick Tryzelaar <erick.tryzelaar@gmail.com>

Closes #8754 from erickt/master.

16b6d186

[SPARK-9996] [SPARK-9997] [SQL] Add local expand and NestedLoopJoin operators · 217e4964

zsxwing authored 9 years ago

This PR is in conflict with #8535 and #8573. Will update this one when they are merged.

Author: zsxwing <zsxwing@gmail.com>

Closes #8642 from zsxwing/expand-nest-join.

217e4964

[SPARK-6981] [SQL] Factor out SparkPlanner and QueryExecution from SQLContext · 64f04154

Edoardo Vacchi authored 9 years ago

Alternative to PR #6122; in this case the refactored out classes are replaced by inner classes with the same name for backwards binary compatibility

   * process in a lighter-weight, backwards-compatible way

Author: Edoardo Vacchi <uncommonnonsense@gmail.com>

Closes #6356 from evacchi/sqlctx-refactoring-lite.

64f04154

[SPARK-10522] [SQL] Nanoseconds of Timestamp in Parquet should be positive · 7e32387a

Davies Liu authored 9 years ago

Or Hive can't read it back correctly.

Thanks vanzin for report this.

Author: Davies Liu <davies@databricks.com>

Closes #8674 from davies/positive_nano.

7e32387a

[SPARK-10573] [ML] IndexToString output schema should be StringType · 8a634e9b

Nick Pritchard authored 9 years ago

Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata.

Author: Nick Pritchard <nicholas.pritchard@falkonry.com>

Closes #8751 from pnpritchard/SPARK-10573.

8a634e9b

[SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter in Python · ce6f3f16

Yanbo Liang authored 9 years ago

[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8457 from yanboliang/spark-10194.

ce6f3f16

[SPARK-10584] [DOC] [SQL] Documentation about spark.sql.hive.metastore.version is wrong. · cf2821ef

Kousuke Saruta authored 9 years ago

The default value of hive metastore version is 1.2.1 but the documentation says the value of `spark.sql.hive.metastore.version` is 0.13.1.
Also, we cannot get the default value by `sqlContext.getConf("spark.sql.hive.metastore.version")`.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #8739 from sarutak/SPARK-10584.

cf2821ef

[SPARK-9899] [SQL] log warning for direct output committer with speculation enabled · 32407bfd

Wenchen Fan authored 9 years ago

This is a follow-up of https://github.com/apache/spark/pull/8317.

When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.

However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](https://github.com/apache/spark/pull/8191#issuecomment-131598385) for more details.

Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8687 from cloud-fan/direct-committer.

32407bfd

[SPARK-9720] [ML] Identifiable types need UID in toString methods · d8156546

Bertrand Dechoux authored 9 years ago

A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour.

This patch is a quick fix. The question of enforcement is still up.

No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now.

It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira).

Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>

Closes #8062 from BertrandDechoux/SPARK-9720.

d8156546

Sep 13, 2015

[SPARK-10222] [GRAPHX] [DOCS] More thoroughly deprecate Bagel in favor of GraphX · 1dc614b8

Sean Owen authored 9 years ago

Finish deprecating Bagel; remove reference to nonexistent example

Author: Sean Owen <sowen@cloudera.com>

Closes #8731 from srowen/SPARK-10222.

1dc614b8

Sep 12, 2015

[SPARK-10330] Add Scalastyle rule to require use of SparkHadoopUtil JobContext methods · b3a7480a

Josh Rosen authored 9 years ago

This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8521 from JoshRosen/SPARK-10330-part2.

b3a7480a

[SPARK-6548] Adding stddev to DataFrame functions · f4a22808

JihongMa authored 9 years ago

Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.

Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>

Closes #6297 from JihongMA/SPARK-SQL.

f4a22808

[SPARK-10547] [TEST] Streamline / improve style of Java API tests · 22730ad5

Sean Owen authored 9 years ago

Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order

Author: Sean Owen <sowen@cloudera.com>

Closes #8706 from srowen/SPARK-10547.

22730ad5

[SPARK-10554] [CORE] Fix NPE with ShutdownHook · 8285e3b0

Nithin Asokan authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-10554

Fixes NPE when ShutdownHook tries to cleanup temporary folders

Author: Nithin Asokan <Nithin.Asokan@Cerner.com>

Closes #8720 from nasokan/SPARK-10554.

8285e3b0

[SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks... · 6d836780

Daniel Imfeld authored 9 years ago

[SPARK-10566] [CORE] SnappyCompressionCodec init exception handling masks important error information

When throwing an IllegalArgumentException in SnappyCompressionCodec.init, chain the existing exception. This allows potentially important debugging info to be passed to the user.

Manual testing shows the exception chained properly, and the test suite still looks fine as well.

This contribution is my original work and I license the work to the project under the project's open source license.

Author: Daniel Imfeld <daniel@danielimfeld.com>

Closes #8725 from dimfeld/dimfeld-patch-1.

6d836780