Commits · 708129187a460aca30790281e9221c0cd5e271df · cs525-sp18-g07 / spark

Dec 04, 2015

[SPARK-12112][BUILD] Upgrade to SBT 0.13.9 · b7204e1d

Josh Rosen authored 9 years ago

We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.

b7204e1d

[SPARK-11314][BUILD][HOTFIX] Add exclusion for moved YARN classes. · d64806b3
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10147 from vanzin/SPARK-11314.
```
d64806b3

Dec 02, 2015

[SPARK-3580][CORE] Add Consistent Method To Get Number of RDD Partitions Across Different Languages · 128c2903

Jeroen Schot authored 9 years ago

I have tried to address all the comments in pull request https://github.com/apache/spark/pull/2447.

Note that the second commit (using the new method in all internal code of all components) is quite intrusive and could be omitted.

Author: Jeroen Schot <jeroen.schot@surfsara.nl>

Closes #9767 from schot/master.

128c2903

Dec 01, 2015

[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues · 69dbe6b4

Cheng Lian authored 9 years ago

This PR backports PR #10039 to master

Author: Cheng Lian <lian@databricks.com>

Closes #10063 from liancheng/spark-12046.doc-fix.master.

69dbe6b4

Nov 30, 2015

[SPARK-12000] Fix API doc generation issues · d3ca8cfa

Josh Rosen authored 9 years ago

This pull request fixes multiple issues with API doc generation.

- Modify the Jekyll plugin so that the entire doc build fails if API docs cannot be generated. This will make it easy to detect when the doc build breaks, since this will now trigger Jenkins failures.
- Change how we handle the `-target` compiler option flag in order to fix `javadoc` generation.
- Incorporate doc changes from thunterdb (in #10048).

Closes #10048.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Timothy Hunter <timhunter@databricks.com>

Closes #10049 from JoshRosen/fix-doc-build.

d3ca8cfa

[MINOR][BUILD] Changed the comment to reflect the plugin project is there to... · 953e8e6d

Prashant Sharma authored 9 years ago

[MINOR][BUILD] Changed the comment to reflect the plugin project is there to support SBT pom reader only.

Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #10012 from ScrapCodes/minor-build-comment.

953e8e6d

Nov 26, 2015

[SPARK-11996][CORE] Make the executor thread dump work again · 0c1e72e7

Shixiong Zhu authored 9 years ago

In the previous implementation, the driver needs to know the executor listening address to send the thread dump request. However, in Netty RPC, the executor doesn't listen to any port, so the executor thread dump feature is broken.

This patch makes the driver use the endpointRef stored in BlockManagerMasterEndpoint to send the thread dump request to fix it.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9976 from zsxwing/executor-thread-dump.

0c1e72e7

Nov 24, 2015
- [SPARK-11947][SQL] Mark deprecated methods with "This will be removed in Spark 2.0." · 4d6bbbc0
  Reynold Xin authored 9 years ago
  
  Also fixed some documentation as I saw them. Author: Reynold Xin <rxin@databricks.com> Closes #9930 from rxin/SPARK-11947.
  4d6bbbc0
Nov 23, 2015

[SPARK-4424] Remove spark.driver.allowMultipleContexts override in tests · 1b6e938b

Josh Rosen authored 9 years ago

This patch removes `spark.driver.allowMultipleContexts=true` from our test configuration. The multiple SparkContexts check was originally disabled because certain tests suites in SQL needed to create multiple contexts. As far as I know, this configuration change is no longer necessary, so we should remove it in order to make it easier to find test cleanup bugs.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9865 from JoshRosen/SPARK-4424.

1b6e938b

Nov 18, 2015

[SPARK-4557][STREAMING] Spark Streaming foreachRDD Java API method should... · 31921e0f

Bryan Cutler authored 9 years ago

[SPARK-4557][STREAMING] Spark Streaming foreachRDD Java API method should accept a VoidFunction<...>

Currently streaming foreachRDD Java API uses a function prototype requiring a return value of null.  This PR deprecates the old method and uses VoidFunction to allow for more concise declaration.  Also added VoidFunction2 to Java API in order to use in Streaming methods.  Unit test is added for using foreachRDD with VoidFunction, and changes have been tested with Java 7 and Java 8 using lambdas.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9488 from BryanCutler/foreachRDD-VoidFunction-SPARK-4557.

31921e0f

Nov 17, 2015

[SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python API · 75a29229

jerryshao authored 9 years ago

Fixed the merge conflicts in #7410

Closes #7410

Author: Shixiong Zhu <shixiong@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Author: jerryshao <sshao@hortonworks.com>

Closes #9742 from zsxwing/pr7410.

75a29229

[SPARK-11732] Removes some MiMa false positives · fa603e08

Timothy Hunter authored 9 years ago

This adds an extra filter for private or protected classes. We only filter for package private right now.

Author: Timothy Hunter <timhunter@databricks.com>

Closes #9697 from thunterdb/spark-11732.

fa603e08

[SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors · 21fac543

Xiangrui Meng authored 9 years ago

This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #9751 from mengxr/SPARK-11766.

21fac543

Nov 12, 2015

[BUILD][MINOR] Remove non-exist yarnStable module in Sbt project · 08660a0b

jerryshao authored 9 years ago

Remove some old yarn related building codes, please review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9625 from jerryshao/remove-old-module.

08660a0b

Nov 11, 2015

[SPARK-6152] Use shaded ASM5 to support closure cleaning of Java 8 compiled classes · 529a1d33

Josh Rosen authored 9 years ago

This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8.

In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml.

http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes.

I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9512 from JoshRosen/SPARK-6152.

529a1d33

Nov 10, 2015

[SPARK-9818] Re-enable Docker tests for JDBC data source · 1dde39d7

Josh Rosen authored 9 years ago

This patch re-enables tests for the Docker JDBC data source. These tests were reverted in #4872 due to transitive dependency conflicts introduced by the `docker-client` library. This patch should avoid those problems by using a version of `docker-client` which shades its transitive dependencies and by performing some build-magic to work around problems with that shaded JAR.

In addition, I significantly refactored the tests to simplify the setup and teardown code and to fix several Docker networking issues which caused problems when running in `boot2docker`.

Closes #8101.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #9503 from JoshRosen/docker-jdbc-tests.

1dde39d7

[SPARK-7841][BUILD] Stop using retrieveManaged to retrieve dependencies in SBT · 689386b1

Josh Rosen authored 9 years ago

This patch modifies Spark's SBT build so that it no longer uses `retrieveManaged` / `lib_managed` to store its dependencies. The motivations for this change are nicely described on the JIRA ticket ([SPARK-7841](https://issues.apache.org/jira/browse/SPARK-7841)); my personal interest in doing this stems from the fact that `lib_managed` has caused me some pain while debugging dependency issues in another PR of mine.

Removing our use of `lib_managed` would be trivial except for one snag: the Datanucleus JARs, required by Spark SQL's Hive integration, cannot be included in assembly JARs due to problems with merging OSGI `plugin.xml` files. As a result, several places in the packaging and deployment pipeline assume that these Datanucleus JARs are copied to `lib_managed/jars`. In the interest of maintaining compatibility, I have chosen to retain the `lib_managed/jars` directory _only_ for these Datanucleus JARs and have added custom code to `SparkBuild.scala` to automatically copy those JARs to that folder as part of the `assembly` task.

`dev/mima` also depended on `lib_managed` in a hacky way in order to set classpaths when generating MiMa excludes; I've updated this to obtain the classpaths directly from SBT instead.

/cc dragos marmbrus pwendell srowen

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9575 from JoshRosen/SPARK-7841.

689386b1

Nov 09, 2015

[SPARK-10565][CORE] add missing web UI stats to /api/v1/applications JSON · 08a7a836

Charles Yeh authored 9 years ago

I looked at the other endpoints, and they don't seem to be missing any fields.
Added fields:
![image](https://cloud.githubusercontent.com/assets/613879/10948801/58159982-82e4-11e5-86dc-62da201af910.png)

Author: Charles Yeh <charlesyeh@dropbox.com>

Closes #9472 from CharlesYeh/api_vars.

08a7a836

Nov 06, 2015

[SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark... · bc5d6c03

Reynold Xin authored 9 years ago

[SPARK-11541][SQL] Break JdbcDialects.scala into multiple files and mark various dialects as private.

Author: Reynold Xin <rxin@databricks.com>

Closes #9511 from rxin/SPARK-11541.

bc5d6c03

Nov 05, 2015
- [SPARK-11538][BUILD] Force guava 14 in sbt build. · 5e31db70
  Marcelo Vanzin authored 9 years ago
  
  sbt's version resolution code always picks the most recent version, and we don't want that for guava. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9508 from vanzin/SPARK-11538.
  5e31db70
- Revert "[SPARK-11469][SQL] Allow users to define nondeterministic udfs." · 6091e91f
  Reynold Xin authored 9 years ago
  
  This reverts commit 9cf56c96.
  6091e91f
Nov 04, 2015

[SPARK-11491] Update build to use Scala 2.10.5 · ce5e6a28

Josh Rosen authored 9 years ago

Spark should build against Scala 2.10.5, since that includes a fix for Scaladoc that will fix doc snapshot publishing: https://issues.scala-lang.org/browse/SI-8479

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9450 from JoshRosen/upgrade-to-scala-2.10.5.

ce5e6a28

[SPARK-11485][SQL] Make DataFrameHolder and DatasetHolder public. · cd1df662

Reynold Xin authored 9 years ago

These two classes should be public, since they are used in public code.

Author: Reynold Xin <rxin@databricks.com>

Closes #9445 from rxin/SPARK-11485.

cd1df662

[SPARK-9492][ML][R] LogisticRegression in R should provide model statistics · e328b69c

Yanbo Liang authored 9 years ago

Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9303 from yanboliang/spark-9492.

e328b69c

Nov 02, 2015

[SPARK-11469][SQL] Allow users to define nondeterministic udfs. · 9cf56c96

Yin Huai authored 9 years ago

This is the first task (https://issues.apache.org/jira/browse/SPARK-11469) of https://issues.apache.org/jira/browse/SPARK-11438

Author: Yin Huai <yhuai@databricks.com>

Closes #9393 from yhuai/udfNondeterministic.

9cf56c96

Oct 30, 2015

[SPARK-11423] remove MapPartitionsWithPreparationRDD · 45029bfd

Davies Liu authored 9 years ago

Since we do not need to preserve a page before calling compute(), MapPartitionsWithPreparationRDD is not needed anymore.

This PR basically revert #8543, #8511, #8038, #8011

Author: Davies Liu <davies@databricks.com>

Closes #9381 from davies/remove_prepare2.

45029bfd

Oct 22, 2015

[SPARK-10708] Consolidate sort shuffle implementations · f6d06adf

Josh Rosen authored 9 years ago

There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.

f6d06adf

Oct 19, 2015

[SPARK-10921][YARN] Completely remove the use of SparkContext.prefer… · bd64c2d5

Jacek Laskowski authored 9 years ago

…redNodeLocationData

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #8976 from jaceklaskowski/SPARK-10921.

bd64c2d5

Oct 16, 2015

[SPARK-11122] [BUILD] [WARN] Add tag to fatal warnings · 4ee2cea2

Jakob Odersky authored 9 years ago

Shows that an error is actually due to a fatal warning.

Author: Jakob Odersky <jodersky@gmail.com>

Closes #9128 from jodersky/fatalwarnings.

4ee2cea2

[SPARK-11092] [DOCS] Add source links to scaladoc generation · ed775042

Jakob Odersky authored 9 years ago

Modify the SBT build script to include GitHub source links for generated Scaladocs, on releases only (no snapshots).

Author: Jakob Odersky <jodersky@gmail.com>

Closes #9110 from jodersky/unidoc.

ed775042

Oct 08, 2015

[SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL · 3390b400

Davies Liu authored 9 years ago

This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session.

A new session of SQLContext could be created by:

1) create an new SQLContext
2) call newSession() on existing SQLContext

For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession).

CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache.

Added jars are still shared by all the sessions, because SparkContext does not support sessions.

cc marmbrus yhuai rxin

Author: Davies Liu <davies@databricks.com>

Closes #8909 from davies/sessions.

3390b400

Oct 07, 2015
- [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 94fc57af
  Marcelo Vanzin authored 9 years ago
  
  Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8775 from vanzin/SPARK-10300.
  94fc57af
Oct 06, 2015

[SPARK-10938] [SQL] remove typeId in columnar cache · 27ecfe61

Davies Liu authored 9 years ago

This PR remove the typeId in columnar cache, it's not needed anymore, it also remove DATE and TIMESTAMP (use INT/LONG instead).

Author: Davies Liu <davies@databricks.com>

Closes #8989 from davies/refactor_cache.

27ecfe61

Sep 21, 2015

[SPARK-9642] [ML] LinearRegression should supported weighted data · 331f0b10

Meihua Wu authored 9 years ago

In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling.

work in progress.

Author: Meihua Wu <meihuawu@umich.edu>

Closes #8631 from rotationsymmetry/SPARK-9642.

331f0b10

Sep 18, 2015
- [SPARK-9808] Remove hash shuffle file consolidation. · 348d7c9a
  Reynold Xin authored 9 years ago
  
  Author: Reynold Xin <rxin@databricks.com> Closes #8812 from rxin/SPARK-9808-1.
  348d7c9a
Sep 15, 2015

[SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator · 38700ea4

Josh Rosen authored 9 years ago

When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.

This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).

This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8544 from JoshRosen/SPARK-10381.

38700ea4

[SPARK-7685] [ML] Apply weights to different samples in Logistic Regression · be52faa7

DB Tsai authored 9 years ago

In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.

Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>

Closes #7884 from dbtsai/SPARK-7685.

be52faa7

Revert "[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py." · b42059d2
Marcelo Vanzin authored 9 years ago
```
This reverts commit 8abef21d.
```
b42059d2

[SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 8abef21d

Marcelo Vanzin authored 9 years ago

This change does two things:

- tag a few tests and adds the mechanism in the build to be able to disable those tags,
  both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
  changed; that's used to disable expensive tests when a module hasn't explicitly
  been changed, to speed up testing for changes that don't directly affect those
  modules.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8437 from vanzin/test-tags.

8abef21d

Update version to 1.6.0-SNAPSHOT. · 09b7e7c1
Reynold Xin authored 9 years ago
```
Author: Reynold Xin <rxin@databricks.com>

Closes #8350 from rxin/1.6.
```
09b7e7c1