Skip to content
Snippets Groups Projects
  1. Nov 23, 2015
    • Josh Rosen's avatar
      [SPARK-4424] Remove spark.driver.allowMultipleContexts override in tests · 1b6e938b
      Josh Rosen authored
      This patch removes `spark.driver.allowMultipleContexts=true` from our test configuration. The multiple SparkContexts check was originally disabled because certain tests suites in SQL needed to create multiple contexts. As far as I know, this configuration change is no longer necessary, so we should remove it in order to make it easier to find test cleanup bugs.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9865 from JoshRosen/SPARK-4424.
      1b6e938b
  2. Nov 18, 2015
    • Bryan Cutler's avatar
      [SPARK-4557][STREAMING] Spark Streaming foreachRDD Java API method should... · 31921e0f
      Bryan Cutler authored
      [SPARK-4557][STREAMING] Spark Streaming foreachRDD Java API method should accept a VoidFunction<...>
      
      Currently streaming foreachRDD Java API uses a function prototype requiring a return value of null.  This PR deprecates the old method and uses VoidFunction to allow for more concise declaration.  Also added VoidFunction2 to Java API in order to use in Streaming methods.  Unit test is added for using foreachRDD with VoidFunction, and changes have been tested with Java 7 and Java 8 using lambdas.
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #9488 from BryanCutler/foreachRDD-VoidFunction-SPARK-4557.
      31921e0f
  3. Nov 17, 2015
  4. Nov 12, 2015
  5. Nov 11, 2015
  6. Nov 10, 2015
    • Josh Rosen's avatar
      [SPARK-9818] Re-enable Docker tests for JDBC data source · 1dde39d7
      Josh Rosen authored
      This patch re-enables tests for the Docker JDBC data source. These tests were reverted in #4872 due to transitive dependency conflicts introduced by the `docker-client` library. This patch should avoid those problems by using a version of `docker-client` which shades its transitive dependencies and by performing some build-magic to work around problems with that shaded JAR.
      
      In addition, I significantly refactored the tests to simplify the setup and teardown code and to fix several Docker networking issues which caused problems when running in `boot2docker`.
      
      Closes #8101.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Yijie Shen <henry.yijieshen@gmail.com>
      
      Closes #9503 from JoshRosen/docker-jdbc-tests.
      1dde39d7
    • Josh Rosen's avatar
      [SPARK-7841][BUILD] Stop using retrieveManaged to retrieve dependencies in SBT · 689386b1
      Josh Rosen authored
      This patch modifies Spark's SBT build so that it no longer uses `retrieveManaged` / `lib_managed` to store its dependencies. The motivations for this change are nicely described on the JIRA ticket ([SPARK-7841](https://issues.apache.org/jira/browse/SPARK-7841)); my personal interest in doing this stems from the fact that `lib_managed` has caused me some pain while debugging dependency issues in another PR of mine.
      
      Removing our use of `lib_managed` would be trivial except for one snag: the Datanucleus JARs, required by Spark SQL's Hive integration, cannot be included in assembly JARs due to problems with merging OSGI `plugin.xml` files. As a result, several places in the packaging and deployment pipeline assume that these Datanucleus JARs are copied to `lib_managed/jars`. In the interest of maintaining compatibility, I have chosen to retain the `lib_managed/jars` directory _only_ for these Datanucleus JARs and have added custom code to `SparkBuild.scala` to automatically copy those JARs to that folder as part of the `assembly` task.
      
      `dev/mima` also depended on `lib_managed` in a hacky way in order to set classpaths when generating MiMa excludes; I've updated this to obtain the classpaths directly from SBT instead.
      
      /cc dragos marmbrus pwendell srowen
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #9575 from JoshRosen/SPARK-7841.
      689386b1
  7. Nov 09, 2015
  8. Nov 06, 2015
  9. Nov 05, 2015
  10. Nov 04, 2015
  11. Nov 02, 2015
  12. Oct 30, 2015
    • Davies Liu's avatar
      [SPARK-11423] remove MapPartitionsWithPreparationRDD · 45029bfd
      Davies Liu authored
      Since we do not need to preserve a page before calling compute(), MapPartitionsWithPreparationRDD is not needed anymore.
      
      This PR basically revert #8543, #8511, #8038, #8011
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #9381 from davies/remove_prepare2.
      45029bfd
  13. Oct 22, 2015
    • Josh Rosen's avatar
      [SPARK-10708] Consolidate sort shuffle implementations · f6d06adf
      Josh Rosen authored
      There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
      f6d06adf
  14. Oct 19, 2015
  15. Oct 16, 2015
  16. Oct 08, 2015
    • Davies Liu's avatar
      [SPARK-10810] [SPARK-10902] [SQL] Improve session management in SQL · 3390b400
      Davies Liu authored
      This PR improve the sessions management by replacing the thread-local based to one SQLContext per session approach, introduce separated temporary tables and UDFs/UDAFs for each session.
      
      A new session of SQLContext could be created by:
      
      1) create an new SQLContext
      2) call newSession() on existing SQLContext
      
      For HiveContext, in order to reduce the cost for each session, the classloader and Hive client are shared across multiple sessions (created by newSession).
      
      CacheManager is also shared by multiple sessions, so cache a table multiple times in different sessions will not cause multiple copies of in-memory cache.
      
      Added jars are still shared by all the sessions, because SparkContext does not support sessions.
      
      cc marmbrus yhuai rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8909 from davies/sessions.
      3390b400
  17. Oct 07, 2015
  18. Oct 06, 2015
    • Davies Liu's avatar
      [SPARK-10938] [SQL] remove typeId in columnar cache · 27ecfe61
      Davies Liu authored
      This PR remove the typeId in columnar cache, it's not needed anymore, it also remove DATE and TIMESTAMP (use INT/LONG instead).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #8989 from davies/refactor_cache.
      27ecfe61
  19. Sep 21, 2015
    • Meihua Wu's avatar
      [SPARK-9642] [ML] LinearRegression should supported weighted data · 331f0b10
      Meihua Wu authored
      In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling.
      
      work in progress.
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #8631 from rotationsymmetry/SPARK-9642.
      331f0b10
  20. Sep 18, 2015
  21. Sep 15, 2015
    • Josh Rosen's avatar
      [SPARK-10381] Fix mixup of taskAttemptNumber & attemptId in OutputCommitCoordinator · 38700ea4
      Josh Rosen authored
      When speculative execution is enabled, consider a scenario where the authorized committer of a particular output partition fails during the OutputCommitter.commitTask() call. In this case, the OutputCommitCoordinator is supposed to release that committer's exclusive lock on committing once that task fails. However, due to a unit mismatch (we used task attempt number in one place and task attempt id in another) the lock will not be released, causing Spark to go into an infinite retry loop.
      
      This bug was masked by the fact that the OutputCommitCoordinator does not have enough end-to-end tests (the current tests use many mocks). Other factors contributing to this bug are the fact that we have many similarly-named identifiers that have different semantics but the same data types (e.g. attemptNumber and taskAttemptId, with inconsistent variable naming which makes them difficult to distinguish).
      
      This patch adds a regression test and fixes this bug by always using task attempt numbers throughout this code.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8544 from JoshRosen/SPARK-10381.
      38700ea4
    • DB Tsai's avatar
      [SPARK-7685] [ML] Apply weights to different samples in Logistic Regression · be52faa7
      DB Tsai authored
      In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
      http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
      On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.
      
      Author: DB Tsai <dbt@netflix.com>
      Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>
      
      Closes #7884 from dbtsai/SPARK-7685.
      be52faa7
    • Marcelo Vanzin's avatar
    • Marcelo Vanzin's avatar
      [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 8abef21d
      Marcelo Vanzin authored
      This change does two things:
      
      - tag a few tests and adds the mechanism in the build to be able to disable those tags,
        both in maven and sbt, for both junit and scalatest suites.
      - add some logic to run-tests.py to disable some tags depending on what files have
        changed; that's used to disable expensive tests when a module hasn't explicitly
        been changed, to speed up testing for changes that don't directly affect those
        modules.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8437 from vanzin/test-tags.
      8abef21d
    • Reynold Xin's avatar
      Update version to 1.6.0-SNAPSHOT. · 09b7e7c1
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8350 from rxin/1.6.
      09b7e7c1
  22. Sep 11, 2015
    • Ahir Reddy's avatar
      [SPARK-10556] Remove explicit Scala version for sbt project build files · 9bbe33f3
      Ahir Reddy authored
      Previously, project/plugins.sbt explicitly set scalaVersion to 2.10.4. This can cause issues when using a version of sbt that is compiled against a different version of Scala (for example sbt 0.13.9 uses 2.10.5). Removing this explicit setting will cause build files to be compiled and run against the same version of Scala that sbt is compiled against.
      
      Note that this only applies to the project build files (items in project/), it is distinct from the version of Scala we target for the actual spark compilation.
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      
      Closes #8709 from ahirreddy/sbt-scala-version-fix.
      9bbe33f3
  23. Sep 07, 2015
    • Reynold Xin's avatar
      [SPARK-9767] Remove ConnectionManager. · 5ffe752b
      Reynold Xin authored
      We introduced the Netty network module for shuffle in Spark 1.2, and has turned it on by default for 3 releases. The old ConnectionManager is difficult to maintain. If we merge the patch now, by the time it is released, it would be 1 yr for which ConnectionManager is off by default. It's time to remove it.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #8161 from rxin/SPARK-9767.
      5ffe752b
  24. Sep 02, 2015
    • Marcelo Vanzin's avatar
      [SPARK-10004] [SHUFFLE] Perform auth checks when clients read shuffle data. · 2da3a9e9
      Marcelo Vanzin authored
      To correctly isolate applications, when requests to read shuffle data
      arrive at the shuffle service, proper authorization checks need to
      be performed. This change makes sure that only the application that
      created the shuffle data can read from it.
      
      Such checks are only enabled when "spark.authenticate" is enabled,
      otherwise there's no secure way to make sure that the client is really
      who it says it is.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8218 from vanzin/SPARK-10004.
      2da3a9e9
  25. Aug 28, 2015
    • Marcelo Vanzin's avatar
      [SPARK-9284] [TESTS] Allow all tests to run without an assembly. · c53c902f
      Marcelo Vanzin authored
      This change aims at speeding up the dev cycle a little bit, by making
      sure that all tests behave the same w.r.t. where the code to be tested
      is loaded from. Namely, that means that tests don't rely on the assembly
      anymore, rather loading all needed classes from the build directories.
      
      The main change is to make sure all build directories (classes and test-classes)
      are added to the classpath of child processes when running tests.
      
      YarnClusterSuite required some custom code since the executors are run
      differently (i.e. not through the launcher library, like standalone and
      Mesos do).
      
      I also found a couple of tests that could leak a SparkContext on failure,
      and added code to handle those.
      
      With this patch, it's possible to run the following command from a clean
      source directory and have all tests pass:
      
        mvn -Pyarn -Phadoop-2.4 -Phive-thriftserver install
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7629 from vanzin/SPARK-9284.
      c53c902f
  26. Aug 25, 2015
  27. Aug 13, 2015
    • Andrew Or's avatar
      [SPARK-9580] [SQL] Replace singletons in SQL tests · 8187b3ae
      Andrew Or authored
      A fundamental limitation of the existing SQL tests is that *there is simply no way to create your own `SparkContext`*. This is a serious limitation because the user may wish to use a different master or config. As a case in point, `BroadcastJoinSuite` is entirely commented out because there is no way to make it pass with the existing infrastructure.
      
      This patch removes the singletons `TestSQLContext` and `TestData`, and instead introduces a `SharedSQLContext` that starts a context per suite. Unfortunately the singletons were so ingrained in the SQL tests that this patch necessarily needed to touch *all* the SQL test files.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/8111)
      <!-- Reviewable:end -->
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8111 from andrewor14/sql-tests-refactor.
      8187b3ae
  28. Aug 12, 2015
    • Joseph K. Bradley's avatar
      [SPARK-9704] [ML] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs · d2d5e7fe
      Joseph K. Bradley authored
      Made ProbabilisticClassifier, Identifiable, VectorUDT public.  All are annotated as DeveloperApi.
      
      CC: mengxr EronWright
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #8004 from jkbradley/ml-api-public-items and squashes the following commits:
      
      7ebefda [Joseph K. Bradley] update per code review
      7ff0768 [Joseph K. Bradley] attepting to add mima fix
      756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
      ae7767d [Joseph K. Bradley] added another warning
      94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
      d2d5e7fe
  29. Aug 11, 2015
    • Andrew Or's avatar
      [SPARK-9649] Fix flaky test MasterSuite again - disable REST · ca8f70e9
      Andrew Or authored
      The REST server is not actually used in most tests and so we can disable it. It is a source of flakiness because it tries to bind to a specific port in vain. There was also some code that avoided the shuffle service in tests. This is actually not necessary because the shuffle service is already off by default.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #8084 from andrewor14/fix-master-suite-again.
      ca8f70e9
Loading