Skip to content
Snippets Groups Projects
  1. Jan 25, 2015
  2. Jan 21, 2015
  3. Jan 12, 2015
    • Sean Owen's avatar
      SPARK-5172 [BUILD] spark-examples-***.jar shades a wrong Hadoop distribution · aff49a3e
      Sean Owen authored
      In addition to the `hadoop-2.x` profiles in the parent POM, there is actually another set of profiles in `examples` that has to be activated differently to get the right Hadoop 1 vs 2 flavor of HBase. This wasn't actually used in making Hadoop 2 distributions, hence the problem.
      
      To reduce complexity, I suggest merging them with the parent POM profiles, which is possible now.
      
      You'll see this changes appears to update the HBase version, but actually, the default 0.94 version was not being used. HBase is only used in examples, and the examples POM always chose one profile or the other that updated the version to 0.98.x anyway.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3992 from srowen/SPARK-5172 and squashes the following commits:
      
      17830d9 [Sean Owen] Control hbase hadoop1/2 flavor in the parent POM with existing hadoop-2.x profiles
      aff49a3e
    • Sean Owen's avatar
      SPARK-4159 [BUILD] Addendum: improve running of single test after enabling Java tests · 13e610b8
      Sean Owen authored
      https://issues.apache.org/jira/browse/SPARK-4159 was resolved but as Sandy points out, the guidance in https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools under "Running Individual Tests" no longer quite works, not optimally.
      
      This minor change is not really the important change, which is an update to the wiki text. The correct way to run one Scala test suite in Maven is now:
      
      ```
      mvn test -DwildcardSuites=org.apache.spark.io.CompressionCodecSuite -Dtests=none
      ```
      
      The correct way to run one Java test is
      
      ```
      mvn test -DwildcardSuites=none -Dtests=org.apache.spark.streaming.JavaAPISuite
      ```
      
      Basically, you have to set two properties in order to suppress all of one type of test (with a non-existent test name like 'none') and all but one test of the other type.
      
      The change in the PR just prevents Surefire from barfing when it finds no "none" test.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3993 from srowen/SPARK-4159 and squashes the following commits:
      
      83106d7 [Sean Owen] Default failIfNoTests to false to enable the -DwildcardSuites=... -Dtests=... syntax for running one test to work
      13e610b8
  4. Jan 09, 2015
    • Jongyoul Lee's avatar
      [SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688 · 454fe129
      Jongyoul Lee authored
      - update version from 0.18.1 to 0.21.0
      - I'm doing some tests in order to verify some spark jobs work fine on mesos 0.21.0 environment.
      
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #3934 from jongyoul/SPARK-3619 and squashes the following commits:
      
      ab994fa [Jongyoul Lee] [SPARK-3619] Upgrade to Mesos 0.21 to work around MESOS-1688 - update version from 0.18.1 to 0.21.0
      454fe129
  5. Jan 08, 2015
    • Marcelo Vanzin's avatar
      [SPARK-4048] Enhance and extend hadoop-provided profile. · 48cecf67
      Marcelo Vanzin authored
      This change does a few things to make the hadoop-provided profile more useful:
      
      - Create new profiles for other libraries / services that might be provided by the infrastructure
      - Simplify and fix the poms so that the profiles are only activated while building assemblies.
      - Fix tests so that they're able to run when the profiles are activated
      - Add a new env variable to be used by distributions that use these profiles to provide the runtime
        classpath for Spark jobs and daemons.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:
      
      82eb688 [Marcelo Vanzin] Add a comment.
      eb228c0 [Marcelo Vanzin] Fix borked merge.
      4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
      371ebee [Marcelo Vanzin] Review feedback.
      52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      322f882 [Marcelo Vanzin] Fix merge fail.
      f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      9640503 [Marcelo Vanzin] Cleanup child process log message.
      115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
      e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
      7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
      1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
      d1399ed [Marcelo Vanzin] Restore jetty dependency.
      82a54b9 [Marcelo Vanzin] Remove unused profile.
      5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
      1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
      f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
      9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
      d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
      4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
      417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
      2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
      1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
      284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.
      48cecf67
  6. Jan 06, 2015
    • Sean Owen's avatar
      SPARK-4159 [CORE] Maven build doesn't run JUnit test suites · 4cba6eb4
      Sean Owen authored
      This PR:
      
      - Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
      - Tells `surefire` to test only Java tests
      - Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.
      
      For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3651 from srowen/SPARK-4159 and squashes the following commits:
      
      2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
      12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
      e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
      4cba6eb4
  7. Dec 30, 2014
  8. Dec 27, 2014
    • Jongyoul Lee's avatar
      [SPARK-3955] Different versions between jackson-mapper-asl and jackson-c... · 2483c1ef
      Jongyoul Lee authored
      ...ore-asl
      
      - set the same version to jackson-mapper-asl and jackson-core-asl
      - It's related with #2818
      - coded a same patch from a latest master
      
      Author: Jongyoul Lee <jongyoul@gmail.com>
      
      Closes #3716 from jongyoul/SPARK-3955 and squashes the following commits:
      
      efa29aa [Jongyoul Lee] [SPARK-3955] Different versions between jackson-mapper-asl and jackson-core-asl - set the same version to jackson-mapper-asl and jackson-core-asl
      2483c1ef
  9. Dec 23, 2014
    • Cheng Lian's avatar
      [SPARK-4914][Build] Cleans lib_managed before compiling with Hive 0.13.1 · 395b771f
      Cheng Lian authored
      This PR tries to fix the Hive tests failure encountered in PR #3157 by cleaning `lib_managed` before building assembly jar against Hive 0.13.1 in `dev/run-tests`. Otherwise two sets of datanucleus jars would be left in `lib_managed` and may mess up class paths while executing Hive test suites. Please refer to [this thread] [1] for details. A clean build would be even safer, but we only clean `lib_managed` here to save build time.
      
      This PR also takes the chance to clean up some minor typos and formatting issues in the comments.
      
      [1]: https://github.com/apache/spark/pull/3157#issuecomment-67656488
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3756)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3756 from liancheng/clean-lib-managed and squashes the following commits:
      
      e2bd21d [Cheng Lian] Adds lib_managed to clean set
      c9f2f3e [Cheng Lian] Cleans lib_managed before compiling with Hive 0.13.1
      395b771f
  10. Dec 19, 2014
    • scwf's avatar
      [Build] Remove spark-staging-1038 · 8e253ebb
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3743 from scwf/abc and squashes the following commits:
      
      7d98bc8 [scwf] removing spark-staging-1038
      8e253ebb
  11. Dec 15, 2014
    • Sean Owen's avatar
      SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from... · 81112e4b
      Sean Owen authored
      SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
      
      This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3692 from srowen/SPARK-4814 and squashes the following commits:
      
      caca704 [Sean Owen] Disable assertions just for Hive
      f71e783 [Sean Owen] Enable assertions for SBT and Maven build
      81112e4b
    • Ryan Williams's avatar
      [SPARK-4668] Fix some documentation typos. · 8176b7a0
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #3523 from ryan-williams/tweaks and squashes the following commits:
      
      d2eddaa [Ryan Williams] code review feedback
      ce27fc1 [Ryan Williams] CoGroupedRDD comment nit
      c6cfad9 [Ryan Williams] remove unnecessary if statement
      b74ea35 [Ryan Williams] comment fix
      b0221f0 [Ryan Williams] fix a gendered pronoun
      c71ffed [Ryan Williams] use names on a few boolean parameters
      89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager
      e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io
      83e8358 [Ryan Williams] fix pom.xml typo
      dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
      8176b7a0
  12. Dec 09, 2014
    • Sandy Ryza's avatar
      SPARK-4338. [YARN] Ditch yarn-alpha. · 912563aa
      Sandy Ryza authored
      Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits:
      
      1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
      9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
      912563aa
  13. Nov 28, 2014
    • Takuya UESHIN's avatar
      [SPARK-4193][BUILD] Disable doclint in Java 8 to prevent from build error. · e464f0ac
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3058 from ueshin/issues/SPARK-4193 and squashes the following commits:
      
      e096bb1 [Takuya UESHIN] Add a plugin declaration to pluginManagement.
      6762ec2 [Takuya UESHIN] Fix usage of -Xdoclint javadoc option.
      fdb280a [Takuya UESHIN] Fix Javadoc errors.
      4745f3c [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4193
      923e2f0 [Takuya UESHIN] Use doclint option `-missing` instead of `none`.
      30d6718 [Takuya UESHIN] Fix Javadoc errors.
      b548017 [Takuya UESHIN] Disable doclint in Java 8 to prevent from build error.
      e464f0ac
    • Daoyuan Wang's avatar
      [SPARK-4643] [Build] Remove unneeded staging repositories from build · 53ed7f1c
      Daoyuan Wang authored
      The old location will return a 404.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3504 from adrian-wang/repo and squashes the following commits:
      
      f604e05 [Daoyuan Wang] already in maven central, remove at all
      f494fac [Daoyuan Wang] spark staging repo outdated
      53ed7f1c
  14. Nov 18, 2014
    • Marcelo Vanzin's avatar
      Bumping version to 1.3.0-SNAPSHOT. · 397d3aae
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3277 from vanzin/version-1.3 and squashes the following commits:
      
      7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
      5f404ff [Marcelo Vanzin] Add another exclusion.
      19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
      3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
      e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
      397d3aae
    • Davies Liu's avatar
      [SPARK-4017] show progress bar in console · e34f38ff
      Davies Liu authored
      The progress bar will look like this:
      
      ![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png)
      
      In the right corner, the numbers are: finished tasks, running tasks, total tasks.
      
      After the stage has finished, it will disappear.
      
      The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3029 from davies/progress and squashes the following commits:
      
      95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      fc49ac8 [Davies Liu] address commentse
      2e90f75 [Davies Liu] show multiple stages in same time
      0081bcc [Davies Liu] address comments
      38c42f1 [Davies Liu] fix tests
      ab87958 [Davies Liu] disable progress bar during tests
      30ac852 [Davies Liu] re-implement progress bar
      b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
      e4e7344 [Davies Liu] refactor
      e1f524d [Davies Liu] revert unnecessary change
      a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      5cae3f2 [Davies Liu] fix style
      ea49fe0 [Davies Liu] address comments
      bc53d99 [Davies Liu] refactor
      e6bb189 [Davies Liu] fix logging in sparkshell
      7e7d4e7 [Davies Liu] address commments
      5df26bb [Davies Liu] fix style
      9e42208 [Davies Liu] show progress bar in console and title
      e34f38ff
  15. Nov 17, 2014
    • Josh Rosen's avatar
      [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts · 0f3ceb56
      Josh Rosen authored
      This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).
      
      **The solution implemented here is only a partial fix.**  A complete fix would have the following properties:
      
      1. Only one SparkContext may ever be under construction at any given time.
      2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
      3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
      4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.
      
      This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.
      
      ### The correct solution:
      
      I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object.  Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.).  Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor.  For example:
      
      ```scala
      class SparkContext private (deps: SparkContextDependencies) {
        def this(conf: SparkConf) {
          this(SparkContext.getDeps(conf))
        }
      }
      
      object SparkContext(
        private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
          if (anotherSparkContextIsActive) { throw Exception(...) }
          var dagScheduler: DAGScheduler = null
          try {
              dagScheduler = new DAGScheduler(...)
              [...]
          } catch {
            case e: Exception =>
               Option(dagScheduler).foreach(_.stop())
                [...]
          }
          SparkContextDependencies(dagScheduler, ....)
        }
      }
      ```
      
      This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.
      
      This indirection is necessary to maintain binary compatibility.  In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.
      
      ### Alternative solutions:
      
      As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block.  Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block.  If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.
      
      The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.
      
      ### This PR's solution:
      
      - At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
      - If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
      - At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context.  If so, throw an exception.
      
      This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor).  If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.
      
      This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`.  The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts.  I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits:
      
      23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      d38251b [Josh Rosen] Address latest round of feedback.
      c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
      85a424a [Josh Rosen] Incorporate more review feedback.
      372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      f5bb78c [Josh Rosen] Update mvn build, too.
      d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
      79a7e6f [Josh Rosen] Fix commented out test
      a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
      4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
      ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
      1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
      d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
      c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
      918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
      afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
      0f3ceb56
  16. Nov 16, 2014
    • Josh Rosen's avatar
      [SPARK-4419] Upgrade snappy-java to 1.1.1.6 · 7d8e152e
      Josh Rosen authored
      This upgrades snappy-java to 1.1.1.6, which includes a patch that improves error messages when attempting to deserialize empty inputs using SnappyInputStream (see xerial/snappy-java#89).
      
      We previously tried up upgrade to 1.1.1.5 in #2911 but reverted that patch after discovering a memory leak in snappy-java.  This should leak have been fixed in 1.1.1.6, though (see xerial/snappy-java#92).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3287 from JoshRosen/SPARK-4419 and squashes the following commits:
      
      5d6f4cc [Josh Rosen] [SPARK-4419] Upgrade snappy-java to 1.1.1.6.
      7d8e152e
  17. Nov 14, 2014
    • Sandy Ryza's avatar
      SPARK-4375. no longer require -Pscala-2.10 · f5f757e4
      Sandy Ryza authored
      It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits:
      
      0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt
      cd42d94 [Sandy Ryza] Update doc
      f6644c3 [Sandy Ryza] SPARK-4375 take 2
      f5f757e4
  18. Nov 11, 2014
    • Prashant Sharma's avatar
      Support cross building for Scala 2.11 · daaca14c
      Prashant Sharma authored
      Let's give this another go using a version of Hive that shades its JLine dependency.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:
      
      e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
      f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
      a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
      7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
      583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
      3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
      935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
      925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
      2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
      8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
      5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
      2121071 [Patrick Wendell] Migrating version detection to PySpark
      b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
      1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
      f5cad4e [Patrick Wendell] Add Scala 2.11 docs
      210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
      48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
      e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
      67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
      8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
      e22b104 [Patrick Wendell] Small fix in pom file
      ec402ab [Patrick Wendell] Various fixes
      0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
      4eaec65 [Prashant Sharma] Changed scripts to ignore target.
      5167bea [Prashant Sharma] small correction
      a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
      80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
      034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
      d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
      6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
      e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
      937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
      cb059b0 [Prashant Sharma] Code review
      0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
      daaca14c
    • Sean Owen's avatar
      SPARK-4305 [BUILD] yarn-alpha profile won't build due to network/yarn module · f820b563
      Sean Owen authored
      SPARK-3797 introduced the `network/yarn` module, but its YARN code depends on YARN APIs not present in older versions covered by the `yarn-alpha` profile. As a result builds like `mvn -Pyarn-alpha -Phadoop-0.23 -Dhadoop.version=0.23.7 -DskipTests clean package` fail.
      
      The solution is just to not build `network/yarn` with profile `yarn-alpha`.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3167 from srowen/SPARK-4305 and squashes the following commits:
      
      88938cb [Sean Owen] Don't build network/yarn in yarn-alpha profile as it won't compile
      f820b563
  19. Nov 05, 2014
    • Andrew Or's avatar
      [SPARK-3797] Run external shuffle service in Yarn NM · 61a5cced
      Andrew Or authored
      This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.
      
      This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits:
      
      ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      0ee67a2 [Andrew Or] Minor wording suggestions
      1c66046 [Andrew Or] Remove unused provided dependencies
      0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      6489db5 [Andrew Or] Try catch at the right places
      7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
      d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
      5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      9b6e058 [Andrew Or] Address various feedback
      f48b20c [Andrew Or] Fix tests again
      f39daa6 [Andrew Or] Do not make network-yarn an assembly module
      761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      15a5b37 [Andrew Or] Fix build for Hadoop 1.x
      baff916 [Andrew Or] Fix tests
      5bf9b7e [Andrew Or] Address a few minor comments
      5b419b8 [Andrew Or] Add missing license header
      804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
      cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
      ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
      1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
      b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
      43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
      b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
      61a5cced
  20. Nov 03, 2014
    • fi's avatar
      [SPARK-4211][Build] Fixes hive.version in Maven profile hive-0.13.1 · df607da0
      fi authored
      instead of `hive.version=0.13.1`.
      e.g. mvn -Phive -Phive=0.13.1
      
      Note: `hive.version=0.13.1a` is the default property value. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected.
      References:  PR #2685, which resolved a package incompatibility issue with Hive-0.13.1 by introducing a special version Hive-0.13.1a
      
      Author: fi <coderfi@gmail.com>
      
      Closes #3072 from coderfi/master and squashes the following commits:
      
      7ca4b1e [fi] Fixes the `hive-0.13.1` maven profile referencing `hive.version=0.13.1` instead of the Spark compatible `hive.version=0.13.1a` Note: `hive.version=0.13.1a` is the default version. However, when explicitly specifying the `hive-0.13.1` maven profile, the wrong one would be selected. e.g. mvn -Phive -Phive=0.13.1 See PR #2685
      df607da0
  21. Nov 01, 2014
    • Xiangrui Meng's avatar
      [SPARK-4121] Set commons-math3 version based on hadoop profiles, instead of shading · d8176b1c
      Xiangrui Meng authored
      In #2928 , we shade commons-math3 to prevent future conflicts with hadoop. It caused problems with our Jenkins master build with maven. Some tests used local-cluster mode, where the assembly jar contains relocated math3 classes, while mllib test code still compiles with core and the untouched math3 classes.
      
      This PR sets commons-math3 version based on hadoop profiles.
      
      pwendell JoshRosen srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3023 from mengxr/SPARK-4121-alt and squashes the following commits:
      
      580f6d9 [Xiangrui Meng] replace tab by spaces
      7f71f08 [Xiangrui Meng] revert changes to PoissonSampler to avoid conflicts
      d3353d9 [Xiangrui Meng] do not shade commons-math3
      b4180dc [Xiangrui Meng] temp work
      d8176b1c
    • Aaron Davidson's avatar
      [SPARK-3796] Create external service which can serve shuffle files · f55218ae
      Aaron Davidson authored
      This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).
      
      This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).
      
      There are several outstanding tasks which must be complete before this PR can be merged:
      - [x] Complete unit testing of network/shuffle package.
      - [x] Performance and correctness testing on a real cluster.
      - [x] Remove example service instantiation from Worker.scala.
      
      There are even more shortcomings of this PR which should be addressed in followup patches:
      - Don't use Java serializer for RPC layer! It is not cross-version compatible.
      - Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
      - Documentation of the feature in the Spark docs.
      - Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
      - SSL and SASL integration
      - Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3001 from aarondav/shuffle-service and squashes the following commits:
      
      4d1f8c1 [Aaron Davidson] Remove changes to Worker
      705748f [Aaron Davidson] Rename Standalone* to External*
      fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
      9883918 [Aaron Davidson] Make suggested build changes
      3d62679 [Aaron Davidson] Add Spark integration test
      7fe51d5 [Aaron Davidson] Fix SBT integration
      56caa50 [Aaron Davidson] Address comments
      c8d1ac3 [Aaron Davidson] Add unit tests
      2f70c0c [Aaron Davidson] Fix unit tests
      5483e96 [Aaron Davidson] Fix unit tests
      46a70bf [Aaron Davidson] Whoops, bracket
      5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
      f55218ae
    • Daniel Lemire's avatar
      Upgrading to roaring 0.4.5 (bug fix release) · 680fd87c
      Daniel Lemire authored
      I recommend upgrading roaring to 0.4.5 as it fixes a rarely occurring bug in iterators (that would otherwise throw an unwarranted exception). The upgrade should have no other consequence.
      
      Author: Daniel Lemire <lemire@gmail.com>
      
      Closes #3044 from lemire/master and squashes the following commits:
      
      54018c5 [Daniel Lemire] Recommended update to roaring 0.4.5 (bug fix release)
      048933e [Daniel Lemire] Merge remote-tracking branch 'upstream/master'
      431f3a0 [Daniel Lemire] Recommended bug fix release
      680fd87c
  22. Oct 31, 2014
    • wangfei's avatar
      [SPARK-3826][SQL]enable hive-thriftserver to support hive-0.13.1 · 7c41d135
      wangfei authored
       In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241.
      
       1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility
      
       2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version
      
       3 SBT cmd for different version as follows:
         hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly
         hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly
      
       4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver
      
      Author: wangfei <wangfei1@huawei.com>
      Author: scwf <wangfei1@huawei.com>
      
      Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits:
      
      f26f3be [wangfei] remove clean to save time
      f5cac74 [wangfei] remove local hivecontext test
      578234d [wangfei] use new shaded hive
      18fb1ff [wangfei] exclude kryo in hive pom
      fa21d09 [wangfei] clean package assembly/assembly
      8a4daf2 [wangfei] minor fix
      0d7f6cf [wangfei] address comments
      f7c93ae [wangfei] adding build with hive 0.13 before running tests
      bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
      c359822 [wangfei] reuse getCommandProcessor in hiveshim
      52674a4 [scwf] sql/hive included since examples depend on it
      3529e98 [scwf] move hive module to hive profile
      f51ff4e [wangfei] update and fix conflicts
      f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
      41f727b [scwf] revert pom changes
      13afde0 [scwf] fix small bug
      4b681f4 [scwf] enable thriftserver in profile hive-0.13.1
      0bc53aa [scwf] fixed when result filed is null
      dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now
      c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver
      7c66b8e [scwf] update pom according spark-2706
      ae47489 [scwf] update and fix conflicts
      7c41d135
  23. Oct 30, 2014
    • Yash Datta's avatar
      [SPARK-3968][SQL] Use parquet-mr filter2 api · 2e35e242
      Yash Datta authored
      The parquet-mr project has introduced a new filter api  (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
      We can leverage that to further improve performance of queries with filters.
      Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself
      
      Author: Yash Datta <Yash.Datta@guavus.com>
      
      Closes #2841 from saucam/master and squashes the following commits:
      
      8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns
      515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column
      5f4530e [Yash Datta] SPARK-3968: Fix scala code style
      f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering
      ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter
      48163c3 [Yash Datta] SPARK-3968: Code cleanup
      cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working             2. Use the serialization/deserialization from Parquet library for filter pushdown
      caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api
      49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns
      9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr
      2e35e242
  24. Oct 29, 2014
    • Reynold Xin's avatar
      [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core · dff01553
      Reynold Xin authored
      This PR encapsulates #2330, which is itself a continuation of #2240. The first goal of this PR is to provide an alternate, simpler implementation of the ConnectionManager which is based on Netty.
      
      In addition to this goal, however, we want to resolve [SPARK-3796](https://issues.apache.org/jira/browse/SPARK-3796), which calls for a standalone shuffle service which can be integrated into the YARN NodeManager, Standalone Worker, or on its own. This PR makes the first step in this direction by ensuring that the actual Netty service is as small as possible and extracted from Spark core. Given this, we should be able to construct this standalone jar which can be included in other JVMs without incurring significant dependency or runtime issues. The actual work to ensure that such a standalone shuffle service would work in Spark will be left for a future PR, however.
      
      In order to minimize dependencies and allow for the service to be long-running (possibly much longer-running than Spark, and possibly having to support multiple version of Spark simultaneously), the entire service has been ported to Java, where we have full control over the binary compatibility of the components and do not depend on the Scala runtime or version.
      
      These issues: have been addressed by folding in #2330:
      
      SPARK-3453: Refactor Netty module to use BlockTransferService interface
      SPARK-3018: Release all buffers upon task completion/failure
      SPARK-3002: Create a connection pool and reuse clients across different threads
      SPARK-3017: Integration tests and unit tests for connection failures
      SPARK-3049: Make sure client doesn't block when server/connection has error(s)
      SPARK-3502: SO_RCVBUF and SO_SNDBUF should be bootstrap childOption, not option
      SPARK-3503: Disable thread local cache in PooledByteBufAllocator
      
      TODO before mergeable:
      - [x] Implement uploadBlock()
      - [x] Unit tests for RPC side of code
      - [x] Performance testing (see comments [here](https://github.com/apache/spark/pull/2753#issuecomment-59475022))
      - [x] Turn OFF by default (currently on for unit testing)
      
      Author: Reynold Xin <rxin@apache.org>
      Author: Aaron Davidson <aaron@databricks.com>
      Author: cocoatomo <cocoatomo77@gmail.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Anand Avati <avati@redhat.com>
      
      Closes #2753 from aarondav/netty and squashes the following commits:
      
      cadfd28 [Aaron Davidson] Turn netty off by default
      d7be11b [Aaron Davidson] Turn netty on by default
      4a204b8 [Aaron Davidson] Fail block fetches if client connection fails
      2b0d1c0 [Aaron Davidson] 100ch
      0c5bca2 [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
      14e37f7 [Aaron Davidson] Address Reynold's comments
      8dfcceb [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
      322dfc1 [Aaron Davidson] Address Reynold's comments, including major rename
      e5675a4 [Aaron Davidson] Fail outstanding RPCs as well
      ccd4959 [Aaron Davidson] Don't throw exception if client immediately fails
      9da0bc1 [Aaron Davidson] Add RPC unit tests
      d236dfd [Aaron Davidson] Remove no-op serializer :)
      7b7a26c [Aaron Davidson] Fix Nio compile issue
      dd420fd [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty-test
      939f276 [Aaron Davidson] Attempt to make comm. bidirectional
      aa58f67 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
      8dc1ded [cocoatomo] [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      5b5dbe6 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
      2c5d9dc [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
      020691e [Davies Liu] [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
      ae4083a [Anand Avati] [SPARK-2805] Upgrade Akka to 2.3.4
      29c6dcf [Aaron Davidson] [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
      f7e7568 [Reynold Xin] Fixed spark.shuffle.io.receiveBuffer setting.
      5d98ce3 [Reynold Xin] Flip buffer.
      f6c220d [Reynold Xin] Merge with latest master.
      407e59a [Reynold Xin] Fix style violation.
      a0518c7 [Reynold Xin] Implemented block uploads.
      4b18db2 [Reynold Xin] Copy the buffer in fetchBlockSync.
      bec4ea2 [Reynold Xin] Removed OIO and added num threads settings.
      1bdd7ee [Reynold Xin] Fixed tests.
      d68f328 [Reynold Xin] Logging close() in case close() fails.
      f63fb4c [Reynold Xin] Add more debug message.
      6afc435 [Reynold Xin] Added logging.
      c066309 [Reynold Xin] Implement java.io.Closeable interface.
      519d64d [Reynold Xin] Mark private package visibility and MimaExcludes.
      f0a16e9 [Reynold Xin] Fixed test hanging.
      14323a5 [Reynold Xin] Removed BlockManager.getLocalShuffleFromDisk.
      b2f3281 [Reynold Xin] Added connection pooling.
      d23ed7b [Reynold Xin] Incorporated feedback from Norman: - use same pool for boss and worker - remove ioratio - disable caching of byte buf allocator - childoption sendbuf/receivebuf - fire exception through pipeline
      9e0cb87 [Reynold Xin] Fixed BlockClientHandlerSuite
      5cd33d7 [Reynold Xin] Fixed style violation.
      cb589ec [Reynold Xin] Added more test cases covering cleanup when fault happens in ShuffleBlockFetcherIteratorSuite
      1be4e8e [Reynold Xin] Shorten NioManagedBuffer and NettyManagedBuffer class names.
      108c9ed [Reynold Xin] Forgot to add TestSerializer to the commit list.
      b5c8d1f [Reynold Xin] Fixed ShuffleBlockFetcherIteratorSuite.
      064747b [Reynold Xin] Reference count buffers and clean them up properly.
      2b44cf1 [Reynold Xin] Added more documentation.
      1760d32 [Reynold Xin] Use Epoll.isAvailable in BlockServer as well.
      165eab1 [Reynold Xin] [SPARK-3453] Refactor Netty module to use BlockTransferService.
      dff01553
  25. Oct 27, 2014
    • Sean Owen's avatar
      SPARK-4022 [CORE] [MLLIB] Replace colt dependency (LGPL) with commons-math · bfa614b1
      Sean Owen authored
      This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2928 from srowen/SPARK-4022 and squashes the following commits:
      
      61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample()
      16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator
      a1a78e0 [Sean Owen] Use Well19937c
      31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems
      5c9c67f [Sean Owen] Additional test fixes from review
      d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
      bfa614b1
  26. Oct 26, 2014
    • Josh Rosen's avatar
      [SPARK-3616] Add basic Selenium tests to WebUISuite · bf589fc7
      Josh Rosen authored
      This patch adds Selenium tests for Spark's web UI.  To avoid adding extra
      dependencies to the test environment, the tests use Selenium's HtmlUnitDriver,
      which is pure-Java, instead of, say, ChromeDriver.
      
      I added new tests to try to reproduce a few UI bugs reported on JIRA, namely
      SPARK-3021, SPARK-2105, and SPARK-2527.  I wasn't able to reproduce these bugs;
      I suspect that the older ones might have been fixed by other patches.
      
      In order to use HtmlUnitDriver, I added an explicit dependency on the
      org.apache.httpcomponents version of httpclient in order to prevent jets3t's
      older version from taking precedence on the classpath.
      
      I also upgraded ScalaTest to 2.2.1.
      
      Author: Josh Rosen <joshrosen@apache.org>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2474 from JoshRosen/webui-selenium-tests and squashes the following commits:
      
      fcc9e83 [Josh Rosen] scalautils -> scalactic package rename
      510e54a [Josh Rosen] [SPARK-3616] Add basic Selenium tests to WebUISuite.
      bf589fc7
    • Daniel Lemire's avatar
      Update RoaringBitmap to 0.4.3 · b7595401
      Daniel Lemire authored
      Roaring has been updated to version 0.4.3. We fixed a rarely occurring bug with serialization. No API or format changes were made.
      
      Author: Daniel Lemire <lemire@gmail.com>
      
      Closes #2938 from lemire/master and squashes the following commits:
      
      431f3a0 [Daniel Lemire] Recommended bug fix release
      b7595401
  27. Oct 25, 2014
  28. Oct 24, 2014
    • Josh Rosen's avatar
      [SPARK-4056] Upgrade snappy-java to 1.1.1.5 · 898b22ab
      Josh Rosen authored
      This upgrades snappy-java to 1.1.1.5, which improves error messages when attempting to deserialize empty inputs using SnappyInputStream (see https://github.com/xerial/snappy-java/issues/89).
      
      Author: Josh Rosen <rosenville@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2911 from JoshRosen/upgrade-snappy-java and squashes the following commits:
      
      adec96c [Josh Rosen] Use snappy-java 1.1.1.5
      cc953d6 [Josh Rosen] [SPARK-4056] Upgrade snappy-java to 1.1.1.4
      898b22ab
    • Zhan Zhang's avatar
      [SPARK-2706][SQL] Enable Spark to support Hive 0.13 · 7c89a8f0
      Zhan Zhang authored
      Given that a lot of users are trying to use hive 0.13 in spark, and the incompatibility between hive-0.12 and hive-0.13 on the API level I want to propose following approach, which has no or minimum impact on existing hive-0.12 support, but be able to jumpstart the development of hive-0.13 and future version support.
      
      Approach: Introduce “hive-version” property,  and manipulate pom.xml files to support different hive version at compiling time through shim layer, e.g., hive-0.12.0 and hive-0.13.1. More specifically,
      
      1. For each different hive version, there is a very light layer of shim code to handle API differences, sitting in sql/hive/hive-version, e.g., sql/hive/v0.12.0 or sql/hive/v0.13.1
      
      2. Add a new profile hive-default active by default, which picks up all existing configuration and hive-0.12.0 shim (v0.12.0)  if no hive.version is specified.
      
      3. If user specifies different version (currently only 0.13.1 by -Dhive.version = 0.13.1), hive-versions profile will be activated, which pick up hive-version specific shim layer and configuration, mainly the hive jars and hive-version shim, e.g., v0.13.1.
      
      4. With this approach, nothing is changed with current hive-0.12 support.
      
      No change by default: sbt/sbt -Phive
      For example: sbt/sbt -Phive -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly
      
      To enable hive-0.13: sbt/sbt -Dhive.version=0.13.1
      For example: sbt/sbt -Dhive.version=0.13.1 -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 assembly
      
      Note that in hive-0.13, hive-thriftserver is not enabled, which should be fixed by other Jira, and we don’t need -Phive with -Dhive.version in building (probably we should use -Phive -Dhive.version=xxx instead after thrift server is also supported in hive-0.13.1).
      
      Author: Zhan Zhang <zhazhan@gmail.com>
      Author: zhzhan <zhazhan@gmail.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2241 from zhzhan/spark-2706 and squashes the following commits:
      
      3ece905 [Zhan Zhang] minor fix
      410b668 [Zhan Zhang] solve review comments
      cbb4691 [Zhan Zhang] change run-test for new options
      0d4d2ed [Zhan Zhang] rebase
      497b0f4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      8fad1cf [Zhan Zhang] change the pom file and make hive-0.13.1 as the default
      ab028d1 [Zhan Zhang] rebase
      4a2e36d [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      4cb1b93 [zhzhan] Merge pull request #1 from pwendell/pr-2241
      b0478c0 [Patrick Wendell] Changes to simplify the build of SPARK-2706
      2b50502 [Zhan Zhang] rebase
      a72c0d4 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      cb22863 [Zhan Zhang] correct the typo
      20f6cf7 [Zhan Zhang] solve compatability issue
      f7912a9 [Zhan Zhang] rebase and solve review feedback
      301eb4a [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      10c3565 [Zhan Zhang] address review comments
      6bc9204 [Zhan Zhang] rebase and remove temparory repo
      d3aa3f2 [Zhan Zhang] Merge branch 'master' into spark-2706
      cedcc6f [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      3ced0d7 [Zhan Zhang] rebase
      d9b981d [Zhan Zhang] rebase and fix error due to rollback
      adf4924 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      3dd50e8 [Zhan Zhang] solve conflicts and remove unnecessary implicts
      d10bf00 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      dc7bdb3 [Zhan Zhang] solve conflicts
      7e0cc36 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      d7c3e1e [Zhan Zhang] Merge branch 'master' into spark-2706
      68deb11 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      d48bd18 [Zhan Zhang] address review comments
      3ee3b2b [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      57ea52e [Zhan Zhang] Merge branch 'master' into spark-2706
      2b0d513 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      9412d24 [Zhan Zhang] address review comments
      f4af934 [Zhan Zhang] rebase
      1ccd7cc [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      128b60b [Zhan Zhang] ignore 0.12.0 test cases for the time being
      af9feb9 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      5f5619f [Zhan Zhang] restructure the directory and different hive version support
      05d3683 [Zhan Zhang] solve conflicts
      e4c1982 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      94b4fdc [Zhan Zhang] Spark-2706: hive-0.13.1 support on spark
      87ebf3b [Zhan Zhang] Merge branch 'master' into spark-2706
      921e914 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      f896b2a [Zhan Zhang] Merge branch 'master' into spark-2706
      789ea21 [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      cb53a2c [Zhan Zhang] Merge branch 'master' of https://github.com/apache/spark
      f6a8a40 [Zhan Zhang] revert
      ba14f28 [Zhan Zhang] test
      dbedff3 [Zhan Zhang] Merge remote-tracking branch 'upstream/master'
      70964fe [Zhan Zhang] revert
      fe0f379 [Zhan Zhang] Merge branch 'master' of https://github.com/zhzhan/spark
      70ffd93 [Zhan Zhang] revert
      42585ec [Zhan Zhang] test
      7d5fce2 [Zhan Zhang] test
      7c89a8f0
    • Prashant Sharma's avatar
      SPARK-3812 Build changes to publish effective pom. · 0aea2289
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2921 from ScrapCodes/build-changes-effective-pom and squashes the following commits:
      
      8841491 [Prashant Sharma] Fixed broken maven build.
      aa7b91d [Prashant Sharma] used an unused dep.
      0300dac [Prashant Sharma] improved comment messages..
      28f891e [Prashant Sharma] Added a useless dependency, so that we can shade it. And realized fake shading works for us.
      553d96b [Prashant Sharma] Shaded some unused class of an unused dep, to generate effective pom(s)
      0aea2289
  29. Oct 23, 2014
    • Josh Rosen's avatar
      [SPARK-4019] [SPARK-3740] Fix MapStatus compression bug that could lead to... · 83b7a1c6
      Josh Rosen authored
      [SPARK-4019] [SPARK-3740] Fix MapStatus compression bug that could lead to empty results or Snappy errors
      
      This commit fixes a bug in MapStatus that could cause jobs to wrongly return
      empty results if those jobs contained stages with more than 2000 partitions
      where most of those partitions were empty.
      
      For jobs with > 2000 partitions, MapStatus uses HighlyCompressedMapStatus,
      which only stores the average size of blocks.  If the average block size is
      zero, then this will cause all blocks to be reported as empty, causing
      BlockFetcherIterator to mistakenly skip them.
      
      For example, this would return an empty result:
      
          sc.makeRDD(0 until 10, 1000).repartition(2001).collect()
      
      This can also lead to deserialization errors (e.g. Snappy decoding errors)
      for jobs with > 2000 partitions where the average block size is non-zero but
      there is at least one empty block.  In this case, the BlockFetcher attempts to
      fetch empty blocks and fails when trying to deserialize them.
      
      The root problem here is that MapStatus has a (previously undocumented)
      correctness property that was violated by HighlyCompressedMapStatus:
      
          If a block is non-empty, then getSizeForBlock must be non-zero.
      
      I fixed this by modifying HighlyCompressedMapStatus to store the average size
      of _non-empty_ blocks and to use a compressed bitmap to track which blocks are
      empty.
      
      I also removed a test which was broken as originally written: it attempted
      to check that HighlyCompressedMapStatus's size estimation error was < 10%,
      but this was broken because HighlyCompressedMapStatus is only used for map
      statuses with > 2000 partitions, but the test only created 50.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2866 from JoshRosen/spark-4019 and squashes the following commits:
      
      fc8b490 [Josh Rosen] Roll back hashset change, which didn't improve performance.
      5faa0a4 [Josh Rosen] Incorporate review feedback
      c8b8cae [Josh Rosen] Two performance fixes:
      3b892dd [Josh Rosen] Address Reynold's review comments
      ba2e71c [Josh Rosen] Add missing newline
      609407d [Josh Rosen] Use Roaring Bitmap to track non-empty blocks.
      c23897a [Josh Rosen] Use sets when comparing collect() results
      91276a3 [Josh Rosen] [SPARK-4019] Fix MapStatus compression bug that could lead to empty results.
      83b7a1c6
    • Patrick Wendell's avatar
      Revert "[SPARK-3812] [BUILD] Adapt maven build to publish effective pom." · 222fa47f
      Patrick Wendell authored
      This reverts commit c5882c66.
      
      I am reverting this becuase it appears to cause the maven tests
      to hang.
      222fa47f
Loading