Skip to content
Snippets Groups Projects
  1. Dec 09, 2014
    • Sandy Ryza's avatar
      SPARK-4338. [YARN] Ditch yarn-alpha. · 912563aa
      Sandy Ryza authored
      Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits:
      
      1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
      9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
      912563aa
  2. Dec 04, 2014
    • lewuathe's avatar
      [SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's MLlib group · 20bfea4a
      lewuathe authored
      This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`.
      
      Closes #3554
      
      jkbradley
      
      Author: lewuathe <lewuathe@me.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits:
      
      184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc
      f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
      20bfea4a
  3. Nov 28, 2014
    • Takuya UESHIN's avatar
      [SPARK-4193][BUILD] Disable doclint in Java 8 to prevent from build error. · e464f0ac
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3058 from ueshin/issues/SPARK-4193 and squashes the following commits:
      
      e096bb1 [Takuya UESHIN] Add a plugin declaration to pluginManagement.
      6762ec2 [Takuya UESHIN] Fix usage of -Xdoclint javadoc option.
      fdb280a [Takuya UESHIN] Fix Javadoc errors.
      4745f3c [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4193
      923e2f0 [Takuya UESHIN] Use doclint option `-missing` instead of `none`.
      30d6718 [Takuya UESHIN] Fix Javadoc errors.
      b548017 [Takuya UESHIN] Disable doclint in Java 8 to prevent from build error.
      e464f0ac
  4. Nov 26, 2014
    • Xiangrui Meng's avatar
      [SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices · 561d31d2
      Xiangrui Meng authored
      Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a `Random` implementation. Otherwise, it is very likely to produce inconsistent RDDs. I also added some unit tests for matrix factory methods. All APIs are new in 1.2, so there is no incompatible changes.
      
      brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3468 from mengxr/SPARK-4614 and squashes the following commits:
      
      3b0e4e2 [Xiangrui Meng] add mima excludes
      6bfd8a4 [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests
      561d31d2
  5. Nov 19, 2014
    • Joseph E. Gonzalez's avatar
      Updating GraphX programming guide and documentation · 377b0682
      Joseph E. Gonzalez authored
      This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
      
      4421964 [Joseph E. Gonzalez] updating documentation for graphx
      377b0682
    • Takuya UESHIN's avatar
      [SPARK-4429][BUILD] Build for Scala 2.11 using sbt fails. · f9adda9a
      Takuya UESHIN authored
      I tried to build for Scala 2.11 using sbt with the following command:
      
      ```
      $ sbt/sbt -Dscala-2.11 assembly
      ```
      
      but it ends with the following error messages:
      
      ```
      [error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found
      [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found
      ```
      
      The reason is:
      If system property `-Dscala-2.11` (without value) was set, `SparkBuild.scala` adds `scala-2.11` profile, but also `sbt-pom-reader` activates `scala-2.10` profile instead of `scala-2.11` profile because the activator `PropertyProfileActivator` used by `sbt-pom-reader` internally checks if the property value is empty or not.
      
      The value is set to non-empty value, then no need to add profiles in `SparkBuild.scala` because `sbt-pom-reader` can handle as expected.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3342 from ueshin/issues/SPARK-4429 and squashes the following commits:
      
      14d86e8 [Takuya UESHIN] Add a comment.
      4eef52b [Takuya UESHIN] Remove unneeded condition.
      ce98d0f [Takuya UESHIN] Set non-empty value to system property "scala-2.11" if the property exists instead of adding profile.
      f9adda9a
    • Andrew Or's avatar
      [HOT FIX] MiMa tests are broken · 0df02ca4
      Andrew Or authored
      This is blocking #3353 and other patches.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3371 from andrewor14/mima-hot-fix and squashes the following commits:
      
      842d059 [Andrew Or] Move excludes to the right section
      c4d4f4e [Andrew Or] MIMA hot fix
      0df02ca4
  6. Nov 18, 2014
    • Marcelo Vanzin's avatar
      Bumping version to 1.3.0-SNAPSHOT. · 397d3aae
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3277 from vanzin/version-1.3 and squashes the following commits:
      
      7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
      5f404ff [Marcelo Vanzin] Add another exclusion.
      19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
      3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
      e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
      397d3aae
    • Davies Liu's avatar
      [SPARK-4017] show progress bar in console · e34f38ff
      Davies Liu authored
      The progress bar will look like this:
      
      ![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png)
      
      In the right corner, the numbers are: finished tasks, running tasks, total tasks.
      
      After the stage has finished, it will disappear.
      
      The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3029 from davies/progress and squashes the following commits:
      
      95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      fc49ac8 [Davies Liu] address commentse
      2e90f75 [Davies Liu] show multiple stages in same time
      0081bcc [Davies Liu] address comments
      38c42f1 [Davies Liu] fix tests
      ab87958 [Davies Liu] disable progress bar during tests
      30ac852 [Davies Liu] re-implement progress bar
      b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
      e4e7344 [Davies Liu] refactor
      e1f524d [Davies Liu] revert unnecessary change
      a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      5cae3f2 [Davies Liu] fix style
      ea49fe0 [Davies Liu] address comments
      bc53d99 [Davies Liu] refactor
      e6bb189 [Davies Liu] fix logging in sparkshell
      7e7d4e7 [Davies Liu] address commments
      5df26bb [Davies Liu] fix style
      9e42208 [Davies Liu] show progress bar in console and title
      e34f38ff
  7. Nov 17, 2014
    • Josh Rosen's avatar
      [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts · 0f3ceb56
      Josh Rosen authored
      This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).
      
      **The solution implemented here is only a partial fix.**  A complete fix would have the following properties:
      
      1. Only one SparkContext may ever be under construction at any given time.
      2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
      3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
      4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.
      
      This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.
      
      ### The correct solution:
      
      I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object.  Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.).  Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor.  For example:
      
      ```scala
      class SparkContext private (deps: SparkContextDependencies) {
        def this(conf: SparkConf) {
          this(SparkContext.getDeps(conf))
        }
      }
      
      object SparkContext(
        private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
          if (anotherSparkContextIsActive) { throw Exception(...) }
          var dagScheduler: DAGScheduler = null
          try {
              dagScheduler = new DAGScheduler(...)
              [...]
          } catch {
            case e: Exception =>
               Option(dagScheduler).foreach(_.stop())
                [...]
          }
          SparkContextDependencies(dagScheduler, ....)
        }
      }
      ```
      
      This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.
      
      This indirection is necessary to maintain binary compatibility.  In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.
      
      ### Alternative solutions:
      
      As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block.  Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block.  If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.
      
      The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.
      
      ### This PR's solution:
      
      - At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
      - If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
      - At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context.  If so, throw an exception.
      
      This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor).  If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.
      
      This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`.  The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts.  I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits:
      
      23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      d38251b [Josh Rosen] Address latest round of feedback.
      c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
      85a424a [Josh Rosen] Incorporate more review feedback.
      372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      f5bb78c [Josh Rosen] Update mvn build, too.
      d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
      79a7e6f [Josh Rosen] Fix commented out test
      a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
      4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
      ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
      1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
      d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
      c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
      918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
      afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
      0f3ceb56
  8. Nov 14, 2014
    • jerryshao's avatar
      [SPARK-4062][Streaming]Add ReliableKafkaReceiver in Spark Streaming Kafka connector · 5930f64b
      jerryshao authored
      Add ReliableKafkaReceiver in Kafka connector to prevent data loss if WAL in Spark Streaming is enabled. Details and design doc can be seen in [SPARK-4062](https://issues.apache.org/jira/browse/SPARK-4062).
      
      Author: jerryshao <saisai.shao@intel.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Saisai Shao <saisai.shao@intel.com>
      
      Closes #2991 from jerryshao/kafka-refactor and squashes the following commits:
      
      5461f1c [Saisai Shao] Merge pull request #8 from tdas/kafka-refactor3
      eae4ad6 [Tathagata Das] Refectored KafkaStreamSuiteBased to eliminate KafkaTestUtils and made Java more robust.
      fab14c7 [Tathagata Das] minor update.
      149948b [Tathagata Das] Fixed mistake
      14630aa [Tathagata Das] Minor updates.
      d9a452c [Tathagata Das] Minor updates.
      ec2e95e [Tathagata Das] Removed the receiver's locks and essentially reverted to Saisai's original design.
      2a20a01 [jerryshao] Address some comments
      9f636b3 [Saisai Shao] Merge pull request #5 from tdas/kafka-refactor
      b2b2f84 [Tathagata Das] Refactored Kafka receiver logic and Kafka testsuites
      e501b3c [jerryshao] Add Mima excludes
      b798535 [jerryshao] Fix the missed issue
      e5e21c1 [jerryshao] Change to while loop
      ea873e4 [jerryshao] Further address the comments
      98f3d07 [jerryshao] Fix comment style
      4854ee9 [jerryshao] Address all the comments
      96c7a1d [jerryshao] Update the ReliableKafkaReceiver unit test
      8135d31 [jerryshao] Fix flaky test
      a949741 [jerryshao] Address the comments
      16bfe78 [jerryshao] Change the ordering of imports
      0894aef [jerryshao] Add some comments
      77c3e50 [jerryshao] Code refactor and add some unit tests
      dd9aeeb [jerryshao] Initial commit for reliable Kafka receiver
      5930f64b
    • Sandy Ryza's avatar
      SPARK-4375. no longer require -Pscala-2.10 · f5f757e4
      Sandy Ryza authored
      It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits:
      
      0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt
      cd42d94 [Sandy Ryza] Update doc
      f6644c3 [Sandy Ryza] SPARK-4375 take 2
      f5f757e4
  9. Nov 12, 2014
    • Andrew Or's avatar
      [SPARK-4281][Build] Package Yarn shuffle service into its own jar · aa43a8da
      Andrew Or authored
      This is another addendum to #3082, which added the Yarn shuffle service to run inside the NM. This PR makes the feature much more usable by packaging enough dependencies into the jar to run the service inside an NM. After these changes, the user can run `./make-distribution.sh` and find a `spark-network-yarn*.jar` in their `lib` directory. The equivalent change is done in SBT by making the `network-yarn` module an assembly project.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3147 from andrewor14/yarn-shuffle-build and squashes the following commits:
      
      bda58d0 [Andrew Or] Fix line too long
      81e9705 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-build
      fb7f398 [Andrew Or] Rename jar to spark-{VERSION}-yarn-shuffle.jar
      65db822 [Andrew Or] Actually mark slf4j as provided
      abcefd1 [Andrew Or] Do the same for SBT
      c653028 [Andrew Or] Package network-yarn and its dependencies
      aa43a8da
  10. Nov 11, 2014
    • Prashant Sharma's avatar
      Support cross building for Scala 2.11 · daaca14c
      Prashant Sharma authored
      Let's give this another go using a version of Hive that shades its JLine dependency.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:
      
      e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
      f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
      a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
      7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
      583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
      3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
      935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
      925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
      2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
      8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
      5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
      2121071 [Patrick Wendell] Migrating version detection to PySpark
      b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
      1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
      f5cad4e [Patrick Wendell] Add Scala 2.11 docs
      210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
      48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
      e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
      67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
      8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
      e22b104 [Patrick Wendell] Small fix in pom file
      ec402ab [Patrick Wendell] Various fixes
      0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
      4eaec65 [Prashant Sharma] Changed scripts to ignore target.
      5167bea [Prashant Sharma] small correction
      a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
      80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
      034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
      d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
      6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
      e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
      937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
      cb059b0 [Prashant Sharma] Code review
      0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
      daaca14c
  11. Nov 10, 2014
    • Patrick Wendell's avatar
      Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without... · 6e7a309b
      Patrick Wendell authored
      Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally."
      
      This reverts commit bd86cb17.
      6e7a309b
    • RongGu's avatar
      [SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a... · bd86cb17
      RongGu authored
      [SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally.
      
      Make Tachyon related unit tests execute without deploying a Tachyon system locally.
      
      Author: RongGu <gurongwalker@gmail.com>
      
      Closes #3030 from RongGu/SPARK-2703 and squashes the following commits:
      
      ad08827 [RongGu] Make Tachyon related unit tests execute without deploying a Tachyon system locally
      bd86cb17
    • Sean Owen's avatar
      SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use... · f8e57323
      Sean Owen authored
      SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop
      
      andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619
      
      I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage.
      
      So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away.
      
      This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3048 from srowen/SPARK-1209 and squashes the following commits:
      
      0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions
      466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
      eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
      f8e57323
  12. Nov 05, 2014
    • Andrew Or's avatar
      [SPARK-3797] Run external shuffle service in Yarn NM · 61a5cced
      Andrew Or authored
      This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.
      
      This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits:
      
      ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      0ee67a2 [Andrew Or] Minor wording suggestions
      1c66046 [Andrew Or] Remove unused provided dependencies
      0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      6489db5 [Andrew Or] Try catch at the right places
      7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
      d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
      5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      9b6e058 [Andrew Or] Address various feedback
      f48b20c [Andrew Or] Fix tests again
      f39daa6 [Andrew Or] Do not make network-yarn an assembly module
      761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      15a5b37 [Andrew Or] Fix build for Hadoop 1.x
      baff916 [Andrew Or] Fix tests
      5bf9b7e [Andrew Or] Address a few minor comments
      5b419b8 [Andrew Or] Add missing license header
      804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
      cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
      ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
      1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
      b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
      43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
      b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
      61a5cced
  13. Nov 01, 2014
    • Aaron Davidson's avatar
      [SPARK-3796] Create external service which can serve shuffle files · f55218ae
      Aaron Davidson authored
      This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).
      
      This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).
      
      There are several outstanding tasks which must be complete before this PR can be merged:
      - [x] Complete unit testing of network/shuffle package.
      - [x] Performance and correctness testing on a real cluster.
      - [x] Remove example service instantiation from Worker.scala.
      
      There are even more shortcomings of this PR which should be addressed in followup patches:
      - Don't use Java serializer for RPC layer! It is not cross-version compatible.
      - Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
      - Documentation of the feature in the Spark docs.
      - Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
      - SSL and SASL integration
      - Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3001 from aarondav/shuffle-service and squashes the following commits:
      
      4d1f8c1 [Aaron Davidson] Remove changes to Worker
      705748f [Aaron Davidson] Rename Standalone* to External*
      fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
      9883918 [Aaron Davidson] Make suggested build changes
      3d62679 [Aaron Davidson] Add Spark integration test
      7fe51d5 [Aaron Davidson] Fix SBT integration
      56caa50 [Aaron Davidson] Address comments
      c8d1ac3 [Aaron Davidson] Add unit tests
      2f70c0c [Aaron Davidson] Fix unit tests
      5483e96 [Aaron Davidson] Fix unit tests
      46a70bf [Aaron Davidson] Whoops, bracket
      5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
      f55218ae
  14. Oct 30, 2014
    • Patrick Wendell's avatar
      HOTFIX: Clean up build in network module. · 0734d093
      Patrick Wendell authored
      This is currently breaking the package build for some people (including me).
      
      This patch does some general clean-up which also fixes the current issue.
      - Uses consistent artifact naming
      - Adds sbt support for this module
      - Changes tests to use scalatest (fixes the original issue[1])
      
      One thing to note, it turns out that scalatest when invoked in the
      Maven build doesn't succesfully detect JUnit Java tests. This is
      a long standing issue, I noticed it applies to all of our current
      test suites as well. I've created SPARK-4159 to fix this.
      
      [1] The original issue is that we need to allocate extra memory
      for the tests, happens by default in our scalatest configuration.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3025 from pwendell/hotfix and squashes the following commits:
      
      faa9053 [Patrick Wendell] HOTFIX: Clean up build in network module.
      0734d093
    • Andrew Or's avatar
      Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use... · 26d31d15
      Andrew Or authored
      Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop"
      
      This reverts commit 68cb69da.
      26d31d15
    • Sean Owen's avatar
      SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop · 68cb69da
      Sean Owen authored
      (This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2814 from srowen/SPARK-1209 and squashes the following commits:
      
      ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
      2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
      68cb69da
  15. Oct 29, 2014
    • Andrew Or's avatar
      [SPARK-3822] Executor scaling mechanism for Yarn · 1df05a40
      Andrew Or authored
      This is part of a broader effort to enable dynamic scaling of executors ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is intended to work alongside SPARK-3795 (#2746), SPARK-3796 and SPARK-3797, but is functionally independently of these other issues.
      
      The logic is built on top of PraveenSeluka's changes at #2798. This is different from the changes there in a few major ways: (1) the mechanism is implemented within the existing scheduler backend framework rather than in new `Actor` classes. This also introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common logic to communicate with the Yarn `ApplicationMaster`. (2) The interface of requesting executors exposed to the `SparkContext` is the same, but the communication between the scheduler backend and the AM uses total number executors desired instead of an incremental number. This is discussed in #2746 and explained in the comments in the code.
      
      I have tested this significantly on a stable Yarn cluster.
      
      ------------
      A remaining task for this issue is to tone down the error messages emitted when an executor is removed.
      Currently, `SparkContext` and its components react as if the executor has failed, resulting in many scary error messages and eventual timeouts. While it's not strictly necessary to fix this as of the first-cut implementation of this mechanism, it would be good to add logic to distinguish this case. I prefer to address this in a separate PR. I have filed a separate JIRA for this task at SPARK-4134.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2840 from andrewor14/yarn-scaling-mechanism and squashes the following commits:
      
      485863e [Andrew Or] Minor log message changes
      4920be8 [Andrew Or] Clarify that public API is only for Yarn mode for now
      1c57804 [Andrew Or] Reword a few comments + other review comments
      6321140 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      02836c0 [Andrew Or] Limit scope of synchronization
      4e2ed7f [Andrew Or] Fix bug: keep track of removed executors properly
      73ade46 [Andrew Or] Wording changes (minor)
      2a7a6da [Andrew Or] Add `sc.killExecutor` as a shorthand (minor)
      665f229 [Andrew Or] Mima excludes
      79aa2df [Andrew Or] Simplify the request interface by asking for a total
      04f625b [Andrew Or] Fix race condition that causes over-allocation of executors
      f4783f8 [Andrew Or] Change the semantics of requesting executors
      005a124 [Andrew Or] Fix tests
      4628b16 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      db4a679 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      572f5c5 [Andrew Or] Unused import (minor)
      f30261c [Andrew Or] Kill multiple executors rather than one at a time
      de260d9 [Andrew Or] Simplify by skipping useless null check
      9c52542 [Andrew Or] Simplify by skipping the TaskSchedulerImpl
      97dd1a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      d987b3e [Andrew Or] Move addWebUIFilters to Yarn scheduler backend
      7b76d0a [Andrew Or] Expose mechanism in SparkContext as developer API
      47466cd [Andrew Or] Refactor common Yarn scheduler backend logic
      c4dfaac [Andrew Or] Avoid thrashing when removing executors
      53e8145 [Andrew Or] Start yarn actor early to listen for AM registration message
      bbee669 [Andrew Or] Add mechanism in yarn client mode
      1df05a40
    • Reynold Xin's avatar
      [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core · dff01553
      Reynold Xin authored
      This PR encapsulates #2330, which is itself a continuation of #2240. The first goal of this PR is to provide an alternate, simpler implementation of the ConnectionManager which is based on Netty.
      
      In addition to this goal, however, we want to resolve [SPARK-3796](https://issues.apache.org/jira/browse/SPARK-3796), which calls for a standalone shuffle service which can be integrated into the YARN NodeManager, Standalone Worker, or on its own. This PR makes the first step in this direction by ensuring that the actual Netty service is as small as possible and extracted from Spark core. Given this, we should be able to construct this standalone jar which can be included in other JVMs without incurring significant dependency or runtime issues. The actual work to ensure that such a standalone shuffle service would work in Spark will be left for a future PR, however.
      
      In order to minimize dependencies and allow for the service to be long-running (possibly much longer-running than Spark, and possibly having to support multiple version of Spark simultaneously), the entire service has been ported to Java, where we have full control over the binary compatibility of the components and do not depend on the Scala runtime or version.
      
      These issues: have been addressed by folding in #2330:
      
      SPARK-3453: Refactor Netty module to use BlockTransferService interface
      SPARK-3018: Release all buffers upon task completion/failure
      SPARK-3002: Create a connection pool and reuse clients across different threads
      SPARK-3017: Integration tests and unit tests for connection failures
      SPARK-3049: Make sure client doesn't block when server/connection has error(s)
      SPARK-3502: SO_RCVBUF and SO_SNDBUF should be bootstrap childOption, not option
      SPARK-3503: Disable thread local cache in PooledByteBufAllocator
      
      TODO before mergeable:
      - [x] Implement uploadBlock()
      - [x] Unit tests for RPC side of code
      - [x] Performance testing (see comments [here](https://github.com/apache/spark/pull/2753#issuecomment-59475022))
      - [x] Turn OFF by default (currently on for unit testing)
      
      Author: Reynold Xin <rxin@apache.org>
      Author: Aaron Davidson <aaron@databricks.com>
      Author: cocoatomo <cocoatomo77@gmail.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Anand Avati <avati@redhat.com>
      
      Closes #2753 from aarondav/netty and squashes the following commits:
      
      cadfd28 [Aaron Davidson] Turn netty off by default
      d7be11b [Aaron Davidson] Turn netty on by default
      4a204b8 [Aaron Davidson] Fail block fetches if client connection fails
      2b0d1c0 [Aaron Davidson] 100ch
      0c5bca2 [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
      14e37f7 [Aaron Davidson] Address Reynold's comments
      8dfcceb [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
      322dfc1 [Aaron Davidson] Address Reynold's comments, including major rename
      e5675a4 [Aaron Davidson] Fail outstanding RPCs as well
      ccd4959 [Aaron Davidson] Don't throw exception if client immediately fails
      9da0bc1 [Aaron Davidson] Add RPC unit tests
      d236dfd [Aaron Davidson] Remove no-op serializer :)
      7b7a26c [Aaron Davidson] Fix Nio compile issue
      dd420fd [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty-test
      939f276 [Aaron Davidson] Attempt to make comm. bidirectional
      aa58f67 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
      8dc1ded [cocoatomo] [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      5b5dbe6 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
      2c5d9dc [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
      020691e [Davies Liu] [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
      ae4083a [Anand Avati] [SPARK-2805] Upgrade Akka to 2.3.4
      29c6dcf [Aaron Davidson] [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
      f7e7568 [Reynold Xin] Fixed spark.shuffle.io.receiveBuffer setting.
      5d98ce3 [Reynold Xin] Flip buffer.
      f6c220d [Reynold Xin] Merge with latest master.
      407e59a [Reynold Xin] Fix style violation.
      a0518c7 [Reynold Xin] Implemented block uploads.
      4b18db2 [Reynold Xin] Copy the buffer in fetchBlockSync.
      bec4ea2 [Reynold Xin] Removed OIO and added num threads settings.
      1bdd7ee [Reynold Xin] Fixed tests.
      d68f328 [Reynold Xin] Logging close() in case close() fails.
      f63fb4c [Reynold Xin] Add more debug message.
      6afc435 [Reynold Xin] Added logging.
      c066309 [Reynold Xin] Implement java.io.Closeable interface.
      519d64d [Reynold Xin] Mark private package visibility and MimaExcludes.
      f0a16e9 [Reynold Xin] Fixed test hanging.
      14323a5 [Reynold Xin] Removed BlockManager.getLocalShuffleFromDisk.
      b2f3281 [Reynold Xin] Added connection pooling.
      d23ed7b [Reynold Xin] Incorporated feedback from Norman: - use same pool for boss and worker - remove ioratio - disable caching of byte buf allocator - childoption sendbuf/receivebuf - fire exception through pipeline
      9e0cb87 [Reynold Xin] Fixed BlockClientHandlerSuite
      5cd33d7 [Reynold Xin] Fixed style violation.
      cb589ec [Reynold Xin] Added more test cases covering cleanup when fault happens in ShuffleBlockFetcherIteratorSuite
      1be4e8e [Reynold Xin] Shorten NioManagedBuffer and NettyManagedBuffer class names.
      108c9ed [Reynold Xin] Forgot to add TestSerializer to the commit list.
      b5c8d1f [Reynold Xin] Fixed ShuffleBlockFetcherIteratorSuite.
      064747b [Reynold Xin] Reference count buffers and clean them up properly.
      2b44cf1 [Reynold Xin] Added more documentation.
      1760d32 [Reynold Xin] Use Epoll.isAvailable in BlockServer as well.
      165eab1 [Reynold Xin] [SPARK-3453] Refactor Netty module to use BlockTransferService.
      dff01553
  16. Oct 28, 2014
    • Xiangrui Meng's avatar
      [SPARK-4084] Reuse sort key in Sorter · 84e5da87
      Xiangrui Meng authored
      Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from rxin .
      
      The `Sorter` class was written in Java and marked package private. So it is only available to `org.apache.spark.util.collection`. I renamed it to `TimSort` and add a simple wrapper of it, still called `Sorter`, in Scala, which is `private[spark]`.
      
      The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:
      
      ~~~
      [info] - Sorter benchmark for key-value pairs !!! IGNORED !!!
      Java Arrays.sort() on non-primitive int array: Took 13237 ms
      Java Arrays.sort() on non-primitive int array: Took 13320 ms
      Java Arrays.sort() on non-primitive int array: Took 15718 ms
      Java Arrays.sort() on non-primitive int array: Took 13283 ms
      Java Arrays.sort() on non-primitive int array: Took 13267 ms
      Java Arrays.sort() on non-primitive int array: Took 15122 ms
      Java Arrays.sort() on non-primitive int array: Took 15495 ms
      Java Arrays.sort() on non-primitive int array: Took 14877 ms
      Java Arrays.sort() on non-primitive int array: Took 16429 ms
      Java Arrays.sort() on non-primitive int array: Took 14250 ms
      Java Arrays.sort() on non-primitive int array: (13878 ms first try, 14499 ms average)
      Java Arrays.sort() on primitive int array: Took 2683 ms
      Java Arrays.sort() on primitive int array: Took 2683 ms
      Java Arrays.sort() on primitive int array: Took 2701 ms
      Java Arrays.sort() on primitive int array: Took 2746 ms
      Java Arrays.sort() on primitive int array: Took 2685 ms
      Java Arrays.sort() on primitive int array: Took 2735 ms
      Java Arrays.sort() on primitive int array: Took 2669 ms
      Java Arrays.sort() on primitive int array: Took 2693 ms
      Java Arrays.sort() on primitive int array: Took 2680 ms
      Java Arrays.sort() on primitive int array: Took 2642 ms
      Java Arrays.sort() on primitive int array: (2948 ms first try, 2691 ms average)
      Sorter without key reuse on primitive int array: Took 10732 ms
      Sorter without key reuse on primitive int array: Took 12482 ms
      Sorter without key reuse on primitive int array: Took 10718 ms
      Sorter without key reuse on primitive int array: Took 12650 ms
      Sorter without key reuse on primitive int array: Took 10747 ms
      Sorter without key reuse on primitive int array: Took 10783 ms
      Sorter without key reuse on primitive int array: Took 12721 ms
      Sorter without key reuse on primitive int array: Took 10604 ms
      Sorter without key reuse on primitive int array: Took 10622 ms
      Sorter without key reuse on primitive int array: Took 11843 ms
      Sorter without key reuse on primitive int array: (11089 ms first try, 11390 ms average)
      Sorter with key reuse on primitive int array: Took 5141 ms
      Sorter with key reuse on primitive int array: Took 5298 ms
      Sorter with key reuse on primitive int array: Took 5066 ms
      Sorter with key reuse on primitive int array: Took 5164 ms
      Sorter with key reuse on primitive int array: Took 5203 ms
      Sorter with key reuse on primitive int array: Took 5274 ms
      Sorter with key reuse on primitive int array: Took 5186 ms
      Sorter with key reuse on primitive int array: Took 5159 ms
      Sorter with key reuse on primitive int array: Took 5164 ms
      Sorter with key reuse on primitive int array: Took 5078 ms
      Sorter with key reuse on primitive int array: (5311 ms first try, 5173 ms average)
      ~~~
      
      So with key reuse, it is faster and less likely to trigger GC.
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2937 from mengxr/SPARK-4084 and squashes the following commits:
      
      d73c3d0 [Xiangrui Meng] address comments
      0b7b682 [Xiangrui Meng] fix mima
      a72f53c [Xiangrui Meng] update timeIt
      38ba50c [Xiangrui Meng] update timeIt
      720f731 [Xiangrui Meng] add doc about JIT specialization
      78f2879 [Xiangrui Meng] update tests
      7de2efd [Xiangrui Meng] update the Sorter benchmark code to be correct
      8626356 [Xiangrui Meng] add prepare to timeIt and update testsin SorterSuite
      5f0d530 [Xiangrui Meng] update method modifiers of SortDataFormat
      6ffbe66 [Xiangrui Meng] rename Sorter to TimSort and add a Scala wrapper that is private[spark]
      b00db4d [Xiangrui Meng] doc and tests
      cf94e8a [Xiangrui Meng] renaming
      464ddce [Reynold Xin] cherry-pick rxin's commit
      84e5da87
  17. Oct 26, 2014
  18. Oct 24, 2014
    • Michael Armbrust's avatar
      [SQL] Update Hive test harness for Hive 12 and 13 · 3a845d3c
      Michael Armbrust authored
      As part of the upgrade I also copy the newest version of the query tests, and whitelist a bunch of new ones that are now passing.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2936 from marmbrus/fix13tests and squashes the following commits:
      
      d9cbdab [Michael Armbrust] Remove user specific tests
      65801cd [Michael Armbrust] style and rat
      8f6b09a [Michael Armbrust] Update test harness to work with both Hive 12 and 13.
      f044843 [Michael Armbrust] Update Hive query tests and golden files to 0.13
      3a845d3c
  19. Oct 23, 2014
    • Holden Karau's avatar
      specify unidocGenjavadocVersion of 0.8 · 293672c4
      Holden Karau authored
      Fixes an issue with being too strict generating javadoc causing errors.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #2893 from holdenk/SPARK-3359-sbtunidoc-java8 and squashes the following commits:
      
      9379a70 [Holden Karau] specify unidocGenjavadocVersion of 0.8
      293672c4
    • Prashant Sharma's avatar
      [BUILD] Fixed resolver for scalastyle plugin and upgrade sbt version. · d6a30253
      Prashant Sharma authored
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2877 from ScrapCodes/scalastyle-fix and squashes the following commits:
      
      a17b9fe [Prashant Sharma] [BUILD] Fixed resolver for scalastyle plugin.
      d6a30253
  20. Oct 19, 2014
    • Josh Rosen's avatar
      [SPARK-3902] [SPARK-3590] Stabilize AsynRDDActions and add Java API · d1966f3a
      Josh Rosen authored
      This PR adds a Java API for AsyncRDDActions and promotes the API from `Experimental` to stable.
      
      Author: Josh Rosen <joshrosen@apache.org>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2760 from JoshRosen/async-rdd-actions-in-java and squashes the following commits:
      
      0d45fbc [Josh Rosen] Whitespace fix.
      ad3ae53 [Josh Rosen] Merge remote-tracking branch 'origin/master' into async-rdd-actions-in-java
      c0153a5 [Josh Rosen] Remove unused variable.
      e8e2867 [Josh Rosen] Updates based on Marcelo's review feedback
      7a1417f [Josh Rosen] Removed unnecessary java.util import.
      6f8f6ac [Josh Rosen] Fix import ordering.
      ff28e49 [Josh Rosen] Add MiMa excludes and fix a scalastyle error.
      346e46e [Josh Rosen] [SPARK-3902] Stabilize AsyncRDDActions; add Java API.
      d1966f3a
  21. Oct 16, 2014
    • Prashant Sharma's avatar
      SPARK-3874: Provide stable TaskContext API · 2fe0ba95
      Prashant Sharma authored
      This is a small number of clean-up changes on top of #2782. Closes #2782.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #2803 from pwendell/pr-2782 and squashes the following commits:
      
      56d5b7a [Patrick Wendell] Minor clean-up
      44089ec [Patrick Wendell] Clean-up the TaskContext API.
      ed551ce [Prashant Sharma] Fixed a typo
      df261d0 [Prashant Sharma] Josh's suggestion
      facf3b1 [Prashant Sharma] Fixed the mima issue.
      7ecc2fe [Prashant Sharma] CR, Moved implementations to TaskContextImpl
      bbd9e05 [Prashant Sharma] adding missed out files to git.
      ef633f5 [Prashant Sharma] SPARK-3874, Provide stable TaskContext API
      2fe0ba95
    • prudhvi's avatar
      [Core] Upgrading ScalaStyle version to 0.5 and removing SparkSpaceAfterCommentStartChecker. · 044583a2
      prudhvi authored
      Author: prudhvi <prudhvi953@gmail.com>
      
      Closes #2799 from prudhvije/ScalaStyle/space-after-comment-start and squashes the following commits:
      
      fc263a1 [prudhvi] [Core] Using scalastyle to check the space after comment start
      044583a2
  22. Oct 02, 2014
    • Colin Patrick Mccabe's avatar
      SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks · 6e27cb63
      Colin Patrick Mccabe authored
      This change reorders the replicas returned by
      HadoopRDD#getPreferredLocations so that replicas cached by HDFS are at
      the start of the list.  This requires Hadoop 2.5 or higher; previous
      versions of Hadoop do not expose the information needed to determine
      whether a replica is cached.
      
      Author: Colin Patrick Mccabe <cmccabe@cloudera.com>
      
      Closes #1486 from cmccabe/SPARK-1767 and squashes the following commits:
      
      338d4f8 [Colin Patrick Mccabe] SPARK-1767: Prefer HDFS-cached replicas when scheduling data-local tasks
      6e27cb63
  23. Sep 30, 2014
    • Reynold Xin's avatar
      [SPARK-3613] Record only average block size in MapStatus for large stages · 6b79bfb4
      Reynold Xin authored
      This changes the way we send MapStatus from executors back to driver for large stages (>2000 tasks). For large stages, we no longer send one byte per block. Instead, we just send the average block size.
      
      This makes large jobs (tens of thousands of tasks) much more reliable since the driver no longer sends huge amount of data.
      
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2470 from rxin/mapstatus and squashes the following commits:
      
      822ff54 [Reynold Xin] Code review feedback.
      3b86f56 [Reynold Xin] Added MimaExclude.
      f89d182 [Reynold Xin] Fixed a bug in MapStatus
      6a0401c [Reynold Xin] [SPARK-3613] Record only average block size in MapStatus for large stages.
      6b79bfb4
  24. Sep 29, 2014
    • Reza Zadeh's avatar
      [MLlib] [SPARK-2885] DIMSUM: All-pairs similarity · 587a0cd7
      Reza Zadeh authored
      # All-pairs similarity via DIMSUM
      Compute all pairs of similar vectors using brute force approach, and also DIMSUM sampling approach.
      
      Laying down some notation: we are looking for all pairs of similar columns in an m x n RowMatrix whose entries are denoted a_ij, with the i’th row denoted r_i and the j’th column denoted c_j. There is an oversampling parameter labeled ɣ that should be set to 4 log(n)/s to get provably correct results (with high probability), where s is the similarity threshold.
      
      The algorithm is stated with a Map and Reduce, with proofs of correctness and efficiency in published papers [1] [2]. The reducer is simply the summation reducer. The mapper is more interesting, and is also the heart of the scheme. As an exercise, you should try to see why in expectation, the map-reduce below outputs cosine similarities.
      
      ![dimsumv2](https://cloud.githubusercontent.com/assets/3220351/3807272/d1d9514e-1c62-11e4-9f12-3cfdb1d78b3a.png)
      
      [1] Bosagh-Zadeh, Reza and Carlsson, Gunnar (2013), Dimension Independent Matrix Square using MapReduce, arXiv:1304.1467 http://arxiv.org/abs/1304.1467
      
      [2] Bosagh-Zadeh, Reza and Goel, Ashish (2012), Dimension Independent Similarity Computation, arXiv:1206.2082 http://arxiv.org/abs/1206.2082
      
      # Testing
      
      Tests for all invocations included.
      
      Added L1 and L2 norm computation to MultivariateStatisticalSummary since it was needed. Added tests for both of them.
      
      Author: Reza Zadeh <rizlar@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1778 from rezazadeh/dimsumv2 and squashes the following commits:
      
      404c64c [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
      4eb71c6 [Reza Zadeh] Add excludes for normL1 and normL2
      ee8bd65 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
      976ddd4 [Reza Zadeh] Broadcast colMags. Avoid div by zero.
      3467cff [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
      aea0247 [Reza Zadeh] Allow large thresholds to promote sparsity
      9fe17c0 [Xiangrui Meng] organize imports
      2196ba5 [Xiangrui Meng] Merge branch 'rezazadeh-dimsumv2' into dimsumv2
      254ca08 [Reza Zadeh] Merge remote-tracking branch 'upstream/master' into dimsumv2
      f2947e4 [Xiangrui Meng] some optimization
      3c4cf41 [Xiangrui Meng] Merge branch 'master' into rezazadeh-dimsumv2
      0e4eda4 [Reza Zadeh] Use partition index for RNG
      251bb9c [Reza Zadeh] Documentation
      25e9d0d [Reza Zadeh] Line length for style
      fb296f6 [Reza Zadeh] renamed to normL1 and normL2
      3764983 [Reza Zadeh] Documentation
      e9c6791 [Reza Zadeh] New interface and documentation
      613f261 [Reza Zadeh] Column magnitude summary
      75a0b51 [Reza Zadeh] Use Ints instead of Longs in the shuffle
      0f12ade [Reza Zadeh] Style changes
      eb1dc20 [Reza Zadeh] Use Double.PositiveInfinity instead of Double.Max
      f56a882 [Reza Zadeh] Remove changes to MultivariateOnlineSummarizer
      dbc55ba [Reza Zadeh] Make colMagnitudes a method in RowMatrix
      41e8ece [Reza Zadeh] style changes
      139c8e1 [Reza Zadeh] Syntax changes
      029aa9c [Reza Zadeh] javadoc and new test
      75edb25 [Reza Zadeh] All tests passing!
      05e59b8 [Reza Zadeh] Add test
      502ce52 [Reza Zadeh] new interface
      654c4fb [Reza Zadeh] default methods
      3726ca9 [Reza Zadeh] Remove MatrixAlgebra
      6bebabb [Reza Zadeh] remove changes to MatrixSuite
      5b8cd7d [Reza Zadeh] Initial files
      587a0cd7
  25. Sep 28, 2014
    • William Benton's avatar
      SPARK-3699: SQL and Hive console tasks now clean up appropriately · 6918012d
      William Benton authored
      The sbt tasks sql/console and hive/console will now `stop()`
      the `SparkContext` upon exit.  Previously, they left an ugly stack
      trace when quitting.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #2547 from willb/consoleCleanup and squashes the following commits:
      
      d5e431f [William Benton] SQL and Hive console tasks now clean up.
      6918012d
  26. Sep 19, 2014
    • Burak's avatar
      [SPARK-3418] Sparse Matrix support (CCS) and additional native BLAS operations added · e76ef5cb
      Burak authored
      Local `SparseMatrix` support added in Compressed Column Storage (CCS) format in addition to Level-2 and Level-3 BLAS operations such as dgemv and dgemm respectively.
      
      BLAS doesn't support  sparse matrix operations, therefore support for `SparseMatrix`-`DenseMatrix` multiplication and `SparseMatrix`-`DenseVector` implementations have been added. I will post performance comparisons in the comments momentarily.
      
      Author: Burak <brkyvz@gmail.com>
      
      Closes #2294 from brkyvz/SPARK-3418 and squashes the following commits:
      
      88814ed [Burak] Hopefully fixed MiMa this time
      47e49d5 [Burak] really fixed MiMa issue
      f0bae57 [Burak] [SPARK-3418] Fixed MiMa compatibility issues (excluded from check)
      4b7dbec [Burak] 9/17 comments addressed
      7af2f83 [Burak] sealed traits Vector and Matrix
      d3a8a16 [Burak] [SPARK-3418] Squashed missing alpha bug.
      421045f [Burak] [SPARK-3418] New code review comments addressed
      f35a161 [Burak] [SPARK-3418] Code review comments addressed and multiplication further optimized
      2508577 [Burak] [SPARK-3418] Fixed one more style issue
      d16e8a0 [Burak] [SPARK-3418] Fixed style issues and added documentation for methods
      204a3f7 [Burak] [SPARK-3418] Fixed failing Matrix unit test
      6025297 [Burak] [SPARK-3418] Fixed Scala-style errors
      dc7be71 [Burak] [SPARK-3418][MLlib] Matrix unit tests expanded with indexing and updating
      d2d5851 [Burak] [SPARK-3418][MLlib] Sparse Matrix support and additional native BLAS operations added
      e76ef5cb
  27. Sep 17, 2014
  28. Sep 16, 2014
  29. Sep 15, 2014
    • Prashant Sharma's avatar
      [SPARK-3433][BUILD] Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations. · ecf0c029
      Prashant Sharma authored
      Actually false positive reported was due to mima generator not picking up the new jars in presence of old jars(theoretically this should not have happened.). So as a workaround, ran them both separately and just append them together.
      
      Author: Prashant Sharma <prashant@apache.org>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #2285 from ScrapCodes/mima-fix and squashes the following commits:
      
      093c76f [Prashant Sharma] Update mima
      59012a8 [Prashant Sharma] Update mima
      35b6c71 [Prashant Sharma] SPARK-3433 Fix for Mima false-positives with @DeveloperAPI and @Experimental annotations.
      ecf0c029
Loading