Skip to content
Snippets Groups Projects
  1. Jan 23, 2015
    • jerryshao's avatar
      [SPARK-5315][Streaming] Fix reduceByWindow Java API not work bug · e0f7fb7f
      jerryshao authored
      `reduceByWindow` for Java API is actually not Java compatible, change to make it Java compatible.
      
      Current solution is to deprecate the old one and add a new API, but since old API actually is not correct, so is keeping the old one meaningful? just to keep the binary compatible? Also even adding new API still need to add to Mima exclusion, I'm not sure to change the API, or deprecate the old API and add a new one, which is the best solution?
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #4104 from jerryshao/SPARK-5315 and squashes the following commits:
      
      5bc8987 [jerryshao] Address the comment
      c7aa1b4 [jerryshao] Deprecate the old one to keep binary compatible
      8e9dc67 [jerryshao] Fix JavaDStream reduceByWindow signature error
      e0f7fb7f
  2. Jan 21, 2015
  3. Jan 20, 2015
    • Sean Owen's avatar
      SPARK-5270 [CORE] Provide isEmpty() function in RDD API · 306ff187
      Sean Owen authored
      Pretty minor, but submitted for consideration -- this would at least help people make this check in the most efficient way I know.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4074 from srowen/SPARK-5270 and squashes the following commits:
      
      66885b8 [Sean Owen] Add note that JavaRDDLike should not be implemented by user code
      2e9b490 [Sean Owen] More tests, and Mima-exclude the new isEmpty method in JavaRDDLike
      28395ff [Sean Owen] Add isEmpty to Java, Python
      7dd04b7 [Sean Owen] Add efficient RDD.isEmpty()
      306ff187
  4. Jan 17, 2015
    • Michael Armbrust's avatar
      [SPARK-5096] Use sbt tasks instead of vals to get hadoop version · 6999910b
      Michael Armbrust authored
      This makes it possible to compile spark as an external `ProjectRef` where as now we throw a `FileNotFoundException`
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3905 from marmbrus/effectivePom and squashes the following commits:
      
      fd63aae [Michael Armbrust] Use sbt tasks instead of vals to get hadoop version.
      6999910b
  5. Jan 16, 2015
  6. Jan 14, 2015
    • Josh Rosen's avatar
      [SPARK-4014] Add TaskContext.attemptNumber and deprecate TaskContext.attemptId · 259936be
      Josh Rosen authored
      `TaskContext.attemptId` is misleadingly-named, since it currently returns a taskId, which uniquely identifies a particular task attempt within a particular SparkContext, instead of an attempt number, which conveys how many times a task has been attempted.
      
      This patch deprecates `TaskContext.attemptId` and add `TaskContext.taskId` and `TaskContext.attemptNumber` fields.  Prior to this change, it was impossible to determine whether a task was being re-attempted (or was a speculative copy), which made it difficult to write unit tests for tasks that fail on early attempts or speculative tasks that complete faster than original tasks.
      
      Earlier versions of the TaskContext docs suggest that `attemptId` behaves like `attemptNumber`, so there's an argument to be made in favor of changing this method's implementation.  Since we've decided against making that change in maintenance branches, I think it's simpler to add better-named methods and retain the old behavior for `attemptId`; if `attemptId` behaved differently in different branches, then this would cause confusing build-breaks when backporting regression tests that rely on the new `attemptId` behavior.
      
      Most of this patch is fairly straightforward, but there is a bit of trickiness related to Mesos tasks: since there's no field in MesosTaskInfo to encode the attemptId, I packed it into the `data` field alongside the task binary.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3849 from JoshRosen/SPARK-4014 and squashes the following commits:
      
      89d03e0 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
      5cfff05 [Josh Rosen] Introduce wrapper for serializing Mesos task launch data.
      38574d4 [Josh Rosen] attemptId -> taskAttemptId in PairRDDFunctions
      a180b88 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
      1d43aa6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
      eee6a45 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4014
      0b10526 [Josh Rosen] Use putInt instead of putLong (silly mistake)
      8c387ce [Josh Rosen] Use local with maxRetries instead of local-cluster.
      cbe4d76 [Josh Rosen] Preserve attemptId behavior and deprecate it:
      b2dffa3 [Josh Rosen] Address some of Reynold's minor comments
      9d8d4d1 [Josh Rosen] Doc typo
      1e7a933 [Josh Rosen] [SPARK-4014] Change TaskContext.attemptId to return attempt number instead of task ID.
      fd515a5 [Josh Rosen] Add failing test for SPARK-4014
      259936be
  7. Jan 13, 2015
    • Reynold Xin's avatar
      [SPARK-5123][SQL] Reconcile Java/Scala API for data types. · f9969098
      Reynold Xin authored
      Having two versions of the data type APIs (one for Java, one for Scala) requires downstream libraries to also have two versions of the APIs if the library wants to support both Java and Scala. I took a look at the Scala version of the data type APIs - it can actually work out pretty well for Java out of the box.
      
      As part of the PR, I created a sql.types package and moved all type definitions there. I then removed the Java specific data type API along with a lot of the conversion code.
      
      This subsumes https://github.com/apache/spark/pull/3925
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #3958 from rxin/SPARK-5123-datatype-2 and squashes the following commits:
      
      66505cc [Reynold Xin] [SPARK-5123] Expose only one version of the data type APIs (i.e. remove the Java-specific API).
      f9969098
  8. Jan 10, 2015
    • Joseph K. Bradley's avatar
      [SPARK-5032] [graphx] Remove GraphX MIMA exclude for 1.3 · 33132609
      Joseph K. Bradley authored
      Since GraphX is no longer alpha as of 1.2, MimaExcludes should not exclude GraphX for 1.3
      
      Here are the individual excludes I had to add + the associated commits:
      
      ```
                  // SPARK-4444
                  ProblemFilters.exclude[IncompatibleResultTypeProblem](
                    "org.apache.spark.graphx.EdgeRDD.fromEdges"),
                  ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.graphx.EdgeRDD.filter"),
                  ProblemFilters.exclude[IncompatibleResultTypeProblem](
                    "org.apache.spark.graphx.impl.EdgeRDDImpl.filter"),
      ```
      [https://github.com/apache/spark/commit/9ac2bb18ede2e9f73c255fa33445af89aaf8a000]
      
      ```
                  // SPARK-3623
                  ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.graphx.Graph.checkpoint")
      ```
      [https://github.com/apache/spark/commit/e895e0cbecbbec1b412ff21321e57826d2d0a982]
      
      ```
                  // SPARK-4620
                  ProblemFilters.exclude[MissingMethodProblem]("org.apache.spark.graphx.Graph.unpersist"),
      ```
      [https://github.com/apache/spark/commit/8817fc7fe8785d7b11138ca744f22f7e70f1f0a0]
      
      CC: rxin
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #3856 from jkbradley/graphx-mima and squashes the following commits:
      
      1eea2f6 [Joseph K. Bradley] moved cleanup to run-tests
      527ccd9 [Joseph K. Bradley] fixed jenkins script to remove ivy2 cache
      802e252 [Joseph K. Bradley] Removed GraphX MIMA excludes and added line to clear spark from .m2 dir before Jenkins tests.  This may not work yet...
      30f8bb4 [Joseph K. Bradley] added individual mima excludes for graphx
      a3fea42 [Joseph K. Bradley] removed graphx mima exclude for 1.3
      33132609
  9. Jan 02, 2015
    • Yadong Qi's avatar
      [SPARK-3325][Streaming] Add a parameter to the method print in class DStream · bd88b718
      Yadong Qi authored
      This PR is a fixed version of the original PR #3237 by watermen and scwf.
      This adds the ability to specify how many elements to print in `DStream.print`.
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      Author: q00251598 <qiyadong@huawei.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3865 from tdas/print-num and squashes the following commits:
      
      cd34e9e [Tathagata Das] Fix bug
      7c09f16 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into HEAD
      bb35d1a [Yadong Qi] Update MimaExcludes.scala
      f8098ca [Yadong Qi] Update MimaExcludes.scala
      f6ac3cb [Yadong Qi] Update MimaExcludes.scala
      e4ed897 [Yadong Qi] Update MimaExcludes.scala
      3b9d5cf [wangfei] fix conflicts
      ec8a3af [q00251598] move to  Spark 1.3
      26a70c0 [q00251598] extend the Python DStream's print
      b589a4b [q00251598] add another print function
      bd88b718
  10. Dec 31, 2014
    • Sean Owen's avatar
      SPARK-2757 [BUILD] [STREAMING] Add Mima test for Spark Sink after 1.10 is released · 4bb12488
      Sean Owen authored
      Re-enable MiMa for Streaming Flume Sink module, now that 1.1.0 is released, per the JIRA TO-DO. That's pretty much all there is to this.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3842 from srowen/SPARK-2757 and squashes the following commits:
      
      50ff80e [Sean Owen] Exclude apparent false positive turned up by re-enabling MiMa checks for Streaming Flume Sink
      0e5ba5c [Sean Owen] Re-enable MiMa for Streaming Flume Sink module
      4bb12488
  11. Dec 27, 2014
    • Patrick Wendell's avatar
      HOTFIX: Slight tweak on previous commit. · 82bf4bee
      Patrick Wendell authored
      Meant to merge this in when committing SPARK-3787.
      82bf4bee
    • Kousuke Saruta's avatar
      [SPARK-3787][BUILD] Assembly jar name is wrong when we build with sbt omitting -Dhadoop.version · de95c57a
      Kousuke Saruta authored
      This PR is another solution for When we build with sbt with profile for hadoop and without property for hadoop version like:
      
          sbt/sbt -Phadoop-2.2 assembly
      
      jar name is always used default version (1.0.4).
      
      When we build with maven with same condition for sbt, default version for each profile is used.
      For instance, if we  build like:
      
          mvn -Phadoop-2.2 package
      
      jar name is used hadoop2.2.0 as a default version of hadoop-2.2.
      
      Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
      
      Closes #3046 from sarutak/fix-assembly-jarname-2 and squashes the following commits:
      
      41ef90e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname-2
      50c8676 [Kousuke Saruta] Merge branch 'fix-assembly-jarname-2' of github.com:sarutak/spark into fix-assembly-jarname-2
      52a1cd2 [Kousuke Saruta] Fixed comflicts
      dd30768 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname2
      f1c90bb [Kousuke Saruta] Fixed SparkBuild.scala in order to read `hadoop.version` property from pom.xml
      af6b100 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      c81806b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      ad1f96e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      b2318eb [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into fix-assembly-jarname
      5fc1259 [Kousuke Saruta] Fixed typo.
      eebbb7d [Kousuke Saruta] Fixed wrong jar name
      de95c57a
  12. Dec 19, 2014
    • scwf's avatar
      [Build] Remove spark-staging-1038 · 8e253ebb
      scwf authored
      Author: scwf <wangfei1@huawei.com>
      
      Closes #3743 from scwf/abc and squashes the following commits:
      
      7d98bc8 [scwf] removing spark-staging-1038
      8e253ebb
  13. Dec 15, 2014
    • Sean Owen's avatar
      SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from... · 81112e4b
      Sean Owen authored
      SPARK-4814 [CORE] Enable assertions in SBT, Maven tests / AssertionError from Hive's LazyBinaryInteger
      
      This enables assertions for the Maven and SBT build, but overrides the Hive module to not enable assertions.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3692 from srowen/SPARK-4814 and squashes the following commits:
      
      caca704 [Sean Owen] Disable assertions just for Hive
      f71e783 [Sean Owen] Enable assertions for SBT and Maven build
      81112e4b
  14. Dec 09, 2014
    • Sandy Ryza's avatar
      SPARK-4338. [YARN] Ditch yarn-alpha. · 912563aa
      Sandy Ryza authored
      Sorry if this is a little premature with 1.2 still not out the door, but it will make other work like SPARK-4136 and SPARK-2089 a lot easier.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3215 from sryza/sandy-spark-4338 and squashes the following commits:
      
      1c5ac08 [Sandy Ryza] Update building Spark docs and remove unnecessary newline
      9c1421c [Sandy Ryza] SPARK-4338. Ditch yarn-alpha.
      912563aa
  15. Dec 04, 2014
    • lewuathe's avatar
      [SPARK-4685] Include all spark.ml and spark.mllib packages in JavaDoc's MLlib group · 20bfea4a
      lewuathe authored
      This is #3554 from Lewuathe except that I put both `spark.ml` and `spark.mllib` in the group 'MLlib`.
      
      Closes #3554
      
      jkbradley
      
      Author: lewuathe <lewuathe@me.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3598 from mengxr/Lewuathe-modify-javadoc-setting and squashes the following commits:
      
      184609a [Xiangrui Meng] merge spark.ml and spark.mllib into the same group in javadoc
      f7535e6 [lewuathe] [SPARK-4685] Update JavaDoc settings to include spark.ml and all spark.mllib subpackages in the right sections
      20bfea4a
  16. Nov 28, 2014
    • Takuya UESHIN's avatar
      [SPARK-4193][BUILD] Disable doclint in Java 8 to prevent from build error. · e464f0ac
      Takuya UESHIN authored
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3058 from ueshin/issues/SPARK-4193 and squashes the following commits:
      
      e096bb1 [Takuya UESHIN] Add a plugin declaration to pluginManagement.
      6762ec2 [Takuya UESHIN] Fix usage of -Xdoclint javadoc option.
      fdb280a [Takuya UESHIN] Fix Javadoc errors.
      4745f3c [Takuya UESHIN] Merge branch 'master' into issues/SPARK-4193
      923e2f0 [Takuya UESHIN] Use doclint option `-missing` instead of `none`.
      30d6718 [Takuya UESHIN] Fix Javadoc errors.
      b548017 [Takuya UESHIN] Disable doclint in Java 8 to prevent from build error.
      e464f0ac
  17. Nov 26, 2014
    • Xiangrui Meng's avatar
      [SPARK-4614][MLLIB] Slight API changes in Matrix and Matrices · 561d31d2
      Xiangrui Meng authored
      Before we have a full picture of the operators we want to add, it might be safer to hide `Matrix.transposeMultiply` in 1.2.0. Another update we want to change is `Matrix.randn` and `Matrix.rand`, both of which should take a `Random` implementation. Otherwise, it is very likely to produce inconsistent RDDs. I also added some unit tests for matrix factory methods. All APIs are new in 1.2, so there is no incompatible changes.
      
      brkyvz
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3468 from mengxr/SPARK-4614 and squashes the following commits:
      
      3b0e4e2 [Xiangrui Meng] add mima excludes
      6bfd8a4 [Xiangrui Meng] hide transposeMultiply; add rng to rand and randn; add unit tests
      561d31d2
  18. Nov 19, 2014
    • Joseph E. Gonzalez's avatar
      Updating GraphX programming guide and documentation · 377b0682
      Joseph E. Gonzalez authored
      This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      
      Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
      
      4421964 [Joseph E. Gonzalez] updating documentation for graphx
      377b0682
    • Takuya UESHIN's avatar
      [SPARK-4429][BUILD] Build for Scala 2.11 using sbt fails. · f9adda9a
      Takuya UESHIN authored
      I tried to build for Scala 2.11 using sbt with the following command:
      
      ```
      $ sbt/sbt -Dscala-2.11 assembly
      ```
      
      but it ends with the following error messages:
      
      ```
      [error] (streaming-kafka/*:update) sbt.ResolveException: unresolved dependency: org.apache.kafka#kafka_2.11;0.8.0: not found
      [error] (catalyst/*:update) sbt.ResolveException: unresolved dependency: org.scalamacros#quasiquotes_2.11;2.0.1: not found
      ```
      
      The reason is:
      If system property `-Dscala-2.11` (without value) was set, `SparkBuild.scala` adds `scala-2.11` profile, but also `sbt-pom-reader` activates `scala-2.10` profile instead of `scala-2.11` profile because the activator `PropertyProfileActivator` used by `sbt-pom-reader` internally checks if the property value is empty or not.
      
      The value is set to non-empty value, then no need to add profiles in `SparkBuild.scala` because `sbt-pom-reader` can handle as expected.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #3342 from ueshin/issues/SPARK-4429 and squashes the following commits:
      
      14d86e8 [Takuya UESHIN] Add a comment.
      4eef52b [Takuya UESHIN] Remove unneeded condition.
      ce98d0f [Takuya UESHIN] Set non-empty value to system property "scala-2.11" if the property exists instead of adding profile.
      f9adda9a
    • Andrew Or's avatar
      [HOT FIX] MiMa tests are broken · 0df02ca4
      Andrew Or authored
      This is blocking #3353 and other patches.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3371 from andrewor14/mima-hot-fix and squashes the following commits:
      
      842d059 [Andrew Or] Move excludes to the right section
      c4d4f4e [Andrew Or] MIMA hot fix
      0df02ca4
  19. Nov 18, 2014
    • Marcelo Vanzin's avatar
      Bumping version to 1.3.0-SNAPSHOT. · 397d3aae
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3277 from vanzin/version-1.3 and squashes the following commits:
      
      7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
      5f404ff [Marcelo Vanzin] Add another exclusion.
      19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
      3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
      e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
      397d3aae
    • Davies Liu's avatar
      [SPARK-4017] show progress bar in console · e34f38ff
      Davies Liu authored
      The progress bar will look like this:
      
      ![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png)
      
      In the right corner, the numbers are: finished tasks, running tasks, total tasks.
      
      After the stage has finished, it will disappear.
      
      The progress bar is only showed if logging level is WARN or higher (but progress in title is still showed), it can be turned off by spark.driver.showConsoleProgress.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3029 from davies/progress and squashes the following commits:
      
      95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      fc49ac8 [Davies Liu] address commentse
      2e90f75 [Davies Liu] show multiple stages in same time
      0081bcc [Davies Liu] address comments
      38c42f1 [Davies Liu] fix tests
      ab87958 [Davies Liu] disable progress bar during tests
      30ac852 [Davies Liu] re-implement progress bar
      b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
      e4e7344 [Davies Liu] refactor
      e1f524d [Davies Liu] revert unnecessary change
      a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress
      5cae3f2 [Davies Liu] fix style
      ea49fe0 [Davies Liu] address comments
      bc53d99 [Davies Liu] refactor
      e6bb189 [Davies Liu] fix logging in sparkshell
      7e7d4e7 [Davies Liu] address commments
      5df26bb [Davies Liu] fix style
      9e42208 [Davies Liu] show progress bar in console and title
      e34f38ff
  20. Nov 17, 2014
    • Josh Rosen's avatar
      [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts · 0f3ceb56
      Josh Rosen authored
      This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).
      
      **The solution implemented here is only a partial fix.**  A complete fix would have the following properties:
      
      1. Only one SparkContext may ever be under construction at any given time.
      2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
      3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
      4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.
      
      This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.
      
      ### The correct solution:
      
      I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object.  Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.).  Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor.  For example:
      
      ```scala
      class SparkContext private (deps: SparkContextDependencies) {
        def this(conf: SparkConf) {
          this(SparkContext.getDeps(conf))
        }
      }
      
      object SparkContext(
        private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
          if (anotherSparkContextIsActive) { throw Exception(...) }
          var dagScheduler: DAGScheduler = null
          try {
              dagScheduler = new DAGScheduler(...)
              [...]
          } catch {
            case e: Exception =>
               Option(dagScheduler).foreach(_.stop())
                [...]
          }
          SparkContextDependencies(dagScheduler, ....)
        }
      }
      ```
      
      This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.
      
      This indirection is necessary to maintain binary compatibility.  In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.
      
      ### Alternative solutions:
      
      As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block.  Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block.  If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.
      
      The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.
      
      ### This PR's solution:
      
      - At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
      - If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
      - At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context.  If so, throw an exception.
      
      This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor).  If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.
      
      This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`.  The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts.  I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits:
      
      23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      d38251b [Josh Rosen] Address latest round of feedback.
      c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
      85a424a [Josh Rosen] Incorporate more review feedback.
      372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      f5bb78c [Josh Rosen] Update mvn build, too.
      d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
      79a7e6f [Josh Rosen] Fix commented out test
      a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
      4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
      ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
      1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
      d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
      c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
      918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
      afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
      0f3ceb56
  21. Nov 14, 2014
    • jerryshao's avatar
      [SPARK-4062][Streaming]Add ReliableKafkaReceiver in Spark Streaming Kafka connector · 5930f64b
      jerryshao authored
      Add ReliableKafkaReceiver in Kafka connector to prevent data loss if WAL in Spark Streaming is enabled. Details and design doc can be seen in [SPARK-4062](https://issues.apache.org/jira/browse/SPARK-4062).
      
      Author: jerryshao <saisai.shao@intel.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Saisai Shao <saisai.shao@intel.com>
      
      Closes #2991 from jerryshao/kafka-refactor and squashes the following commits:
      
      5461f1c [Saisai Shao] Merge pull request #8 from tdas/kafka-refactor3
      eae4ad6 [Tathagata Das] Refectored KafkaStreamSuiteBased to eliminate KafkaTestUtils and made Java more robust.
      fab14c7 [Tathagata Das] minor update.
      149948b [Tathagata Das] Fixed mistake
      14630aa [Tathagata Das] Minor updates.
      d9a452c [Tathagata Das] Minor updates.
      ec2e95e [Tathagata Das] Removed the receiver's locks and essentially reverted to Saisai's original design.
      2a20a01 [jerryshao] Address some comments
      9f636b3 [Saisai Shao] Merge pull request #5 from tdas/kafka-refactor
      b2b2f84 [Tathagata Das] Refactored Kafka receiver logic and Kafka testsuites
      e501b3c [jerryshao] Add Mima excludes
      b798535 [jerryshao] Fix the missed issue
      e5e21c1 [jerryshao] Change to while loop
      ea873e4 [jerryshao] Further address the comments
      98f3d07 [jerryshao] Fix comment style
      4854ee9 [jerryshao] Address all the comments
      96c7a1d [jerryshao] Update the ReliableKafkaReceiver unit test
      8135d31 [jerryshao] Fix flaky test
      a949741 [jerryshao] Address the comments
      16bfe78 [jerryshao] Change the ordering of imports
      0894aef [jerryshao] Add some comments
      77c3e50 [jerryshao] Code refactor and add some unit tests
      dd9aeeb [jerryshao] Initial commit for reliable Kafka receiver
      5930f64b
    • Sandy Ryza's avatar
      SPARK-4375. no longer require -Pscala-2.10 · f5f757e4
      Sandy Ryza authored
      It seems like the winds might have moved away from this approach, but wanted to post the PR anyway because I got it working and to show what it would look like.
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #3239 from sryza/sandy-spark-4375 and squashes the following commits:
      
      0ffbe95 [Sandy Ryza] Enable -Dscala-2.11 in sbt
      cd42d94 [Sandy Ryza] Update doc
      f6644c3 [Sandy Ryza] SPARK-4375 take 2
      f5f757e4
  22. Nov 12, 2014
    • Andrew Or's avatar
      [SPARK-4281][Build] Package Yarn shuffle service into its own jar · aa43a8da
      Andrew Or authored
      This is another addendum to #3082, which added the Yarn shuffle service to run inside the NM. This PR makes the feature much more usable by packaging enough dependencies into the jar to run the service inside an NM. After these changes, the user can run `./make-distribution.sh` and find a `spark-network-yarn*.jar` in their `lib` directory. The equivalent change is done in SBT by making the `network-yarn` module an assembly project.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3147 from andrewor14/yarn-shuffle-build and squashes the following commits:
      
      bda58d0 [Andrew Or] Fix line too long
      81e9705 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-build
      fb7f398 [Andrew Or] Rename jar to spark-{VERSION}-yarn-shuffle.jar
      65db822 [Andrew Or] Actually mark slf4j as provided
      abcefd1 [Andrew Or] Do the same for SBT
      c653028 [Andrew Or] Package network-yarn and its dependencies
      aa43a8da
  23. Nov 11, 2014
    • Prashant Sharma's avatar
      Support cross building for Scala 2.11 · daaca14c
      Prashant Sharma authored
      Let's give this another go using a version of Hive that shades its JLine dependency.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3159 from pwendell/scala-2.11-prashant and squashes the following commits:
      
      e93aa3e [Patrick Wendell] Restoring -Phive-thriftserver profile and cleaning up build script.
      f65d17d [Patrick Wendell] Fixing build issue due to merge conflict
      a8c41eb [Patrick Wendell] Reverting dev/run-tests back to master state.
      7a6eb18 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into scala-2.11-prashant
      583aa07 [Prashant Sharma] REVERT ME: removed hive thirftserver
      3680e58 [Prashant Sharma] Revert "REVERT ME: Temporarily removing some Cli tests."
      935fb47 [Prashant Sharma] Revert "Fixed by disabling a few tests temporarily."
      925e90f [Prashant Sharma] Fixed by disabling a few tests temporarily.
      2fffed3 [Prashant Sharma] Exclude groovy from sbt build, and also provide a way for such instances in future.
      8bd4e40 [Prashant Sharma] Switched to gmaven plus, it fixes random failures observer with its predecessor gmaven.
      5272ce5 [Prashant Sharma] SPARK_SCALA_VERSION related bugs.
      2121071 [Patrick Wendell] Migrating version detection to PySpark
      b1ed44d [Patrick Wendell] REVERT ME: Temporarily removing some Cli tests.
      1743a73 [Patrick Wendell] Removing decimal test that doesn't work with Scala 2.11
      f5cad4e [Patrick Wendell] Add Scala 2.11 docs
      210d7e1 [Patrick Wendell] Revert "Testing new Hive version with shaded jline"
      48518ce [Patrick Wendell] Remove association of Hive and Thriftserver profiles.
      e9d0a06 [Patrick Wendell] Revert "Enable thritfserver for Scala 2.10 only"
      67ec364 [Patrick Wendell] Guard building of thriftserver around Scala 2.10 check
      8502c23 [Patrick Wendell] Enable thritfserver for Scala 2.10 only
      e22b104 [Patrick Wendell] Small fix in pom file
      ec402ab [Patrick Wendell] Various fixes
      0be5a9d [Patrick Wendell] Testing new Hive version with shaded jline
      4eaec65 [Prashant Sharma] Changed scripts to ignore target.
      5167bea [Prashant Sharma] small correction
      a4fcac6 [Prashant Sharma] Run against scala 2.11 on jenkins.
      80285f4 [Prashant Sharma] MAven equivalent of setting spark.executor.extraClasspath during tests.
      034b369 [Prashant Sharma] Setting test jars on executor classpath during tests from sbt.
      d4874cb [Prashant Sharma] Fixed Python Runner suite. null check should be first case in scala 2.11.
      6f50f13 [Prashant Sharma] Fixed build after rebasing with master. We should use ${scala.binary.version} instead of just 2.10
      e56ca9d [Prashant Sharma] Print an error if build for 2.10 and 2.11 is spotted.
      937c0b8 [Prashant Sharma] SCALA_VERSION -> SPARK_SCALA_VERSION
      cb059b0 [Prashant Sharma] Code review
      0476e5e [Prashant Sharma] Scala 2.11 support with repl and all build changes.
      daaca14c
  24. Nov 10, 2014
    • Patrick Wendell's avatar
      Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without... · 6e7a309b
      Patrick Wendell authored
      Revert "[SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally."
      
      This reverts commit bd86cb17.
      6e7a309b
    • RongGu's avatar
      [SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a... · bd86cb17
      RongGu authored
      [SPARK-2703][Core]Make Tachyon related unit tests execute without deploying a Tachyon system locally.
      
      Make Tachyon related unit tests execute without deploying a Tachyon system locally.
      
      Author: RongGu <gurongwalker@gmail.com>
      
      Closes #3030 from RongGu/SPARK-2703 and squashes the following commits:
      
      ad08827 [RongGu] Make Tachyon related unit tests execute without deploying a Tachyon system locally
      bd86cb17
    • Sean Owen's avatar
      SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use... · f8e57323
      Sean Owen authored
      SPARK-1209 [CORE] (Take 2) SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop
      
      andrewor14 Another try at SPARK-1209, to address https://github.com/apache/spark/pull/2814#issuecomment-61197619
      
      I successfully tested with `mvn -Dhadoop.version=1.0.4 -DskipTests clean package; mvn -Dhadoop.version=1.0.4 test` I assume that is what failed Jenkins last time. I also tried `-Dhadoop.version1.2.1` and `-Phadoop-2.4 -Pyarn -Phive` for more coverage.
      
      So this is why the class was put in `org.apache.hadoop` to begin with, I assume. One option is to leave this as-is for now and move it only when Hadoop 1.0.x support goes away.
      
      This is the other option, which adds a call to force the constructor to be public at run-time. It's probably less surprising than putting Spark code in `org.apache.hadoop`, but, does involve reflection. A `SecurityManager` might forbid this, but it would forbid a lot of stuff Spark does. This would also only affect Hadoop 1.0.x it seems.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3048 from srowen/SPARK-1209 and squashes the following commits:
      
      0d48f4b [Sean Owen] For Hadoop 1.0.x, make certain constructors public, which were public in later versions
      466e179 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
      eb61820 [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
      f8e57323
  25. Nov 05, 2014
    • Andrew Or's avatar
      [SPARK-3797] Run external shuffle service in Yarn NM · 61a5cced
      Andrew Or authored
      This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.
      
      This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits:
      
      ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      0ee67a2 [Andrew Or] Minor wording suggestions
      1c66046 [Andrew Or] Remove unused provided dependencies
      0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      6489db5 [Andrew Or] Try catch at the right places
      7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
      d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
      5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      9b6e058 [Andrew Or] Address various feedback
      f48b20c [Andrew Or] Fix tests again
      f39daa6 [Andrew Or] Do not make network-yarn an assembly module
      761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
      15a5b37 [Andrew Or] Fix build for Hadoop 1.x
      baff916 [Andrew Or] Fix tests
      5bf9b7e [Andrew Or] Address a few minor comments
      5b419b8 [Andrew Or] Add missing license header
      804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
      cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
      ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
      1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
      b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
      43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
      b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
      61a5cced
  26. Nov 01, 2014
    • Aaron Davidson's avatar
      [SPARK-3796] Create external service which can serve shuffle files · f55218ae
      Aaron Davidson authored
      This patch introduces the tooling necessary to construct an external shuffle service which is independent of Spark executors, and then use this service inside Spark. An example (just for the sake of this PR) of the service creation can be found in Worker, and the service itself is used by plugging in the StandaloneShuffleClient as Spark's ShuffleClient (setup in BlockManager).
      
      This PR continues the work from #2753, which extracted out the transport layer of Spark's block transfer into an independent package within Spark. A new package was created which contains the Spark business logic necessary to retrieve the actual shuffle data, which is completely independent of the transport layer introduced in the previous patch. Similar to the transport layer, this package must not depend on Spark as we anticipate plugging this service as a lightweight process within, say, the YARN NodeManager, and do not wish to include Spark's dependencies (including Scala itself).
      
      There are several outstanding tasks which must be complete before this PR can be merged:
      - [x] Complete unit testing of network/shuffle package.
      - [x] Performance and correctness testing on a real cluster.
      - [x] Remove example service instantiation from Worker.scala.
      
      There are even more shortcomings of this PR which should be addressed in followup patches:
      - Don't use Java serializer for RPC layer! It is not cross-version compatible.
      - Handle shuffle file cleanup for dead executors once the application terminates or the ContextCleaner triggers.
      - Documentation of the feature in the Spark docs.
      - Improve behavior if the shuffle service itself goes down (right now we don't blacklist it, and new executors cannot spawn on that machine).
      - SSL and SASL integration
      - Nice to have: Handle shuffle file consolidation (this would requires changes to Spark's implementation).
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3001 from aarondav/shuffle-service and squashes the following commits:
      
      4d1f8c1 [Aaron Davidson] Remove changes to Worker
      705748f [Aaron Davidson] Rename Standalone* to External*
      fd3928b [Aaron Davidson] Do not unregister executor outputs unduly
      9883918 [Aaron Davidson] Make suggested build changes
      3d62679 [Aaron Davidson] Add Spark integration test
      7fe51d5 [Aaron Davidson] Fix SBT integration
      56caa50 [Aaron Davidson] Address comments
      c8d1ac3 [Aaron Davidson] Add unit tests
      2f70c0c [Aaron Davidson] Fix unit tests
      5483e96 [Aaron Davidson] Fix unit tests
      46a70bf [Aaron Davidson] Whoops, bracket
      5ea4df6 [Aaron Davidson] [SPARK-3796] Create external service which can serve shuffle files
      f55218ae
  27. Oct 30, 2014
    • Patrick Wendell's avatar
      HOTFIX: Clean up build in network module. · 0734d093
      Patrick Wendell authored
      This is currently breaking the package build for some people (including me).
      
      This patch does some general clean-up which also fixes the current issue.
      - Uses consistent artifact naming
      - Adds sbt support for this module
      - Changes tests to use scalatest (fixes the original issue[1])
      
      One thing to note, it turns out that scalatest when invoked in the
      Maven build doesn't succesfully detect JUnit Java tests. This is
      a long standing issue, I noticed it applies to all of our current
      test suites as well. I've created SPARK-4159 to fix this.
      
      [1] The original issue is that we need to allocate extra memory
      for the tests, happens by default in our scalatest configuration.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #3025 from pwendell/hotfix and squashes the following commits:
      
      faa9053 [Patrick Wendell] HOTFIX: Clean up build in network module.
      0734d093
    • Andrew Or's avatar
      Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use... · 26d31d15
      Andrew Or authored
      Revert "SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop"
      
      This reverts commit 68cb69da.
      26d31d15
    • Sean Owen's avatar
      SPARK-1209 [CORE] SparkHadoop{MapRed,MapReduce}Util should not use package org.apache.hadoop · 68cb69da
      Sean Owen authored
      (This is just a look at what completely moving the classes would look like. I know Patrick flagged that as maybe not OK, although, it's private?)
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2814 from srowen/SPARK-1209 and squashes the following commits:
      
      ead1115 [Sean Owen] Disable MIMA warnings resulting from moving the class -- this was also part of the PairRDDFunctions type hierarchy though?
      2d42c1d [Sean Owen] Move SparkHadoopMapRedUtil / SparkHadoopMapReduceUtil from org.apache.hadoop to org.apache.spark
      68cb69da
  28. Oct 29, 2014
    • Andrew Or's avatar
      [SPARK-3822] Executor scaling mechanism for Yarn · 1df05a40
      Andrew Or authored
      This is part of a broader effort to enable dynamic scaling of executors ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is intended to work alongside SPARK-3795 (#2746), SPARK-3796 and SPARK-3797, but is functionally independently of these other issues.
      
      The logic is built on top of PraveenSeluka's changes at #2798. This is different from the changes there in a few major ways: (1) the mechanism is implemented within the existing scheduler backend framework rather than in new `Actor` classes. This also introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common logic to communicate with the Yarn `ApplicationMaster`. (2) The interface of requesting executors exposed to the `SparkContext` is the same, but the communication between the scheduler backend and the AM uses total number executors desired instead of an incremental number. This is discussed in #2746 and explained in the comments in the code.
      
      I have tested this significantly on a stable Yarn cluster.
      
      ------------
      A remaining task for this issue is to tone down the error messages emitted when an executor is removed.
      Currently, `SparkContext` and its components react as if the executor has failed, resulting in many scary error messages and eventual timeouts. While it's not strictly necessary to fix this as of the first-cut implementation of this mechanism, it would be good to add logic to distinguish this case. I prefer to address this in a separate PR. I have filed a separate JIRA for this task at SPARK-4134.
      
      Author: Andrew Or <andrew@databricks.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #2840 from andrewor14/yarn-scaling-mechanism and squashes the following commits:
      
      485863e [Andrew Or] Minor log message changes
      4920be8 [Andrew Or] Clarify that public API is only for Yarn mode for now
      1c57804 [Andrew Or] Reword a few comments + other review comments
      6321140 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      02836c0 [Andrew Or] Limit scope of synchronization
      4e2ed7f [Andrew Or] Fix bug: keep track of removed executors properly
      73ade46 [Andrew Or] Wording changes (minor)
      2a7a6da [Andrew Or] Add `sc.killExecutor` as a shorthand (minor)
      665f229 [Andrew Or] Mima excludes
      79aa2df [Andrew Or] Simplify the request interface by asking for a total
      04f625b [Andrew Or] Fix race condition that causes over-allocation of executors
      f4783f8 [Andrew Or] Change the semantics of requesting executors
      005a124 [Andrew Or] Fix tests
      4628b16 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      db4a679 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      572f5c5 [Andrew Or] Unused import (minor)
      f30261c [Andrew Or] Kill multiple executors rather than one at a time
      de260d9 [Andrew Or] Simplify by skipping useless null check
      9c52542 [Andrew Or] Simplify by skipping the TaskSchedulerImpl
      97dd1a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
      d987b3e [Andrew Or] Move addWebUIFilters to Yarn scheduler backend
      7b76d0a [Andrew Or] Expose mechanism in SparkContext as developer API
      47466cd [Andrew Or] Refactor common Yarn scheduler backend logic
      c4dfaac [Andrew Or] Avoid thrashing when removing executors
      53e8145 [Andrew Or] Start yarn actor early to listen for AM registration message
      bbee669 [Andrew Or] Add mechanism in yarn client mode
      1df05a40
    • Reynold Xin's avatar
      [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core · dff01553
      Reynold Xin authored
      This PR encapsulates #2330, which is itself a continuation of #2240. The first goal of this PR is to provide an alternate, simpler implementation of the ConnectionManager which is based on Netty.
      
      In addition to this goal, however, we want to resolve [SPARK-3796](https://issues.apache.org/jira/browse/SPARK-3796), which calls for a standalone shuffle service which can be integrated into the YARN NodeManager, Standalone Worker, or on its own. This PR makes the first step in this direction by ensuring that the actual Netty service is as small as possible and extracted from Spark core. Given this, we should be able to construct this standalone jar which can be included in other JVMs without incurring significant dependency or runtime issues. The actual work to ensure that such a standalone shuffle service would work in Spark will be left for a future PR, however.
      
      In order to minimize dependencies and allow for the service to be long-running (possibly much longer-running than Spark, and possibly having to support multiple version of Spark simultaneously), the entire service has been ported to Java, where we have full control over the binary compatibility of the components and do not depend on the Scala runtime or version.
      
      These issues: have been addressed by folding in #2330:
      
      SPARK-3453: Refactor Netty module to use BlockTransferService interface
      SPARK-3018: Release all buffers upon task completion/failure
      SPARK-3002: Create a connection pool and reuse clients across different threads
      SPARK-3017: Integration tests and unit tests for connection failures
      SPARK-3049: Make sure client doesn't block when server/connection has error(s)
      SPARK-3502: SO_RCVBUF and SO_SNDBUF should be bootstrap childOption, not option
      SPARK-3503: Disable thread local cache in PooledByteBufAllocator
      
      TODO before mergeable:
      - [x] Implement uploadBlock()
      - [x] Unit tests for RPC side of code
      - [x] Performance testing (see comments [here](https://github.com/apache/spark/pull/2753#issuecomment-59475022))
      - [x] Turn OFF by default (currently on for unit testing)
      
      Author: Reynold Xin <rxin@apache.org>
      Author: Aaron Davidson <aaron@databricks.com>
      Author: cocoatomo <cocoatomo77@gmail.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Anand Avati <avati@redhat.com>
      
      Closes #2753 from aarondav/netty and squashes the following commits:
      
      cadfd28 [Aaron Davidson] Turn netty off by default
      d7be11b [Aaron Davidson] Turn netty on by default
      4a204b8 [Aaron Davidson] Fail block fetches if client connection fails
      2b0d1c0 [Aaron Davidson] 100ch
      0c5bca2 [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
      14e37f7 [Aaron Davidson] Address Reynold's comments
      8dfcceb [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty
      322dfc1 [Aaron Davidson] Address Reynold's comments, including major rename
      e5675a4 [Aaron Davidson] Fail outstanding RPCs as well
      ccd4959 [Aaron Davidson] Don't throw exception if client immediately fails
      9da0bc1 [Aaron Davidson] Add RPC unit tests
      d236dfd [Aaron Davidson] Remove no-op serializer :)
      7b7a26c [Aaron Davidson] Fix Nio compile issue
      dd420fd [Aaron Davidson] Merge branch 'master' of https://github.com/apache/spark into netty-test
      939f276 [Aaron Davidson] Attempt to make comm. bidirectional
      aa58f67 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
      8dc1ded [cocoatomo] [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      5b5dbe6 [Prashant Sharma] [SPARK-2924] Required by scala 2.11, only one fun/ctor amongst overriden alternatives, can have default argument(s).
      2c5d9dc [Patrick Wendell] HOTFIX: Fix build issue with Akka 2.3.4 upgrade.
      020691e [Davies Liu] [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
      ae4083a [Anand Avati] [SPARK-2805] Upgrade Akka to 2.3.4
      29c6dcf [Aaron Davidson] [SPARK-3453] Netty-based BlockTransferService, extracted from Spark core
      f7e7568 [Reynold Xin] Fixed spark.shuffle.io.receiveBuffer setting.
      5d98ce3 [Reynold Xin] Flip buffer.
      f6c220d [Reynold Xin] Merge with latest master.
      407e59a [Reynold Xin] Fix style violation.
      a0518c7 [Reynold Xin] Implemented block uploads.
      4b18db2 [Reynold Xin] Copy the buffer in fetchBlockSync.
      bec4ea2 [Reynold Xin] Removed OIO and added num threads settings.
      1bdd7ee [Reynold Xin] Fixed tests.
      d68f328 [Reynold Xin] Logging close() in case close() fails.
      f63fb4c [Reynold Xin] Add more debug message.
      6afc435 [Reynold Xin] Added logging.
      c066309 [Reynold Xin] Implement java.io.Closeable interface.
      519d64d [Reynold Xin] Mark private package visibility and MimaExcludes.
      f0a16e9 [Reynold Xin] Fixed test hanging.
      14323a5 [Reynold Xin] Removed BlockManager.getLocalShuffleFromDisk.
      b2f3281 [Reynold Xin] Added connection pooling.
      d23ed7b [Reynold Xin] Incorporated feedback from Norman: - use same pool for boss and worker - remove ioratio - disable caching of byte buf allocator - childoption sendbuf/receivebuf - fire exception through pipeline
      9e0cb87 [Reynold Xin] Fixed BlockClientHandlerSuite
      5cd33d7 [Reynold Xin] Fixed style violation.
      cb589ec [Reynold Xin] Added more test cases covering cleanup when fault happens in ShuffleBlockFetcherIteratorSuite
      1be4e8e [Reynold Xin] Shorten NioManagedBuffer and NettyManagedBuffer class names.
      108c9ed [Reynold Xin] Forgot to add TestSerializer to the commit list.
      b5c8d1f [Reynold Xin] Fixed ShuffleBlockFetcherIteratorSuite.
      064747b [Reynold Xin] Reference count buffers and clean them up properly.
      2b44cf1 [Reynold Xin] Added more documentation.
      1760d32 [Reynold Xin] Use Epoll.isAvailable in BlockServer as well.
      165eab1 [Reynold Xin] [SPARK-3453] Refactor Netty module to use BlockTransferService.
      dff01553
  29. Oct 28, 2014
    • Xiangrui Meng's avatar
      [SPARK-4084] Reuse sort key in Sorter · 84e5da87
      Xiangrui Meng authored
      Sorter uses generic-typed key for sorting. When data is large, it creates lots of key objects, which is not efficient. We should reuse the key in Sorter for memory efficiency. This change is part of the petabyte sort implementation from rxin .
      
      The `Sorter` class was written in Java and marked package private. So it is only available to `org.apache.spark.util.collection`. I renamed it to `TimSort` and add a simple wrapper of it, still called `Sorter`, in Scala, which is `private[spark]`.
      
      The benchmark code is updated, which now resets the array before each run. Here is the result on sorting primitive Int arrays of size 25 million using Sorter:
      
      ~~~
      [info] - Sorter benchmark for key-value pairs !!! IGNORED !!!
      Java Arrays.sort() on non-primitive int array: Took 13237 ms
      Java Arrays.sort() on non-primitive int array: Took 13320 ms
      Java Arrays.sort() on non-primitive int array: Took 15718 ms
      Java Arrays.sort() on non-primitive int array: Took 13283 ms
      Java Arrays.sort() on non-primitive int array: Took 13267 ms
      Java Arrays.sort() on non-primitive int array: Took 15122 ms
      Java Arrays.sort() on non-primitive int array: Took 15495 ms
      Java Arrays.sort() on non-primitive int array: Took 14877 ms
      Java Arrays.sort() on non-primitive int array: Took 16429 ms
      Java Arrays.sort() on non-primitive int array: Took 14250 ms
      Java Arrays.sort() on non-primitive int array: (13878 ms first try, 14499 ms average)
      Java Arrays.sort() on primitive int array: Took 2683 ms
      Java Arrays.sort() on primitive int array: Took 2683 ms
      Java Arrays.sort() on primitive int array: Took 2701 ms
      Java Arrays.sort() on primitive int array: Took 2746 ms
      Java Arrays.sort() on primitive int array: Took 2685 ms
      Java Arrays.sort() on primitive int array: Took 2735 ms
      Java Arrays.sort() on primitive int array: Took 2669 ms
      Java Arrays.sort() on primitive int array: Took 2693 ms
      Java Arrays.sort() on primitive int array: Took 2680 ms
      Java Arrays.sort() on primitive int array: Took 2642 ms
      Java Arrays.sort() on primitive int array: (2948 ms first try, 2691 ms average)
      Sorter without key reuse on primitive int array: Took 10732 ms
      Sorter without key reuse on primitive int array: Took 12482 ms
      Sorter without key reuse on primitive int array: Took 10718 ms
      Sorter without key reuse on primitive int array: Took 12650 ms
      Sorter without key reuse on primitive int array: Took 10747 ms
      Sorter without key reuse on primitive int array: Took 10783 ms
      Sorter without key reuse on primitive int array: Took 12721 ms
      Sorter without key reuse on primitive int array: Took 10604 ms
      Sorter without key reuse on primitive int array: Took 10622 ms
      Sorter without key reuse on primitive int array: Took 11843 ms
      Sorter without key reuse on primitive int array: (11089 ms first try, 11390 ms average)
      Sorter with key reuse on primitive int array: Took 5141 ms
      Sorter with key reuse on primitive int array: Took 5298 ms
      Sorter with key reuse on primitive int array: Took 5066 ms
      Sorter with key reuse on primitive int array: Took 5164 ms
      Sorter with key reuse on primitive int array: Took 5203 ms
      Sorter with key reuse on primitive int array: Took 5274 ms
      Sorter with key reuse on primitive int array: Took 5186 ms
      Sorter with key reuse on primitive int array: Took 5159 ms
      Sorter with key reuse on primitive int array: Took 5164 ms
      Sorter with key reuse on primitive int array: Took 5078 ms
      Sorter with key reuse on primitive int array: (5311 ms first try, 5173 ms average)
      ~~~
      
      So with key reuse, it is faster and less likely to trigger GC.
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Reynold Xin <rxin@apache.org>
      
      Closes #2937 from mengxr/SPARK-4084 and squashes the following commits:
      
      d73c3d0 [Xiangrui Meng] address comments
      0b7b682 [Xiangrui Meng] fix mima
      a72f53c [Xiangrui Meng] update timeIt
      38ba50c [Xiangrui Meng] update timeIt
      720f731 [Xiangrui Meng] add doc about JIT specialization
      78f2879 [Xiangrui Meng] update tests
      7de2efd [Xiangrui Meng] update the Sorter benchmark code to be correct
      8626356 [Xiangrui Meng] add prepare to timeIt and update testsin SorterSuite
      5f0d530 [Xiangrui Meng] update method modifiers of SortDataFormat
      6ffbe66 [Xiangrui Meng] rename Sorter to TimSort and add a Scala wrapper that is private[spark]
      b00db4d [Xiangrui Meng] doc and tests
      cf94e8a [Xiangrui Meng] renaming
      464ddce [Reynold Xin] cherry-pick rxin's commit
      84e5da87
  30. Oct 26, 2014
Loading