Skip to content
Snippets Groups Projects
  1. Oct 24, 2013
  2. Oct 23, 2013
  3. Oct 22, 2013
    • Reynold Xin's avatar
      Merge pull request #100 from JoshRosen/spark-902 · 9dfcf53a
      Reynold Xin authored
      Remove redundant Java Function call() definitions
      
      This should fix [SPARK-902](https://spark-project.atlassian.net/browse/SPARK-902), an issue where some Java API Function classes could cause AbstractMethodErrors when user code is compiled using the Eclipse compiler.
      
      Thanks to @MartinWeindel for diagnosing this problem.
      
      (This PR subsumes #30).
      9dfcf53a
    • Josh Rosen's avatar
      Remove redundant Java Function call() definitions · 768eb9c9
      Josh Rosen authored
      This should fix SPARK-902, an issue where some
      Java API Function classes could cause
      AbstractMethodErrors when user code is compiled
      using the Eclipse compiler.
      
      Thanks to @MartinWeindel for diagnosing this
      problem.
      
      (This PR subsumes / closes #30)
      768eb9c9
    • Patrick Wendell's avatar
      Merge pull request #99 from pwendell/master · 97184de1
      Patrick Wendell authored
      Use correct formatting for comments in StoragePerfTester
      97184de1
    • Patrick Wendell's avatar
      Formatting cleanup · ab5ece19
      Patrick Wendell authored
      ab5ece19
    • Ewen Cheslack-Postava's avatar
    • Patrick Wendell's avatar
      Merge pull request #90 from pwendell/master · c404adb9
      Patrick Wendell authored
      SPARK-940: Do not directly pass Stage objects to SparkListener.
      
      This patch updates the SparkListener interface to pass StageInfo objects rather than directly pass spark Stages. The reason for this patch is explained in detail in SPARK-940.
      c404adb9
    • Ewen Cheslack-Postava's avatar
      Pass self to SparkContext._ensure_initialized. · 317a9eb1
      Ewen Cheslack-Postava authored
      The constructor for SparkContext should pass in self so that we track
      the current context and produce errors if another one is created. Add
      a doctest to make sure creating multiple contexts triggers the
      exception.
      317a9eb1
    • Patrick Wendell's avatar
      Minor clean-up in review · c22046b3
      Patrick Wendell authored
      c22046b3
    • Patrick Wendell's avatar
      7de0ea4d
    • Patrick Wendell's avatar
      Fix for Spark-870. · 2fa3c4c4
      Patrick Wendell authored
      This patch fixes a bug where the Spark UI didn't display the correct number of total
      tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions.
      
      It also cleans up the listener API a bit by embedding this information in the
      StageInfo class rather than passing it seperately.
      2fa3c4c4
    • Patrick Wendell's avatar
    • Matei Zaharia's avatar
      Merge pull request #98 from aarondav/docs · aa9019fc
      Matei Zaharia authored
      Docs: Fix links to RDD API documentation
      aa9019fc
    • Matei Zaharia's avatar
      Merge pull request #82 from JoshRosen/map-output-tracker-refactoring · a0e08f0f
      Matei Zaharia authored
      Split MapOutputTracker into Master/Worker classes
      
      Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances.  This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers.
      
      I also renamed a few methods and made others protected/private.
      a0e08f0f
    • Aaron Davidson's avatar
      Docs: Fix links to RDD API documentation · 962bec97
      Aaron Davidson authored
      962bec97
    • Ewen Cheslack-Postava's avatar
      Add classmethod to SparkContext to set system properties. · 56d230e6
      Ewen Cheslack-Postava authored
      Add a new classmethod to SparkContext to set system properties like is
      possible in Scala/Java. Unlike the Java/Scala implementations, there's
      no access to System until the JVM bridge is created. Since
      SparkContext handles that, move the initialization of the JVM
      connection to a separate classmethod that can safely be called
      repeatedly as long as the same instance (or no instance) is provided.
      56d230e6
    • Matei Zaharia's avatar
      Merge pull request #92 from tgravescs/sparkYarnFixClasspath · b84193c5
      Matei Zaharia authored
      Fix the Worker to use CoarseGrainedExecutorBackend and modify classpath ...
      
      ...to be explicit about inclusion of spark.jar and app.jar.  Be explicit so if there are any conflicts in packaging between spark.jar and app.jar we don't get random results due to the classpath having /*, which can including things in different order.
      b84193c5
    • Matei Zaharia's avatar
      Merge pull request #56 from jerryshao/kafka-0.8-dev · 731c94e9
      Matei Zaharia authored
      Upgrade Kafka 0.7.2 to Kafka 0.8.0-beta1 for Spark Streaming
      
      Conflicts:
      	streaming/pom.xml
      731c94e9
    • Reynold Xin's avatar
      Merge pull request #87 from aarondav/shuffle-base · 48952d67
      Reynold Xin authored
      Basic shuffle file consolidation
      
      The Spark shuffle phase can produce a large number of files, as one file is created
      per mapper per reducer. For large or repeated jobs, this often produces millions of
      shuffle files, which sees extremely degredaded performance from the OS file system.
      This patch seeks to reduce that burden by combining multipe shuffle files into one.
      
      This PR draws upon the work of @jason-dai in https://github.com/mesos/spark/pull/669.
      However, it simplifies the design in order to get the majority of the gain with less
      overall intellectual and code burden. The vast majority of code in this pull request
      is a refactor to allow the insertion of a clean layer of indirection between logical
      block ids and physical files. This, I feel, provides some design clarity in addition
      to enabling shuffle file consolidation.
      
      The main goal is to produce one shuffle file per reducer per active mapper thread.
      This allows us to isolate the mappers (simplifying the failure modes), while still
      allowing us to reduce the number of mappers tremendously for large tasks. In order
      to accomplish this, we simply create a new set of shuffle files for every parallel
      task, and return the files to a pool which will be given out to the next run task.
      
      I have run some ad hoc query testing on 5 m1.xlarge EC2 nodes with 2g of executor memory and the following microbenchmark:
      
          scala> val nums = sc.parallelize(1 to 1000, 1000).flatMap(x => (1 to 1e6.toInt))
          scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now }
          scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, 2000, x)).reduceByKey(_ + _).count) / 1000.0)
      
      For this particular workload, with 1000 mappers and 2000 reducers, I saw the old method running at around 15 minutes, with the consolidated shuffle files running at around 4 minutes. There was a very sharp increase in running time for the non-consolidated version after around 1 million total shuffle files. Below this threshold, however, there wasn't a significant difference between the two.
      
      Better performance measurement of this patch is warranted, and I plan on doing so in the near future as part of a general investigation of our shuffle file bottlenecks and performance.
      48952d67
    • Aaron Davidson's avatar
  4. Oct 21, 2013
  5. Oct 20, 2013
    • Reynold Xin's avatar
      Merge pull request #89 from rxin/executor · 5b9380e0
      Reynold Xin authored
      Don't setup the uncaught exception handler in local mode.
      
      This avoids unit test failures for Spark streaming.
      
          java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14]
      	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
      	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
      	at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
      	at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54)
      	at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108)
      	at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108)
      	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
      	at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108)
      	at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41)
      	at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66)
      	at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)
      5b9380e0
    • Reynold Xin's avatar
      b4d84784
    • Matei Zaharia's avatar
      Merge pull request #80 from rxin/build · 261bcf27
      Matei Zaharia authored
      Exclusion rules for Maven build files.
      261bcf27
    • Aaron Davidson's avatar
      4b68ddf3
    • Matei Zaharia's avatar
      Merge pull request #75 from JoshRosen/block-manager-cleanup · edc5e3f8
      Matei Zaharia authored
      Code de-duplication in BlockManager
      
      The BlockManager has a few methods that duplicate most of their code.  This pull request extracts the duplicated code into private doPut(), doGetLocal(), and doGetRemote() methods that unify the storing/reading of bytes or objects.
      
      I believe that I preserved the logic of the original code, but I'd appreciate some help in reviewing this.
      edc5e3f8
Loading