Skip to content
Snippets Groups Projects
  1. Oct 17, 2013
  2. Oct 15, 2013
    • KarthikTunga's avatar
      Merge branch 'master' of https://github.com/apache/incubator-spark · 6c6b146f
      KarthikTunga authored
      Updating local branch
      6c6b146f
    • KarthikTunga's avatar
      SPARK-627 - reading --config argument · d2c86e71
      KarthikTunga authored
      d2c86e71
    • Patrick Wendell's avatar
      Merge pull request #29 from rxin/kill · e33b1839
      Patrick Wendell authored
      Job killing
      
      Moving https://github.com/mesos/spark/pull/935 here
      
      The high level idea is to have an "interrupted" field in TaskContext, and a task should check that flag to determine if its execution should continue. For convenience, I provide an InterruptibleIterator which wraps around a normal iterator but checks for the interrupted flag. I also provide an InterruptibleRDD that wraps around an existing RDD.
      
      As part of this pull request, I added an AsyncRDDActions class that provides a number of RDD actions that return a FutureJob (extending scala.concurrent.Future). The FutureJob can be used to kill the job execution, or waits until the job finishes.
      
      This is NOT ready for merging yet. Remaining TODOs:
      
      1. Add unit tests
      2. Add job killing functionality for local scheduler (current job killing functionality only works in cluster scheduler)
      
      -------------
      
      Update on Oct 10, 2013:
      
      This is ready!
      
      Related future work:
      - Figure out how to handle the job triggered by RangePartitioner (this one is tough; might become future work)
      - Java API
      - Python API
      e33b1839
  3. Oct 14, 2013
    • Reynold Xin's avatar
      Merge branch 'master' of github.com:apache/incubator-spark into kill · 9cd8786e
      Reynold Xin authored
      Conflicts:
      	core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
      9cd8786e
    • Reynold Xin's avatar
      Merge pull request #57 from aarondav/bid · 3b11f43e
      Reynold Xin authored
      Refactor BlockId into an actual type
      
      Converts all of our BlockId strings into actual BlockId types. Here are some advantages of doing this now:
      
      + Type safety
      +  Code clarity - it's now obvious what the key of a shuffle or rdd block is, for instance. Additionally, appearing in tuple/map type signatures is a big readability bonus. A Seq[(String, BlockStatus)] is not very clear. Further, we can now use more Scala features, like matching on BlockId types.
      + Explicit usage - we can now formally tell where various BlockIds are being used (without doing string searches); this makes updating current BlockIds a much clearer process, and compiler-supported.
        (I'm looking at you, shuffle file consolidation.)
      + It will only get harder to make this change as time goes on.
      
      Downside is, of course, that this is a very invasive change touching a lot of different files, which will inevitably lead to merge conflicts for many.
      3b11f43e
    • Aaron Davidson's avatar
      Address Matei's comments · 4a45019f
      Aaron Davidson authored
      4a45019f
  4. Oct 13, 2013
    • Aaron Davidson's avatar
    • Aaron Davidson's avatar
      d6035228
    • Aaron Davidson's avatar
      Refactor BlockId into an actual type · a3959111
      Aaron Davidson authored
      This is an unfortunately invasive change which converts all of our BlockId
      strings into actual BlockId types. Here are some advantages of doing this now:
      
      + Type safety
      
      + Code clarity - it's now obvious what the key of a shuffle or rdd block is,
        for instance. Additionally, appearing in tuple/map type signatures is a big
        readability bonus. A Seq[(String, BlockStatus)] is not very clear.
        Further, we can now use more Scala features, like matching on BlockId types.
      
      + Explicit usage - we can now formally tell where various BlockIds are being used
        (without doing string searches); this makes updating current BlockIds a much
        clearer process, and compiler-supported.
        (I'm looking at you, shuffle file consolidation.)
      
      + It will only get harder to make this change as time goes on.
      
      Since this touches a lot of files, it'd be best to either get this patch
      in quickly or throw it on the ground to avoid too many secondary merge conflicts.
      a3959111
  5. Oct 12, 2013
  6. Oct 11, 2013
  7. Oct 10, 2013
    • Matei Zaharia's avatar
      Merge remote-tracking branch 'tgravescs/sparkYarnDistCache' · 8f11c36f
      Matei Zaharia authored
      Closes #11
      
      Conflicts:
      	docs/running-on-yarn.md
      	yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala
      8f11c36f
    • Reynold Xin's avatar
    • Reynold Xin's avatar
    • Matei Zaharia's avatar
      Merge pull request #19 from aarondav/master-zk · c71499b7
      Matei Zaharia authored
      Standalone Scheduler fault tolerance using ZooKeeper
      
      This patch implements full distributed fault tolerance for standalone scheduler Masters.
      There is only one master Leader at a time, which is actively serving scheduling
      requests. If this Leader crashes, another master will eventually be elected, reconstruct
      the state from the first Master, and continue serving scheduling requests.
      
      Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
      the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
      retries and session monitoring on top of the ZooKeeper client.
      
      Master failover follows directly from the single-node Master recovery via the file
      system (patch d5a96fec), save that the Master state is stored in ZooKeeper instead.
      
      Configuration:
      By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
      By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
      to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
      By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
      to an appropriate directory accessible by the Master, we will keep the behavior of from d5a96fec.
      
      Additionally, places where a Master could be specificied by a spark:// url can now take
      comma-delimited lists to specify backup masters. Note that this is only used for registration
      of NEW Workers and application Clients. Once a Worker or Client has registered with the
      Master Leader, it is "in the system" and will never need to register again.
      c71499b7
    • Harvey Feng's avatar
      Add an optional closure parameter to HadoopRDD instantiation to used when... · 5a99e678
      Harvey Feng authored
      Add an optional closure parameter to HadoopRDD instantiation to used when creating any local JobConfs.
      5a99e678
    • Aaron Davidson's avatar
    • Matei Zaharia's avatar
      Merge pull request #44 from mateiz/fast-map · cd08f734
      Matei Zaharia authored
      A fast and low-memory append-only map for shuffle operations
      
      This is a continuation of the old repo's pull request https://github.com/mesos/spark/pull/823 to add a more efficient hashmap class for shuffles. I've optimized and tested this more thoroughly now so I think it's good to go. I've also addressed some of the comments that were outstanding there.
      
      The idea is to reduce the cost of shuffles by taking advantage of the properties their hashmaps need. In particular, the hashmaps there are append-only, and a common operation is updating a key's value based on the old value. The included AppendOnlyMap class uses open hashing to use less space than Java's (by not having a linked list per bucket), does not support deletes, and has a changeValue operation to update a key in place without following the hash chain twice. In micro-benchmarks against java.util.HashMap and scala.collection.mutable.HashMap, this is 20-30% smaller and 10-40% faster depending on the number and type of keys. It's also noticeably faster than fastutil's Object2ObjectOpenHashMap.
      
      I've also tested this in Spark apps now. While the speed gain is modest (partly due to other overheads, like serialization), there is some, and I think the lower memory usage is worth it. Here's one example where the speedup is most noticeable, in spark-shell on local mode:
      ```
      scala> val nums = sc.parallelize(1 to 8).flatMap(x => (1 to 5e6.toInt)).cache
      
      scala> nums.count
      
      scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now }
      
      scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, x)).reduceByKey(_ + _).count) / 1000.0)
      ```
      
      This prints the following times before and after this change:
      ```
      Before: Vector(4.368, 2.635, 2.549, 2.522, 2.233, 2.222, 2.214, 2.195)
      
      After: Vector(3.588, 1.741, 1.706, 1.648, 1.777, 1.81, 1.776, 1.731)
      ```
      
      I've also run the spark-perf suite, enhanced with some tests that use Ints (https://github.com/amplab/spark-perf/pull/9), and it shows some speedup on those, but less on the string ones (presumably due to existing overhead): https://gist.github.com/mateiz/6897121.
      cd08f734
    • Reynold Xin's avatar
    • Matei Zaharia's avatar
      Merge branch 'master' into fast-map · 001d13f7
      Matei Zaharia authored
      Conflicts:
      	core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
      001d13f7
    • Reynold Xin's avatar
      ddf64f01
Loading