Commits · 8537f19268bf53e5f154dedb7ba35b711dfbefbd · cs525-sp18-g07 / spark

Oct 17, 2013
- SPARK-627 , Implementing --config arguments in the scripts · 8537f192
  KarthikTunga authored 11 years ago
  
  8537f192
- SPARK-627 , Implementing --config arguments in the scripts · ff4fb1f7
  KarthikTunga authored 11 years ago
  
  ff4fb1f7
- Implementing --config argument in the scripts · a32aa6b3
  KarthikTunga authored 11 years ago
  
  a32aa6b3
Oct 15, 2013

Merge branch 'master' of https://github.com/apache/incubator-spark · 6c6b146f
KarthikTunga authored 11 years ago
```
Updating local branch
```
6c6b146f
SPARK-627 - reading --config argument · d2c86e71
KarthikTunga authored 11 years ago

d2c86e71

Merge pull request #29 from rxin/kill · e33b1839

Patrick Wendell authored 11 years ago

Job killing

Moving https://github.com/mesos/spark/pull/935 here

The high level idea is to have an "interrupted" field in TaskContext, and a task should check that flag to determine if its execution should continue. For convenience, I provide an InterruptibleIterator which wraps around a normal iterator but checks for the interrupted flag. I also provide an InterruptibleRDD that wraps around an existing RDD.

As part of this pull request, I added an AsyncRDDActions class that provides a number of RDD actions that return a FutureJob (extending scala.concurrent.Future). The FutureJob can be used to kill the job execution, or waits until the job finishes.

This is NOT ready for merging yet. Remaining TODOs:

1. Add unit tests
2. Add job killing functionality for local scheduler (current job killing functionality only works in cluster scheduler)

-------------

Update on Oct 10, 2013:

This is ready!

Related future work:
- Figure out how to handle the job triggered by RangePartitioner (this one is tough; might become future work)
- Java API
- Python API

e33b1839

Oct 14, 2013

Merge branch 'master' of github.com:apache/incubator-spark into kill · 9cd8786e
Reynold Xin authored 11 years ago
```
Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
```
9cd8786e

Merge pull request #57 from aarondav/bid · 3b11f43e

Reynold Xin authored 11 years ago

Refactor BlockId into an actual type

Converts all of our BlockId strings into actual BlockId types. Here are some advantages of doing this now:

+ Type safety
+  Code clarity - it's now obvious what the key of a shuffle or rdd block is, for instance. Additionally, appearing in tuple/map type signatures is a big readability bonus. A Seq[(String, BlockStatus)] is not very clear. Further, we can now use more Scala features, like matching on BlockId types.
+ Explicit usage - we can now formally tell where various BlockIds are being used (without doing string searches); this makes updating current BlockIds a much clearer process, and compiler-supported.
  (I'm looking at you, shuffle file consolidation.)
+ It will only get harder to make this change as time goes on.

Downside is, of course, that this is a very invasive change touching a lot of different files, which will inevitably lead to merge conflicts for many.

3b11f43e

Address Matei's comments · 4a45019f
Aaron Davidson authored 11 years ago

4a45019f

Oct 13, 2013

Change BlockId filename to name + rest of Patrick's comments · da896115
Aaron Davidson authored 11 years ago

da896115
Add unit test and address rest of Reynold's comments · d6035228
Aaron Davidson authored 11 years ago

d6035228

Refactor BlockId into an actual type · a3959111

Aaron Davidson authored 11 years ago

This is an unfortunately invasive change which converts all of our BlockId
strings into actual BlockId types. Here are some advantages of doing this now:

+ Type safety

+ Code clarity - it's now obvious what the key of a shuffle or rdd block is,
  for instance. Additionally, appearing in tuple/map type signatures is a big
  readability bonus. A Seq[(String, BlockStatus)] is not very clear.
  Further, we can now use more Scala features, like matching on BlockId types.

+ Explicit usage - we can now formally tell where various BlockIds are being used
  (without doing string searches); this makes updating current BlockIds a much
  clearer process, and compiler-supported.
  (I'm looking at you, shuffle file consolidation.)

+ It will only get harder to make this change as time goes on.

Since this touches a lot of files, it'd be best to either get this patch
in quickly or throw it on the ground to avoid too many secondary merge conflicts.

a3959111

Oct 12, 2013
- Merge pull request #52 from harveyfeng/hadoop-closure · 99796904
  Reynold Xin authored 11 years ago
  
  Add an optional closure parameter to HadoopRDD instantiation to use when creating local JobConfs. Having HadoopRDD accept this optional closure eliminates the need for the HadoopFileRDD added earlier. It makes the HadoopRDD more general, in that the caller can specify any JobConf initialization flow.
  99796904
- Remove the new HadoopRDD constructor from SparkContext API, plus some minor style changes. · 6c32aab8
  Harvey Feng authored 11 years ago
  
  6c32aab8
- Fixed PairRDDFunctionsSuite after removing InterruptibleRDD. · 88866ea9
  Reynold Xin authored 11 years ago
  
  88866ea9
- Job cancellation: address Matei's code review feedback. · 6b288b75
  Reynold Xin authored 11 years ago
  
  6b288b75
Oct 11, 2013
- Job cancellation: addressed code review feedback round 2 from Kay. · ab0940f0
  Reynold Xin authored 11 years ago
  
  ab0940f0
- Fixed dagscheduler suite because of a logging message change. · 97ffebbe
  Reynold Xin authored 11 years ago
  
  97ffebbe
- Merge pull request #54 from aoiwelle/remove_unused_imports · dca80094
  Reynold Xin authored 11 years ago
  
  Remove unnecessary mutable imports It appears that the imports aren't necessary here.
  dca80094
- Job cancellation: addressed code review feedback from Kay. · a61cf40a
  Reynold Xin authored 11 years ago
  
  a61cf40a
- Merge pull request #53 from witgo/master · fb25f323
  Matei Zaharia authored 11 years ago
  
  Add a zookeeper compile dependency to fix build in maven Add a zookeeper compile dependency to fix build in maven
  fb25f323
- Merge pull request #32 from mridulm/master · d6ead478
  Matei Zaharia authored 11 years ago
  
  Address review comments, move to incubator spark Also includes a small fix to speculative execution. <edit> Continued from https://github.com/mesos/spark/pull/914 </edit>
  d6ead478
- Making takeAsync and collectAsync deterministic. · e2047d39
  Reynold Xin authored 11 years ago
  
  e2047d39
- Properly handle interrupted exception in FutureAction. · 09f76092
  Reynold Xin authored 11 years ago
  
  09f76092
- Remove unnecessary mutable imports · 67d4a31f
  Neal Wiggins authored 11 years ago
  
  67d4a31f
- Add a zookeeper compile dependency to fix build in maven · fc60c412
  LiGuoqiang authored 11 years ago
  
  fc60c412
- Merge branch 'master' of github.com:apache/incubator-spark into kill · 42fb1df6
  Reynold Xin authored 11 years ago
  
  Conflicts: core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
  42fb1df6
- Fixed the broken local scheduler test. · d9e724e7
  Reynold Xin authored 11 years ago
  
  d9e724e7
- Added comprehensive tests for job cancellation in a variety of environments... · 37397b73
  Reynold Xin authored 11 years ago
  
  Added comprehensive tests for job cancellation in a variety of environments (local vs cluster, fifo vs fair).
  37397b73
- Switched to use daemon thread in executor and fixed a bug in job cancellation for fair scheduler. · 80cdbf4f
  Reynold Xin authored 11 years ago
  
  80cdbf4f
Oct 10, 2013

Merge remote-tracking branch 'tgravescs/sparkYarnDistCache' · 8f11c36f

Matei Zaharia authored 11 years ago

Closes #11

Conflicts:
	docs/running-on-yarn.md
	yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala

8f11c36f

Changed the name of the local cluster executor from local to localhost. · 058508b6
Reynold Xin authored 11 years ago

058508b6
Use the same Executor in LocalScheduler as in ClusterScheduler. · ec2e2ed1
Reynold Xin authored 11 years ago

ec2e2ed1

Merge pull request #19 from aarondav/master-zk · c71499b7

Matei Zaharia authored 11 years ago

Standalone Scheduler fault tolerance using ZooKeeper

This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.

Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.

Master failover follows directly from the single-node Master recovery via the file
system (patch d5a96fec), save that the Master state is stored in ZooKeeper instead.

Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from d5a96fec.

Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.

c71499b7

Add an optional closure parameter to HadoopRDD instantiation to used when... · 5a99e678
Harvey Feng authored 11 years ago
```
Add an optional closure parameter to HadoopRDD instantiation to used when creating any local JobConfs.
```
5a99e678
Minor clarification and cleanup to spark-standalone.md · 66c20635
Aaron Davidson authored 11 years ago

66c20635

Merge pull request #44 from mateiz/fast-map · cd08f734

Matei Zaharia authored 11 years ago

A fast and low-memory append-only map for shuffle operations

This is a continuation of the old repo's pull request https://github.com/mesos/spark/pull/823 to add a more efficient hashmap class for shuffles. I've optimized and tested this more thoroughly now so I think it's good to go. I've also addressed some of the comments that were outstanding there.

The idea is to reduce the cost of shuffles by taking advantage of the properties their hashmaps need. In particular, the hashmaps there are append-only, and a common operation is updating a key's value based on the old value. The included AppendOnlyMap class uses open hashing to use less space than Java's (by not having a linked list per bucket), does not support deletes, and has a changeValue operation to update a key in place without following the hash chain twice. In micro-benchmarks against java.util.HashMap and scala.collection.mutable.HashMap, this is 20-30% smaller and 10-40% faster depending on the number and type of keys. It's also noticeably faster than fastutil's Object2ObjectOpenHashMap.

I've also tested this in Spark apps now. While the speed gain is modest (partly due to other overheads, like serialization), there is some, and I think the lower memory usage is worth it. Here's one example where the speedup is most noticeable, in spark-shell on local mode:
```
scala> val nums = sc.parallelize(1 to 8).flatMap(x => (1 to 5e6.toInt)).cache

scala> nums.count

scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now }

scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, x)).reduceByKey(_ + _).count) / 1000.0)
```

This prints the following times before and after this change:
```
Before: Vector(4.368, 2.635, 2.549, 2.522, 2.233, 2.222, 2.214, 2.195)

After: Vector(3.588, 1.741, 1.706, 1.648, 1.777, 1.81, 1.776, 1.731)
```

I've also run the spark-perf suite, enhanced with some tests that use Ints (https://github.com/amplab/spark-perf/pull/9), and it shows some speedup on those, but less on the string ones (presumably due to existing overhead): https://gist.github.com/mateiz/6897121.

cd08f734

Rename kill -> cancel in user facing API / documentation. · 357733d2
Reynold Xin authored 11 years ago

357733d2
Merge branch 'master' into fast-map · 001d13f7
Matei Zaharia authored 11 years ago
```
Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
```
001d13f7
Support job cancellation in multi-pool scheduler. · ddf64f01
Reynold Xin authored 11 years ago

ddf64f01