- Nov 11, 2013
-
-
Ankur Dave authored
-
Matei Zaharia authored
add tachyon module
-
- Nov 10, 2013
-
-
Haoyuan Li authored
-
Matei Zaharia authored
3 Kryo related changes. 1. Call Kryo setReferences before calling user specified Kryo registrator. This is done so the user specified registrator can override the default setting. 2. Register more internal classes (MapStatus, BlockManagerId). 3. Slightly refactored the internal class registration to allocate less memory.
-
Reynold Xin authored
Moved the Spark internal class registration for Kryo into an object, and added more classes (e.g. MapStatus, BlockManagerId) to the registration.
-
Haoyuan Li authored
-
Reynold Xin authored
-
- Nov 09, 2013
-
-
Matei Zaharia authored
Add spark-tools assembly to spark-class'ss classpath This commit adds an assembly for `spark-tools` and adds it to `spark-class`'s classpath, allowing the JavaAPICompletenessChecker to be run against Spark 0.8+ with ./spark-class org.apache.spark.tools.JavaAPICompletenessChecker Previously, this tool was run through the `run` script. I chose to add this to `run-example` because I didn't want to duplicate code in a `run-tool` script.
-
Matei Zaharia authored
Replace the thread inside ClusterScheduler.start() with an Akka scheduler Threads are precious resources so that we shouldn't abuse them
-
Reynold Xin authored
Don't reset job group when a new job description is set.
-
Reynold Xin authored
-
Matei Zaharia authored
Fix secure hdfs access for spark on yarn https://github.com/apache/incubator-spark/pull/23 broke secure hdfs access. Not sure if it works with secure hdfs on standalone. Fixing it at least for spark on yarn. The broadcasting of jobconf change also broke secure hdfs access as it didn't take into account things calling the getPartitions before sparkContext is initialized. The DAGScheduler does this as it tries to getShuffleMapStage.
-
Josh Rosen authored
This allows the JavaAPICompletenessChecker to be run with Spark 0.8+.
-
Matei Zaharia authored
Propagate SparkContext local properties from spark-repl caller thread to the repl execution thread.
-
soulmachine authored
-
Reynold Xin authored
Propagate the SparkContext local property from the thread that calls the spark-repl to the actual execution thread.
-
- Nov 08, 2013
-
-
Aaron Davidson authored
-
tgravescs authored
-
tgravescs authored
-
tgravescs authored
-
- Nov 07, 2013
-
-
Reynold Xin authored
Include appId in executor cmd line args add the appId back into the executor cmd line args. I also made a pretty lame regression test, just to make sure it doesn't get dropped in the future. not sure it will run on the build server, though, b/c `ExecutorRunner.buildCommandSeq()` expects to be abel to run the scripts in `bin`.
-
Imran Rashid authored
-
Imran Rashid authored
-
Reynold Xin authored
Add Spark multi-user support for standalone mode and Mesos This PR add multi-user support for Spark both standalone mode and Mesos (coarse and fine grained ) mode, user can specify the user name who submit app through environment variable `SPARK_USER` or use default one. Executor will communicate with Hadoop using specified user name. Also I fixed one bug in JobLogger when different user wrote job log to specified folder which has no right file permission. I separate previous [PR750](https://github.com/mesos/spark/pull/750) into two PRs, in this PR I only solve multi-user support problem. I will try to solve security auth problem in subsequent PR because security auth is a complicated problem especially for Shark Server like long-run app (both Kerberos TGT and HDFS delegation token should be renewed or re-created through app's run time).
-
Imran Rashid authored
-
- Nov 06, 2013
-
-
jerryshao authored
-
Reynold Xin authored
Removed unused return value in SparkContext.runJob Return type of this `runJob` version is `Unit`: def runJob[T, U: ClassManifest]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], allowLocal: Boolean, resultHandler: (Int, U) => Unit) { ... } It's obviously unnecessary to "return" `result`.
-
Reynold Xin authored
Attempt to fix SparkListenerSuite breakage Could not reproduce locally, but this test could've been flaky if the build machine was too fast, due to typo. (index 0 is intentionally slowed down to ensure total time is >= 1 ms) This should be merged into branch-0.8 as well.
-
Aaron Davidson authored
Could not reproduce locally, but this test could've been flaky if the build machine was too fast.
-
Lian, Cheng authored
-
Reynold Xin authored
Ignore a task update status if the executor doesn't exist anymore. Otherwise if the scheduler receives a task update message when the executor's been removed, the scheduler would hang. It is pretty hard to add unit tests for these right now because it is hard to mock the cluster scheduler. We should do that once @kayousterhout finishes merging the local scheduler and the cluster scheduler.
-
- Nov 05, 2013
-
-
Reynold Xin authored
-
Reynold Xin authored
Using case class deep match to simplify code in DAGScheduler.processEvent Since all `XxxEvent` pushed in `DAGScheduler.eventQueue` are case classes, deep pattern matching is more convenient to extract event object components.
-
Lian, Cheng authored
-
- Nov 04, 2013
-
-
Reynold Xin authored
Never store shuffle blocks in BlockManager After the BlockId refactor (PR #114), it became very clear that ShuffleBlocks are of no use within BlockManager (they had a no-arg constructor!). This patch completely eliminates them, saving us around 100-150 bytes per shuffle block. The total, system-wide overhead per shuffle block is now a flat 8 bytes, excluding state saved by the MapOutputTracker. Note: This should *not* be merged directly into 0.8.0 -- see #138
-
Aaron Davidson authored
After the BlockId refactor (PR #114), it became very clear that ShuffleBlocks are of no use within BlockManager (they had a no-arg constructor!). This patch completely eliminates them, saving us around 100-150 bytes per shuffle block. The total, system-wide overhead per shuffle block is now a flat 8 bytes, excluding state saved by the MapOutputTracker.
-
Reynold Xin authored
add javadoc to JobLogger, and some small fix against Spark-941 add javadoc to JobLogger, output more info for RDD, modify recordStageDepGraph to avoid output duplicate stage dependency information (cherry picked from commit 518cf22e) Signed-off-by:
Reynold Xin <rxin@apache.org>
-
Reynold Xin authored
Memory-optimized shuffle file consolidation Reduces overhead of each shuffle block for consolidation from >300 bytes to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks, net overhead was ~8,400,000 bytes. Despite the memory-optimized implementation incurring extra CPU overhead, the runtime of the shuffle phase in this test was only around 2% slower, while the reduce phase was 40% faster, when compared to not using any shuffle file consolidation. This is accomplished by replacing the map from ShuffleBlockId to FileSegment (i.e., block id to where it's located), which had high overhead due to being a gigantic, timestamped, concurrent map with a more space-efficient structure. Namely, the following are introduced (I have omitted the word "Shuffle" from some names for clarity): **ShuffleFile** - there is one ShuffleFile per consolidated shuffle file on disk. We store an array of offsets into the physical shuffle file for each ShuffleMapTask that wrote into the file. This is sufficient to reconstruct FileSegments for mappers that are in the file. **FileGroup** - contains a set of ShuffleFiles, one per reducer, that a MapTask can use to write its output. There is one FileGroup created per _concurrent_ MapTask. The FileGroup contains an array of the mapIds that have been written to all files in the group. The positions of elements in this array map directly onto the positions in each ShuffleFile's offsets array. In order to locate the FileSegment associated with a BlockId, we have another structure which maps each reducer to the set of ShuffleFiles that were created for it. (There will be as many ShuffleFiles per reducer as there are FileGroups.) To lookup a given ShuffleBlockId (shuffleId, reducerId, mapId), we thus search through all ShuffleFiles associated with that reducer. As a time optimization, we ensure that FileGroups are only reused for MapTasks with monotonically increasing mapIds. This allows us to perform a binary search to locate a mapId inside a group, and also enables potential future optimization (based on the usual monotonic access order).
-
Aaron Davidson authored
-
Aaron Davidson authored
- ShuffleBlocks has been removed and replaced by ShuffleWriterGroup. - ShuffleWriterGroup no longer contains a reference to a ShuffleFileGroup. - ShuffleFile has been removed and its contents are now within ShuffleFileGroup. - ShuffleBlockManager.forShuffle has been replaced by a more stateful forMapTask.
-