- Nov 06, 2013
-
-
Dan Crankshaw authored
-
- Nov 05, 2013
-
-
Joey authored
Merge Spark master into graphx
-
- Nov 04, 2013
-
-
Reynold Xin authored
Conflicts: README.md core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala
-
Reynold Xin authored
Memory-optimized shuffle file consolidation Reduces overhead of each shuffle block for consolidation from >300 bytes to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks, net overhead was ~8,400,000 bytes. Despite the memory-optimized implementation incurring extra CPU overhead, the runtime of the shuffle phase in this test was only around 2% slower, while the reduce phase was 40% faster, when compared to not using any shuffle file consolidation. This is accomplished by replacing the map from ShuffleBlockId to FileSegment (i.e., block id to where it's located), which had high overhead due to being a gigantic, timestamped, concurrent map with a more space-efficient structure. Namely, the following are introduced (I have omitted the word "Shuffle" from some names for clarity): **ShuffleFile** - there is one ShuffleFile per consolidated shuffle file on disk. We store an array of offsets into the physical shuffle file for each ShuffleMapTask that wrote into the file. This is sufficient to reconstruct FileSegments for mappers that are in the file. **FileGroup** - contains a set of ShuffleFiles, one per reducer, that a MapTask can use to write its output. There is one FileGroup created per _concurrent_ MapTask. The FileGroup contains an array of the mapIds that have been written to all files in the group. The positions of elements in this array map directly onto the positions in each ShuffleFile's offsets array. In order to locate the FileSegment associated with a BlockId, we have another structure which maps each reducer to the set of ShuffleFiles that were created for it. (There will be as many ShuffleFiles per reducer as there are FileGroups.) To lookup a given ShuffleBlockId (shuffleId, reducerId, mapId), we thus search through all ShuffleFiles associated with that reducer. As a time optimization, we ensure that FileGroups are only reused for MapTasks with monotonically increasing mapIds. This allows us to perform a binary search to locate a mapId inside a group, and also enables potential future optimization (based on the usual monotonic access order).
-
Aaron Davidson authored
-
Aaron Davidson authored
- ShuffleBlocks has been removed and replaced by ShuffleWriterGroup. - ShuffleWriterGroup no longer contains a reference to a ShuffleFileGroup. - ShuffleFile has been removed and its contents are now within ShuffleFileGroup. - ShuffleBlockManager.forShuffle has been replaced by a more stateful forMapTask.
-
Aaron Davidson authored
-
- Nov 03, 2013
-
-
Aaron Davidson authored
For some reason, even calling java.nio.Files.createTempDirectory().getFile.deleteOnExit() does not delete the directory on exit. Guava's analagous function seems to work, however.
-
Aaron Davidson authored
-
Aaron Davidson authored
-
Aaron Davidson authored
-
Aaron Davidson authored
-
Aaron Davidson authored
-
Aaron Davidson authored
-
Aaron Davidson authored
Overhead of each shuffle block for consolidation has been reduced from >300 bytes to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks, net overhead was ~8,400,000 bytes. Despite the memory-optimized implementation incurring extra CPU overhead, the runtime of the shuffle phase in this test was only around 2% slower, while the reduce phase was 40% faster, when compared to not using any shuffle file consolidation.
-
Reynold Xin authored
Fast, memory-efficient hash set, hash table implementations optimized for primitive data types. This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space). Details: This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs. Building on top of OpenHashSet, this PR adds two different hash tables implementations: 1. OpenHashSet: for nullable keys, optional specialization for primitive values 2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items: Int to Int: ~30% java.lang.Integer to java.lang.Integer: ~100% Int to java.lang.Integer: ~50% java.lang.Integer to Int: ~85%
-
Reynold Xin authored
-
Reynold Xin authored
Also addressed Matei's code review comment.
-
Reynold Xin authored
-
- Nov 02, 2013
-
-
Reynold Xin authored
update default github
-
Reynold Xin authored
Fixed a typo in Hadoop version in README.
-
Reynold Xin authored
-
- Nov 01, 2013
-
-
Fabrizio (Misto) Milo authored
-
Reynold Xin authored
fix persistent-hdfs
-
Fabrizio (Misto) Milo authored
-
Matei Zaharia authored
Document & finish support for local: URIs Review all the supported URI schemes for addJar / addFile to the Cluster Overview page. Add support for local: URI to addFile.
-
Dan Crankshaw authored
-
Evan Chan authored
-
Evan Chan authored
-
- Oct 31, 2013
-
-
Reynold Xin authored
Switched VertexSetRDD and GraphImpl to use OpenHashSet
-
Joseph E. Gonzalez authored
-
Joseph E. Gonzalez authored
After some testing I realized that the IndexedSeq is still instantiating the array (not maintaining a view) so I have replaced all IndexedSeq[V] with (Int => V)
-
Joseph E. Gonzalez authored
-
Dan Crankshaw authored
-
Dan Crankshaw authored
-
Reynold Xin authored
Switching to VertexSetRDD to use @rxin BitSet and OpenHash
-
Joseph E. Gonzalez authored
Large parts of the VertexSetRDD were restructured to take advantage of: 1) the OpenHashSet as an index map 2) view based lazy mapValues and mapValuesWithVertices 3) the cogroup code is currently disabled (since it is not used in any of the tests) The GraphImpl was updated to also use the OpenHashSet and PrimitiveOpenHashMap wherever possible: 1) the LocalVidMaps (used to track replicated vertices) are now implemented using the OpenHashSet 2) an OpenHashMap is temporarily constructed to combine the local OpenHashSet with the local (replicated) vertex attribute arrays 3) because the OpenHashSet constructor grabs a class manifest all operations that construct OpenHashSets have been moved to the GraphImpl Singleton to prevent implicit variable capture within closures.
-
Joseph E. Gonzalez authored
1) _keySet --renamed--> keySet 2) keySet and _values are made externally accessible 3) added an update function which merges duplicate values
-
Dan Crankshaw authored
-