Commits · 384befb208837cf6c22a6d3959d0ecaff06f2b78 · cs525-sp18-g07 / spark

Nov 06, 2013
- Merge branch 'master' of github.com:amplab/graphx · 384befb2
  Dan Crankshaw authored 11 years ago
  
  384befb2
Nov 05, 2013
- Merge pull request #50 from amplab/mergemerge · ca44b513
  Joey authored 11 years ago
  
  Merge Spark master into graphx
  ca44b513
Nov 04, 2013

Merge branch 'master' of github.com:apache/incubator-spark into mergemerge · 551a43fd

Reynold Xin authored 11 years ago

Conflicts:
	README.md
	core/src/main/scala/org/apache/spark/util/collection/OpenHashMap.scala
	core/src/main/scala/org/apache/spark/util/collection/OpenHashSet.scala
	core/src/main/scala/org/apache/spark/util/collection/PrimitiveKeyOpenHashMap.scala

551a43fd

Merge pull request #130 from aarondav/shuffle · 7a26104a

Reynold Xin authored 11 years ago

Memory-optimized shuffle file consolidation

Reduces overhead of each shuffle block for consolidation from >300 bytes to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks, net overhead was ~8,400,000 bytes.

Despite the memory-optimized implementation incurring extra CPU overhead, the runtime of the shuffle phase in this test was only around 2% slower, while the reduce phase was 40% faster, when compared to not using any shuffle file consolidation.

This is accomplished by replacing the map from ShuffleBlockId to FileSegment (i.e., block id to where it's located), which had high overhead due to being a gigantic, timestamped, concurrent map with a more space-efficient structure. Namely, the following are introduced (I have omitted the word "Shuffle" from some names for clarity):
**ShuffleFile** - there is one ShuffleFile per consolidated shuffle file on disk. We store an array of offsets into the physical shuffle file for each ShuffleMapTask that wrote into the file. This is sufficient to reconstruct FileSegments for mappers that are in the file.
**FileGroup** - contains a set of ShuffleFiles, one per reducer, that a MapTask can use to write its output. There is one FileGroup created per _concurrent_ MapTask. The FileGroup contains an array of the mapIds that have been written to all files in the group. The positions of elements in this array map directly onto the positions in each ShuffleFile's offsets array.

In order to locate the FileSegment associated with a BlockId, we have another structure which maps each reducer to the set of ShuffleFiles that were created for it. (There will be as many ShuffleFiles per reducer as there are FileGroups.) To lookup a given ShuffleBlockId (shuffleId, reducerId, mapId), we thus search through all ShuffleFiles associated with that reducer.

As a time optimization, we ensure that FileGroups are only reused for MapTasks with monotonically increasing mapIds. This allows us to perform a binary search to locate a mapId inside a group, and also enables potential future optimization (based on the usual monotonic access order).

7a26104a

Minor cleanup in ShuffleBlockManager · 1ba11b1c
Aaron Davidson authored 11 years ago

1ba11b1c

Refactor ShuffleBlockManager to reduce public interface · 6201e5e2

Aaron Davidson authored 11 years ago

- ShuffleBlocks has been removed and replaced by ShuffleWriterGroup.
- ShuffleWriterGroup no longer contains a reference to a ShuffleFileGroup.
- ShuffleFile has been removed and its contents are now within ShuffleFileGroup.
- ShuffleBlockManager.forShuffle has been replaced by a more stateful forMapTask.

6201e5e2

Add javadoc and remove unused code · b0cf19fe
Aaron Davidson authored 11 years ago

b0cf19fe

Nov 03, 2013

Clean up test files properly · 39d93ed4

Aaron Davidson authored 11 years ago

For some reason, even calling
java.nio.Files.createTempDirectory().getFile.deleteOnExit()
does not delete the directory on exit. Guava's analagous function
seems to work, however.

39d93ed4

use OpenHashMap, remove monotonicity requirement, fix failure bug · a0bb569a
Aaron Davidson authored 11 years ago

a0bb569a
Address Reynold's comments · 8703898d
Aaron Davidson authored 11 years ago

8703898d
Fix test breakage · 3ca52309
Aaron Davidson authored 11 years ago

3ca52309
Add documentation and address other comments · 1592adfa
Aaron Davidson authored 11 years ago

1592adfa
Fix weird bug with specialized PrimitiveVector · 7d44dec9
Aaron Davidson authored 11 years ago

7d44dec9
Address minor comments · 7453f311
Aaron Davidson authored 11 years ago

7453f311

Memory-optimized shuffle file consolidation · 84991a1b

Aaron Davidson authored 11 years ago

Overhead of each shuffle block for consolidation has been reduced from >300 bytes
to 8 bytes (1 primitive Long). Verified via profiler testing with 1 mil shuffle blocks,
net overhead was ~8,400,000 bytes.

Despite the memory-optimized implementation incurring extra CPU overhead, the runtime
of the shuffle phase in this test was only around 2% slower, while the reduce phase
was 40% faster, when compared to not using any shuffle file consolidation.

84991a1b

Merge pull request #70 from rxin/hash1 · b5dc3393

Reynold Xin authored 11 years ago

Fast, memory-efficient hash set, hash table implementations optimized for primitive data types.

This pull request adds two hash table implementations optimized for primitive data types. For primitive types, the new hash tables are much faster than the current Spark AppendOnlyMap (3X faster - note that the current AppendOnlyMap is already much better than the Java map) while uses much less space (1/4 of the space).

Details:

This PR first adds a open hash set implementation (OpenHashSet) optimized for primitive types (using Scala's specialization feature). This OpenHashSet is designed to serve as building blocks for more advanced structures. It is currently used to build the following two hash tables, but can be used in the future to build multi-valued hash tables as well (GraphX has this use case). Note that there are some peculiarities in the code for working around some Scala compiler bugs.

Building on top of OpenHashSet, this PR adds two different hash tables implementations:
1. OpenHashSet: for nullable keys, optional specialization for primitive values
2. PrimitiveKeyOpenHashMap: for primitive keys that are not nullable, and optional specialization for primitive values

I tested the update speed of these two implementations using the changeValue function (which is what Aggregator and cogroup would use). Runtime relative to AppendOnlyMap for inserting 10 million items:

Int to Int: ~30%
java.lang.Integer to java.lang.Integer: ~100%
Int to java.lang.Integer: ~50%
java.lang.Integer to Int: ~85%

b5dc3393

Code review feedback. · eb5f8a3f
Reynold Xin authored 11 years ago

eb5f8a3f
Fixed a bug that uses twice amount of memory for the primitive arrays due to a scala compiler bug. · 1e9543b5
Reynold Xin authored 11 years ago
```
Also addressed Matei's code review comment.
```
1e9543b5
Merge branch 'master' into hash1 · da6bb0ae
Reynold Xin authored 11 years ago

da6bb0ae

Nov 02, 2013
- Merge pull request #133 from Mistobaan/link_fix · 41ead7a7
  Reynold Xin authored 11 years ago
  
  update default github
  41ead7a7
- Merge pull request #134 from rxin/readme · d407c073
  Reynold Xin authored 11 years ago
  
  Fixed a typo in Hadoop version in README.
  d407c073
- Fixed a typo in Hadoop version in README. · 895747bb
  Reynold Xin authored 11 years ago
  
  895747bb
Nov 01, 2013
- update default github · 4b5d61f3
  Fabrizio (Misto) Milo authored 11 years ago
  
  4b5d61f3
- Merge pull request #132 from Mistobaan/doc_fix · e7c7b804
  Reynold Xin authored 11 years ago
  
  fix persistent-hdfs
  e7c7b804
- fix persistent-hdfs · 3f89354c
  Fabrizio (Misto) Milo authored 11 years ago
  
  3f89354c
- Merge pull request #129 from velvia/2013-11/document-local-uris · d6d11c2e
  Matei Zaharia authored 11 years ago
  
  Document & finish support for local: URIs Review all the supported URI schemes for addJar / addFile to the Cluster Overview page. Add support for local: URI to addFile.
  d6d11c2e
- Merge branch 'master' of github.com:amplab/graphx · d87d112b
  Dan Crankshaw authored 11 years ago
  
  d87d112b
- Add local: URI support to addFile as well · f3679fd4
  Evan Chan authored 11 years ago
  
  f3679fd4
- Document all the URIs for addJar/addFile · e54a37fe
  Evan Chan authored 11 years ago
  
  e54a37fe
Oct 31, 2013

Merge pull request #46 from jegonzal/VertexSetWithHashSet · 99bfcc91
Reynold Xin authored 11 years ago
```
Switched VertexSetRDD and GraphImpl to use OpenHashSet
```
99bfcc91
Changing var to val for keySet in OpenHashMaps · db89ac4b
Joseph E. Gonzalez authored 11 years ago

db89ac4b

After some testing I realized that the IndexedSeq is still instantiating the... · e7d37472

Joseph E. Gonzalez authored 11 years ago

After some testing I realized that the IndexedSeq is still instantiating the array (not maintaining a view) so I have replaced all IndexedSeq[V] with (Int => V)

e7d37472

renamed update to setMerge · 63311d9c
Joseph E. Gonzalez authored 11 years ago

63311d9c
Merge branch 'master' of github.com:amplab/graphx · e218e30b
Dan Crankshaw authored 11 years ago

e218e30b
Added logging to Graph, GraphLab, and Pregel. · 0a61cafb
Dan Crankshaw authored 11 years ago

0a61cafb
Merge branch 'master' of https://github.com/amplab/graphx into VertexSetWithHashSet · 7f584403
Joseph E. Gonzalez authored 11 years ago

7f584403
Merge pull request #44 from jegonzal/rxinBitSet · fcaaf868
Reynold Xin authored 11 years ago
```
Switching to VertexSetRDD to use @rxin BitSet and OpenHash 
```
fcaaf868

This commit introduces the OpenHashSet and OpenHashMap as indexing primitives. · 8381aeff

Joseph E. Gonzalez authored 11 years ago

Large parts of the VertexSetRDD were restructured to take advantage of:

1) the OpenHashSet as an index map
2) view based lazy mapValues and mapValuesWithVertices
3) the cogroup code is currently disabled (since it is not used in any of the tests)

The GraphImpl was updated to also use the OpenHashSet and PrimitiveOpenHashMap
wherever possible:

1) the LocalVidMaps (used to track replicated vertices) are now implemented
using the OpenHashSet
2) an OpenHashMap is temporarily constructed to combine the local OpenHashSet
with the local (replicated) vertex attribute arrays
3) because the OpenHashSet constructor grabs a class manifest all operations
that construct OpenHashSets have been moved to the GraphImpl Singleton to prevent
implicit variable capture within closures.

8381aeff

This commit makes three changes to the (PrimitiveKey)OpenHashMap · 4ad58e2b

Joseph E. Gonzalez authored 11 years ago

  1) _keySet  --renamed--> keySet
  2) keySet and _values are made externally accessible
  3) added an update function which merges duplicate values

4ad58e2b

Merge branch 'master' of github.com:amplab/graphx · b3bcfc09
Dan Crankshaw authored 11 years ago

b3bcfc09