-
- Downloads
[SPARK-1855] Local checkpointing
Certain use cases of Spark involve RDDs with long lineages that must be truncated periodically (e.g. GraphX). The existing way of doing it is through `rdd.checkpoint()`, which is expensive because it writes to HDFS. This patch provides an alternative to truncate lineages cheaply *without providing the same level of fault tolerance*. **Local checkpointing** writes checkpointed data to the local file system through the block manager. It is much faster than replicating to a reliable storage and provides the same semantics as long as executors do not fail. It is accessible through a new operator `rdd.localCheckpoint()` and leaves the old one unchanged. Users may even decide to combine the two and call the reliable one less frequently. The bulk of this patch involves refactoring the checkpointing interface to accept custom implementations of checkpointing. [Design doc](https://issues.apache.org/jira/secure/attachment/12741708/SPARK-7292-design.pdf). Author: Andrew Or <andrew@databricks.com> Closes #7279 from andrewor14/local-checkpoint and squashes the following commits: 729600f [Andrew Or] Oops, fix tests 34bc059 [Andrew Or] Avoid computing all partitions in local checkpoint e43bbb6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 3be5aea [Andrew Or] Address comments bf846a6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint ab003a3 [Andrew Or] Fix compile c2e111b [Andrew Or] Address comments 33f167a [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint e908a42 [Andrew Or] Fix tests f5be0f3 [Andrew Or] Use MEMORY_AND_DISK as the default local checkpoint level a92657d [Andrew Or] Update a few comments e58e3e3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 4eb6eb1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint 1bbe154 [Andrew Or] Simplify LocalCheckpointRDD 48a9996 [Andrew Or] Avoid traversing dependency tree + rewrite tests 62aba3f [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint db70dc2 [Andrew Or] Express local checkpointing through caching the original RDD 87d43c6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into local-checkpoint c449b38 [Andrew Or] Fix style 4a182f3 [Andrew Or] Add fine-grained tests for local checkpointing 53b363b [Andrew Or] Rename a few more awkwardly named methods (minor) e4cf071 [Andrew Or] Simplify LocalCheckpointRDD + docs + clean ups 4880deb [Andrew Or] Fix style d096c67 [Andrew Or] Fix mima 172cb66 [Andrew Or] Fix mima? e53d964 [Andrew Or] Fix style 56831c5 [Andrew Or] Add a few warnings and clear exception messages 2e59646 [Andrew Or] Add local checkpoint clean up tests 4dbbab1 [Andrew Or] Refactor CheckpointSuite to test local checkpointing 4514dc9 [Andrew Or] Clean local checkpoint files through RDD cleanups 0477eec [Andrew Or] Rename a few methods with awkward names (minor) 2e902e5 [Andrew Or] First implementation of local checkpointing 8447454 [Andrew Or] Fix tests 4ac1896 [Andrew Or] Refactor checkpoint interface for modularity
Showing
- core/src/main/scala/org/apache/spark/ContextCleaner.scala 6 additions, 3 deletionscore/src/main/scala/org/apache/spark/ContextCleaner.scala
- core/src/main/scala/org/apache/spark/SparkContext.scala 1 addition, 1 deletioncore/src/main/scala/org/apache/spark/SparkContext.scala
- core/src/main/scala/org/apache/spark/TaskContext.scala 8 additions, 0 deletionscore/src/main/scala/org/apache/spark/TaskContext.scala
- core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala 14 additions, 139 deletionscore/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
- core/src/main/scala/org/apache/spark/rdd/LocalCheckpointRDD.scala 67 additions, 0 deletions.../main/scala/org/apache/spark/rdd/LocalCheckpointRDD.scala
- core/src/main/scala/org/apache/spark/rdd/LocalRDDCheckpointData.scala 83 additions, 0 deletions...n/scala/org/apache/spark/rdd/LocalRDDCheckpointData.scala
- core/src/main/scala/org/apache/spark/rdd/RDD.scala 107 additions, 21 deletionscore/src/main/scala/org/apache/spark/rdd/RDD.scala
- core/src/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala 36 additions, 70 deletions...c/main/scala/org/apache/spark/rdd/RDDCheckpointData.scala
- core/src/main/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala 172 additions, 0 deletions...in/scala/org/apache/spark/rdd/ReliableCheckpointRDD.scala
- core/src/main/scala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala 108 additions, 0 deletions...cala/org/apache/spark/rdd/ReliableRDDCheckpointData.scala
- core/src/test/scala/org/apache/spark/CheckpointSuite.scala 101 additions, 63 deletionscore/src/test/scala/org/apache/spark/CheckpointSuite.scala
- core/src/test/scala/org/apache/spark/ContextCleanerSuite.scala 45 additions, 16 deletions...src/test/scala/org/apache/spark/ContextCleanerSuite.scala
- core/src/test/scala/org/apache/spark/rdd/LocalCheckpointSuite.scala 330 additions, 0 deletions...est/scala/org/apache/spark/rdd/LocalCheckpointSuite.scala
- project/MimaExcludes.scala 7 additions, 2 deletionsproject/MimaExcludes.scala
Loading
Please register or sign in to comment