-
- Downloads
[SPARK-5561] [MLLIB] Generalized PeriodicCheckpointer for RDDs and Graphs
PeriodicGraphCheckpointer was introduced for Latent Dirichlet Allocation (LDA), but it was meant to be generalized to work with Graphs, RDDs, and other data structures based on RDDs. This PR generalizes it. For those who are not familiar with the periodic checkpointer, it tries to automatically handle persisting/unpersisting and checkpointing/removing checkpoint files in a lineage of RDD-based objects. I need it generalized to use with GradientBoostedTrees [https://issues.apache.org/jira/browse/SPARK-6684]. It should be useful for other iterative algorithms as well. Changes I made: * Copied PeriodicGraphCheckpointer to PeriodicCheckpointer. * Within PeriodicCheckpointer, I created abstract methods for the basic operations (checkpoint, persist, etc.). * The subclasses for Graphs and RDDs implement those abstract methods. * I copied the test suite for the graph checkpointer and made tiny modifications to make it work for RDDs. To review this PR, I recommend doing 2 diffs: (1) diff between the old PeriodicGraphCheckpointer.scala and the new PeriodicCheckpointer.scala (2) diff between the 2 test suites CCing andrewor14 in case there are relevant changes to checkpointing. CCing feynmanliang in case you're interested in learning about checkpointing. CCing mengxr for final OK. Thanks all! Author: Joseph K. Bradley <joseph@databricks.com> Closes #7728 from jkbradley/gbt-checkpoint and squashes the following commits: d41902c [Joseph K. Bradley] Oops, forgot to update an extra time in the checkpointer tests, after the last commit. I'll fix that. I'll also make some of the checkpointer methods protected, which I should have done before. 32b23b8 [Joseph K. Bradley] fixed usage of checkpointer in lda 0b3dbc0 [Joseph K. Bradley] Changed checkpointer constructor not to take initial data. 568918c [Joseph K. Bradley] Generalized PeriodicGraphCheckpointer to PeriodicCheckpointer, with subclasses for RDDs and Graphs.
Showing
- mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala 3 additions, 3 deletions...cala/org/apache/spark/mllib/clustering/LDAOptimizer.scala
- mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicCheckpointer.scala 154 additions, 0 deletions...la/org/apache/spark/mllib/impl/PeriodicCheckpointer.scala
- mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala 16 additions, 89 deletions...g/apache/spark/mllib/impl/PeriodicGraphCheckpointer.scala
- mllib/src/main/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointer.scala 97 additions, 0 deletions...org/apache/spark/mllib/impl/PeriodicRDDCheckpointer.scala
- mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicGraphCheckpointerSuite.scala 9 additions, 7 deletions...che/spark/mllib/impl/PeriodicGraphCheckpointerSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/impl/PeriodicRDDCheckpointerSuite.scala 173 additions, 0 deletions...pache/spark/mllib/impl/PeriodicRDDCheckpointerSuite.scala
Loading
Please register or sign in to comment