-
- Downloads
[SPARK-10064] [ML] Parallelize decision tree bin split calculations
Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation. With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours. Author: Nathan Howell <nhowell@godaddy.com> Closes #8246 from NathanHowell/SPARK-10064.
Showing
- mllib/src/main/scala/org/apache/spark/mllib/tree/DecisionTree.scala 86 additions, 78 deletions...main/scala/org/apache/spark/mllib/tree/DecisionTree.scala
- mllib/src/main/scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala 9 additions, 9 deletions.../scala/org/apache/spark/mllib/tree/impl/NodeIdCache.scala
- mllib/src/test/scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala 0 additions, 6 deletions...scala/org/apache/spark/mllib/tree/DecisionTreeSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/tree/EnsembleTestHelper.scala 2 additions, 2 deletions...cala/org/apache/spark/mllib/tree/EnsembleTestHelper.scala
Loading
Please register or sign in to comment