Commit 1bc435ae authored 9 years ago by Nathan Howell Committed by Joseph K. Bradley 9 years ago

[SPARK-10064] [ML] Parallelize decision tree bin split calculations

Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation.

With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.

Author: Nathan Howell <nhowell@godaddy.com>

Closes #8246 from NathanHowell/SPARK-10064.

parent 075a0b65

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 97 additions and 95 deletions

Please register or to comment