-
- Downloads
[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames
## What changes were proposed in this pull request? Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` ## How was the this patch tested? Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <ocp@gatech.edu> Author: Oliver Pierson <opierson@umd.edu> Closes #11319 from oliverpierson/SPARK-13444.
Showing
- mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 9 additions, 2 deletions...ala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 20 additions, 0 deletions...rg/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
Please register or sign in to comment