-
- Downloads
[SPARK-17219][ML] enhanced NaN value handling in Bucketizer
## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326b. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Author: Vincent Xie <vincent.xie@intel.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #15428 from VinceShieh/spark-17219_followup.
Showing
- docs/ml-features.md 10 additions, 5 deletionsdocs/ml-features.md
- mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 64 additions, 7 deletions...c/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
- mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 40 additions, 7 deletions...ala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala 19 additions, 7 deletions...t/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 24 additions, 11 deletions...rg/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
- python/pyspark/ml/feature.py 0 additions, 5 deletionspython/pyspark/ml/feature.py
- sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala 4 additions, 0 deletions.../test/scala/org/apache/spark/sql/DataFrameStatSuite.scala
Loading
Please register or sign in to comment