-
- Downloads
[SPARK-17219][ML] Add NaN value handling in Bucketizer
## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14858 from VinceShieh/spark-17219.
Showing
- docs/ml-features.md 5 additions, 1 deletiondocs/ml-features.md
- mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 9 additions, 4 deletions...c/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
- mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala 7 additions, 2 deletions...ala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala 31 additions, 0 deletions...t/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala 25 additions, 4 deletions...rg/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
- python/pyspark/ml/feature.py 5 additions, 0 deletionspython/pyspark/ml/feature.py
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala 3 additions, 1 deletion...n/scala/org/apache/spark/sql/DataFrameStatFunctions.scala
Loading
Please register or sign in to comment