-
- Downloads
[SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in the last bucket.
We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11). This also update the impl to use `Arrays.binarySearch` and `withClue` in test. yinxusen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6075 from mengxr/SPARK-7559 and squashes the following commits: e28f910 [Xiangrui Meng] update bucketizer impl
Showing
- mllib/src/main/scala/org/apache/spark/ml/feature/Bucketizer.scala 28 additions, 27 deletions...c/main/scala/org/apache/spark/ml/feature/Bucketizer.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/BucketizerSuite.scala 13 additions, 12 deletions...t/scala/org/apache/spark/ml/feature/BucketizerSuite.scala
Please register or sign in to comment