-
- Downloads
[SPARK-13568][ML] Create feature transformer to impute missing values
## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13568 It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn. Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc). Currently this PR supports imputation for Double and Vector (null and NaN in Vector). ## How was this patch tested? new unit tests and manual test Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao <yuhao.yang@intel.com> Closes #11601 from hhbyyh/imputer.
Showing
- mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala 259 additions, 0 deletions.../src/main/scala/org/apache/spark/ml/feature/Imputer.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala 185 additions, 0 deletions...test/scala/org/apache/spark/ml/feature/ImputerSuite.scala
Please register or sign in to comment