-
- Downloads
[SPARK-5888] [MLLIB] Add OneHotEncoder as a Transformer
This patch adds a one hot encoder for categorical features. Planning to add documentation and another test after getting feedback on the approach. A couple choices made here: * There's an `includeFirst` option which, if false, creates numCategories - 1 columns and, if true, creates numCategories columns. The default is true, which is the behavior in scikit-learn. * The user is expected to pass a `Seq` of category names when instantiating a `OneHotEncoder`. These can be easily gotten from a `StringIndexer`. The names are used for the output column names, which take the form colName_categoryName. Author: Sandy Ryza <sandy@cloudera.com> Closes #5500 from sryza/sandy-spark-5888 and squashes the following commits: f383250 [Sandy Ryza] Infer label names automatically 6e257b9 [Sandy Ryza] Review comments 7c539cf [Sandy Ryza] Vector transformers 1c182dd [Sandy Ryza] SPARK-5888. [MLLIB]. Add OneHotEncoder as a Transformer
Showing
- mllib/src/main/scala/org/apache/spark/ml/feature/OneHotEncoder.scala 107 additions, 0 deletions...ain/scala/org/apache/spark/ml/feature/OneHotEncoder.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala 80 additions, 0 deletions...cala/org/apache/spark/ml/feature/OneHotEncoderSuite.scala
Please register or sign in to comment