-
- Downloads
[SPARK-13969][ML] Add FeatureHasher transformer
This PR adds a `FeatureHasher` transformer, modeled on [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) and [Vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction). The transformer operates on multiple input columns in one pass. Current behavior is: * for numerical columns, the values are assumed to be real values and the feature index is `hash(columnName)` while feature value is `feature_value` * for string columns, the values are assumed to be categorical and the feature index is `hash(column_name=feature_value)`, while feature value is `1.0` * For hash collisions, feature values will be summed * `null` (missing) values are ignored The following dataframe illustrates the basic semantics: ``` +---+------+-----+---------+------+-----------------------------------------+ |int|double|float|stringNum|string|features | +---+------+-----+---------+------+-----------------------------------------+ |3 |4.0 |5.0 |1 |foo |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])| |6 |7.0 |8.0 |2 |bar |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])| +---+------+-----+---------+------+-----------------------------------------+ ``` ## How was this patch tested? New unit tests and manual experiments. Author: Nick Pentreath <nickp@za.ibm.com> Closes #18513 from MLnick/FeatureHasher.
Showing
- mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala 196 additions, 0 deletions...ain/scala/org/apache/spark/ml/feature/FeatureHasher.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/FeatureHasherSuite.scala 193 additions, 0 deletions...cala/org/apache/spark/ml/feature/FeatureHasherSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/HashingTFSuite.scala 7 additions, 1 deletion...st/scala/org/apache/spark/ml/feature/HashingTFSuite.scala
Please register or sign in to comment