diff --git a/docs/ml-features.md b/docs/ml-features.md index e19fba249fb2df981df2dd7553eb4ca4329799c1..86a0e09997b8e64cefaa3101f3f6355f53bec04d 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -53,9 +53,9 @@ are calculated based on the mapped indices. This approach avoids the need to com term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets -of the hash table. Since a simple modulo is used to transform the hash function to a column index, -it is advisable to use a power of two as the feature dimension, otherwise the features will -not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. +of the hash table. Since a simple modulo on the hashed value is used to determine the vector index, +it is advisable to use a power of two as the feature dimension, otherwise the features will not +be mapped evenly to the vector indices. The default feature dimension is `$2^{18} = 262,144$`. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts. @@ -65,7 +65,7 @@ model binary, rather than integer, counts. **IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`. The `IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and -scales each column. Intuitively, it down-weights columns which appear frequently in a corpus. +scales each feature. Intuitively, it down-weights features which appear frequently in a corpus. **Note:** `spark.ml` doesn't provide tools for text segmentation. We refer users to the [Stanford NLP Group](http://nlp.stanford.edu/) and @@ -211,6 +211,89 @@ for more details on the API. </div> </div> +## FeatureHasher + +Feature hashing projects a set of categorical or numerical features into a feature vector of +specified dimension (typically substantially smaller than that of the original feature +space). This is done using the [hashing trick](https://en.wikipedia.org/wiki/Feature_hashing) +to map features to indices in the feature vector. + +The `FeatureHasher` transformer operates on multiple columns. Each column may contain either +numeric or categorical features. Behavior and handling of column data types is as follows: + +- Numeric columns: For numeric features, the hash value of the column name is used to map the +feature value to its index in the feature vector. Numeric features are never treated as +categorical, even when they are integers. You must explicitly convert numeric columns containing +categorical features to strings first. +- String columns: For categorical features, the hash value of the string "column_name=value" +is used to map to the vector index, with an indicator value of `1.0`. Thus, categorical features +are "one-hot" encoded (similarly to using [OneHotEncoder](ml-features.html#onehotencoder) with +`dropLast=false`). +- Boolean columns: Boolean values are treated in the same way as string columns. That is, +boolean features are represented as "column_name=true" or "column_name=false", with an indicator +value of `1.0`. + +Null (missing) values are ignored (implicitly zero in the resulting feature vector). + +The hash function used here is also the [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash) +used in [HashingTF](ml-features.html#tf-idf). Since a simple modulo on the hashed value is used to +determine the vector index, it is advisable to use a power of two as the numFeatures parameter; +otherwise the features will not be mapped evenly to the vector indices. + +**Examples** + +Assume that we have a DataFrame with 4 input columns `real`, `bool`, `stringNum`, and `string`. +These different data types as input will illustrate the behavior of the transform to produce a +column of feature vectors. + +~~~~ +real| bool|stringNum|string +----|-----|---------|------ + 2.2| true| 1| foo + 3.3|false| 2| bar + 4.4|false| 3| baz + 5.5|false| 4| foo +~~~~ + +Then the output of `FeatureHasher.transform` on this DataFrame is: + +~~~~ +real|bool |stringNum|string|features +----|-----|---------|------|------------------------------------------------------- +2.2 |true |1 |foo |(262144,[51871, 63643,174475,253195],[1.0,1.0,2.2,1.0]) +3.3 |false|2 |bar |(262144,[6031, 80619,140467,174475],[1.0,1.0,1.0,3.3]) +4.4 |false|3 |baz |(262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0]) +5.5 |false|4 |foo |(262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5]) +~~~~ + +The resulting feature vectors could then be passed to a learning algorithm. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +Refer to the [FeatureHasher Scala docs](api/scala/index.html#org.apache.spark.ml.feature.FeatureHasher) +for more details on the API. + +{% include_example scala/org/apache/spark/examples/ml/FeatureHasherExample.scala %} +</div> + +<div data-lang="java" markdown="1"> + +Refer to the [FeatureHasher Java docs](api/java/org/apache/spark/ml/feature/FeatureHasher.html) +for more details on the API. + +{% include_example java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java %} +</div> + +<div data-lang="python" markdown="1"> + +Refer to the [FeatureHasher Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.FeatureHasher) +for more details on the API. + +{% include_example python/ml/feature_hasher_example.py %} +</div> +</div> + # Feature Transformers ## Tokenizer diff --git a/examples/src/main/java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java b/examples/src/main/java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java new file mode 100644 index 0000000000000000000000000000000000000000..9730d42e6db8d9d400c8b765ffbbdf1c85ad7b0c --- /dev/null +++ b/examples/src/main/java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java @@ -0,0 +1,69 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml; + +import org.apache.spark.sql.Dataset; +import org.apache.spark.sql.SparkSession; + +// $example on$ +import java.util.Arrays; +import java.util.List; + +import org.apache.spark.ml.feature.FeatureHasher; +import org.apache.spark.sql.Row; +import org.apache.spark.sql.RowFactory; +import org.apache.spark.sql.types.DataTypes; +import org.apache.spark.sql.types.Metadata; +import org.apache.spark.sql.types.StructField; +import org.apache.spark.sql.types.StructType; +// $example off$ + +public class JavaFeatureHasherExample { + public static void main(String[] args) { + SparkSession spark = SparkSession + .builder() + .appName("JavaFeatureHasherExample") + .getOrCreate(); + + // $example on$ + List<Row> data = Arrays.asList( + RowFactory.create(2.2, true, "1", "foo"), + RowFactory.create(3.3, false, "2", "bar"), + RowFactory.create(4.4, false, "3", "baz"), + RowFactory.create(5.5, false, "4", "foo") + ); + StructType schema = new StructType(new StructField[]{ + new StructField("real", DataTypes.DoubleType, false, Metadata.empty()), + new StructField("bool", DataTypes.BooleanType, false, Metadata.empty()), + new StructField("stringNum", DataTypes.StringType, false, Metadata.empty()), + new StructField("string", DataTypes.StringType, false, Metadata.empty()) + }); + Dataset<Row> dataset = spark.createDataFrame(data, schema); + + FeatureHasher hasher = new FeatureHasher() + .setInputCols(new String[]{"real", "bool", "stringNum", "string"}) + .setOutputCol("features"); + + Dataset<Row> featurized = hasher.transform(dataset); + + featurized.show(false); + // $example off$ + + spark.stop(); + } +} diff --git a/examples/src/main/python/ml/feature_hasher_example.py b/examples/src/main/python/ml/feature_hasher_example.py new file mode 100644 index 0000000000000000000000000000000000000000..6cf9ecc39640019f134ca957594d4f8eee5eb5dc --- /dev/null +++ b/examples/src/main/python/ml/feature_hasher_example.py @@ -0,0 +1,46 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +from __future__ import print_function + +from pyspark.sql import SparkSession +# $example on$ +from pyspark.ml.feature import FeatureHasher +# $example off$ + +if __name__ == "__main__": + spark = SparkSession\ + .builder\ + .appName("FeatureHasherExample")\ + .getOrCreate() + + # $example on$ + dataset = spark.createDataFrame([ + (2.2, True, "1", "foo"), + (3.3, False, "2", "bar"), + (4.4, False, "3", "baz"), + (5.5, False, "4", "foo") + ], ["real", "bool", "stringNum", "string"]) + + hasher = FeatureHasher(inputCols=["real", "bool", "stringNum", "string"], + outputCol="features") + + featurized = hasher.transform(dataset) + featurized.show(truncate=False) + # $example off$ + + spark.stop() diff --git a/examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala b/examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala new file mode 100644 index 0000000000000000000000000000000000000000..1aed10bfb2d384e30c2f123d66169fabc35df405 --- /dev/null +++ b/examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala @@ -0,0 +1,50 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.examples.ml + +// $example on$ +import org.apache.spark.ml.feature.FeatureHasher +// $example off$ +import org.apache.spark.sql.SparkSession + +object FeatureHasherExample { + def main(args: Array[String]): Unit = { + val spark = SparkSession + .builder + .appName("FeatureHasherExample") + .getOrCreate() + + // $example on$ + val dataset = spark.createDataFrame(Seq( + (2.2, true, "1", "foo"), + (3.3, false, "2", "bar"), + (4.4, false, "3", "baz"), + (5.5, false, "4", "foo") + )).toDF("real", "bool", "stringNum", "string") + + val hasher = new FeatureHasher() + .setInputCols("real", "bool", "stringNum", "string") + .setOutputCol("features") + + val featurized = hasher.transform(dataset) + featurized.show(false) + // $example off$ + + spark.stop() + } +} diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala index 4b91fa933ed9fc64daeec287d727399612609a8a..4615daed20fb1bf5e885d90142512f591b5b5c8a 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/FeatureHasher.scala @@ -53,9 +53,10 @@ import org.apache.spark.util.collection.OpenHashMap * * Null (missing) values are ignored (implicitly zero in the resulting feature vector). * - * Since a simple modulo is used to transform the hash function to a vector index, - * it is advisable to use a power of two as the numFeatures parameter; - * otherwise the features will not be mapped evenly to the vector indices. + * The hash function used here is also the MurmurHash 3 used in [[HashingTF]]. Since a simple modulo + * on the hashed value is used to determine the vector index, it is advisable to use a power of two + * as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector + * indices. * * {{{ * val df = Seq(