Skip to content
Snippets Groups Projects
Commit ed9d8038 authored by Yuhao Yang's avatar Yuhao Yang Committed by Sean Owen
Browse files

[SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTF

## What changes were proposed in this pull request?

Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.

## How was this patch tested?

unit tests and doc generation

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #12454 from hhbyyh/tfdoc.
parent 17db4bfe
No related branches found
No related tags found
No related merge requests found
...@@ -22,10 +22,19 @@ This section covers algorithms for working with features, roughly divided into t ...@@ -22,10 +22,19 @@ This section covers algorithms for working with features, roughly divided into t
[Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF. [Term Frequency-Inverse Document Frequency (TF-IDF)](http://en.wikipedia.org/wiki/Tf%E2%80%93idf) is a common text pre-processing step. In Spark ML, TF-IDF is separate into two parts: TF (+hashing) and IDF.
**TF**: `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words. **TF**: Both `HashingTF` and `CountVectorizer` can be used to generate the term frequency vectors.
The algorithm combines Term Frequency (TF) counts with the [hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
**IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The `IDFModel` takes feature vectors (generally created from `HashingTF`) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus. `HashingTF` is a `Transformer` which takes sets of terms and converts those sets into
fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words.
The algorithm combines Term Frequency (TF) counts with the
[hashing trick](http://en.wikipedia.org/wiki/Feature_hashing) for dimensionality reduction.
`CountVectorizer` converts text documents to vectors of term counts. Refer to [CountVectorizer
](ml-features.html#countvectorizer) for more details.
**IDF**: `IDF` is an `Estimator` which is fit on a dataset and produces an `IDFModel`. The
`IDFModel` takes feature vectors (generally created from `HashingTF` or `CountVectorizer`) and scales each column.
Intuitively, it down-weights columns which appear frequently in a corpus.
Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency. Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency.
......
...@@ -63,6 +63,8 @@ public class JavaTfIdfExample { ...@@ -63,6 +63,8 @@ public class JavaTfIdfExample {
.setOutputCol("rawFeatures") .setOutputCol("rawFeatures")
.setNumFeatures(numFeatures); .setNumFeatures(numFeatures);
Dataset<Row> featurizedData = hashingTF.transform(wordsData); Dataset<Row> featurizedData = hashingTF.transform(wordsData);
// alternatively, CountVectorizer can also be used to get term frequency vectors
IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features"); IDF idf = new IDF().setInputCol("rawFeatures").setOutputCol("features");
IDFModel idfModel = idf.fit(featurizedData); IDFModel idfModel = idf.fit(featurizedData);
Dataset<Row> rescaledData = idfModel.transform(featurizedData); Dataset<Row> rescaledData = idfModel.transform(featurizedData);
......
...@@ -37,6 +37,8 @@ if __name__ == "__main__": ...@@ -37,6 +37,8 @@ if __name__ == "__main__":
wordsData = tokenizer.transform(sentenceData) wordsData = tokenizer.transform(sentenceData)
hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=20)
featurizedData = hashingTF.transform(wordsData) featurizedData = hashingTF.transform(wordsData)
# alternatively, CountVectorizer can also be used to get term frequency vectors
idf = IDF(inputCol="rawFeatures", outputCol="features") idf = IDF(inputCol="rawFeatures", outputCol="features")
idfModel = idf.fit(featurizedData) idfModel = idf.fit(featurizedData)
rescaledData = idfModel.transform(featurizedData) rescaledData = idfModel.transform(featurizedData)
......
...@@ -43,6 +43,8 @@ object TfIdfExample { ...@@ -43,6 +43,8 @@ object TfIdfExample {
val hashingTF = new HashingTF() val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20) .setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData) val featurizedData = hashingTF.transform(wordsData)
// alternatively, CountVectorizer can also be used to get term frequency vectors
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features") val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(featurizedData) val idfModel = idf.fit(featurizedData)
val rescaledData = idfModel.transform(featurizedData) val rescaledData = idfModel.transform(featurizedData)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment