-
Yuhao Yang authored
## What changes were proposed in this pull request? Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316 ## How was this patch tested? local doc generation and example execution Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17324 from hhbyyh/imputerdoc.
Yuhao Yang authored## What changes were proposed in this pull request? Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316 ## How was this patch tested? local doc generation and example execution Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17324 from hhbyyh/imputerdoc.
layout: global
title: Extracting, transforming and selecting features
displayTitle: Extracting, transforming and selecting features
This section covers algorithms for working with features, roughly divided into these groups:
- Extraction: Extracting features from "raw" data
- Transformation: Scaling, converting, or modifying features
- Selection: Selecting a subset from a larger set of features
- Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.
Table of Contents
- This will become a table of contents (this text will be scraped). {:toc}
Feature Extractors
TF-IDF
Term frequency-inverse document frequency (TF-IDF)
is a feature vectorization method widely used in text mining to reflect the importance of a term
to a document in the corpus. Denote a term by $t$
, a document by $d$
, and the corpus by $D$
.
Term frequency $TF(t, d)$
is the number of times that term $t$
appears in document $d$
, while
document frequency $DF(t, D)$
is the number of documents that contains term $t$
. If we only use
term frequency to measure the importance, it is very easy to over-emphasize terms that appear very
often but carry little information about the document, e.g. "a", "the", and "of". If a term appears
very often across the corpus, it means it doesn't carry special information about a particular document.
Inverse document frequency is a numerical measure of how much information a term provides:
\[ IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1}, \]
where $|D|$
is the total number of documents in the corpus. Since logarithm is used, if a term
appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid
dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF:
\[ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). \]
There are several variants on the definition of term frequency and document frequency.
In MLlib, we separate TF and IDF to make them flexible.
TF: Both HashingTF
and CountVectorizer
can be used to generate the term frequency vectors.
HashingTF
is a Transformer
which takes sets of terms and converts those sets into
fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words.
HashingTF
utilizes the hashing trick.
A raw feature is mapped into an index (term) by applying a hash function. The hash function
used here is MurmurHash 3. Then term frequencies
are calculated based on the mapped indices. This approach avoids the need to compute a global
term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash
collisions, where different raw features may become the same term after hashing. To reduce the
chance of collision, we can increase the target feature dimension, i.e. the number of buckets
of the hash table. Since a simple modulo is used to transform the hash function to a column index,
it is advisable to use a power of two as the feature dimension, otherwise the features will
not be mapped evenly to the columns. The default feature dimension is $2^{18} = 262,144$
.
An optional binary toggle parameter controls term frequency counts. When set to true all nonzero
frequency counts are set to 1. This is especially useful for discrete probabilistic models that
model binary, rather than integer, counts.
CountVectorizer
converts text documents to vectors of term counts. Refer to CountVectorizer
for more details.
IDF: IDF
is an Estimator
which is fit on a dataset and produces an IDFModel
. The
IDFModel
takes feature vectors (generally created from HashingTF
or CountVectorizer
) and
scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
Note: spark.ml
doesn't provide tools for text segmentation.
We refer users to the Stanford NLP Group and
scalanlp/chalk.
Examples
In the following code segment, we start with a set of sentences. We split each sentence into words
using Tokenizer
. For each sentence (bag of words), we use HashingTF
to hash the sentence into
a feature vector. We use IDF
to rescale the feature vectors; this generally improves performance
when using text as features. Our feature vectors could then be passed to a learning algorithm.
Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/TfIdfExample.scala %}
Refer to the HashingTF Java docs and the IDF Java docs for more details on the API.
{% include_example java/org/apache/spark/examples/ml/JavaTfIdfExample.java %}
Refer to the HashingTF Python docs and the IDF Python docs for more details on the API.
{% include_example python/ml/tf_idf_example.py %}
Word2Vec
Word2Vec
is an Estimator
which takes sequences of words representing documents and trains a
Word2VecModel
. The model maps each word to a unique fixed-size vector. The Word2VecModel
transforms each document into a vector using the average of all words in the document; this vector
can then be used as features for prediction, document similarity calculations, etc.
Please refer to the MLlib user guide on Word2Vec for more
details.
Examples
In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.
Refer to the Word2Vec Scala docs for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/Word2VecExample.scala %}
Refer to the Word2Vec Java docs for more details on the API.
{% include_example java/org/apache/spark/examples/ml/JavaWord2VecExample.java %}
Refer to the Word2Vec Python docs for more details on the API.
{% include_example python/ml/word2vec_example.py %}
CountVectorizer
CountVectorizer
and CountVectorizerModel
aim to help convert a collection of text documents
to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer
can
be used as an Estimator
to extract the vocabulary, and generates a CountVectorizerModel
. The
model produces sparse representations for the documents over the vocabulary, which can then be
passed to other algorithms like LDA.
During the fitting process, CountVectorizer
will select the top vocabSize
words ordered by
term frequency across the corpus. An optional parameter minDF
also affects the fitting process
by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be
included in the vocabulary. Another optional binary toggle parameter controls the output vector.
If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic
models that model binary, rather than integer, counts.
Examples
Assume that we have the following DataFrame with columns id
and texts
:
id | texts
----|----------
0 | Array("a", "b", "c")
1 | Array("a", "b", "b", "c", "a")
each row in texts
is a document of type Array[String].
Invoking fit of CountVectorizer
produces a CountVectorizerModel
with vocabulary (a, b, c).
Then the output column "vector" after transformation contains:
id | texts | vector
----|---------------------------------|---------------
0 | Array("a", "b", "c") | (3,[0,1,2],[1.0,1.0,1.0])
1 | Array("a", "b", "b", "c", "a") | (3,[0,1,2],[2.0,2.0,1.0])
Each vector represents the token counts of the document over the vocabulary.
Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/CountVectorizerExample.scala %}
Refer to the CountVectorizer Java docs and the CountVectorizerModel Java docs for more details on the API.
{% include_example java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java %}
Refer to the CountVectorizer Python docs and the CountVectorizerModel Python docs for more details on the API.
{% include_example python/ml/count_vectorizer_example.py %}
Feature Transformers
Tokenizer
Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.
RegexTokenizer allows more
advanced tokenization based on regular expression (regex) matching.
By default, the parameter "pattern" (regex, default: "\\s+"
) is used as delimiters to split the input text.
Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes
"tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result.
Examples
Refer to the Tokenizer Scala docs and the RegexTokenizer Scala docs for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/TokenizerExample.scala %}
Refer to the Tokenizer Java docs and the RegexTokenizer Java docs for more details on the API.
{% include_example java/org/apache/spark/examples/ml/JavaTokenizerExample.java %}
Refer to the Tokenizer Python docs and the RegexTokenizer Python docs for more details on the API.
{% include_example python/ml/tokenizer_example.py %}
StopWordsRemover
Stop words are words which should be excluded from the input, typically because the words appear frequently and don't carry as much meaning.
StopWordsRemover
takes as input a sequence of strings (e.g. the output
of a Tokenizer) and drops all the stop
words from the input sequences. The list of stopwords is specified by
the stopWords
parameter. Default stop words for some languages are accessible
by calling StopWordsRemover.loadDefaultStopWords(language)
, for which available
options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian",
"italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish".
A boolean parameter caseSensitive
indicates if the matches should be case sensitive
(false by default).
Examples
Assume that we have the following DataFrame with columns id
and raw
:
id | raw
----|----------
0 | [I, saw, the, red, baloon]
1 | [Mary, had, a, little, lamb]
Applying StopWordsRemover
with raw
as the input column and filtered
as the output
column, we should get the following:
id | raw | filtered
----|-----------------------------|--------------------
0 | [I, saw, the, red, baloon] | [saw, red, baloon]
1 | [Mary, had, a, little, lamb]|[Mary, little, lamb]
In filtered
, the stop words "I", "the", "had", and "a" have been
filtered out.
Refer to the StopWordsRemover Scala docs for more details on the API.
{% include_example scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala %}
Refer to the StopWordsRemover Java docs for more details on the API.
{% include_example java/org/apache/spark/examples/ml/JavaStopWordsRemoverExample.java %}
Refer to the StopWordsRemover Python docs for more details on the API.
{% include_example python/ml/stopwords_remover_example.py %}
n-gram
An n-gram is a sequence of n tokens (typically words) for some integer n. The NGram
class can be used to transform input features into n-grams.
NGram
takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n
is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words. If the input sequence contains fewer than n
strings, no output is produced.