Skip to content
Snippets Groups Projects
  • Bryan Cutler's avatar
    4133c1b0
    [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher · 4133c1b0
    Bryan Cutler authored
    ## What changes were proposed in this pull request?
    
    This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.
    
    ## How was this patch tested?
    
    Manually ran examples and verified that output is consistent for different APIs
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.
    4133c1b0
    History
    [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher
    Bryan Cutler authored
    ## What changes were proposed in this pull request?
    
    This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.
    
    ## How was this patch tested?
    
    Manually ran examples and verified that output is consistent for different APIs
    
    Author: Bryan Cutler <cutlerb@gmail.com>
    
    Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.
ml-features.md 69.52 KiB
layout: global
title: Extracting, transforming and selecting features
displayTitle: Extracting, transforming and selecting features

This section covers algorithms for working with features, roughly divided into these groups:

  • Extraction: Extracting features from "raw" data
  • Transformation: Scaling, converting, or modifying features
  • Selection: Selecting a subset from a larger set of features
  • Locality Sensitive Hashing (LSH): This class of algorithms combines aspects of feature transformation with other algorithms.

Table of Contents

  • This will become a table of contents (this text will be scraped). {:toc}

Feature Extractors

TF-IDF

Term frequency-inverse document frequency (TF-IDF) is a feature vectorization method widely used in text mining to reflect the importance of a term to a document in the corpus. Denote a term by $t$, a document by $d$, and the corpus by $D$. Term frequency $TF(t, d)$ is the number of times that term $t$ appears in document $d$, while document frequency $DF(t, D)$ is the number of documents that contains term $t$. If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g. "a", "the", and "of". If a term appears very often across the corpus, it means it doesn't carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides: \[ IDF(t, D) = \log \frac{|D| + 1}{DF(t, D) + 1}, \] where $|D|$ is the total number of documents in the corpus. Since logarithm is used, if a term appears in all documents, its IDF value becomes 0. Note that a smoothing term is applied to avoid dividing by zero for terms outside the corpus. The TF-IDF measure is simply the product of TF and IDF: \[ TFIDF(t, d, D) = TF(t, d) \cdot IDF(t, D). \] There are several variants on the definition of term frequency and document frequency. In MLlib, we separate TF and IDF to make them flexible.

TF: Both HashingTF and CountVectorizer can be used to generate the term frequency vectors.

HashingTF is a Transformer which takes sets of terms and converts those sets into fixed-length feature vectors. In text processing, a "set of terms" might be a bag of words. HashingTF utilizes the hashing trick. A raw feature is mapped into an index (term) by applying a hash function. The hash function used here is MurmurHash 3. Then term frequencies are calculated based on the mapped indices. This approach avoids the need to compute a global term-to-index map, which can be expensive for a large corpus, but it suffers from potential hash collisions, where different raw features may become the same term after hashing. To reduce the chance of collision, we can increase the target feature dimension, i.e. the number of buckets of the hash table. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the feature dimension, otherwise the features will not be mapped evenly to the vector indices. The default feature dimension is $2^{18} = 262,144$. An optional binary toggle parameter controls term frequency counts. When set to true all nonzero frequency counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

CountVectorizer converts text documents to vectors of term counts. Refer to CountVectorizer for more details.

IDF: IDF is an Estimator which is fit on a dataset and produces an IDFModel. The IDFModel takes feature vectors (generally created from HashingTF or CountVectorizer) and scales each feature. Intuitively, it down-weights features which appear frequently in a corpus.

Note: spark.ml doesn't provide tools for text segmentation. We refer users to the Stanford NLP Group and scalanlp/chalk.

Examples

In the following code segment, we start with a set of sentences. We split each sentence into words using Tokenizer. For each sentence (bag of words), we use HashingTF to hash the sentence into a feature vector. We use IDF to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.

Refer to the HashingTF Scala docs and the IDF Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/TfIdfExample.scala %}

Refer to the HashingTF Java docs and the IDF Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaTfIdfExample.java %}

Refer to the HashingTF Python docs and the IDF Python docs for more details on the API.

{% include_example python/ml/tf_idf_example.py %}

Word2Vec

Word2Vec is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document; this vector can then be used as features for prediction, document similarity calculations, etc. Please refer to the MLlib user guide on Word2Vec for more details.

Examples

In the following code segment, we start with a set of documents, each of which is represented as a sequence of words. For each document, we transform it into a feature vector. This feature vector could then be passed to a learning algorithm.

Refer to the Word2Vec Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/Word2VecExample.scala %}

Refer to the Word2Vec Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaWord2VecExample.java %}

Refer to the Word2Vec Python docs for more details on the API.

{% include_example python/ml/word2vec_example.py %}

CountVectorizer

CountVectorizer and CountVectorizerModel aim to help convert a collection of text documents to vectors of token counts. When an a-priori dictionary is not available, CountVectorizer can be used as an Estimator to extract the vocabulary, and generates a CountVectorizerModel. The model produces sparse representations for the documents over the vocabulary, which can then be passed to other algorithms like LDA.

During the fitting process, CountVectorizer will select the top vocabSize words ordered by term frequency across the corpus. An optional parameter minDF also affects the fitting process by specifying the minimum number (or fraction if < 1.0) of documents a term must appear in to be included in the vocabulary. Another optional binary toggle parameter controls the output vector. If set to true all nonzero counts are set to 1. This is especially useful for discrete probabilistic models that model binary, rather than integer, counts.

Examples

Assume that we have the following DataFrame with columns id and texts:

 id | texts
----|----------
 0  | Array("a", "b", "c")
 1  | Array("a", "b", "b", "c", "a")

each row in texts is a document of type Array[String]. Invoking fit of CountVectorizer produces a CountVectorizerModel with vocabulary (a, b, c). Then the output column "vector" after transformation contains:

 id | texts                           | vector
----|---------------------------------|---------------
 0  | Array("a", "b", "c")            | (3,[0,1,2],[1.0,1.0,1.0])
 1  | Array("a", "b", "b", "c", "a")  | (3,[0,1,2],[2.0,2.0,1.0])

Each vector represents the token counts of the document over the vocabulary.

Refer to the CountVectorizer Scala docs and the CountVectorizerModel Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/CountVectorizerExample.scala %}

Refer to the CountVectorizer Java docs and the CountVectorizerModel Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaCountVectorizerExample.java %}

Refer to the CountVectorizer Python docs and the CountVectorizerModel Python docs for more details on the API.

{% include_example python/ml/count_vectorizer_example.py %}

FeatureHasher

Feature hashing projects a set of categorical or numerical features into a feature vector of specified dimension (typically substantially smaller than that of the original feature space). This is done using the hashing trick to map features to indices in the feature vector.

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows:

  • Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. Numeric features are never treated as categorical, even when they are integers. You must explicitly convert numeric columns containing categorical features to strings first.
  • String columns: For categorical features, the hash value of the string "column_name=value" is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are "one-hot" encoded (similarly to using OneHotEncoder with dropLast=false).
  • Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as "column_name=true" or "column_name=false", with an indicator value of 1.0.

Null (missing) values are ignored (implicitly zero in the resulting feature vector).

The hash function used here is also the MurmurHash 3 used in HashingTF. Since a simple modulo on the hashed value is used to determine the vector index, it is advisable to use a power of two as the numFeatures parameter; otherwise the features will not be mapped evenly to the vector indices.

Examples

Assume that we have a DataFrame with 4 input columns real, bool, stringNum, and string. These different data types as input will illustrate the behavior of the transform to produce a column of feature vectors.

real| bool|stringNum|string
----|-----|---------|------
 2.2| true|        1|   foo
 3.3|false|        2|   bar
 4.4|false|        3|   baz
 5.5|false|        4|   foo

Then the output of FeatureHasher.transform on this DataFrame is:

real|bool |stringNum|string|features
----|-----|---------|------|-------------------------------------------------------
2.2 |true |1        |foo   |(262144,[51871, 63643,174475,253195],[1.0,1.0,2.2,1.0])
3.3 |false|2        |bar   |(262144,[6031,  80619,140467,174475],[1.0,1.0,1.0,3.3])
4.4 |false|3        |baz   |(262144,[24279,140467,174475,196810],[1.0,1.0,4.4,1.0])
5.5 |false|4        |foo   |(262144,[63643,140467,168512,174475],[1.0,1.0,1.0,5.5])

The resulting feature vectors could then be passed to a learning algorithm.

Refer to the FeatureHasher Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/FeatureHasherExample.scala %}

Refer to the FeatureHasher Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaFeatureHasherExample.java %}

Refer to the FeatureHasher Python docs for more details on the API.

{% include_example python/ml/feature_hasher_example.py %}

Feature Transformers

Tokenizer

Tokenization is the process of taking text (such as a sentence) and breaking it into individual terms (usually words). A simple Tokenizer class provides this functionality. The example below shows how to split sentences into sequences of words.

RegexTokenizer allows more advanced tokenization based on regular expression (regex) matching. By default, the parameter "pattern" (regex, default: "\\s+") is used as delimiters to split the input text. Alternatively, users can set parameter "gaps" to false indicating the regex "pattern" denotes "tokens" rather than splitting gaps, and find all matching occurrences as the tokenization result.

Examples

Refer to the Tokenizer Scala docs and the RegexTokenizer Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/TokenizerExample.scala %}

Refer to the Tokenizer Java docs and the RegexTokenizer Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaTokenizerExample.java %}

Refer to the Tokenizer Python docs and the RegexTokenizer Python docs for more details on the API.

{% include_example python/ml/tokenizer_example.py %}

StopWordsRemover

Stop words are words which should be excluded from the input, typically because the words appear frequently and don't carry as much meaning.

StopWordsRemover takes as input a sequence of strings (e.g. the output of a Tokenizer) and drops all the stop words from the input sequences. The list of stopwords is specified by the stopWords parameter. Default stop words for some languages are accessible by calling StopWordsRemover.loadDefaultStopWords(language), for which available options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian", "italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish". A boolean parameter caseSensitive indicates if the matches should be case sensitive (false by default).

Examples

Assume that we have the following DataFrame with columns id and raw:

 id | raw
----|----------
 0  | [I, saw, the, red, baloon]
 1  | [Mary, had, a, little, lamb]

Applying StopWordsRemover with raw as the input column and filtered as the output column, we should get the following:

 id | raw                         | filtered
----|-----------------------------|--------------------
 0  | [I, saw, the, red, baloon]  |  [saw, red, baloon]
 1  | [Mary, had, a, little, lamb]|[Mary, little, lamb]

In filtered, the stop words "I", "the", "had", and "a" have been filtered out.

Refer to the StopWordsRemover Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/StopWordsRemoverExample.scala %}

Refer to the StopWordsRemover Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaStopWordsRemoverExample.java %}

Refer to the StopWordsRemover Python docs for more details on the API.

{% include_example python/ml/stopwords_remover_example.py %}

n-gram

An n-gram is a sequence of n tokens (typically words) for some integer n. The NGram class can be used to transform input features into n-grams.

NGram takes as input a sequence of strings (e.g. the output of a Tokenizer). The parameter n is used to determine the number of terms in each n-gram. The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words. If the input sequence contains fewer than n strings, no output is produced.

Examples

Refer to the NGram Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/NGramExample.scala %}

Refer to the NGram Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaNGramExample.java %}

Refer to the NGram Python docs for more details on the API.

{% include_example python/ml/n_gram_example.py %}

Binarizer

Binarization is the process of thresholding numerical features to binary (0/1) features.

Binarizer takes the common parameters inputCol and outputCol, as well as the threshold for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported for inputCol.

Examples

Refer to the Binarizer Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/BinarizerExample.scala %}

Refer to the Binarizer Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaBinarizerExample.java %}

Refer to the Binarizer Python docs for more details on the API.

{% include_example python/ml/binarizer_example.py %}

PCA

PCA is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A PCA class trains a model to project vectors to a low-dimensional space using PCA. The example below shows how to project 5-dimensional feature vectors into 3-dimensional principal components.

Examples

Refer to the PCA Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/PCAExample.scala %}

Refer to the PCA Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaPCAExample.java %}

Refer to the PCA Python docs for more details on the API.

{% include_example python/ml/pca_example.py %}

PolynomialExpansion

Polynomial expansion is the process of expanding your features into a polynomial space, which is formulated by an n-degree combination of original dimensions. A PolynomialExpansion class provides this functionality. The example below shows how to expand your features into a 3-degree polynomial space.

Examples

Refer to the PolynomialExpansion Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/PolynomialExpansionExample.scala %}

Refer to the PolynomialExpansion Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaPolynomialExpansionExample.java %}

Refer to the PolynomialExpansion Python docs for more details on the API.

{% include_example python/ml/polynomial_expansion_example.py %}

Discrete Cosine Transform (DCT)

The Discrete Cosine Transform transforms a length N real-valued sequence in the time domain into another length N real-valued sequence in the frequency domain. A DCT class provides this functionality, implementing the DCT-II and scaling the result by 1/\sqrt{2} such that the representing matrix for the transform is unitary. No shift is applied to the transformed sequence (e.g. the 0th element of the transformed sequence is the 0th DCT coefficient and not the N/2th).

Examples

Refer to the DCT Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/DCTExample.scala %}

Refer to the DCT Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaDCTExample.java %}

Refer to the DCT Python docs for more details on the API.

{% include_example python/ml/dct_example.py %}

StringIndexer

StringIndexer encodes a string column of labels to a column of label indices. The indices are in [0, numLabels), ordered by label frequencies, so the most frequent label gets index 0. The unseen labels will be put at index numLabels if user chooses to keep them. If the input column is numeric, we cast it to string and index the string values. When downstream pipeline components such as Estimator or Transformer make use of this string-indexed label, you must set the input column of the component to this string-indexed column name. In many cases, you can set the input column with setInputCol.

Examples

Assume that we have the following DataFrame with columns id and category:

 id | category
----|----------
 0  | a
 1  | b
 2  | c
 3  | a
 4  | a
 5  | c

category is a string column with three labels: "a", "b", and "c". Applying StringIndexer with category as the input column and categoryIndex as the output column, we should get the following:

 id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0
 3  | a        | 0.0
 4  | a        | 0.0
 5  | c        | 1.0

"a" gets index 0 because it is the most frequent, followed by "c" with index 1 and "b" with index 2.

Additionally, there are three strategies regarding how StringIndexer will handle unseen labels when you have fit a StringIndexer on one dataset and then use it to transform another:

  • throw an exception (which is the default)
  • skip the row containing the unseen label entirely
  • put unseen labels in a special additional bucket, at index numLabels

Examples

Let's go back to our previous example but this time reuse our previously defined StringIndexer on the following dataset:

 id | category
----|----------
 0  | a
 1  | b
 2  | c
 3  | d
 4  | e

If you've not set how StringIndexer handles unseen labels or set it to "error", an exception will be thrown. However, if you had called setHandleInvalid("skip"), the following dataset will be generated:

 id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0

Notice that the rows containing "d" or "e" do not appear.

If you call setHandleInvalid("keep"), the following dataset will be generated:

 id | category | categoryIndex
----|----------|---------------
 0  | a        | 0.0
 1  | b        | 2.0
 2  | c        | 1.0
 3  | d        | 3.0
 4  | e        | 3.0

Notice that the rows containing "d" or "e" are mapped to index "3.0"

Refer to the StringIndexer Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/StringIndexerExample.scala %}

Refer to the StringIndexer Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaStringIndexerExample.java %}

Refer to the StringIndexer Python docs for more details on the API.

{% include_example python/ml/string_indexer_example.py %}

IndexToString

Symmetrically to StringIndexer, IndexToString maps a column of label indices back to a column containing the original labels as strings. A common use case is to produce indices from labels with StringIndexer, train a model with those indices and retrieve the original labels from the column of predicted indices with IndexToString. However, you are free to supply your own labels.

Examples

Building on the StringIndexer example, let's assume we have the following DataFrame with columns id and categoryIndex:

 id | categoryIndex
----|---------------
 0  | 0.0
 1  | 2.0
 2  | 1.0
 3  | 0.0
 4  | 0.0
 5  | 1.0

Applying IndexToString with categoryIndex as the input column, originalCategory as the output column, we are able to retrieve our original labels (they will be inferred from the columns' metadata):

 id | categoryIndex | originalCategory
----|---------------|-----------------
 0  | 0.0           | a
 1  | 2.0           | b
 2  | 1.0           | c
 3  | 0.0           | a
 4  | 0.0           | a
 5  | 1.0           | c

Refer to the IndexToString Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/IndexToStringExample.scala %}

Refer to the IndexToString Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaIndexToStringExample.java %}

Refer to the IndexToString Python docs for more details on the API.

{% include_example python/ml/index_to_string_example.py %}

OneHotEncoder

One-hot encoding maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features.

Examples

Refer to the OneHotEncoder Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %}

Refer to the OneHotEncoder Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java %}

Refer to the OneHotEncoder Python docs for more details on the API.

{% include_example python/ml/onehot_encoder_example.py %}

VectorIndexer

VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically decide which features are categorical and convert original values to category indices. Specifically, it does the following:

  1. Take an input column of type Vector and a parameter maxCategories.
  2. Decide which features should be categorical based on the number of distinct values, where features with at most maxCategories are declared categorical.
  3. Compute 0-based category indices for each categorical feature.
  4. Index categorical features and transform original feature values to indices.

Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.

Examples

In the example below, we read in a dataset of labeled points and then use VectorIndexer to decide which features should be treated as categorical. We transform the categorical feature values to their indices. This transformed data could then be passed to algorithms such as DecisionTreeRegressor that handle categorical features.

Refer to the VectorIndexer Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/VectorIndexerExample.scala %}

Refer to the VectorIndexer Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaVectorIndexerExample.java %}

Refer to the VectorIndexer Python docs for more details on the API.

{% include_example python/ml/vector_indexer_example.py %}

Interaction

Interaction is a Transformer which takes vector or double-valued columns, and generates a single vector column that contains the product of all combinations of one value from each input column.

For example, if you have 2 vector type columns each of which has 3 dimensions as input columns, then you'll get a 9-dimensional vector as the output column.

Examples

Assume that we have the following DataFrame with the columns "id1", "vec1", and "vec2":

  id1|vec1          |vec2          
  ---|--------------|--------------
  1  |[1.0,2.0,3.0] |[8.0,4.0,5.0] 
  2  |[4.0,3.0,8.0] |[7.0,9.0,8.0] 
  3  |[6.0,1.0,9.0] |[2.0,3.0,6.0] 
  4  |[10.0,8.0,6.0]|[9.0,4.0,5.0] 
  5  |[9.0,2.0,7.0] |[10.0,7.0,3.0]
  6  |[1.0,1.0,4.0] |[2.0,8.0,4.0]     

Applying Interaction with those input columns, then interactedCol as the output column contains:

  id1|vec1          |vec2          |interactedCol                                         
  ---|--------------|--------------|------------------------------------------------------
  1  |[1.0,2.0,3.0] |[8.0,4.0,5.0] |[8.0,4.0,5.0,16.0,8.0,10.0,24.0,12.0,15.0]            
  2  |[4.0,3.0,8.0] |[7.0,9.0,8.0] |[56.0,72.0,64.0,42.0,54.0,48.0,112.0,144.0,128.0]     
  3  |[6.0,1.0,9.0] |[2.0,3.0,6.0] |[36.0,54.0,108.0,6.0,9.0,18.0,54.0,81.0,162.0]        
  4  |[10.0,8.0,6.0]|[9.0,4.0,5.0] |[360.0,160.0,200.0,288.0,128.0,160.0,216.0,96.0,120.0]
  5  |[9.0,2.0,7.0] |[10.0,7.0,3.0]|[450.0,315.0,135.0,100.0,70.0,30.0,350.0,245.0,105.0] 
  6  |[1.0,1.0,4.0] |[2.0,8.0,4.0] |[12.0,48.0,24.0,12.0,48.0,24.0,48.0,192.0,96.0]       

Refer to the Interaction Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/InteractionExample.scala %}

Refer to the Interaction Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaInteractionExample.java %}

Normalizer

Normalizer is a Transformer which transforms a dataset of Vector rows, normalizing each Vector to have unit norm. It takes parameter p, which specifies the p-norm used for normalization. (p = 2 by default.) This normalization can help standardize your input data and improve the behavior of learning algorithms.

Examples

The following example demonstrates how to load a dataset in libsvm format and then normalize each row to have unit L^1 norm and unit L^\infty norm.

Refer to the Normalizer Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/NormalizerExample.scala %}

Refer to the Normalizer Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaNormalizerExample.java %}

Refer to the Normalizer Python docs for more details on the API.

{% include_example python/ml/normalizer_example.py %}

StandardScaler

StandardScaler transforms a dataset of Vector rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters:

  • withStd: True by default. Scales the data to unit standard deviation.
  • withMean: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.

StandardScaler is an Estimator which can be fit on a dataset to produce a StandardScalerModel; this amounts to computing summary statistics. The model can then transform a Vector column in a dataset to have unit standard deviation and/or zero mean features.

Note that if the standard deviation of a feature is zero, it will return default 0.0 value in the Vector for that feature.

Examples

The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.

Refer to the StandardScaler Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/StandardScalerExample.scala %}

Refer to the StandardScaler Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaStandardScalerExample.java %}

Refer to the StandardScaler Python docs for more details on the API.

{% include_example python/ml/standard_scaler_example.py %}

MinMaxScaler

MinMaxScaler transforms a dataset of Vector rows, rescaling each feature to a specific range (often [0, 1]). It takes parameters:

  • min: 0.0 by default. Lower bound after transformation, shared by all features.
  • max: 1.0 by default. Upper bound after transformation, shared by all features.

MinMaxScaler computes summary statistics on a data set and produces a MinMaxScalerModel. The model can then transform each feature individually such that it is in the given range.

The rescaled value for a feature E is calculated as, \begin{equation} Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min \end{equation} For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$

Note that since zero values will probably be transformed to non-zero values, output of the transformer will be DenseVector even for sparse input.

Examples

The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [0, 1].

Refer to the MinMaxScaler Scala docs and the MinMaxScalerModel Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/MinMaxScalerExample.scala %}

Refer to the MinMaxScaler Java docs and the MinMaxScalerModel Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaMinMaxScalerExample.java %}

Refer to the MinMaxScaler Python docs and the MinMaxScalerModel Python docs for more details on the API.

{% include_example python/ml/min_max_scaler_example.py %}

MaxAbsScaler

MaxAbsScaler transforms a dataset of Vector rows, rescaling each feature to range [-1, 1] by dividing through the maximum absolute value in each feature. It does not shift/center the data, and thus does not destroy any sparsity.

MaxAbsScaler computes summary statistics on a data set and produces a MaxAbsScalerModel. The model can then transform each feature individually to range [-1, 1].

Examples

The following example demonstrates how to load a dataset in libsvm format and then rescale each feature to [-1, 1].

Refer to the MaxAbsScaler Scala docs and the MaxAbsScalerModel Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/MaxAbsScalerExample.scala %}

Refer to the MaxAbsScaler Java docs and the MaxAbsScalerModel Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaMaxAbsScalerExample.java %}

Refer to the MaxAbsScaler Python docs and the MaxAbsScalerModel Python docs for more details on the API.

{% include_example python/ml/max_abs_scaler_example.py %}

Bucketizer

Bucketizer transforms a column of continuous features to a column of feature buckets, where the buckets are specified by users. It takes a parameter:

  • splits: Parameter for mapping continuous features into buckets. With n+1 splits, there are n buckets. A bucket defined by splits x,y holds values in the range [x,y) except the last bucket, which also includes y. Splits should be strictly increasing. Values at -inf, inf must be explicitly provided to cover all Double values; Otherwise, values outside the splits specified will be treated as errors. Two examples of splits are Array(Double.NegativeInfinity, 0.0, 1.0, Double.PositiveInfinity) and Array(0.0, 1.0, 2.0).

Note that if you have no idea of the upper and lower bounds of the targeted column, you should add Double.NegativeInfinity and Double.PositiveInfinity as the bounds of your splits to prevent a potential out of Bucketizer bounds exception.

Note also that the splits that you provided have to be in strictly increasing order, i.e. s0 < s1 < s2 < ... < sn.

More details can be found in the API docs for Bucketizer.

Examples

The following example demonstrates how to bucketize a column of Doubles into another index-wised column.

Refer to the Bucketizer Scala docs for more details on the API.

{% include_example scala/org/apache/spark/examples/ml/BucketizerExample.scala %}

Refer to the Bucketizer Java docs for more details on the API.

{% include_example java/org/apache/spark/examples/ml/JavaBucketizerExample.java %}

Refer to the Bucketizer Python docs for more details on the API.

{% include_example python/ml/bucketizer_example.py %}

ElementwiseProduct

ElementwiseProduct multiplies each input vector by a provided "weight" vector, using element-wise multiplication. In other words, it scales each column of the dataset by a scalar multiplier. This represents the Hadamard product between the input vector, v and transforming vector, w, to yield a result vector.

\[ \begin{pmatrix} v_1 \\ \vdots \\ v_N \end{pmatrix} \circ \begin{pmatrix} w_1 \\ \vdots \\ w_N \end{pmatrix} = \begin{pmatrix} v_1 w_1 \\ \vdots \\ v_N w_N \end{pmatrix} \]