Skip to content
Snippets Groups Projects
Commit a58f4023 authored by Yuhao Yang's avatar Yuhao Yang Committed by Xiangrui Meng
Browse files

[SPARK-16045][ML][DOC] Spark 2.0 ML.feature: doc update for stopwords and binarizer

## What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-16045
2.0 Audit: Update document for StopWordsRemover and Binarizer.

## How was this patch tested?

manual review for doc

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #13375 from hhbyyh/stopdoc.
parent 37494a18
No related branches found
No related tags found
No related merge requests found
......@@ -251,11 +251,12 @@ frequently and don't carry as much meaning.
`StopWordsRemover` takes as input a sequence of strings (e.g. the output
of a [Tokenizer](ml-features.html#tokenizer)) and drops all the stop
words from the input sequences. The list of stopwords is specified by
the `stopWords` parameter. We provide [a list of stop
words](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) by
default, accessible by calling `getStopWords` on a newly instantiated
`StopWordsRemover` instance. A boolean parameter `caseSensitive` indicates
if the matches should be case sensitive (false by default).
the `stopWords` parameter. Default stop words for some languages are accessible
by calling `StopWordsRemover.loadDefaultStopWords(language)`, for which available
options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian",
"italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish".
A boolean parameter `caseSensitive` indicates if the matches should be case sensitive
(false by default).
**Examples**
......@@ -346,7 +347,10 @@ for more details on the API.
Binarization is the process of thresholding numerical features to binary (0/1) features.
`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0.
`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold`
for binarization. Feature values greater than the threshold are binarized to 1.0; values equal
to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported
for `inputCol`.
<div class="codetabs">
<div data-lang="scala" markdown="1">
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment