-
- Downloads
[SPARK-18374][ML] Incorrect words in StopWords/english.txt
## What changes were proposed in this pull request? Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes. Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list. see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #16103 from hhbyyh/addstopwords.
Showing
- mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt 54 additions, 26 deletions...sources/org/apache/spark/ml/feature/stopwords/english.txt
- mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala 1 addition, 1 deletion...a/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
Please register or sign in to comment