Skip to content
Snippets Groups Projects
Unverified Commit fac5b75b authored by Yuhao's avatar Yuhao Committed by Sean Owen
Browse files

[SPARK-18374][ML] Incorrect words in StopWords/english.txt

## What changes were proposed in this pull request?

Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes.

Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list.

see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374

## How was this patch tested?
existing ut

Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #16103 from hhbyyh/addstopwords.
parent 1ef6b296
No related branches found
No related tags found
No related merge requests found
......@@ -125,29 +125,57 @@ just
don
should
now
d
ll
m
o
re
ve
y
ain
aren
couldn
didn
doesn
hadn
hasn
haven
isn
ma
mightn
mustn
needn
shan
shouldn
wasn
weren
won
wouldn
i'll
you'll
he'll
she'll
we'll
they'll
i'd
you'd
he'd
she'd
we'd
they'd
i'm
you're
he's
she's
it's
we're
they're
i've
we've
you've
they've
isn't
aren't
wasn't
weren't
haven't
hasn't
hadn't
don't
doesn't
didn't
won't
wouldn't
shan't
shouldn't
mustn't
can't
couldn't
cannot
could
here's
how's
let's
ought
that's
there's
what's
when's
where's
who's
why's
would
\ No newline at end of file
......@@ -45,7 +45,7 @@ class StopWordsRemoverSuite
.setOutputCol("filtered")
val dataSet = Seq(
(Seq("test", "test"), Seq("test", "test")),
(Seq("a", "b", "c", "d"), Seq("b", "c")),
(Seq("a", "b", "c", "d"), Seq("b", "c", "d")),
(Seq("a", "the", "an"), Seq()),
(Seq("A", "The", "AN"), Seq()),
(Seq(null), Seq(null)),
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment