-
- Downloads
[SPARK-20619][ML] StringIndexer supports multiple ways to order label
## What changes were proposed in this pull request? StringIndexer maps labels to numbers according to the descending order of label frequency. Other types of ordering (e.g., alphabetical) may be needed in feature ETL. For example, the ordering will affect the result in one-hot encoding and RFormula. This PR proposes to support other ordering methods and we add a parameter `stringOrderType` that supports the following four options: - 'frequencyDesc': descending order by label frequency (most frequent label assigned 0) - 'frequencyAsc': ascending order by label frequency (least frequent label assigned 0) - 'alphabetDesc': descending alphabetical order - 'alphabetAsc': ascending alphabetical order The default is still descending order of label frequency, so there should be no impact to existing programs. ## How was this patch tested? new test Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17879 from actuaryzhang/stringIndexer.
Showing
- mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala 48 additions, 7 deletions...ain/scala/org/apache/spark/ml/feature/StringIndexer.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/StringIndexerSuite.scala 23 additions, 0 deletions...cala/org/apache/spark/ml/feature/StringIndexerSuite.scala
Please register or sign in to comment