-
- Downloads
[SPARK-14659][ML] RFormula consistent with R when handling strings
## What changes were proposed in this pull request? When handling strings, the category dropped by RFormula and R are different: - RFormula drops the least frequent level - R drops the first level after ascending alphabetical ordering This PR supports different string ordering types in StringIndexer #17879 so that RFormula can drop the same level as R when handling strings using`stringOrderType = "alphabetDesc"`. ## How was this patch tested? new tests Author: Wayne Zhang <actuaryzhang@uber.com> Closes #17967 from actuaryzhang/RFormula.
Showing
- mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala 43 additions, 1 deletion...src/main/scala/org/apache/spark/ml/feature/RFormula.scala
- mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala 2 additions, 2 deletions...ain/scala/org/apache/spark/ml/feature/StringIndexer.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/RFormulaSuite.scala 84 additions, 0 deletions...est/scala/org/apache/spark/ml/feature/RFormulaSuite.scala
Please register or sign in to comment