-
- Downloads
[SPARK-5565][ML] LDA wrapper for Pipelines API
This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [https://github.com/apache/spark/pull/9484], but I'll try to merge [https://github.com/apache/spark/pull/9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.
Showing
- mllib/src/main/scala/org/apache/spark/ml/clustering/LDA.scala 701 additions, 0 deletions...b/src/main/scala/org/apache/spark/ml/clustering/LDA.scala
- mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAModel.scala 24 additions, 5 deletions...in/scala/org/apache/spark/mllib/clustering/LDAModel.scala
- mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala 221 additions, 0 deletions.../test/scala/org/apache/spark/ml/clustering/LDASuite.scala
Please register or sign in to comment