-
- Downloads
[SPARK-4081] [mllib] VectorIndexer
**Ready for review!** Since the original PR, I moved the code to the spark.ml API and renamed this to VectorIndexer. This introduces a VectorIndexer class which does the following: * VectorIndexer.fit(): collect statistics about how many values each feature in a dataset (RDD[Vector]) can take (limited by maxCategories) * Feature which exceed maxCategories are declared continuous, and the Model will treat them as such. * VectorIndexerModel.transform(): Convert categorical feature values to corresponding 0-based indices Design notes: * This maintains sparsity in vectors by ensuring that categorical feature value 0.0 gets index 0. * This does not yet support transforming data with new (unknown) categorical feature values. That can be added later. * This is necessary for DecisionTree and tree ensembles. Reviewers: Please check my use of metadata and my unit tests for it; I'm not sure if I covered everything in the tests. Other notes: * This also adds a public toMetadata method to AttributeGroup (for simpler construction of metadata). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #3000 from jkbradley/indexer and squashes the following commits: 5956d91 [Joseph K. Bradley] minor cleanups f5c57a8 [Joseph K. Bradley] added Java test suite 643b444 [Joseph K. Bradley] removed FeatureTests 02236c3 [Joseph K. Bradley] Updated VectorIndexer, ready for PR 286d221 [Joseph K. Bradley] Reworked DatasetIndexer for spark.ml API, and renamed it to VectorIndexer 12e6cf2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into indexer 6d8f3f1 [Joseph K. Bradley] Added partly done DatasetIndexer to spark.ml 6a2f553 [Joseph K. Bradley] Updated TODO for allowUnknownCategories 3f041f8 [Joseph K. Bradley] Final cleanups for DatasetIndexer 038b9e3 [Joseph K. Bradley] DatasetIndexer now maintains sparsity in SparseVector 3a4a0bd [Joseph K. Bradley] Added another test for DatasetIndexer 2006923 [Joseph K. Bradley] DatasetIndexer now passes tests f409987 [Joseph K. Bradley] partly done with DatasetIndexerSuite 5e7c874 [Joseph K. Bradley] working on DatasetIndexer
Showing
- mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala 3 additions, 0 deletionsmllib/src/main/scala/org/apache/spark/ml/Pipeline.scala
- mllib/src/main/scala/org/apache/spark/ml/attribute/AttributeGroup.scala 14 additions, 7 deletions.../scala/org/apache/spark/ml/attribute/AttributeGroup.scala
- mllib/src/main/scala/org/apache/spark/ml/feature/VectorIndexer.scala 393 additions, 0 deletions...ain/scala/org/apache/spark/ml/feature/VectorIndexer.scala
- mllib/src/main/scala/org/apache/spark/ml/param/params.scala 14 additions, 6 deletionsmllib/src/main/scala/org/apache/spark/ml/param/params.scala
- mllib/src/test/java/org/apache/spark/ml/feature/JavaVectorIndexerSuite.java 70 additions, 0 deletions...a/org/apache/spark/ml/feature/JavaVectorIndexerSuite.java
- mllib/src/test/scala/org/apache/spark/ml/attribute/AttributeGroupSuite.scala 4 additions, 4 deletions...a/org/apache/spark/ml/attribute/AttributeGroupSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/NormalizerSuite.scala 5 additions, 2 deletions...t/scala/org/apache/spark/ml/feature/NormalizerSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/feature/VectorIndexerSuite.scala 255 additions, 0 deletions...cala/org/apache/spark/ml/feature/VectorIndexerSuite.scala
- mllib/src/test/scala/org/apache/spark/ml/util/TestingUtils.scala 60 additions, 0 deletions...rc/test/scala/org/apache/spark/ml/util/TestingUtils.scala
Loading
Please register or sign in to comment