-
- Downloads
[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features
## What changes were proposed in this pull request? The following test will fail on current master ````scala test("gmm fails on high dimensional data") { val ctx = spark.sqlContext import ctx.implicits._ val df = Seq( Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)), Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0))) .map(Tuple1.apply).toDF("features") val gm = new GaussianMixture() intercept[IllegalArgumentException] { gm.fit(df) } } ```` Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users. This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency. ## How was this patch tested? Unit tests in ML and MLlib. Author: sethah <seth.hendrickson16@gmail.com> Closes #16661 from sethah/gmm_high_dim.
Showing
- mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala 11 additions, 3 deletions...cala/org/apache/spark/ml/clustering/GaussianMixture.scala
- mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala 12 additions, 3 deletions...a/org/apache/spark/mllib/clustering/GaussianMixture.scala
- mllib/src/test/scala/org/apache/spark/ml/clustering/GaussianMixtureSuite.scala 14 additions, 0 deletions...org/apache/spark/ml/clustering/GaussianMixtureSuite.scala
- mllib/src/test/scala/org/apache/spark/mllib/clustering/GaussianMixtureSuite.scala 14 additions, 0 deletions.../apache/spark/mllib/clustering/GaussianMixtureSuite.scala
Please register or sign in to comment