Commit 0e821ec6 authored 8 years ago by sethah Committed by Yanbo Liang 8 years ago

[SPARK-19313][ML][MLLIB] GaussianMixture should limit the number of features

## What changes were proposed in this pull request?

The following test will fail on current master

````scala
test("gmm fails on high dimensional data") {
    val ctx = spark.sqlContext
    import ctx.implicits._
    val df = Seq(
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(0, 4), Array(3.0, 8.0)),
      Vectors.sparse(GaussianMixture.MAX_NUM_FEATURES + 1, Array(1, 5), Array(4.0, 9.0)))
      .map(Tuple1.apply).toDF("features")
    val gm = new GaussianMixture()
    intercept[IllegalArgumentException] {
      gm.fit(df)
    }
  }
````

Instead, you'll get an `ArrayIndexOutOfBoundsException` or something similar for MLlib. That's because the covariance matrix allocates an array of `numFeatures * numFeatures`, and in this case we get integer overflow. While there is currently a warning that the algorithm does not perform well for high number of features, we should perform an appropriate check to communicate this limitation to users.

This patch adds a `require(numFeatures < GaussianMixture.MAX_NUM_FEATURES)` check to ML and MLlib algorithms. For the feature limitation, we can limit it such that we do not get numerical overflow to something like `math.sqrt(Integer.MaxValue).toInt` (about 46k) which eliminates the cryptic error. However in, for example WLS, we need to collect an array on the order of `numFeatures * numFeatures` to the driver and we therefore limit to 4096 features. We may want to keep that convention here for consistency.

## How was this patch tested?
Unit tests in ML and MLlib.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #16661 from sethah/gmm_high_dim.

parent 76db394f

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 51 additions and 6 deletions

Please register or to comment