Skip to content
Snippets Groups Projects
Commit bbae20ad authored by Yanbo Liang's avatar Yanbo Liang
Browse files

[SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance

## What changes were proposed in this pull request?
```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance.
BTW, we should destroy broadcast variable ```compute``` at the end of each iteration.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14621 from yanboliang/spark-17033.
parent 79e2caa1
No related branches found
No related tags found
No related merge requests found
......@@ -198,7 +198,7 @@ class GaussianMixture private (
val compute = sc.broadcast(ExpectationSum.add(weights, gaussians)_)
// aggregate the cluster contribution for all sample points
val sums = breezeData.aggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
val sums = breezeData.treeAggregate(ExpectationSum.zero(k, d))(compute.value, _ += _)
// Create new distributions based on the partial assignments
// (often referred to as the "M" step in literature)
......@@ -227,6 +227,7 @@ class GaussianMixture private (
llhp = llh // current becomes previous
llh = sums.logLikelihood // this is the freshly computed log-likelihood
iter += 1
compute.destroy(blocking = false)
}
new GaussianMixtureModel(weights, gaussians)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment