-
- Downloads
[SPARK-4708][MLLib] Make k-mean runs two/three times faster with dense/sparse sample
Note that the usage of `breezeSquaredDistance` in `org.apache.spark.mllib.util.MLUtils.fastSquaredDistance` is in the critical path, and `breezeSquaredDistance` is slow. We should replace it with our own implementation. Here is the benchmark against mnist8m dataset. Before DenseVector: 70.04secs SparseVector: 59.05secs With this PR DenseVector: 30.58secs SparseVector: 21.14secs Author: DB Tsai <dbtsai@alpinenow.com> Closes #3565 from dbtsai/kmean and squashes the following commits: 08bc068 [DB Tsai] restyle de24662 [DB Tsai] address feedback b185a77 [DB Tsai] cleanup 4554ddd [DB Tsai] first commit
Showing
- mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala 33 additions, 34 deletions...main/scala/org/apache/spark/mllib/clustering/KMeans.scala
- mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeansModel.scala 5 additions, 5 deletions...scala/org/apache/spark/mllib/clustering/KMeansModel.scala
- mllib/src/main/scala/org/apache/spark/mllib/clustering/LocalKMeans.scala 10 additions, 12 deletions...scala/org/apache/spark/mllib/clustering/LocalKMeans.scala
- mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala 15 additions, 11 deletions.../src/main/scala/org/apache/spark/mllib/util/MLUtils.scala
- mllib/src/test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala 7 additions, 6 deletions...test/scala/org/apache/spark/mllib/util/MLUtilsSuite.scala
Loading
Please register or sign in to comment