-
- Downloads
[SPARK-4431][MLlib] Implement efficient foreachActive for dense and sparse vector
Previously, we were using Breeze's activeIterator to access the non-zero elements in dense/sparse vector. Due to the overhead, we switched back to native `while loop` in #SPARK-4129. However, #SPARK-4129 requires de-reference the dv.values/sv.values in each access to the value, which is very expensive. Also, in MultivariateOnlineSummarizer, we're using Breeze's dense vector to store the partial stats, and this is very expensive compared with using primitive scala array. In this PR, efficient foreachActive is implemented to unify the code path for dense and sparse vector operation which makes codebase easier to maintain. Breeze dense vector is replaced by primitive array to reduce the overhead further. Benchmarking with mnist8m dataset on single JVM with first 200 samples loaded in memory, and repeating 5000 times. Before change: Sparse Vector - 30.02 Dense Vector - 38.27 With this PR: Sparse Vector - 6.29 Dense Vector - 11.72 Author: DB Tsai <dbtsai@alpinenow.com> Closes #3288 from dbtsai/activeIterator and squashes the following commits: 844b0e6 [DB Tsai] formating 03dd693 [DB Tsai] futher performance tunning. 1907ae1 [DB Tsai] address feedback 98448bb [DB Tsai] Made the override final, and had a local copy of variables which made the accessing a single step operation. c0cbd5a [DB Tsai] fix a bug 6441f92 [DB Tsai] Finished SPARK-4431
Showing
- mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala 32 additions, 0 deletions...rc/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
- mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala 49 additions, 72 deletions...pache/spark/mllib/stat/MultivariateOnlineSummarizer.scala
- mllib/src/test/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala 24 additions, 0 deletions...st/scala/org/apache/spark/mllib/linalg/VectorsSuite.scala
Loading
Please register or sign in to comment