[SPARK-19922][ML] small speedups to findSynonyms

Currently generating synonyms using a large model (I've tested with 3m words) is very slow. These efficiencies have sped things up for us by ~17% I wasn't sure if such small changes were worthy of a jira, but the guidelines seemed to suggest that that is the preferred approach ## What changes were proposed in this pull request? Address a few small issues in the findSynonyms logic: 1) remove usage of ``Array.fill`` to zero out the ``cosineVec`` array. The default float value in Scala and Java is 0.0f, so explicitly setting the values to zero is not needed 2) use Floats throughout. The conversion to Doubles before doing the ``priorityQueue`` is totally superfluous, since all the similarity computations are done using Floats anyway. Creating a second large array just serves to put extra strain on the GC 3) convert the slow ``for(i <- cosVec.indices)`` to an ugly, but faster, ``while`` loop These efficiencies are really only apparent when working with a large model ## How was this patch tested? Existing unit tests + some in-house tests to time the difference cc jkbradley MLNick srowen Author: Asher Krim <krim.asher@gmail.com> Author: Asher Krim <krim.asher@gmail> Closes #17263 from Krimit/fasterFindSynonyms.

[SPARK-19922][ML] small speedups to findSynonyms
5e96a57b · Asher Krim · Sean Owen · 1c7275ef · 5e96a57b
Commit 5e96a57b authored 8 years ago by Asher Krim Committed by Sean Owen 8 years ago
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -491,8 +491,8 @@ class Word2VecModel private[spark] (

  // wordVecNorms: Array of length numWords, each value being the Euclidean norm
  //               of the wordVector.
-  private val wordVecNorms: Array[Double] = {
-    val wordVecNorms = new Array[Double](numWords)
+  private val wordVecNorms: Array[Float] = {
+    val wordVecNorms = new Array[Float](numWords)
    var i = 0
    while (i < numWords) {
      val vec = wordVectors.slice(i * vectorSize, i * vectorSize + vectorSize)
@@ -570,7 +570,7 @@ class Word2VecModel private[spark] (
    require(num > 0, "Number of similar words should > 0")

    val fVector = vector.toArray.map(_.toFloat)
-    val cosineVec = Array.fill[Float](numWords)(0)
+    val cosineVec = new Array[Float](numWords)
    val alpha: Float = 1
    val beta: Float = 0
    // Normalize input vector before blas.sgemv to avoid Inf value
@@ -581,22 +581,23 @@ class Word2VecModel private[spark] (
    blas.sgemv(
      "T", vectorSize, numWords, alpha, wordVectors, vectorSize, fVector, 1, beta, cosineVec, 1)

-    val cosVec = cosineVec.map(_.toDouble)
-    var ind = 0
-    while (ind < numWords) {
-      val norm = wordVecNorms(ind)
-      if (norm == 0.0) {
-        cosVec(ind) = 0.0
+    var i = 0
+    while (i < numWords) {
+      val norm = wordVecNorms(i)
+      if (norm == 0.0f) {
+        cosineVec(i) = 0.0f
      } else {
-        cosVec(ind) /= norm
+        cosineVec(i) /= norm
      }
-      ind += 1
+      i += 1
    }

-    val pq = new BoundedPriorityQueue[(String, Double)](num + 1)(Ordering.by(_._2))
+    val pq = new BoundedPriorityQueue[(String, Float)](num + 1)(Ordering.by(_._2))

-    for(i <- cosVec.indices) {
-      pq += Tuple2(wordList(i), cosVec(i))
+    var j = 0
+    while (j < numWords) {
+      pq += Tuple2(wordList(j), cosineVec(j))
+      j += 1
    }

    val scored = pq.toSeq.sortBy(-_._2)
@@ -606,7 +607,10 @@ class Word2VecModel private[spark] (
      case None => scored
    }

-    filtered.take(num).toArray
+    filtered
+      .take(num)
+      .map { case (word, score) => (word, score.toDouble) }
+      .toArray
  }

  /**