Skip to content
Snippets Groups Projects
Commit 54d95758 authored by Joseph J.C. Tang's avatar Joseph J.C. Tang Committed by Xiangrui Meng
Browse files

[MLLIB] SPARK-4846: throw a RuntimeException and give users hints to increase the minCount

When the vocabSize\*vectorSize is larger than Int.MaxValue/8, we try to throw a RuntimeException. Because under this circumstance it would definitely throw an OOM when allocating memory to serialize the arrays syn0Global&syn1Global.   syn0Global&syn1Global are float arrays. Serializing them should need a byte array of more than 8 times of syn0Global's size.
Also if we catch an OOM even if vocabSize\*vectorSize is less than Int.MaxValue/8, we should give users hints to increase the minCount or decrease the vectorSize.

Author: Joseph J.C. Tang <jinntrance@gmail.com>

Closes #4247 from jinntrance/w2v-fix and squashes the following commits:

b5eb71f [Joseph J.C. Tang] throw a RuntimeException and give users hints regarding the vectorSize&minCount
parent 254eaa4d
No related branches found
No related tags found
No related merge requests found
......@@ -290,6 +290,13 @@ class Word2Vec extends Serializable with Logging {
val newSentences = sentences.repartition(numPartitions).cache()
val initRandom = new XORShiftRandom(seed)
if (vocabSize.toLong * vectorSize * 8 >= Int.MaxValue) {
throw new RuntimeException("Please increase minCount or decrease vectorSize in Word2Vec" +
" to avoid an OOM. You are highly recommended to make your vocabSize*vectorSize, " +
"which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue/8`.")
}
val syn0Global =
Array.fill[Float](vocabSize * vectorSize)((initRandom.nextFloat() - 0.5f) / vectorSize)
val syn1Global = new Array[Float](vocabSize * vectorSize)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment