Skip to content
Snippets Groups Projects
Commit e391abdf authored by Yuhao Yang's avatar Yuhao Yang Committed by Xiangrui Meng
Browse files

[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec

jira: https://issues.apache.org/jira/browse/SPARK-11813

I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits.
1. Performance improvement for less serialization.
2. Increase the capacity of Word2Vec a lot.
Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table.
the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab
2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab.

Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary.

Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred.

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #9803 from hhbyyh/w2vVocab.
parent 2acdf10b
No related branches found
No related tags found
No related merge requests found
...@@ -145,8 +145,8 @@ class Word2Vec extends Serializable with Logging { ...@@ -145,8 +145,8 @@ class Word2Vec extends Serializable with Logging {
private var trainWordsCount = 0 private var trainWordsCount = 0
private var vocabSize = 0 private var vocabSize = 0
private var vocab: Array[VocabWord] = null @transient private var vocab: Array[VocabWord] = null
private var vocabHash = mutable.HashMap.empty[String, Int] @transient private var vocabHash = mutable.HashMap.empty[String, Int]
private def learnVocab(words: RDD[String]): Unit = { private def learnVocab(words: RDD[String]): Unit = {
vocab = words.map(w => (w, 1)) vocab = words.map(w => (w, 1))
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment