CONTENTS ================= 1 PURPOSE 1.1 License 2 System requirements 3 Programmatic Use 4 Download contents 5 Contact Information 6 Developer Notes ============================== 1. PURPOSE This software allows for efficient use of word embeddings in java projects. It does so by using an improved version of Java's RandomAccessFile. There is a small overhead in reading the file as offsets are stored for each word embedding. This takes abourt 5 seconds per 100,000 embeddings. However, there is very little RAM usage and extraction of the vectors is still fast. Cosine similarity of 300 dim vectors between two words (which includes their lookups and calculating the cosine score) takes about 0.003s. Lookup takes about 0.001s if uncached, but through the cachine option, frequently used embeddings can be cached for improved speed at the expense of growing RAM usage. Note that word embedding files to be read must be in a standard form. That is each line is a word embedding in the following format: <word> <float> <float> ..... where the <float> are the components of the vector. The delimiter between the tokens in the file can be space or tab. Vectors for two types of Paragram 300 dimension embeddings as well as pre-trained 300 dimensional GloVe and W2Vec embeddings (trained on 100s of billions of tokens) are here: /shared/shelley/wieting2/illinois-word-embeddings These can be read by this software. 1.1 License This software is available under a Research and Academic use license. ============================== 2. SYSTEM REQUIREMENTS This software was developed on the following platform: Java 1.7 ============================== 3. PROGRAMMATIC USE To use the efficient word map, just initialize EfficientWordMap with 3 parameters. The first is the path to the word embedding file, the second is the dimension of the word embeddings and the last is a boolean parameter of whether or not to store in RAM the word embeddings that have been looked up. vectors = new EfficientWordMap(wordfile,300,useCache); Then use it to lookup a vector: double[] dog = vectors.lookup("dog"); or compute cosine similarity between two vectors: double sim = vectors.cosineSim("dog", "puppy"); Some particulars about cosineSim: 1. If there is an unknown vector in the word embedding file, i.e. a vector for the UUUNNKKK token, then if one word is unknown, the embedding for that vector is the UUUNNKKK embedding. 2. If both embeddings are unknown, the score is 0. unless the words are equal. 3. If there is no UUUNNKKK embedding, than if one of the words is unknown, the score is 0. ============================== 4. DOWNLOAD CONTENTS The main directory has two sub-directories: doc/ -- documentation (this README) src/ -- source files ============================== 5. CONTACT INFORMATION You can ask questions/report problems via the CCG's software newsgroup, which you can sign up for here: http://lists.cs.uiuc.edu/mailman/listinfo/illinois-ml-nlp-users ================================ 6. DEVELOPER NOTES
Forked from
CogComp / illinois-word-embedding
Up to date with the upstream repository.
wieting2
authored
Name | Last commit | Last update |
---|---|---|
doc | ||
src | ||
.gitignore | ||
README | ||
pom.xml |