Skip to content
Snippets Groups Projects
Forked from CogComp / illinois-word-embedding
Up to date with the upstream repository.
John's avatar
wieting2 authored
be9bee2e
History
Name Last commit Last update
doc
src
.gitignore
README
pom.xml
CONTENTS
=================

1 PURPOSE
1.1  License
2 System requirements 
3 Programmatic Use
4 Download contents 
5 Contact Information
6 Developer Notes

==============================


1. PURPOSE

This software allows for efficient use of word embeddings in java projects. It does so
by using an improved version of Java's RandomAccessFile. There is a small overhead in reading
the file as offsets are stored for each word embedding. This takes abourt 5 seconds per 100,000
embeddings.

However, there is very little RAM usage and extraction of the vectors is still fast.
Cosine similarity of 300 dim vectors between two words (which includes their lookups and
calculating the cosine score) takes about 0.003s. Lookup takes about 0.001s if uncached,
but through the cachine option, frequently used embeddings can be cached for improved speed
at the expense of growing RAM usage.

Note that word embedding files to be read must be in a standard form. That is each line
is a word embedding in the following format:

<word> <float> <float> .....

where the <float> are the components of the vector. The delimiter between the tokens in the file
can be space or tab.

Vectors for two types of Paragram 300 dimension embeddings as well as pre-trained 300 dimensional
GloVe and W2Vec embeddings (trained on 100s of billions of tokens) are here:

/shared/shelley/wieting2/illinois-word-embeddings

These can be read by this software.

1.1 License

This software is available under a Research and Academic
use license.

==============================

2. SYSTEM REQUIREMENTS

This software was developed on the following platform:

Java 1.7

==============================

3. PROGRAMMATIC USE

To use the efficient word map, just initialize EfficientWordMap with 3 parameters.
The first is the path to the word embedding file, the second is the dimension of
the word embeddings and the last is a boolean parameter of whether or not to store
in RAM the word embeddings that have been looked up.

vectors = new EfficientWordMap(wordfile,300,useCache);

Then use it to lookup a vector:

double[] dog = vectors.lookup("dog");

or compute cosine similarity between two vectors:

double sim  = vectors.cosineSim("dog", "puppy");

Some particulars about cosineSim:
    1. If there is an unknown vector in the word embedding file, i.e. a vector for
    the UUUNNKKK token, then if one word is unknown, the embedding for that vector
    is the UUUNNKKK embedding.

    2. If both embeddings are unknown, the score is 0. unless the words are equal.

    3. If there is no UUUNNKKK embedding, than if one of the words is unknown, the score is 0.

==============================

4. DOWNLOAD CONTENTS

The main directory has two sub-directories:

doc/ -- documentation (this README)
src/ -- source files


==============================

5. CONTACT INFORMATION

You can ask questions/report problems via the CCG's software newsgroup, which you 
can sign up for here:

http://lists.cs.uiuc.edu/mailman/listinfo/illinois-ml-nlp-users


================================

6. DEVELOPER NOTES