598 Reproducibility Project
steps for using this code repo:
-
- You need to get the original/input dataset. One way to get is from https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ by submitting data access request.
-
- Once you have the input dataset, you need to put the data under the folder:
/data/test
,/data/train
separatelly.
- Once you have the input dataset, you need to put the data under the folder:
-
- Run the program
SentenceDumper.java
to generate thesentences.txt
file.
- Run the program
-
- Run the program
VocabularyDumper.java
to generate thevocab.txt
file.
- Run the program
-
- Install fasttext program to your working machine. You can follow this link to install fasttext: https://fasttext.cc/docs/en/supervised-tutorial.html
-
- Copy
sentences.txt
file to the folderscripts
, then run the scripttrain_embeddings.sh
which will generaten2c2-fasttext
model.
- Copy
-
- Copy
vocab.txt
file to the folderscripts
. DownloadBioWordVec_PubMed_MIMICIII_d200.bin
from https://github.com/ncbi-nlp/BioSentVec. Run the scriptprint_pre_trained_vectors.sh
to generate pre_trained embedding. Run the scriptprint_self_trained_vectors.sh
to generate self_trained embedding. Then, copy both the embeddings to the related folder as the generated class file by java.
- Copy
Code Dependencies
- JDK8+
- python3 (to run official evaluation scripts)
- make (to compile fastText)
- gcc/clang (to compile fastText)