Skip to content
Snippets Groups Projects
kunyin2's avatar
kunyin2 authored
337ee034
History

598 Reproducibility Project

steps for using this code repo:

    1. You need to get the original/input dataset. One way to get is from https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ by submitting data access request.
    1. Once you have the input dataset, you need to put the data under the folder: /data/test , /data/train separatelly.
    1. Run the program SentenceDumper.java to generate the sentences.txt file.
    1. Run the program VocabularyDumper.java to generate the vocab.txt file.
    1. Install fasttext program to your working machine. You can follow this link to install fasttext: https://fasttext.cc/docs/en/supervised-tutorial.html
    1. Copy sentences.txt file to the folder scripts, then run the script train_embeddings.sh which will generate n2c2-fasttext model.
    1. Copy vocab.txt file to the folder scripts. Download BioWordVec_PubMed_MIMICIII_d200.bin from https://github.com/ncbi-nlp/BioSentVec. Run the script print_pre_trained_vectors.sh to generate pre_trained embedding. Run the script print_self_trained_vectors.sh to generate self_trained embedding. Then, copy both the embeddings to the related folder as the generated class file by java.

Code Dependencies

  • JDK8+
  • python3 (to run official evaluation scripts)
  • make (to compile fastText)
  • gcc/clang (to compile fastText)