Skip to content
Snippets Groups Projects
mcroos2's avatar
mcroos2 authored
1668a5dd
History
Name Last commit Last update
data
fasttext
n2c2
README.md
execution.PNG
results.PNG

CS598DL4H

Reproducing Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification

Original Citation

If you use data or code in your work, please cite our JAMIA paper:

@article{oleynik2019evaluating,
  title={Evaluating shallow and deep learning strategies for the 2018 n2c2 shared-task on clinical text classification},
  author={Michel Oleynik and Amila Kugic and Zdenko Kasáč and Markus Kreuzthaler},
  journal={Journal of the American Medical Informatics Association},
  publisher={Oxford University Press},
  year={2019}
}

Code Dependencies

  • JDK8+
  • python3 (to run official evaluation scripts)
  • make (to compile fastText)
  • gcc/clang (to compile fastText)

Add your files

cd existing_repo
git remote add origin https://gitlab.engr.illinois.edu/mcroos21/cs598dl4h.git
git branch -M main
git push -uf origin main

Usage

Dataset download

Dataset request portal:

https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/

Dataset: 2018 (Track 1) Clinical Trial Cohort Selection Challenge Downloads

Training Data: Gold standard training data

Test Data: Gold Standard test data

Data Locations Setup Application

For this project, all data was run from /home/mcroos/code/aih/aih_project/data Please replace this hard coded location with your own data preprocessing locations in the codebase

mcroos@INDRAN-I6700K:~/code/aih/aih_project/data
$ ls
bio_embedding_extrinsic.bin  n2c2-fasttext.bin  n2c2-fasttext.vec  n2c2-t1_gold_standard_test_data.zip  sentences.txt  test  train  train.zip  vectors.tsv  vectors.vec  vocab.txt

Installation - Prerequisites

Install fastertext https://fasttext.cc/docs/en/support.html

Download embeddings https://github.com/ncbi-nlp/BioSentVec

Generate vectors

Self-trained sentences.txt dumper for self-training

/usr/lib/jvm/java-11-openjdk-amd64/bin/java at.medunigraz.imi.bst.n2c2.runner.SentenceDumper

Pre-trained BioWordVec vocab generator.

/usr/bin/env /usr/lib/jvm/java-11-openjdk-amd64/bin/java at.medunigraz.imi.bst.n2c2.nn.VocabularyDumper

Command to train fasttext with our own texts and generate vectors for training pipeline
mcroos@INDRAN-I6700K:~/code/aih/aih_project/data
$ /home/mcroos/code/aih/fastText/fasttext skipgram -input sentences.txt -output n2c2-fasttext -dim 200 -t 0.001 -minCount 0 -neg 10 -wordNgrams 6 -ws 20
Read 0M words
Number of words:  23331
Number of labels: 0
Progress: 100.0% words/sec/thread:   10178 lr:  0.000000 avg.loss:  2.029469 ETA:   0h 0m 0s
mcroos@INDRAN-I6700K:~/code/aih/aih_project/data
$ ../n2c2/scripts/print_vectors.sh n2c2-fasttext.bin 
(base) 
mcroos@INDRAN-I6700K:~/code/aih/aih_project/data
$ ls
bio_embedding_extrinsic.bin  n2c2-fasttext.bin  n2c2-fasttext.vec  n2c2-t1_gold_standard_test_data.zip  sentences.txt  test  train  train.zip  vectors.tsv  vectors.vec  vocab.txt

Self-training command:

/home/mcroos/code/aih/fastText/fasttext supervised -input /home/mcroos/code/aih/aih_project/data/fasttext/train.txt -output /home/mcroos/code/aih/aih_project/data/fasttext/model -thread 1 -epoch 100 -lr 0.50 -dim 200 -pretrainedVectors /home/mcroos/code/aih/aih_project/n2c2/target/classes/self-trained-vectors.vec

Execution

This project was reproduced under Visual Studio code.

For easy execution, click the Run command in the main method as highlighted below or run the command line as show below

alt text

mcroos@INDRAN-I6700K:~/code/aih/aih_project
$ /usr/lib/jvm/java-11-openjdk-amd64/bin/java  at.medunigraz.imi.bst.n2c2.ClassifierRunner 
22:00:32  INFO [DatasetUtil         ] Loading 202 files from /home/mcroos/code/aih/aih_project/data/train ...
22:00:32 DEBUG [DatasetUtil         ] Reading /home/mcroos/code/aih/aih_project/data/train/255.xml
22:00:32 DEBUG [DatasetUtil         ] Reading /home/mcroos/code/aih/aih_project/data/train/332.xml
22:00:32 DEBUG [DatasetUtil         ] Reading /home/mcroos/code/aih/aih_project/data/train/315.xml
22:00:32 DEBUG [DatasetUtil         ] Reading /home/mcroos/code/aih/aih_project/data/train/210.xml

Results

Raw results for all experiments are stored in CSV format in stats directory.

Below image summarizes the reproduced results which matches the results claimed by the authors.

alt text