CS598DL4H
Reproducing Evaluating shallow and deep learning strategies for the 2018 n2c2 shared task on clinical text classification
Original Citation
If you use data or code in your work, please cite our JAMIA paper:
@article{oleynik2019evaluating,
title={Evaluating shallow and deep learning strategies for the 2018 n2c2 shared-task on clinical text classification},
author={Michel Oleynik and Amila Kugic and Zdenko Kasáč and Markus Kreuzthaler},
journal={Journal of the American Medical Informatics Association},
publisher={Oxford University Press},
year={2019}
}
Code Dependencies
- JDK8+
- python3 (to run official evaluation scripts)
- make (to compile fastText)
- gcc/clang (to compile fastText)
Add your files
cd existing_repo
git remote add origin https://gitlab.engr.illinois.edu/mcroos21/cs598dl4h.git
git branch -M main
git push -uf origin main
Usage
Dataset download
Dataset request portal:
https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/
Dataset: 2018 (Track 1) Clinical Trial Cohort Selection Challenge Downloads
Training Data: Gold standard training data
Test Data: Gold Standard test data
Data Locations Setup Application
For this project, all data was run from /home/mcroos/code/aih/aih_project/data
Please replace this hard coded location with your own data preprocessing locations in the codebase
mcroos@INDRAN-I6700K:~/code/aih/aih_project/data
$ ls
bio_embedding_extrinsic.bin n2c2-fasttext.bin n2c2-fasttext.vec n2c2-t1_gold_standard_test_data.zip sentences.txt test train train.zip vectors.tsv vectors.vec vocab.txt
Installation - Prerequisites
Install fastertext https://fasttext.cc/docs/en/support.html
Download embeddings https://github.com/ncbi-nlp/BioSentVec
Generate vectors
Self-trained sentences.txt dumper for self-training
/usr/lib/jvm/java-11-openjdk-amd64/bin/java at.medunigraz.imi.bst.n2c2.runner.SentenceDumper
Pre-trained BioWordVec vocab generator.
/usr/bin/env /usr/lib/jvm/java-11-openjdk-amd64/bin/java at.medunigraz.imi.bst.n2c2.nn.VocabularyDumper
Command to train fasttext with our own texts and generate vectors for training pipeline
mcroos@INDRAN-I6700K:~/code/aih/aih_project/data
$ /home/mcroos/code/aih/fastText/fasttext skipgram -input sentences.txt -output n2c2-fasttext -dim 200 -t 0.001 -minCount 0 -neg 10 -wordNgrams 6 -ws 20
Read 0M words
Number of words: 23331
Number of labels: 0
Progress: 100.0% words/sec/thread: 10178 lr: 0.000000 avg.loss: 2.029469 ETA: 0h 0m 0s
mcroos@INDRAN-I6700K:~/code/aih/aih_project/data
$ ../n2c2/scripts/print_vectors.sh n2c2-fasttext.bin
(base)
mcroos@INDRAN-I6700K:~/code/aih/aih_project/data
$ ls
bio_embedding_extrinsic.bin n2c2-fasttext.bin n2c2-fasttext.vec n2c2-t1_gold_standard_test_data.zip sentences.txt test train train.zip vectors.tsv vectors.vec vocab.txt
Self-training command:
/home/mcroos/code/aih/fastText/fasttext supervised -input /home/mcroos/code/aih/aih_project/data/fasttext/train.txt -output /home/mcroos/code/aih/aih_project/data/fasttext/model -thread 1 -epoch 100 -lr 0.50 -dim 200 -pretrainedVectors /home/mcroos/code/aih/aih_project/n2c2/target/classes/self-trained-vectors.vec
Execution
This project was reproduced under Visual Studio code.
For easy execution, click the Run command in the main method as highlighted below or run the command line as show below
mcroos@INDRAN-I6700K:~/code/aih/aih_project
$ /usr/lib/jvm/java-11-openjdk-amd64/bin/java at.medunigraz.imi.bst.n2c2.ClassifierRunner
22:00:32 INFO [DatasetUtil ] Loading 202 files from /home/mcroos/code/aih/aih_project/data/train ...
22:00:32 DEBUG [DatasetUtil ] Reading /home/mcroos/code/aih/aih_project/data/train/255.xml
22:00:32 DEBUG [DatasetUtil ] Reading /home/mcroos/code/aih/aih_project/data/train/332.xml
22:00:32 DEBUG [DatasetUtil ] Reading /home/mcroos/code/aih/aih_project/data/train/315.xml
22:00:32 DEBUG [DatasetUtil ] Reading /home/mcroos/code/aih/aih_project/data/train/210.xml
Results
Raw results for all experiments are stored in CSV format in stats
directory.
Below image summarizes the reproduced results which matches the results claimed by the authors.