kunyin2 · 337ee034
--- a/README.md

+ 11

− 24
+++ b/README.md

+ 11

− 24
-# National NLP Clinical Challenges (n2c2)
+# 598 Reproducibility Project 
-[![Codacy Badge](https://api.codacy.com/project/badge/Grade/61f8f04b341a482f95b9a38073575860)](https://app.codacy.com/app/michelole/n2c2?utm_source=github.com&utm_medium=referral&utm_content=bst-mug/n2c2&utm_campaign=badger)
+## steps for using this code repo:
-[![Build Status](https://travis-ci.org/bst-mug/n2c2.svg?branch=master)](https://travis-ci.org/bst-mug/n2c2)
+* 1. You need to get the original/input dataset. One way to get is from https://portal.dbmi.hms.harvard.edu/projects/n2c2-nlp/ by submitting data access request.
-[![Coverage Status](https://coveralls.io/repos/github/bst-mug/n2c2/badge.svg?branch=master)](https://coveralls.io/github/bst-mug/n2c2?branch=master)
+* 2. Once you have the input dataset, you need to put the data under the folder: `/data/test` , `/data/train` separatelly.
-[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+* 3. Run the program `SentenceDumper.java` to generate the `sentences.txt` file.
+* 4. Run the program `VocabularyDumper.java` to generate the `vocab.txt` file.
-A repository containing support code and resources developed at the [Institute for Medical Informatics, Statistics and Documentation at the Medical University of Graz (Austria)](https://www.medunigraz.at/imi/en/) for participation at the [2018 n2c2 Shared-Task Track 1](https://n2c2.dbmi.hms.harvard.edu/) organized by the Department of Biomedical Informatics at the Harvard Medical School.
+* 5. Install fasttext program to your working machine. You can follow this link to install fasttext: https://fasttext.cc/docs/en/supervised-tutorial.html
+* 6. Copy `sentences.txt` file to the folder `scripts`, then run the script `train_embeddings.sh` which will generate `n2c2-fasttext` model.
-## Citing
+* 7. Copy `vocab.txt` file to the folder `scripts`. Download `BioWordVec_PubMed_MIMICIII_d200.bin` from https://github.com/ncbi-nlp/BioSentVec. Run the script `print_pre_trained_vectors.sh` to generate pre_trained embedding. Run the script `print_self_trained_vectors.sh` to generate self_trained embedding. 
+ Then, copy both the embeddings to the related folder as the generated class file by java.
-If you use data or code in your work, please cite our [JAMIA paper](https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocz149/5568257):
-```
-@article{oleynik2019evaluating,
-  title={Evaluating shallow and deep learning strategies for the 2018 n2c2 shared-task on clinical text classification},
-  author={Michel Oleynik and Amila Kugic and Zdenko Kasáč and Markus Kreuzthaler},
-  journal={Journal of the American Medical Informatics Association},
-  publisher={Oxford University Press},
-  year={2019}
-}
-```
-Also of interest:
- Our [n2c2 presentation slides](https://www.medunigraz.at/imi/de/n2c2.Presentation_V6.pdf)
 ## Code Dependencies
 - JDK8+