diff --git a/expemb/LICENSE b/LICENSE similarity index 100% rename from expemb/LICENSE rename to LICENSE diff --git a/README.md b/README.md index 2fce8419d9b20996fd665f3bd953c187f1501d0b..acddba0c67e40a96ffc951bd05a1b953815926fd 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,10 @@ Setup the environment using `conda` as follows: conda env create -n expembtx -f environment.yml ``` +## Datasets +The datasets are available [here](https://osf.io/9tdqg/?view_only=78c364b3c71f43b5b414deac81cf863b). + + ## Training and Evaluation ### Setup To run the training and evaluation pipeline in this repository, [eqnet](https://github.com/mast-group/eqnet/) is required. As it can not be installed as a dependency, clone this repository and add it to `PYTHONPATH`. @@ -24,32 +28,32 @@ Example: python train_expembtx.py \ --train_file <TRAIN_FILE> \ --val_file <VAL_FILE> \ - --n_epochs 100 \ + --n_epochs <N_EPOCHS> \ --norm_first True \ --optim Adam \ --weight_decay 0 \ --lr 0.0001 \ - --train_batch_size 128 \ + --train_batch_size <TRAIN_BATCH_SIZE> \ --run_name <RUN_NAME> \ - --val_batch_size 256 \ + --val_batch_size <EVAL_BATCH_SIZE> \ --grad_clip_val 1 \ --max_out_len 256 \ --precision 16 \ --save_dir <OUT_DIR> \ - --early_stopping 5 \ - --n_min_epochs 10 \ + --early_stopping <EARLY_STOPPING> \ + --n_min_epochs <N_MIN_EPOCHS> \ --label_smoothing 0.1 \ --seed 42 ``` -Add `--semvec` option to the above-mentioned command for the SemVec datasets. +Add `--semvec` option to the above-mentioned command for the SemVec datasets. For the SemVec datasets, `<TRAIN_FILE>` is not the original training file provided with the SemVec datasets but a version in the input-output format. -For all supported options, use `python train_expembtx.py --help` or refer to [TrainingAgruments](expemb/args.py#TestingArguments). +For all supported options, use `python train_expembtx.py --help` or refer to [TrainingAgruments](expemb/args.py#TrainingAgruments). ### Evaluation -To evaluate a trained model, `test_expembtx.py` may be used. +To evaluate a trained model, `test_expembtx.py` may be used. The options may vary depending if the model is trained on the Equivalent Expressions Dataset or the SemVec datasets. -Example: +For the Equivalent Expressions Dataset, the following command may be used to test the model accuracy. On completion, it will generate a file containing the results inside `<SAVED_MODEL_DIR>` with `<RESULT_FILE_PREFIX>` as the file name prefix. ``` python test_expembtx.py \ --test_file <TEST_FILE> \ @@ -60,6 +64,16 @@ python test_expembtx.py \ --batch_size 32 ``` +For the SemVec datasets, the following command may be used. +``` +python test_expembtx.py \ + --test_file <TEST_FILE> \ + --full_file <SEMVEC_FULL_DATASET> \ + --ckpt_name best_max \ + --save_dir <SAVED_MODEL_DIR> \ + --semvec +``` + For all supported options, use `python test_expembtx.py --help` or refer to [TestingArguments](expemb/args.py#TestingArguments). ## Embedding Mathematics @@ -91,5 +105,5 @@ For all supported options, use `python run_embmath.py --help` or refer to [Dista ## Embedding Plots For embedding plots, refer to [embedding_plots.ipynb](notebooks/embedding_plots.ipynb). -## Wandb Integration -This repository supports wandb integration. To start using it, login to wandb using `wandb login`. To disable wandb, set the environment variable `WANDB_MODE=offline`. \ No newline at end of file +## Weights & Biases (wandb) Integration +This repository supports wandb integration. To start using it, login to wandb using `wandb login`. To disable wandb, set the environment variable `WANDB_MODE=offline`. diff --git a/data.dvc b/data.dvc index 9e6eb57d08ec837186f04357cce51a6b58f697f7..6525e677da476a2ffa7c7f9e933fbdb4b8263b3c 100644 --- a/data.dvc +++ b/data.dvc @@ -1,5 +1,5 @@ outs: -- md5: dd9adab06b0b971ca76b127229ca272e.dir - size: 1056242338 - nfiles: 125 +- md5: 8f77cd8265892df56a3ffd2a7a785b2b.dir + size: 1056244911 + nfiles: 127 path: data