Skip to content
Snippets Groups Projects
Commit fe1bb1d7 authored by aBotel's avatar aBotel
Browse files

Add install script for Py deps. Add skeleton .ipynb that loads BERT.

parent 0e7f1962
No related branches found
No related tags found
No related merge requests found
......@@ -8,6 +8,8 @@ To make it easy for you to get started with GitLab, here's a list of recommended
Already a pro? Just edit this README.md and make it your own. Want to make it easy? [Use the template at the bottom](#editing-this-readme)!
Colab info: https://answers.illinois.edu/illinois/122558
## Add your files
- [ ] [Create](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#create-a-file) or [upload](https://docs.gitlab.com/ee/user/project/repository/web_editor.html#upload-a-file) files
......
#!/bin/bash
# Maybe uncomment this to activate your conda env.
# conda activate <your_conda_env>
# Make sure the BERT model is available in PyTorch
#conda config --add channels conda-forge
#conda install --yes -c conda-forge pytorch-pretrained-bert
pip install pytorch-transformers
pip install pyhealth
%% Cell type:code id:3225717c-458b-45cb-9e8b-89ecd03b9d07 tags:
``` python
```
%% Cell type:markdown id:7ccba6af-1867-491d-b9d0-d99965738c64 tags:
# Imports
%% Cell type:code id:5da368cd-3045-4ec0-86fb-2ce15b5d1b92 tags:
``` python
# General includes.
import os
# Numerical includes.
import numpy as np
import pandas as pd
import torch
from torch import nn
# Pyhealth includes.
from pyhealth.datasets import MIMIC3Dataset
```
%% Cell type:code id:95e714b1-ca5d-4d7b-9ace-6b4372adfd9f tags:
``` python
# Model imports
from pytorch_transformers import BertTokenizer, BertModel, BertForMaskedLM
```
%% Cell type:markdown id:cf83fa50-9dd0-467b-bd36-634795e4a09c tags:
# Globals
%% Cell type:code id:ebef3528-0396-459a-8112-b272089f5d67 tags:
``` python
USE_GPU_ = False
```
%% Cell type:markdown id:df0f8271-44b2-44a2-b7cc-cd7339a70e87 tags:
## Preprocessing
%% Cell type:code id:02624f53-6cee-4db0-8b15-e08b755a04f2 tags:
``` python
```
%% Cell type:markdown id:ebe2c2d9-c161-4b9d-8182-86384faea929 tags:
### Load BERT Model
%% Cell type:markdown id:39b0edec-1570-4640-9926-d908b456f957 tags:
See instructions here:
- https://pypi.org/project/pytorch-pretrained-bert/#examples
- https://neptune.ai/blog/how-to-code-bert-using-pytorch-tutorial
- Could also get it from pytorch transformers library: https://pytorch.org/hub/huggingface_pytorch-transformers/
%% Cell type:code id:f40665f4-db92-4567-b0f9-001bd35e9284 tags:
``` python
# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# TODO(botelho3)`bert-base-uncased` is big. Load `bert-tiny` instead from the filesystem?
# Model available at https://huggingface.co/prajjwal1/bert-tiny.
# model = BERT_CLASS.from_pretrained(PRE_TRAINED_MODEL_NAME_OR_PATH, cache_dir=None)
# Tokenized input
text = "[CLS] Who was Jim Henson ? [SEP] Jim Henson was a puppeteer [SEP]"
tokenized_text = tokenizer.tokenize(text)
# Mask a token that we will try to predict back with `BertForMaskedLM`
masked_index = 8
tokenized_text[masked_index] = '[MASK]'
assert tokenized_text == ['[CLS]', 'who', 'was', 'jim', 'henson', '?', '[SEP]', 'jim', '[MASK]', 'was', 'a', 'puppet', '##eer', '[SEP]']
# Convert token to vocabulary indices
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
# Define sentence A and B indices associated to 1st and 2nd sentences (see paper)
segments_ids = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1]
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
```
%% Output
100%|█████████████████████████████████████████████████████████| 231508/231508 [00:00<00:00, 877100.29B/s]
%% Cell type:code id:a7539b72-a871-4d52-947b-075b5c4be2ae tags:
``` python
```
%% Cell type:code id:de132943-2c3e-405b-b2f8-5d348472672c tags:
``` python
# Load pre-trained model (weights)
model = BertForMaskedLM.from_pretrained('bert-base-uncased')
model.eval()
# If you have a GPU, put everything on cuda
if USE_GPU_:
tokens_tensor = tokens_tensor.to('cuda')
segments_tensors = segments_tensors.to('cuda')
model.to('cuda')
# Predict all tokens
with torch.no_grad():
predictions = model(tokens_tensor, segments_tensors)
# confirm we were able to predict 'henson'
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
assert predicted_token == 'henson'
```
%% Output
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
/var/folders/bw/pyw_1xcj0f302h0krt1_f5lm0000gn/T/ipykernel_23031/2203343201.py in <module>
14
15 # confirm we were able to predict 'henson'
---> 16 predicted_index = torch.argmax(predictions[0, masked_index]).item()
17 predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
18 assert predicted_token == 'henson'
TypeError: tuple indices must be integers or slices, not tuple
%% Cell type:code id:02efb18a-626d-42d4-b19d-ba4b8f752a00 tags:
``` python
```
%% Cell type:code id:0e2d58de-bf9a-4b57-977f-6aa3ac75e84f tags:
``` python
# https://neptune.ai/blog/how-to-code-bert-using-pytorch-tutorial
class BERTClassification(nn.Module):
def __init__ (self):
super(BERTClassification, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-cased')
self.bert_drop = nn.Dropout(0.4)
self.out = nn.Linear(768, 1)
def forward(self, ids, mask, token_type_ids):
_, pooledOut = self.bert(ids, attention_mask = mask,
token_type_ids=token_type_ids)
bertOut = self.bert_drop(pooledOut)
output = self.out(bertOut)
return output
```
%% Cell type:markdown id:8f0d5fa9-226e-4612-b28f-ed24de16399a tags:
### Load
%% Cell type:code id:427a42a6-441e-438c-96f4-b38ba82fd192 tags:
``` python
```
%% Cell type:code id:1056b60a-5861-4e47-b694-891053cc8470 tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment