The aim of this experiment is to see how BERT performs on punctuation restoration. The purpose of this mini-experiment is to see how well the pre-trained (on English Wiki and BookCorpus with MLM and NSP as pre-training tasks) BERT performs on conversational data hence why I will be using movie transcripts.
We will be using Download the dataset @ Cornell Movie Corpus
Clone the repo:
git clone https://github.com/matthiaslmz/BERT-Punctuation-Restoration.git
Next, create a new conda environment:
conda env create -f enviroment.yml
conda activate whatisbert
Make sure to download bert-base-uncased
model, vocab and config file @:
- If an error occurs, first make sure that you have a GPU that is available.
- If error persists, it could be that the defined training batch size is too large for the GPU, decrease if necessary
- By default in
pipeline.py
, if a GPU is available,torch.cuda.is_available()
defaults to the first GPU, indexed0
. If you need to use the X^{th} GPU, make sure to edit like the following:DEVICE = "cuda:X-1" if torch.cuda.is_available() else cpu
When deciding the number of punctuations you want to keep using the kept_punctuation
parameter in create_moveis_dataset()
, you MUST also make sure that the number of heads in the final layer of the model is same. This can be specified with the num_labels
in BERTPuncResto