bert-punctuation-restoration's Introduction

Punctuation Restoration with BERT

The aim of this experiment is to see how BERT performs on punctuation restoration. The purpose of this mini-experiment is to see how well the pre-trained (on English Wiki and BookCorpus with MLM and NSP as pre-training tasks) BERT performs on conversational data hence why I will be using movie transcripts.

Dataset

We will be using Download the dataset @ Cornell Movie Corpus

Setup & Installation

Clone the repo:

git clone https://github.com/matthiaslmz/BERT-Punctuation-Restoration.git

Next, create a new conda environment:

conda env create -f enviroment.yml
conda activate whatisbert

Make sure to download bert-base-uncased model, vocab and config file @:

CUDA

If an error occurs, first make sure that you have a GPU that is available.
If error persists, it could be that the defined training batch size is too large for the GPU, decrease if necessary
By default in pipeline.py, if a GPU is available, torch.cuda.is_available() defaults to the first GPU, indexed 0. If you need to use the X^{th} GPU, make sure to edit like the following: DEVICE = "cuda:X-1" if torch.cuda.is_available() else cpu

IMPORTANT!

When deciding the number of punctuations you want to keep using the kept_punctuation parameter in create_moveis_dataset(), you MUST also make sure that the number of heads in the final layer of the model is same. This can be specified with the num_labels in BERTPuncResto

Recommend Projects

yh646492956 / bert-punctuation-restoration Goto Github PK

bert-punctuation-restoration's Introduction

Punctuation Restoration with BERT

Dataset

Setup & Installation

CUDA

IMPORTANT!

bert-punctuation-restoration's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent