language

Bangla BERT Base

A long way passed. Here is our Bangla-Bert! It is now available in huggingface model hub.

Bangla-Bert-Base is a pretrained language model of Bengali language using mask language modeling described in BERT and it's github repository

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

Pretrain Corpus Details

Corpus was downloaded from two main sources:

Bengali commoncrawl copurs downloaded from OSCAR
Bengali Wikipedia Dump Dataset

After downloading these corpus, we preprocessed it as a Bert format. which is one sentence per line and an extra newline for new documents.

sentence 1
sentence 2

sentence 1
sentence 2

Building Vocab

We used BNLP package for training bengali sentencepiece model with vocab size 102025. We preprocess the output vocab file as Bert format. Our final vocab file availabe at https://github.com/sagorbrur/bangla-bert and also at huggingface model hub.

Training Details

Bangla-Bert was trained with code provided in Google BERT's github repository (https://github.com/google-research/bert)
Currently released model follows bert-base-uncased model architecture (12-layer, 768-hidden, 12-heads, 110M parameters)
Total Training Steps: 1 Million
The model was trained on a single Google Cloud TPU

Evaluation Results

After training 1 millions steps here is the evaluation resutls.

global_step = 1000000
loss = 2.2406516
masked_lm_accuracy = 0.60641736
masked_lm_loss = 2.201459
next_sentence_accuracy = 0.98625
next_sentence_loss = 0.040997364
perplexity = numpy.exp(2.2406516) = 9.393331287442784
Loss for final step: 2.426227

NB: If you use this model for any nlp task please share evaluation results with us. We will add it here.

How to Use

Bangla BERT Tokenizer

from transformers import AutoTokenizer, AutoModel

bnbert_tokenizer = AutoTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
text = "আমি বাংলায় গান গাই।"
bnbert_tokenizer.tokenize(text)
# ['আমি', 'বাংলা', '##য', 'গান', 'গাই', '।']

MASK Generation

You can use this model directly with a pipeline for masked language modeling:

from transformers import BertForMaskedLM, BertTokenizer, pipeline

model = BertForMaskedLM.from_pretrained("sagorsarker/bangla-bert-base")
tokenizer = BertTokenizer.from_pretrained("sagorsarker/bangla-bert-base")
nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
for pred in nlp(f"আমি বাংলায় {nlp.tokenizer.mask_token} গাই।"):
  print(pred)

# {'sequence': '[CLS] আমি বাংলায গান গাই । [SEP]', 'score': 0.13404667377471924, 'token': 2552, 'token_str': 'গান'}

Author

Sagor Sarker

Acknowledgements

Thanks to Google TensorFlow Research Cloud (TFRC) for providing the free TPU credits - thank you!
Thank to all the people around, who always helping us to build something for Bengali.

Reference

https://github.com/google-research/bert

apan007 / bangla-bert Goto Github PK

bangla-bert's Introduction

Bangla BERT Base

Pretrain Corpus Details

Building Vocab

Training Details

Evaluation Results

How to Use

Author

Acknowledgements

Reference

bangla-bert's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent