Giter Site home page Giter Site logo

bert-vietnamese-base's Introduction

Pretrained Vietnamese BERT models

This is a repository of pretrained Vietnamese BERT models. The pretrained models are available along with the source code of pretraining.

Features

  • All the models are trained on Vietnamese Wikipedia.

  • All the models are trained with the same configuration as the original BERT; 512 tokens per instance, 256 instances per batch, and 1M training steps.

  • We also distribute models trained with Whole Word Masking enabled; all of the tokens corresponding to a word (tokenized by Underthesea) are masked at once.

  • Along with the models, we provide tokenizers, which are compatible with ones defined in Transformers by Hugging Face.

  • using underthesea library to to sentence processing.

  • v2 Update VietNameseTokenierNormalize

Pretrained models

  • BERT-base models (12-layer, 768-hidden, 12-heads, 110M parameters)

All the model archives include following files. pytorch_model.bin and tf_model.h5 are compatible with Transformers.

.
├── config.json
├── model.ckpt.data-00000-of-00001
├── model.ckpt.index
├── model.ckpt.meta
├── pytorch_model.bin
├── tf_model.h5
└── vocab.txt

At present, only BERT-base models are available. I am planning to release BERT-large models in the future.

Requirements

For just using the models:

If you wish to pretrain a model:

Usage

Please refer to masked_lm_example.ipynb.

Details of pretraining

Corpus generation and preprocessing

The all distributed models are pretrained on Vietnamese Wikipedia. To generate the corpus, WikiExtractor is used to extract plain texts from a Wikipedia dump file.

$ wget http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/viwiki/20200520/viwiki-20200520-pages-articles-multistream.xml.bz2
$ python wikiextractor/WikiExtractor.py --output /corpus --bytes 512M --compress --json --links --namespaces 0 --no_templates --min_text_length 16 --processes 20 viwiki-20200520-pages-articles-multistream.xml.bz2

install requirements library
$ sudo bash preprocessing.sh
Some preprocessing is applied to the extracted texts.
Preprocessing includes splitting texts into sentences, removing noisy markups, etc.

```sh
$ seq -f %02g 0 3|xargs -L 1 -I {} -P 9 python bert-vietnamese/make_corpus.py --input_file /corpus/AA/wiki_{}.bz2 --output_file /corpus/corpus.txt.{} --vina_dict_path /path/to/neologd/dict/dir/

Building vocabulary

Same as the original BERT, we used byte-pair-encoding (BPE) to obtain subwords. We used a implementaion of BPE in SentencePiece.

# For vocab models
$ !python bert-vietnamese/build_vocab.py --input_file "/corpus/corpus.txt.*" --output_file "/base/vocab.txt" --subword_type bpe --vocab_size 32000

Creating data for pretraining

With the vocabulary and text files above, we create dataset files for pretraining. Note that this process is highly memory-consuming and takes many hours.

# For 32k w/ whole word masking
# Note: each process will consume about 32GB RAM
$ seq -f %02g 0 8|xargs -L 1 -I {} -P 1 python create_pretraining_data.py --input_file /path/to/corpus/dir/corpus.txt.{} --output_file /path/to/base/dir/pretraining-data.tf_record.{} --do_whole_word_mask True --vocab_file /path/to/base/dir/vocab.txt --subword_type bpe --max_seq_length 512 --max_predictions_per_seq 80 --masked_lm_prob 0.15

# Note: each process will consume about 32GB RAM
$ !seq -f %02g 0 8|xargs -L 1 -I {} -P 1 python bert-vietnamese/create_pretraining_data.py --input_file /corpus/corpus.txt.{} --output_file /base/pretraining-data.tf_record.{} --do_whole_word_mask True --vocab_file /base/vocab.txt --subword_type bpe --max_seq_length 512 --max_predictions_per_seq 80 --masked_lm_prob 0.15

Training

We used Cloud TPUs to run pre-training.

For BERT-base models, v3-8 TPUs are used.

# For BERT-base models
$ python3 run_pretraining.py \
--input_file="/path/to/pretraining-data.tf_record.*" \
--output_dir="/path/to/output_dir" \
--bert_config_file=bert_base_32k_config.json \
--max_seq_length=512 \
--max_predictions_per_seq=80 \
--do_train=True \
--train_batch_size=256 \
--num_train_steps=1000000 \
--learning_rate=1e-4 \
--save_checkpoints_steps=100000 \
--keep_checkpoint_max=10 \
--use_tpu=True \
--tpu_name=<tpu name> \
--num_tpu_cores=8

Using

  • Model can use with transformers:
    • Tokenizer:
    • Model:

Licenses

The pretrained models are distributed under the terms of the Creative Commons Attribution-ShareAlike 3.0.

The codes in this repository are distributed under the MIT License.

Related Work

Acknowledgments

For training models, we used Cloud TPUs provided by TensorFlow Research Cloud program. Thanks for Japanese BERT !

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.