Giter Site home page Giter Site logo

albert_vi's Introduction

ALBERT for Vietnamese

Introduction

ALBERT is "A Lite" version of BERT, a popular unsupervised language representation learning algorithm. ALBERT uses parameter-reduction techniques that allow for large-scale configurations, overcome previous memory limitations, and achieve better behavior with respect to model degradation.

For a technical detail description of the algorithm, see the paper:

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations (Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut) and the official repository Google ALBERT

Google researchers introduced three standout innovations with ALBERT. [1]

  • Factorized embedding parameterization: Researchers isolated the size of the hidden layers from the size of vocabulary embeddings by projecting one-hot vectors into a lower dimensional embedding space and then to the hidden space, which made it easier to increase the hidden layer size without significantly increasing the parameter size of the vocabulary embeddings.

  • Cross-layer parameter sharing: Researchers chose to share all parameters across layers to prevent the parameters from growing along with the depth of the network. As a result, the large ALBERT model has about 18x fewer parameters compared to BERT-large.

  • Inter-sentence coherence loss: In the BERT paper, Google proposed a next-sentence prediction technique to improve the model’s performance in downstream tasks, but subsequent studies found this to be unreliable. Researchers used a sentence-order prediction (SOP) loss to model inter-sentence coherence in ALBERT, which enabled the new model to perform more robustly in multi-sentence encoding tasks.

We preproduced ALBERT for Vietnamese dataset and provided pre-trained model in below.

Data preparation

Training data is the Vietnamese wikipedia corpus from Wikipedia

Data is preprocessed and extracted using WikiExtractor

Training SentencePiece model for producing vocab file, we used 30000 words from this model on Vietnamese wikipedia corpus.

SentencePice model and vocab at folder assets.

Pretraining

Creating data for pretraining

We trained ALBERT model on config version 2 of base and large.

Base Config

{
  "attention_probs_dropout_prob": 0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0,
  "embedding_size": 128,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "max_position_embeddings": 512,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_hidden_groups": 1,
  "net_structure_type": 0,
  "gap_size": 0,
  "num_memory_blocks": 0,
  "inner_group_num": 1,
  "down_scale_factor": 1,
  "type_vocab_size": 2,
  "vocab_size": 30000
}

Large Config

{
  "attention_probs_dropout_prob": 0,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0,
  "embedding_size": 128,
  "hidden_size": 1024,
  "initializer_range": 0.02,
  "intermediate_size": 4096,
  "max_position_embeddings": 512,
  "num_attention_heads": 16,
  "num_hidden_layers": 24,
  "num_hidden_groups": 1,
  "net_structure_type": 0,
  "gap_size": 0,
  "num_memory_blocks": 0,
  "inner_group_num": 1,
  "down_scale_factor": 1,
  "type_vocab_size": 2,
  "vocab_size": 30000
}

Create tfrecord for training data:

python create_pretraining_data.py \
  --input_file={path to wiki data} \
  --dupe_factor=10 \
  --output_file={path to save tfrecord} \
  --vocab_file assets/albertvi_30k-clean.vocab \
  --spm_model_file assets/albertvi_30k-clean.model 

Pre-training base config

python run_pretraining.py \
--albert_config_file=assets/base/albert_config.json \
--input_file={tfrecord path} \
--output_dir={}\
--export_dir={}\
--train_batch_size=4096 \
--do_eval=True \
--use_tpu=True \

Pre-training large config

python run_pretraining.py \
--albert_config_file=assets/large/albert_config.json \
--input_file={tfrecord path} \
--output_dir={}\
--export_dir={}\
--train_batch_size=512 \
--do_eval=True \
--use_tpu=True \

Pretrained model

We run ~1M steps for base config and ~250k for large config.

Eval result at step 1001000

***** Eval results *****
global_step = 1001000
loss = 1.6706645
masked_lm_accuracy = 0.66281766
masked_lm_loss = 1.6631233
sentence_order_accuracy = 0.9998438
sentence_order_loss = 0.00065024174
sentence_order_loss = 0.00065024174

You could download the pretrained models of base config at here

Experimential Results

Coming soon

Acknowledgement

Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).

Thank so much @lampts, @dal team for suporting me to finish this project.

Conclusion

I hope to receiving contributions and feedback form everyone, Email me or create an issue for any questions .

albert_vi's People

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.