Giter Site home page Giter Site logo

aivivn-tone's Introduction

aivivn-tone

Submission for AIviVN Vietnamese diacritics restoration contest https://www.aivivn.com/contests/3.

A more detailed summary of the approach can be found here (Vietnamese).

Requirements

Python > 3.6, PyTorch 1.0.1, torchtext, unidecode, dill, visdom, tqdm, kenlm.

visdom is mainly used for visualizing training loss and accuracy.

kenlm can be found here. I had some troubles with the version on the master branch, so the stable release may be better.

Overview

Character-level BiLSTM seq2seq model

The embedding layer and encoder are standard. The model consists of 3 decoders (each decoder has its own softmax prediction layer):

  • a left-to-right decoder
  • a right-to-left decoder
  • a combined decoder constructed by concatenating output LSTM states of two previous decoders

The final loss is a sum of 3 component losses: L = L_ltr + L_rtl + L_combined

Since only a certain set of characters requires diacritics restoration (a, d, e, i, o, u, y), we can apply teacher forcing at both training time and test time.

In addition, since each character only has a fixed set of targets (e.g., for i it's i, í, ì, ỉ, ĩ, ị), masked softmax can also be applied.

Beam search

We run a standard beam search in 2 directions, left-to-right and right-to-left, and combine results. For any disagreements that may appear between the two searches, repeat the procedure until there are no disagreements left. We fall back on exhaustive search after a number of recursive calls in case of infinite recursion.

A 4-gram word-level language model is used to score candidates during beam search.

The beam search component is separated from the seq2seq model (not jointly trained during training time), so it can be used with any other models.

Replicating submission results

I filtered out sentences longer than 300 characters, and divided the training data into smaller splits so they could fit in my computer's limited RAM. The data I used can be found here.

I trained the seq2seq model until the accuracy on validation set stopped increasing. The final model can be found here.

The n-gram language model can be found here.

The main function in train.py and predict.py has examples of how to train the model from scratch and run predictions on test data. I set beam size to a very large number so it may take very long to run predictions.

Credits

Some of the code is taken from IBM's PyTorch seq2seq implementation.

The code to produce cleaned test data is written by Khoi Nguyen.

The data used to train n-gram language model are taken from this repo by @binhvq.

The Vietnamese dictionary used during beam search is taken from this repo by @undertheseanlp.

Finally, I'd like to thank the AIviVN admins for organizing the contest, providing the data, and preparing a script to convert predicted text file to submission file.

aivivn-tone's People

Contributors

iotayo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

aivivn-tone's Issues

vocab size using validation set

Please let me know why we use validation set to build vocab instead of using training set?

def load_data_in_parts(train_src, train_tgt, val_src, val_tgt, batch_size=64, save_path=CHECKPOINT_PATH):
    # prepare dataset
    print("Reading data...")
    val = Seq2SeqDataset.from_file(val_src, val_tgt)
    print("Building vocab...")
    val.build_vocab(max_size=300)
    src_vocab = val.src_field.vocab
    tgt_vocab = val.tgt_field.vocab
......

Resume Checkpoint Error while saving model

Hi,

I tried to resume your checkpoint with my data. Specifically, My code is
trainer.resume_in_parts(train_parts, val, val_iterator, batch_size,save_path="./aivivn_tone.model.ep25")

Then i had a error that likes:

Total parameters: 16995327
Training part [data/train.txt] with target [data/train.txt]...
epoch = 26 iter = 0 loss = 0.433 correct = 92.504 r_correct = 94.407 c_correct = 95.789
EPOCH = 26 AVG_LOSS = 0.259 AVG_CORRECT = 86.402

Traceback (most recent call last):
File "train.py", line 347, in
trainer.resume_in_parts(train_parts, val, val_iterator, batch_size, save_path="./aivivn_tone.model.ep25")
File "train.py", line 138, in resume_in_parts
self.train_in_parts(train_parts, val, val_iterator, batch_size, start_epoch=start_epoch)
File "train.py", line 114, in train_in_parts
self.save(epoch)
File "train.py", line 242, in save
}, os.path.join(save_path, "aivivn_tone.model.ep{}".format(epoch)))
File "//python3.7/site-packages/torch/serialization.py", line 372, in save
_save(obj, opened_zipfile, pickle_module, pickle_protocol)
File "/python3.7/site-packages/torch/serialization.py", line 476, in _save
pickler.dump(obj)
File "/python3.7/site-packages/torch/optim/optimizer.py", line 59, in getstate
'defaults': self.defaults,
AttributeError: 'Adam' object has no attribute 'defaults'

Can you help me fix the issue?

Thanks you.

Language model

Chào bạn, mình đang tìm hiểu về kenLM, mô hình của bạn trên drive không còn download được nữa, bạn có thể cập nhật lại link giúp mình được không @iotayo
Mình cảm ơn

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.