cdli-gh / semi-supervised-nmt-for-sumerian-english Goto Github PK

Exploring the Limits of Low-Resource Neural Machine Translation

License: MIT License

Python 13.14% Jupyter Notebook 82.99% Shell 3.61% Perl 0.25%

low-resource-languages nmt unsupervised semi-supervised backtranslation transformers xlm translation

semi-supervised-nmt-for-sumerian-english's Introduction

Figure-1: Shows a Cuneiform inscription, extracted from actual tablets.
Sumerian: pisan-dub-ba sza3-bi su-ga sag-nig2-gur11-ra u3 zi-ga lu2-kal-la i3-gal2 ...
English: Basket-of-tablets: therefroms, restitutions, debits, and credits, of Lukalla are here; ...

Sumerian-English Neural Machine Translation

As a part of the MTAAC project at CDLI, we aim to build an end-to-end NMT Pipeline while making use of the extensive monolingual Sumerian Data.

Previous models that have been used to carry out English<-->Sumerian Translation have only made use of the available parallel corpora. Presently we have only about 50K extracted sentences for both languages in the parallel corpora, whereas around 1.47M sentences in the Sumerian monolingual corpus.

This huge amount of monolingual data can be used to improve the NMT system by combining it with techniques like Back Translation, Tranfer Learning and Dual Learning which have proved specially useful for Low-Resource languages like Sumerian which have a limited amount of parallel data. Moreover, we also look to implement models like XLM and MASS for the same.

Requirements

- Python 3.5.2 or higher
- NumPy
- Pandas
- PyTorch
- Torch Text
- OpenNMT-py
- fairseq

Repository Structure
Results
Visualisations and Interpretations
Tasks:

Repository Structure

|__ translation/ --> all translation models used for Sumerian-English Translation 
        |__ transformer/ --> Supervised NMT using Vanilla Transformer
                |__ runTransformerSumEn.sh --> to perform training
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ backtranslation/ --> fairseq usgae for Back Translation using Vanilla Transformers
        |__ backtranslation-onmt/ --> OpenNMT usage for Back Translation using Vanilla Transformers
                |__ backtranslateONMT.py --> to translate all Sumerian Text in a given shard using weights from the previous iteration
                |__ stack.py --> To stack the backtranslated sentences to the parallel corpora for training
                |__ runTransformerSumEn.sh --> To retrain the transformer model using the updated parallel data from the last step
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ XLM/ --> Unsupervised and Semi-Supervised NMT using Cross-Lingual Langual Model Pretraining
                |__ XLM/ --> directory containing all model, data preperation and inference scripts
                |__ models.txt --> lists the possible commands and parameter combinations for XLM training and inference.
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ MASS-unmt/ --> Unsupervised NMT using Masked Sequence to Sequence Pretraining
                |__ data_prep.sh --> to prepare and process data for training 
                |__ pre_training.sh --> to carry out pre training using Unsupervised Objectives
                |__ fine_tuning.sh --> to carry out fine tuning using parallel data
                |__ translate.sh --> to carry out inference using the specified checkpoints
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ MASS-snmt/ --> Unsupervised NMT using Masked Sequence to Sequence Pretraining 
                |__ data_prep.sh --> to prepare and process data for training 
                |__ pre_training.sh --> to carry out pre training using Unsupervised Objectives
                |__ fine_tuning.sh --> to carry out fine tuning using parallel data
                |__ translate.sh --> to carry out inference using the specified checkpoints
                |__ README.md --> lists down all checkpoints and steps to run training and inference.

|__ dataset/ --> All Sumerian Language related textual dataset by CDLI
        |__ README.md --> Gives detailed description of the dataset and the different sub-folders.
        |__ dataToUse/ --> Contains all the parallel data divided among traing, test and dev sets, in 4 different categories
                |__ UrIIICompSents/ --> UrIII Admin Data with complete sentence translations
                |__ AllCompSents/ --> All kinds of Sumerian Data with complete sentence translations
                |__ UrIIILineByLine/ --> UrIII Admin Data with line by line translations
                |__ AllLineByLIne/ --> All kinds of Sumerian Dtaa with line by line translations
        |__ cleaned/ --> Contains data after cleaning using the helper scripts, including the monolingual data. Divided in the same 4 categories.
        |__ original/ --> Contains all of the data before cleaning
        |__ oldFormat/ --> Contains data from last year, for comparison

Refer to the README of each folder and sub-folder to throughly know them and to reproduce the translation models

Results

Table-1: Sumerian-English Machine Translation.
All numeric values other than those in Human Evaluation represent the BLEU Score.

Visualisations and Interpretations

Figure-2: Selected output tokens for Sumerian Input text of ”sze-ba geme2 usz-bar kiszib3 ur-dasznan ugula”, which translates to ”barley rations of the female weavers under seal of UrAnan the foreman”

Figure-3: Feature Ablation and attention Attributions, respectively,
for a span of input and output text through the Data Augmented XLM

Mentors:

Tasks:

...

For an end-to-end translation pipeline making use of translation models from this repository, refer to the cdli-gh/Sumerian_Translation-Pipeline project, where you can give an ATF file containing Sumerian sentences as input and get an ATF file with corresponding English translations as the output.

semi-supervised-nmt-for-sumerian-english's People

Contributors

Stargazers

Watchers

Forkers

shashankjain12 echodarkstar shankprod ravneetdtu deeppublicgit ishaansharma sujit-jaunjal yb-huang barionleg aibolem

semi-supervised-nmt-for-sumerian-english's Issues

How to use model for training on custom dataset

Thanks for making the repo public!

I want to use your repository to develop a machine translation model for both EN to DE and DE to EN.

But I am not sure how to use your repository as you have added many features in the repository which is really appreciable.

Could you please let me know how to use your repository for preprocessing, training, and evaluation?

Cleaning the Raw extratced Data.

Files "sumerian_translated.atf" and "sumerian_untranslated.atf" are completlt Raw.
Make clean .txt files similar like in the dataset folder.

Create data extraction script for MT work and put in github repo.

To make the data more accessible to anyone wanting to experiment, we need a python script to extract the appropriate data from our datasets which are freely accessible online.
The data is here: https://github.com/cdli-gh/data
in the atf data file, each text starts with &P and end where the other text starts or with the end of the file. Check the texts one by one to see what their language is, if Sumerian, then extracts the text.
To select only text with translation, also check if #tr.en: is present in the text.
To select only a genre, cross-reference with the catalogue file and see look up the genre. The key between each file is the P number, in the text, it is the number following &P and in the catalogue it is in the field text_id.
The datasets to create should be :
All Sumerian (translated or not )
All Sumerian with parallel translation

consolidate evaluation

At the moment, every MTAAC/CDLI MT system is independently evaluated, so that it is impossible to track progress.

e.g., Rachit's (2020) "mu usz-bar x 2(disz) tug2 usz-bar tur" seems to correspond to two independent (!) lines in Ravneet's (2019) system:

544,mu ucbar X
542, NUMB tug ucbar tur sumun

But it's likely that these are actually completely different texts (and that there is no overlap for the phrase "ucbar tur" / "usz-bar tur" in their data), because "sumun" is not in Rachit's text, and then, the systems are just incomparable.

Establish a consistent train/test set and replicate.