Giter Site home page Giter Site logo

cdli-gh / semi-supervised-nmt-for-sumerian-english Goto Github PK

View Code? Open in Web Editor NEW
33.0 3.0 10.0 159.65 MB

Exploring the Limits of Low-Resource Neural Machine Translation

License: MIT License

Python 13.14% Jupyter Notebook 82.99% Shell 3.61% Perl 0.25%
low-resource-languages nmt unsupervised semi-supervised backtranslation transformers xlm translation

semi-supervised-nmt-for-sumerian-english's Introduction

Figure-1: Shows a Cuneiform inscription, extracted from actual tablets.
Sumerian: pisan-dub-ba sza3-bi su-ga sag-nig2-gur11-ra u3 zi-ga lu2-kal-la i3-gal2 ...
English: Basket-of-tablets: therefroms, restitutions, debits, and credits, of Lukalla are here; ...

Sumerian-English Neural Machine Translation

As a part of the MTAAC project at CDLI, we aim to build an end-to-end NMT Pipeline while making use of the extensive monolingual Sumerian Data.

Previous models that have been used to carry out English<-->Sumerian Translation have only made use of the available parallel corpora. Presently we have only about 50K extracted sentences for both languages in the parallel corpora, whereas around 1.47M sentences in the Sumerian monolingual corpus.

This huge amount of monolingual data can be used to improve the NMT system by combining it with techniques like Back Translation, Tranfer Learning and Dual Learning which have proved specially useful for Low-Resource languages like Sumerian which have a limited amount of parallel data. Moreover, we also look to implement models like XLM and MASS for the same.

Requirements - Python 3.5.2 or higher
- NumPy
- Pandas
- PyTorch
- Torch Text
- OpenNMT-py
- fairseq

Table of Contents

Repository Structure

|__ translation/ --> all translation models used for Sumerian-English Translation 
        |__ transformer/ --> Supervised NMT using Vanilla Transformer
                |__ runTransformerSumEn.sh --> to perform training
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ backtranslation/ --> fairseq usgae for Back Translation using Vanilla Transformers
        |__ backtranslation-onmt/ --> OpenNMT usage for Back Translation using Vanilla Transformers
                |__ backtranslateONMT.py --> to translate all Sumerian Text in a given shard using weights from the previous iteration
                |__ stack.py --> To stack the backtranslated sentences to the parallel corpora for training
                |__ runTransformerSumEn.sh --> To retrain the transformer model using the updated parallel data from the last step
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ XLM/ --> Unsupervised and Semi-Supervised NMT using Cross-Lingual Langual Model Pretraining
                |__ XLM/ --> directory containing all model, data preperation and inference scripts
                |__ models.txt --> lists the possible commands and parameter combinations for XLM training and inference.
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ MASS-unmt/ --> Unsupervised NMT using Masked Sequence to Sequence Pretraining
                |__ data_prep.sh --> to prepare and process data for training 
                |__ pre_training.sh --> to carry out pre training using Unsupervised Objectives
                |__ fine_tuning.sh --> to carry out fine tuning using parallel data
                |__ translate.sh --> to carry out inference using the specified checkpoints
                |__ README.md --> lists down all checkpoints and steps to run training and inference.
        |__ MASS-snmt/ --> Unsupervised NMT using Masked Sequence to Sequence Pretraining 
                |__ data_prep.sh --> to prepare and process data for training 
                |__ pre_training.sh --> to carry out pre training using Unsupervised Objectives
                |__ fine_tuning.sh --> to carry out fine tuning using parallel data
                |__ translate.sh --> to carry out inference using the specified checkpoints
                |__ README.md --> lists down all checkpoints and steps to run training and inference.

|__ dataset/ --> All Sumerian Language related textual dataset by CDLI
        |__ README.md --> Gives detailed description of the dataset and the different sub-folders.
        |__ dataToUse/ --> Contains all the parallel data divided among traing, test and dev sets, in 4 different categories
                |__ UrIIICompSents/ --> UrIII Admin Data with complete sentence translations
                |__ AllCompSents/ --> All kinds of Sumerian Data with complete sentence translations
                |__ UrIIILineByLine/ --> UrIII Admin Data with line by line translations
                |__ AllLineByLIne/ --> All kinds of Sumerian Dtaa with line by line translations
        |__ cleaned/ --> Contains data after cleaning using the helper scripts, including the monolingual data. Divided in the same 4 categories.
        |__ original/ --> Contains all of the data before cleaning
        |__ oldFormat/ --> Contains data from last year, for comparison
        

Refer to the README of each folder and sub-folder to throughly know them and to reproduce the translation models

Results

Table-1: Sumerian-English Machine Translation.
All numeric values other than those in Human Evaluation represent the BLEU Score.

Visualisations and Interpretations

Figure-2: Selected output tokens for Sumerian Input text of ”sze-ba geme2 usz-bar kiszib3 ur-dasznan ugula”, which translates to ”barley rations of the female weavers under seal of UrAnan the foreman”

Figure-3: Feature Ablation and attention Attributions, respectively,
for a span of input and output text through the Data Augmented XLM

Mentors:

  1. Niko Schenk
  2. Ravneet Punia

Tasks:

  • Preparing the parallel and monolingual texts for final usage. Using methods like BPE and BBPE to tokenize the text.
  • Implementing the Vanilla Transformer for Sumerian to English as well as English to Sumerian
  • Back Translation using Sumerian Monolingual data
  • Transfer Learning from pre-trained models of other languages
  • XLM for Unsupervised NMT.
  • XLM for Semi-Supervised NMT
  • MASS for Unsupervised NMT.
  • MASS for Semi-Supervised NMT.
  • Pre-training using Augmented Data
  • Interpretation of the NMT Models

...

For an end-to-end translation pipeline making use of translation models from this repository, refer to the cdli-gh/Sumerian_Translation-Pipeline project, where you can give an ATF file containing Sumerian sentences as input and get an ATF file with corresponding English translations as the output.

semi-supervised-nmt-for-sumerian-english's People

Contributors

adityashar avatar dependabot[bot] avatar epageperron avatar rachitbansal avatar ravneetdtu avatar shashankjain12 avatar shfrz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

semi-supervised-nmt-for-sumerian-english's Issues

How to use model for training on custom dataset

Thanks for making the repo public!

I want to use your repository to develop a machine translation model for both EN to DE and DE to EN.

But I am not sure how to use your repository as you have added many features in the repository which is really appreciable.

Could you please let me know how to use your repository for preprocessing, training, and evaluation?

Cleaning the Raw extratced Data.

Files "sumerian_translated.atf" and "sumerian_untranslated.atf" are completlt Raw.
Make clean .txt files similar like in the dataset folder.

Create data extraction script for MT work and put in github repo.

To make the data more accessible to anyone wanting to experiment, we need a python script to extract the appropriate data from our datasets which are freely accessible online.
The data is here: https://github.com/cdli-gh/data
in the atf data file, each text starts with &P and end where the other text starts or with the end of the file. Check the texts one by one to see what their language is, if Sumerian, then extracts the text.
To select only text with translation, also check if #tr.en: is present in the text.
To select only a genre, cross-reference with the catalogue file and see look up the genre. The key between each file is the P number, in the text, it is the number following &P and in the catalogue it is in the field text_id.
The datasets to create should be :
All Sumerian (translated or not )
All Sumerian with parallel translation

consolidate evaluation

At the moment, every MTAAC/CDLI MT system is independently evaluated, so that it is impossible to track progress.

e.g., Rachit's (2020) "mu usz-bar x 2(disz) tug2 usz-bar tur" seems to correspond to two independent (!) lines in Ravneet's (2019) system:

544,mu ucbar X
542, NUMB tug ucbar tur sumun

But it's likely that these are actually completely different texts (and that there is no overlap for the phrase "ucbar tur" / "usz-bar tur" in their data), because "sumun" is not in Rachit's text, and then, the systems are just incomparable.

Establish a consistent train/test set and replicate.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.