Giter Site home page Giter Site logo

miniscale_translit's Introduction

Transliteration training and decoding with sample Urdu data

1. If using a mac, update SRILM call in train_helper (comment out line 72, uncomment line 73)
If using CLSP or COE clusters, do nothing.

2. Set JOSHUA, SRILM, and JAVA_HOME environment variables appropriately (or use my pointers at the top of train_helper and test_helper for coe machines)

3. To train a system, create directory with a file containing a "train" file and a "dev" file with transliteration pairs, then run (pointing to training directory):
./train_main.sh demo_data/urdu
Example directory is demo_data/urdu (note that these example train and dev files already have beginning of word and end of word symbols appended)

4. To decode a test set, run:
./translit_strings.sh demo_data/urdu/urdu_test demo_data/urdu
First arg: test file (list of single words to decode)
Second arg: pointer to same directory given in training step, above

Output in demo_data/urdu/urdu_test.answer
For this example, references are given in demo_data/urdu/urdu_test.ref for comparison

5. Evaluate your output:
python evaluate.py demo_data/urdu/urdu_test.answer demo_data/urdu/urdu_test.ref
This simple evaluation script reports the number and percent of perfect transliterations, the average edit distance, and the average normalized (normalized by the length of the reference) edit distance

-------------------------------------

Using your own data

train/dev data:
To use your own, just need to create directory with a train and dev file. Script to append beginning and word and end of word symbols: addSymTrainDev.py (run: python addSymTrainDev.py inputTrain)

language model data:
Two English language model files are included in demo_data/lm 
One is taken from Wikipedia person-page titles and the other from NE-labeled Gigaword corpus, and each includes relative frequency counts.

train your own LM:
Can use other LM, just change pointer in train_main.sh (and script for adding beginning of word and end of word symbols in demo_data/lm)
to build from monolingual corpus:
cd demo_data/lm
python make_lm.py my-input-text
Change LM pointer to my-input-text.wordModel

-------------------------------------

Using Wikipedia data:
See wikipedia_data/README

-------------------------------------

Substitute transliterations into Joshua MT output

To find OOVs in Joshua MT output: find-oovs.pl

After generating transliterations for all OOV words, create a dictionary file (oov word - tab - transliterated word), and substitute them into the Joshua output:
python sub-oovs.py tab-separated-oov-dictionary translation-output-file

-------------------------------------

If you use this system, please cite the following:
http://www.clsp.jhu.edu/~anni/papers/irvine_amta_translit.pdf

-------------------------------------

Questions:
[email protected]

miniscale_translit's People

Stargazers

Ann Irvine avatar Chris Callison-Burch avatar

Watchers

Ann Irvine avatar

Forkers

joshua-decoder

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.