miniscale_translit's Introduction
Transliteration training and decoding with sample Urdu data 1. If using a mac, update SRILM call in train_helper (comment out line 72, uncomment line 73) If using CLSP or COE clusters, do nothing. 2. Set JOSHUA, SRILM, and JAVA_HOME environment variables appropriately (or use my pointers at the top of train_helper and test_helper for coe machines) 3. To train a system, create directory with a file containing a "train" file and a "dev" file with transliteration pairs, then run (pointing to training directory): ./train_main.sh demo_data/urdu Example directory is demo_data/urdu (note that these example train and dev files already have beginning of word and end of word symbols appended) 4. To decode a test set, run: ./translit_strings.sh demo_data/urdu/urdu_test demo_data/urdu First arg: test file (list of single words to decode) Second arg: pointer to same directory given in training step, above Output in demo_data/urdu/urdu_test.answer For this example, references are given in demo_data/urdu/urdu_test.ref for comparison 5. Evaluate your output: python evaluate.py demo_data/urdu/urdu_test.answer demo_data/urdu/urdu_test.ref This simple evaluation script reports the number and percent of perfect transliterations, the average edit distance, and the average normalized (normalized by the length of the reference) edit distance ------------------------------------- Using your own data train/dev data: To use your own, just need to create directory with a train and dev file. Script to append beginning and word and end of word symbols: addSymTrainDev.py (run: python addSymTrainDev.py inputTrain) language model data: Two English language model files are included in demo_data/lm One is taken from Wikipedia person-page titles and the other from NE-labeled Gigaword corpus, and each includes relative frequency counts. train your own LM: Can use other LM, just change pointer in train_main.sh (and script for adding beginning of word and end of word symbols in demo_data/lm) to build from monolingual corpus: cd demo_data/lm python make_lm.py my-input-text Change LM pointer to my-input-text.wordModel ------------------------------------- Using Wikipedia data: See wikipedia_data/README ------------------------------------- Substitute transliterations into Joshua MT output To find OOVs in Joshua MT output: find-oovs.pl After generating transliterations for all OOV words, create a dictionary file (oov word - tab - transliterated word), and substitute them into the Joshua output: python sub-oovs.py tab-separated-oov-dictionary translation-output-file ------------------------------------- If you use this system, please cite the following: http://www.clsp.jhu.edu/~anni/papers/irvine_amta_translit.pdf ------------------------------------- Questions: [email protected]
miniscale_translit's People
Forkers
joshua-decoderRecommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. ๐๐๐
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google โค๏ธ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.