Giter Site home page Giter Site logo

bimu's Introduction

Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders

(c) Simon Šuster, 2016

This is the implementation of the embedding models described in:

Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders. Simon Šuster, Ivan Titov and Gertjan van Noord. NAACL, 2016. bibtex

The individual similarity scores, presented as averages in the paper, are reported in appendix.

Bilingual training of the multi-sense model

See python3.4 examples/run_bimu.py --help for the full list of options, and set the Theano flags as THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32.

To train a multi-sense model bilingually on a toy parallel corpus of 10k sentences with default parameters:

python3.4 examples/run_bimu.py -model bimu -corpus_path_e data/toy_10k_en -corpus_path_f data/toy_10k_fr -corpus_path_a data/toy_10k_align -model_f_dir $OUTPUT_FR

This will output the embedding matrices, the vocabulary and the configuration file in the output/bimu3_toy_10k_en_... directory ($OUTPUT). Note that this presupposes that the second-language embeddings already exist in the folder $OUTPUT_FR. If not, train them by simply running:

python3.4 examples/run_mono.py -corpus_path data/toy_10k_fr -model sg

To obtain the nearest neighbors for selected polysemous words:

python3 eval/neighbors.py -input_dir $OUTPUT

To evaluate the embeddings on the SCWS dataset:

python3 eval/scws/embed.py -input_dir $OUTPUT -model senses -sim avg_exp

To train and test the POS tagger:

python3 eval/nn/score.py -train_file wsjtrain -test_file wsjtest  
-tag_vocab_file data/tagvocab.json -embedding_file $OUTPUT/W_w.npy -vocab_file $OUTPUT/w_index.json -cembedding_file $OUTPUT/W_c.npy

Here, you will need the gold standard WSJ data available as wsjtrain and wsjtest. The index of POS tags is given as a json file, example can be found in data/tagvocab.json.

Training the monolingual Skip-Gram and multi-sense models

See python3.4 examples/run_mono.py --help for the full list of options, and set the Theano flags as THEANO_FLAGS=mode=FAST_RUN,device=gpu,floatX=float32.

To train the basic SkipGram with default options:

python3.4 examples/run_mono.py -corpus_path data/toy_10k_en -model sg

To train a multi-sense embedding model with 3 senses per word:

python3.4 examples/run_mono.py -corpus_path data/toy_10k_en -model senses -n_senses 3

bimu's People

Contributors

simonsuster avatar

Stargazers

 avatar  avatar Benno Kruit avatar  avatar Stephan Tulkens avatar  avatar  avatar Philipp Dowling avatar Jo Daiber avatar

Watchers

Kilian Evang avatar Gertjan van Noord avatar James Cloos avatar Valerio Basile avatar  avatar  avatar paper2code - bot avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.