Giter Site home page Giter Site logo

biling-survey's Introduction

Data and scripts for reproducing the results from the ACL 2016 paper. The vectors used for the experiments can be found here.

Running Monolingual Evaluation

Running QVec

Simply run

python qvec_cca2.py --in_vectors ~/mydir/en1.vectors

Running Word Similarity with Steigler's p-value

Run the following over a pair of models for which you want to compute whether the difference is significant,

python stat_signf.py ~/mydir/en1.vectors ~/mydir/vulic_vectors/en2.vectors data/word-sim/EN-SIMLEX-999.txt

This will output something like,

Vectors read from: ~/mydir/en1.vectors
Vectors read from: ~/mydir/en2.vectors
==================================================
      Num Pairs       Not found             Rho
==================================================
         999              90          0.1234
         999              41          0.5678
         999             129          0.8429
(1.376206182800332, 0.014533569095982752)

Where the p-value is 0.014.

Running BLDict

The dictionaries used for evaluation are provided in en.*.dict.

The format is,

en1 en2 <tab> fr1 fr2 fr3

Where en1 and en2 are entries on the english side which all share fr1, fr2, fr3 as possible translations. For eg.

dangerous hazardous unsafe risky dangereux dangereuse

If you compute the english side tokens of the dictionary you should get 1510 (fr), 1425 (de), 1610(zh) and 1024(sv).

The evaluation code is provided under my-evaluation. First setup the dependencies using mvn compile and mvn dependency:copy-dependencies. Then run,

sh run.sh evaluateBiDict de ~/mydir/bicvm_vectors/bicvm.en-de.en.200 ~/mydir/bicvm_vectors/bicvm.en-de.de.200

This should print the top-10 accuracy and MRR.

Code to extract dictionaries from word alignments for performing CCA is also provided in my-evaluation. Try running,

run.sh WriteDictForCCA parallelFile(uniq.en-es) alignFile(tr.*.intersect) outfile(*.dict) minCount(0-5) limit(-1)

Running CLDC

Download the code for the Klementiev et al. paper from here.

You will also need to procure the RCV2 Multilingual Corpus.

For en --> L (train on english and test on language L)

The english train split is same as one provided by Klementiev et al. Similarly for de's test split. The test splits for fr,sv,zh are in file test-en-l2.txt (There should be a distribution of 10 C, 300 E, 600 G, 900 M).

For L --> en Train files for fr,sv,zh are in file train-l2-en.txt. For de, its same as before. The test files (en) are also same as before.

Running CLDEP

You will need to get the universal dependencies treebank v1.2 from here.

You will also need to install the parser released here.

An example training script is provided under the file example.sh. It trains on english and german treebanks using the embeddings trained by a particular model. It uses the config file provided under config/config.cfg. Note that you may need to change the embedding dimensions embedding_size for your embeddings.

###Reference

@inproceedings{bicompare:16,
author = {Upadhyay, Shyam and Faruqui, Manaal and Dyer, Chris and Roth, Dan},
title = {Cross-lingual Models of Word Embeddings: An Empirical Comparison},
booktitle = {Proc. of ACL},
year = {2016},
url = {http://arxiv.org/abs/1604.00425}
}

biling-survey's People

Contributors

shyamupa avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.