Giter Site home page Giter Site logo

2021_emnlp_adaptation's Introduction

Adapting Entities Across Languages and Cultures

EMNLP Findings 2021

We provide the following files for future experiments:

Predictions are evaluated as:

python code/evaluate.py --golds <gold_annotations> --predictions

e.g., python code/evaluate.py --golds evaluation_data/gold_American_VealeNOC.txt --predictions embedding_predictions/predictions_3cosadd_American_VealeNOC.txt

Human Generated Adaptations The final human generated data is available under evaluation_data as the four gold files: German VealeNOC, German Wiki, American VealeNOC, American Wiki.

Human Evaluations of All Adaptations The questions and the five translator judgements for them are provided at evaluation_data/translator_evaluations.csv.

Embedding-Based Adaptations The final generated data for our VealeNOC and Wikipedia sourced entities are available under embedding_predictions. There are 6 files: German Veale, German Wiki, American Veale, American Wiki created with the 3cosadd. And 2 files ("learned") that are trained on Wikipedia and tested on VealeNOC.

WikiData Adaptations The final generated data for our VealeNOC and Wikipedia sourced entities are available under wikidata_predictions. There are 4 files: German Veale, German Wiki, American Veale, American Wiki created with our WikiData method.

To create them yourself, you will need (combined 50GB):

German Matrix

American Matrix

Optionally, we provide the original WikiData dump from 10-26-2020 (processed to remove everything unnecessary to Properties and Values): https://obj.umiacs.umd.edu/adaptation/10-26-20-wikidata.jsonl

FAQ

1) What python environment do I need?

pip install -r requirements.txt

2) How do I produce embeddings based modulations?

We provide modulate.py which supports both the unsupervised 3cosadd and the supervised learned modulation modes. For detailed parameters run:

python code/modulate.py -h
  • Example American to German modulation with 3cosadd:
python modulate.py \
    --input input_American_Wiki.txt \
    --output predictions_3cosadd_American_Wiki.txt \
    --src_emb vectors-en.txt \
    --trg_emb vectors-de.txt \
    --method add \
    --src_pos Germany \
    --src_neg United_States \
    --trg_pos Deutschland \
    --trg_neg USA
  • Example German to American modulation with learned:
python code/modulate.py \
    --input input_German_VealeNOC.txt \
    --output predictions_learned_German_VealeNOC.txt \
    --src_emb vectors-de.txt \
    --trg_emb vectors-en.txt \
    --method ridge \
    --train_file train_German_Wiki.txt

3) How do I get my own WikiData dump?:

  1. download a specific date from https://dumps.wikimedia.org/wikidatawiki/entities/ You're looking for the file titled: e.g., wikidata-20210830-all.json.bz2 under a recent date.
  2. process the data to get it into .jsonl format (WikiData is unsurprisingly large, so removing unrelated attributes and making it into a JSONLines format---which can be loaded item by item---is a helpful preprocessing step.
    We use https://github.com/EntilZha/wikidata-rust to make this conversion.

4) What computing environment do I need?

Wikipedia and Wikidata are obviously large. The code for Wikidata used a large RAM CPU (100+ GB) for pre-processing the data, and a GPU for computing Faiss distance. Since the data is provided in a .jsonl format, the code could likely be reworked to require less CPU memory if needed. The Faiss distance calculation is tractable (~1 hour) on a CPU.

Please contact [email protected] with any questions.

2021_emnlp_adaptation's People

Contributors

denispeskoff avatar hangyav avatar

Stargazers

Binwei Yao avatar Adam Visokay avatar

Watchers

 avatar James Cloos avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.