Adapting Entities Across Languages and Cultures

EMNLP Findings 2021

We provide the following files for future experiments:

Predictions are evaluated as:

python code/evaluate.py --golds <gold_annotations> --predictions

e.g., python code/evaluate.py --golds evaluation_data/gold_American_VealeNOC.txt --predictions embedding_predictions/predictions_3cosadd_American_VealeNOC.txt

Human Generated Adaptations The final human generated data is available under evaluation_data as the four gold files: German VealeNOC, German Wiki, American VealeNOC, American Wiki.

Human Evaluations of All Adaptations The questions and the five translator judgements for them are provided at evaluation_data/translator_evaluations.csv.

Embedding-Based Adaptations The final generated data for our VealeNOC and Wikipedia sourced entities are available under embedding_predictions. There are 6 files: German Veale, German Wiki, American Veale, American Wiki created with the 3cosadd. And 2 files ("learned") that are trained on Wikipedia and tested on VealeNOC.

WikiData Adaptations The final generated data for our VealeNOC and Wikipedia sourced entities are available under wikidata_predictions. There are 4 files: German Veale, German Wiki, American Veale, American Wiki created with our WikiData method.

To create them yourself, you will need (combined 50GB):

German Matrix

American Matrix

Optionally, we provide the original WikiData dump from 10-26-2020 (processed to remove everything unnecessary to Properties and Values): https://obj.umiacs.umd.edu/adaptation/10-26-20-wikidata.jsonl

FAQ

1) What python environment do I need?

pip install -r requirements.txt

2) How do I produce embeddings based modulations?

We provide modulate.py which supports both the unsupervised 3cosadd and the supervised learned modulation modes. For detailed parameters run:

python code/modulate.py -h

Example American to German modulation with 3cosadd:

python modulate.py \
    --input input_American_Wiki.txt \
    --output predictions_3cosadd_American_Wiki.txt \
    --src_emb vectors-en.txt \
    --trg_emb vectors-de.txt \
    --method add \
    --src_pos Germany \
    --src_neg United_States \
    --trg_pos Deutschland \
    --trg_neg USA

Example German to American modulation with learned:

python code/modulate.py \
    --input input_German_VealeNOC.txt \
    --output predictions_learned_German_VealeNOC.txt \
    --src_emb vectors-de.txt \
    --trg_emb vectors-en.txt \
    --method ridge \
    --train_file train_German_Wiki.txt

3) How do I get my own WikiData dump?:

download a specific date from https://dumps.wikimedia.org/wikidatawiki/entities/ You're looking for the file titled: e.g., wikidata-20210830-all.json.bz2 under a recent date.
process the data to get it into .jsonl format (WikiData is unsurprisingly large, so removing unrelated attributes and making it into a JSONLines format---which can be loaded item by item---is a helpful preprocessing step.
We use https://github.com/EntilZha/wikidata-rust to make this conversion.

4) What computing environment do I need?

Wikipedia and Wikidata are obviously large. The code for Wikidata used a large RAM CPU (100+ GB) for pre-processing the data, and a GPU for computing Faiss distance. Since the data is provided in a .jsonl format, the code could likely be reworked to require less CPU memory if needed. The Faiss distance calculation is tractable (~1 hour) on a CPU.

Please contact [email protected] with any questions.

denispeskoff / 2021_emnlp_adaptation Goto Github PK

2021_emnlp_adaptation's Introduction

Adapting Entities Across Languages and Cultures

EMNLP Findings 2021

2021_emnlp_adaptation's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent