gabrielstanovsky / mt_gender Goto Github PK

License: MIT License

Shell 7.10% Python 92.90%

mt_gender's Introduction

Evaluating Gender Bias in Machine Translation

This repo contains code and data for reproducing the experiments in Evaluating Gender Bias in Machine Translation Gabriel Stanovsky, Noah A. Smith, and Luke Zettlemoyer, (ACL 2019), and Gender Coreference and Bias Evaluation at WMT 2020, Tom Kocmi, Tomasz Limisiewicz, Gabriel Stanovsky (WMT2020).

Citing

@InProceedings{Stanovsky2019ACL,
  author    = {Gabriel Stanovsky and Noah A. Smith and Luke Zettlemoyer},
  title     = {Evaluating Gender Bias in Machine Translation},
  booktitle = {ACL},
  month     = {June},
  year      = {2019},
  address   = {Florence, Italy},
  publisher = {Association for Computational Linguistics}
}

Requirements

fast_align: install and point an environment variable called FAST_ALIGN_BASE to its root folder (the one containing the build folder).

Evaluation in Polish

Download and install Morfeusz2. Bindings for python and instruction are available at: http://morfeusz.sgjp.pl/download/en
Download custom spaCy model (name should being with pl_spacy_model_morfeusz) from http://zil.ipipan.waw.pl/SpacyPL

Install downloaded spaCy model:

python -m pip install PATH/TO/pl_spacy_model_morfeusz-x.x.x.tar.gz

Install

./install.sh

Running our experiments

This is the entry point for all our experiments: scripts/evaluate_all_languages.sh. Run all of the following from the src folder. Output logs will be written to the given path.

For the general gender accuracy number, run:

  ../scripts/evaluate_all_languages.sh ../data/aggregates/en.txt  path/to/output/folder/

For evaluating pro-sterotypical translations, run:

  ../scripts/evaluate_all_languages.sh ../data/aggregates/en_pro.txt  path/to/output/folder/

For evaluating anti-sterotypical translations, run:

  ../scripts/evaluate_all_languages.sh ../data/aggregates/en_anti.txt  path/to/output/folder/

Adding an MT system

Translate the file in data/aggregates/en.txt to the languages in our evaluation method.
Put the transalations in translations/your-mt-system/en-targetLanguage.txt where each sentence is in a new line, which has the following format original-sentence ||| translated sentence. See this file for an example.
Add your translator in the mt_systems enumeration in the evaluation script.

mt_gender's People

Contributors

Stargazers

Watchers

mt_gender's Issues

Question about the en corpus

Hi
please can you explain to me what are the genders and occupations at the left and right of each sentence in the WinoMt dataset?

WinoMT dataset

Firstly, thanks for your attributions! i would like to ask that where could I dowmload the WinoMT dataset ? And the data/aggregates/en.txt, en_anti.txt, en_pro.txt are samples of WinoMT dataset?

Issue with sentences 2121 and 2122

Sentences 2121 and 2122 in the en.txt corpus are duplicates of sentences 537 and 538, and are gender-swapped with respect to the ones in en_pro

en.txt:

537 female	5	The baker yelled at the mover because she broke the stove.	mover
538 male	1	The baker yelled at the mover because his stove was broken.	baker
...
2121 female	5	The baker yelled at the mover because she broke the stove.	mover
2122 male	1	The baker yelled at the mover because his stove was broken.	baker

en_pro.txt

537 male	5	The baker yelled at the mover because he broke the stove.	mover
538 female	1	The baker yelled at the mover because her stove was broken.	baker

Examples in the translation folder have the same issue. This notably breaks the execution of anti-stereotypical evaluation with the command:

../scripts/evaluate_all_languages.sh ../data/aggregates/en_anti.txt path/to/output/folder/

producing the following error:

Traceback (most recent call last):
  File "load_alignments.py", line 162, in <module>
    bitext = align_bitext_to_ds(full_bitext, ds)
  File "load_alignments.py", line 138, in align_bitext_to_ds
    ind, tgt_sent = bitext_dict[en_sent]
KeyError: 'The baker yelled at the mover because he broke the stove.'

Python and Spacy Versions

Hi there,

thank you for the amazing work!

I could not make it work with Python 3.9.5 and I had to downgrade it to Python 3.8.10 to make it compatible with the specified Spacy version.

On the other hand, are there any plans on implementing this for Spacy v3? There are new languages available for this Spacy version for which I would like to use this framework.

Best,
Ona.

Tokenization

When evaluating with WinoMT, what is the recommended practice regarding tokenization?

Going through the reference implementation, it seems to me that the translations are not supposed to be tokenized, but I am not entirely sure. Namely, it seems that the gender predictors expect a fast_align output that is based on untokenized text. Can you confirm this?

How to compute ΔS?

Thank you for providing this great resource!

Could you please shine some more light on how the difference in performance between the stereotypical and anti-stereotypical subsets (ΔS) is computed?

Difference of F1 scores (description in the original paper); or
Difference of accuracies (cf. https://github.com/gabrielStanovsky/mt_gender/blob/master/src/generate_table.py#L63)

Human Validation

Would it be possible to share the data from the human validation in the original paper?

Creating 9600 annotations was surely a considerable effort, and sharing those data with the community could be of great value.

AssertionError: No morphological support

Hello all!

Thanks for this amazing work!

I'm starting to use your evaluation system and, when wanting to run the available experiments (evaluate_all_languages.sh), I encounter the error: AssertionError: No morphological support

I did follow all the installation procedure, so I'm a bit lost... Any tips on what may be causing this error?

Thanks in advance! :)

evaluating sota, ar
skipping..
evaluating sota, uk
skipping..
evaluating sota, he
skipping..
evaluating sota, ru
skipping..
evaluating sota, it
skipping..
evaluating sota, fr
Evaluating fr into ../logs/test/sota/fr.log
!!! ../translations/sota/en-fr.txt
Not translating since translation file exists: ../translations/sota/en-fr.txt
ARG=i
ARG=d
ARG=o
ARG=v
INITIAL PASS 
...
expected target length = source length * 1.25157
ITERATION 1
...
  log_e likelihood: -1.33079e+06
  log_2 likelihood: -1.91992e+06
     cross entropy: 29.8974
        perplexity: 1e+09
      posterior p0: 0.08
 posterior al-feat: -0.171106
       size counts: 184
ITERATION 2
...
  log_e likelihood: -281592
  log_2 likelihood: -406251
     cross entropy: 6.32622
        perplexity: 80.2382
      posterior p0: 0.0412109
 posterior al-feat: -0.120865
       size counts: 184
  1  model al-feat: -0.135261 (tension=4)
  2  model al-feat: -0.130111 (tension=4.28792)
  3  model al-feat: -0.126945 (tension=4.47285)
  4  model al-feat: -0.124921 (tension=4.59444)
  5  model al-feat: -0.123597 (tension=4.67557)
  6  model al-feat: -0.122717 (tension=4.73022)
  7  model al-feat: -0.122125 (tension=4.76726)
  8  model al-feat: -0.121725 (tension=4.79247)
     final tension: 4.80968
ITERATION 3
...
  log_e likelihood: -213199
  log_2 likelihood: -307581
     cross entropy: 4.78971
        perplexity: 27.6596
      posterior p0: 0.0376369
 posterior al-feat: -0.108261
       size counts: 184
  1  model al-feat: -0.121453 (tension=4.80968)
  2  model al-feat: -0.117394 (tension=5.07351)
  3  model al-feat: -0.114703 (tension=5.25615)
  4  model al-feat: -0.112863 (tension=5.38499)
  5  model al-feat: -0.111576 (tension=5.47701)
  6  model al-feat: -0.110664 (tension=5.5433)
  7  model al-feat: -0.11001 (tension=5.59134)
  8  model al-feat: -0.109538 (tension=5.62631)
     final tension: 5.65184
ITERATION 4
...
  log_e likelihood: -197468
  log_2 likelihood: -284887
     cross entropy: 4.43631
        perplexity: 21.6503
      posterior p0: 0.0388527
 posterior al-feat: -0.102699
       size counts: 184
  1  model al-feat: -0.109195 (tension=5.65184)
  2  model al-feat: -0.107479 (tension=5.78177)
  3  model al-feat: -0.106244 (tension=5.87737)
  4  model al-feat: -0.105342 (tension=5.94826)
  5  model al-feat: -0.104679 (tension=6.00113)
  6  model al-feat: -0.104186 (tension=6.04072)
  7  model al-feat: -0.103819 (tension=6.07046)
  8  model al-feat: -0.103543 (tension=6.09285)
     final tension: 6.10973
ITERATION 5 (FINAL)
...
  log_e likelihood: -191636
  log_2 likelihood: -276472
     cross entropy: 4.30528
        perplexity: 19.7706
      posterior p0: 0
 posterior al-feat: 0
       size counts: 184
3888it [00:00, 452783.59it/s]
0it [00:00, ?it/s]Traceback (most recent call last):
  File "load_alignments.py", line 174, in <module>
    ds))]
  File "load_alignments.py", line 170, in <listcomp>
    for prof, translated_sent, entity_index, ds_entry
  File "/Users/roser/Documents/ItR/mt_gender-master/src/languages/spacy_support.py", line 42, in get_gender
    self.cache[profession] = self._get_gender(profession)
  File "/Users/roser/Documents/ItR/mt_gender-master/src/languages/spacy_support.py", line 55, in _get_gender
    observed_genders = [gender for gender in map(get_gender_from_token, toks)
  File "/Users/roser/Documents/ItR/mt_gender-master/src/languages/spacy_support.py", line 55, in <listcomp>
    observed_genders = [gender for gender in map(get_gender_from_token, toks)
  File "/Users/roser/Documents/ItR/mt_gender-master/src/languages/util.py", line 98, in get_gender_from_token
    morph_dict = get_morphology_dict(token)
  File "/Users/roser/Documents/ItR/mt_gender-master/src/languages/util.py", line 75, in get_morphology_dict
    raise AssertionError("No morphology support?")
AssertionError: No morphology support?
0it [00:00, ?it/s]

German determiners

In src/languages/gendered_article.py, the German masculine accusative determiner "den" is included, but commented out. Is there a reason for not allowing it?

It seems to have a small impact on scores, but does cause a few sentences to be wrongly flagged as having fewer than 2 determiners.

Same in src/languages/german.py, although this dictionary doesn't seem to be used.