Giter Site home page Giter Site logo

vanessaschenkel / nmt-gender-rerank Goto Github PK

View Code? Open in Web Editor NEW

This project forked from dcsaunders/nmt-gender-rerank

0.0 0.0 0.0 24.67 MB

Scripts and nbest lists for "First the worst: Finding better gender translations during beam search" (Findings of ACL 2022)

Shell 0.01% Python 0.06% Roff 99.93%

nmt-gender-rerank's Introduction

nmt-gender-rerank

Scripts and nbest lists for our paper First the worst: Finding better gender translations during beam search (Findings of ACL 2022)

Abstract

Neural machine translation inference procedures like beam search generate the most likely output under the model. This can exacerbate any demographic biases exhibited by the model. We focus on gender bias resulting from systematic errors in grammatical gender translation, which can lead to human referents being misrepresented or misgendered.

Most approaches to this problem adjust the training data or the model. By contrast, we experiment with simply adjusting the inference procedure. We experiment with reranking nbest lists using gender features obtained automatically from the source sentence, and applying gender constraints while decoding to improve nbest list gender diversity. We find that a combination of these techniques allows large gains in WinoMT accuracy without requiring additional bilingual data or an additional NMT model.

Requirements

  • fast_align (Dyer et al 2013) https://github.com/clab/fast_align must be built and an environment variable FAST_ALIGN_BASE defined that points to its root directory, e.g.:
export FAST_ALIGN_BASE=/<...>/fast_align
  • Python package mosestokenizer.

  • Spacy and DeMorphy are required for Spanish and German evaluation. Experiments in paper conducted with spacy version 3.1.3.

Example use

./rerank_nbest.sh LANG WDIR NBEST-LIST TAGGED-SOURCE 1BEST-OUTPUT-FILE
  • LANG: two-letter abbreviation for target language. Current implementation supports de, he, es.

  • WDIR: a working directory where fastalign files will be output. Directory will be created if it does not exist.

  • NBEST-LIST: a file containing an nbest list in the target language, consisting of three columns with a triple-pipe ("|||") delimiter. First column contains sentence index, second column contains target-language hypotheses, third column contains score (higher is better)

  • TAGGED-SOURCE: a file containing three tab-delimited columns conveying gender information about the source sentence. first column contains gender to search for. second column contains the entity - the word to identify gender for, which may be provided as oracle information or identified by automatic coreference resolution. The entity may contain multiple words (e.g. "the doctor" or "all teachers"). There may also be multiple entities per sentence, which should be pipe-separated (e.g. "the doctor|who". Third column contains source sentence (which should be detokenized - it is tokenized internally).

  • 1BEST-OUTPUT-FILE: Path to where the 1-best reranked selection (text only) should be output

The following command reranks the provided English-Spanish gender-constrained 20-best list using oracle entities.

./rerank_nbest.sh es tmp  nbest-lists/enes.constrainnbest.20  tagged-source/winomt-en-entity tmp/output.winomt.enes.20.constrain.rerank.oracle

Citing

If you found the data or scripts here useful, please cite the paper:

@InProceedings{saunders-etal-2022-first,
  author    = {Saunders, Danielle and Sallis, Rosie and Byrne, Bill},
  title     = {First the worst: Finding better gender translations during beam search},
  booktitle = "Findings of the Association for Computational Linguistics: ACL 2022",
  month     = {May},
  year      = {2022},
  publisher = {Association for Computational Linguistics}
}

nmt-gender-rerank's People

Contributors

dcsaunders avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.