Giter Site home page Giter Site logo

deep-spin / robust_mt_evaluation Goto Github PK

View Code? Open in Web Editor NEW
15.0 4.0 0.0 841 KB

Repository for "BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation", accepted at EAMT 2023.

License: Apache License 2.0

Python 30.39% Jupyter Notebook 69.61%
eamt interpretability machine-translation machine-translation-metrics quality-estimation robustness

robust_mt_evaluation's Introduction

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

Repository for "BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation", accepted at EAMT 2023.

TL;DR

This repository is en extension of the original COMET metric, providing different options to enhance it with lexical features.

It includes code for word-level and sentence-level features. We also provide the data that was used in the experiments and checkpoints for the models presented in the paper: COMET+aug, COMET+SL-feat. and COMET+WL-tags.

We used COMET v1.0 as the basis for this extension.

Soon: we will add similar checkpoints but for a newer COMET v2.0.

Quick Installation

COMET requires python 3.8 or above. In our experiments we are using python 3.8.

Detailed usage examples and instructions for the COMET metric can be found in the Full Documentation.

To develop locally install Poetry (pip install poetry) and run the following commands:

git clone https://github.com/deep-spin/robust_MT_evaluation.git
cd robust_MT_evaluation
poetry install

Important commands

Training your own Metric:

  • To train a new model use:

    comet-train --cfg configs/models/{your_model_config}.yaml

Scoring MT outputs:

  • To score with your trained metric use:

    comet-score --model <path_to_trained_model> -s src.txt -t mt.txt -r ref.txt --to_json <path_where_to_save_the_scores>
  • If you used word-level tags during training, then add -wlt <path_to_wlt_for_mt>

    comet-score --model <path_to_trained_model> -s src.txt -t mt.txt -r ref.txt -wlt <path_to_wlt_for_mt> --to_json <path_where_to_save_the_scores>
  • If you used sentence-level features during training, then add -f <path_to_features_for_mt>

    comet-score --model <path_to_trained_model> -s src.txt -t mt.txt -r ref.txt -f <path_to_features_for_mt> --to_json <path_where_to_save_the_scores>

Note: Please contact [email protected] if you wish to host your own metric within COMET available metrics!

COMET configurations

To train a COMET model on your data you can use the following configuration files:

COMET robust_MT_evaluation/configs/models/regression_metric_original.yaml

COMET+WL-tags robust_MT_evaluation/configs/models/regression_metric_original_with_tags.yaml

COMET+SL-feat. robust_MT_evaluation/configs/models/regression_metric_original_with_feats_bs64.yaml

COMET+aug robust_MT_evaluation/configs/models/regression_metric_original_with_augmts.yaml

COMET Models

Here are the pretrained models that can be used to evaluate your translations:

  • comet-wl-tags: Regression model with incorporated into the architecture word-level OK / BAD tags that correspond to the subwords of the translation hypothesis. (COMET+WL-tags)

  • comet-sl-feats: Regression model that was enhanced with scores obtained from other metrics, BLEU and CHRF, that are used as sentence-level (SL) features in a late fusion manner. (COMET+SL-feat.)

  • comet-aug: Regression model that was trained on a mixture of original and augmented Direct Assessments from WMT17 to WMT20. We use the code provided by the authors of SMAUG and apply their choice of hyperparameters, including the optimal percentage of the augmented data. (COMET+aug)

Note: The range of scores between different models can be totally different. To better understand COMET scores please take a look at these FAQs

Note #2: The word-level tags can be generated in different ways. To generate tags for subwords instead of tokens we use a modified version of WMT word-level quality estimation task.

Related Publications

Citation

If you found our work/code useful, please consider citing our paper:

@article{glushkova2023bleu,
  title={BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation},
  author={Glushkova, Taisiya and Zerva, Chrysoula and Martins, Andr{\'e} FT},
  journal={arXiv preprint arXiv:2305.19144},
  year={2023}
}

Acknowledgments

This code is largely based on the COMET repo by Ricardo Rei.

robust_mt_evaluation's People

Contributors

glushkovato avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

robust_mt_evaluation's Issues

Loading the models

Hi, thanks for a very interesting work and for open-sourcing the models.
I have a clean python3.8 virtualenv and I've installed this repository and dependencies with poetry, but I'm not able to run comet-score for any of the checkpoints:

comet-score -s ../out/news18_csen.beam20.trans -t ../news18_csen.en.snt -r ../news18_csen.en.snt --model models/comet-wl-tags/checkpoints/epoch=1-step=206468.ckpt
/home/jon/.local/lib/python3.8/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Global seed set to 12
Created a temporary directory at /tmp/tmpmttoa89o
Writing /tmp/tmpmttoa89o/_remote_module_non_scriptable.py
Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias']

  • This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Encoder model frozen.
    Traceback (most recent call last):
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/bin/comet-score", line 6, in
    sys.exit(score_command())
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/comet/cli/score.py", line 191, in score_command
    model = load_from_checkpoint(model_path)
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/comet/models/init.py", line 72, in load_from_checkpoint
    model = model_class.load_from_checkpoint(checkpoint_path, **hparams)
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
    model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 209, in _load_model_state
    keys = model.load_state_dict(checkpoint["state_dict"], strict=strict)
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for RegressionMetric:
    Missing key(s) in state_dict: "reproject_embed_layer.weight", "reproject_embed_layer.bias".
    size mismatch for encoder.model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([250005, 1024]) from checkpoint, the shape in current model is torch.Size([250002, 1024]).
    size mismatch for estimator.ff.0.weight: copying a param with shape torch.Size([3072, 8192]) from checkpoint, the shape in current model is torch.Size([3072, 6144]).

comet-score -s ../out/news18_csen.beam20.trans -t ../news18_csen.en.snt -r ../news18_csen.en.snt --model models/comet-sl-feats/checkpoints/epoch=1-step=237518.ckpt
/home/jon/.local/lib/python3.8/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Global seed set to 12
Created a temporary directory at /tmp/tmpucdssx8o
Writing /tmp/tmpucdssx8o/_remote_module_non_scriptable.py
Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']

  • This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
  • This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    Encoder model frozen.
    Traceback (most recent call last):
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/bin/comet-score", line 6, in
    sys.exit(score_command())
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/comet/cli/score.py", line 191, in score_command
    model = load_from_checkpoint(model_path)
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/comet/models/init.py", line 72, in load_from_checkpoint
    model = model_class.load_from_checkpoint(checkpoint_path, **hparams)
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
    model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 209, in _load_model_state
    keys = model.load_state_dict(checkpoint["state_dict"], strict=strict)
    File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
    RuntimeError: Error(s) in loading state_dict for RegressionMetric:
    Missing key(s) in state_dict: "reproject_embed_layer.weight", "reproject_embed_layer.bias".

Additionally, I think the comet-aug archive is incomplete, I'm getting EOF errors when trying to extract it. What can I do to solve these problems?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.