BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

Repository for "BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation", accepted at EAMT 2023.

TL;DR

This repository is en extension of the original COMET metric, providing different options to enhance it with lexical features.

It includes code for word-level and sentence-level features. We also provide the data that was used in the experiments and checkpoints for the models presented in the paper: COMET+aug, COMET+SL-feat. and COMET+WL-tags.

We used COMET v1.0 as the basis for this extension.

Soon: we will add similar checkpoints but for a newer COMET v2.0.

Quick Installation

COMET requires python 3.8 or above. In our experiments we are using python 3.8.

Detailed usage examples and instructions for the COMET metric can be found in the Full Documentation.

To develop locally install Poetry (pip install poetry) and run the following commands:

git clone https://github.com/deep-spin/robust_MT_evaluation.git
cd robust_MT_evaluation
poetry install

Important commands

Training your own Metric:

To train a new model use:

comet-train --cfg configs/models/{your_model_config}.yaml

Scoring MT outputs:

To score with your trained metric use:

comet-score --model <path_to_trained_model> -s src.txt -t mt.txt -r ref.txt --to_json <path_where_to_save_the_scores>

If you used word-level tags during training, then add -wlt <path_to_wlt_for_mt>

comet-score --model <path_to_trained_model> -s src.txt -t mt.txt -r ref.txt -wlt <path_to_wlt_for_mt> --to_json <path_where_to_save_the_scores>

If you used sentence-level features during training, then add -f <path_to_features_for_mt>

comet-score --model <path_to_trained_model> -s src.txt -t mt.txt -r ref.txt -f <path_to_features_for_mt> --to_json <path_where_to_save_the_scores>

Note: Please contact [email protected] if you wish to host your own metric within COMET available metrics!

COMET configurations

To train a COMET model on your data you can use the following configuration files:

COMET robust_MT_evaluation/configs/models/regression_metric_original.yaml

COMET+WL-tags robust_MT_evaluation/configs/models/regression_metric_original_with_tags.yaml

COMET+SL-feat. robust_MT_evaluation/configs/models/regression_metric_original_with_feats_bs64.yaml

COMET+aug robust_MT_evaluation/configs/models/regression_metric_original_with_augmts.yaml

COMET Models

Here are the pretrained models that can be used to evaluate your translations:

comet-wl-tags: Regression model with incorporated into the architecture word-level OK / BAD tags that correspond to the subwords of the translation hypothesis. (COMET+WL-tags)
comet-sl-feats: Regression model that was enhanced with scores obtained from other metrics, BLEU and CHRF, that are used as sentence-level (SL) features in a late fusion manner. (COMET+SL-feat.)
comet-aug: Regression model that was trained on a mixture of original and augmented Direct Assessments from WMT17 to WMT20. We use the code provided by the authors of SMAUG and apply their choice of hyperparameters, including the optimal percentage of the augmented data. (COMET+aug)

Note: The range of scores between different models can be totally different. To better understand COMET scores please take a look at these FAQs

Note #2: The word-level tags can be generated in different ways. To generate tags for subwords instead of tokens we use a modified version of WMT word-level quality estimation task.

Related Publications

Citation

If you found our work/code useful, please consider citing our paper:

@article{glushkova2023bleu,
  title={BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation},
  author={Glushkova, Taisiya and Zerva, Chrysoula and Martins, Andr{\'e} FT},
  journal={arXiv preprint arXiv:2305.19144},
  year={2023}
}

Acknowledgments

This code is largely based on the COMET repo by Ricardo Rei.

Loading the models

Hi, thanks for a very interesting work and for open-sourcing the models.
I have a clean python3.8 virtualenv and I've installed this repository and dependencies with poetry, but I'm not able to run comet-score for any of the checkpoints:

comet-score -s ../out/news18_csen.beam20.trans -t ../news18_csen.en.snt -r ../news18_csen.en.snt --model models/comet-wl-tags/checkpoints/epoch=1-step=206468.ckpt
/home/jon/.local/lib/python3.8/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Global seed set to 12
Created a temporary directory at /tmp/tmpmttoa89o
Writing /tmp/tmpmttoa89o/_remote_module_non_scriptable.py
Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['roberta.pooler.dense.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias', 'lm_head.dense.bias', 'lm_head.bias', 'lm_head.layer_norm.bias']

This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Encoder model frozen.
Traceback (most recent call last):
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/bin/comet-score", line 6, in
sys.exit(score_command())
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/comet/cli/score.py", line 191, in score_command
model = load_from_checkpoint(model_path)
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/comet/models/init.py", line 72, in load_from_checkpoint
model = model_class.load_from_checkpoint(checkpoint_path, **hparams)
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 209, in _load_model_state
keys = model.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RegressionMetric:
Missing key(s) in state_dict: "reproject_embed_layer.weight", "reproject_embed_layer.bias".
size mismatch for encoder.model.embeddings.word_embeddings.weight: copying a param with shape torch.Size([250005, 1024]) from checkpoint, the shape in current model is torch.Size([250002, 1024]).
size mismatch for estimator.ff.0.weight: copying a param with shape torch.Size([3072, 8192]) from checkpoint, the shape in current model is torch.Size([3072, 6144]).

comet-score -s ../out/news18_csen.beam20.trans -t ../news18_csen.en.snt -r ../news18_csen.en.snt --model models/comet-sl-feats/checkpoints/epoch=1-step=237518.ckpt
/home/jon/.local/lib/python3.8/site-packages/scipy/init.py:146: UserWarning: A NumPy version >=1.16.5 and <1.23.0 is required for this version of SciPy (detected version 1.23.3
warnings.warn(f"A NumPy version >={np_minversion} and <{np_maxversion}"
Global seed set to 12
Created a temporary directory at /tmp/tmpucdssx8o
Writing /tmp/tmpucdssx8o/_remote_module_non_scriptable.py
Some weights of the model checkpoint at xlm-roberta-large were not used when initializing XLMRobertaModel: ['lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']

This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Encoder model frozen.
Traceback (most recent call last):
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/bin/comet-score", line 6, in
sys.exit(score_command())
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/comet/cli/score.py", line 191, in score_command
model = load_from_checkpoint(model_path)
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/comet/models/init.py", line 72, in load_from_checkpoint
model = model_class.load_from_checkpoint(checkpoint_path, **hparams)
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 161, in load_from_checkpoint
model = cls._load_model_state(checkpoint, strict=strict, **kwargs)
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/pytorch_lightning/core/saving.py", line 209, in _load_model_state
keys = model.load_state_dict(checkpoint["state_dict"], strict=strict)
File "/lnet/work/people/jon/ga_clean/robust_MT_evaluation/env2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for RegressionMetric:
Missing key(s) in state_dict: "reproject_embed_layer.weight", "reproject_embed_layer.bias".

Additionally, I think the comet-aug archive is incomplete, I'm getting EOF errors when trying to extract it. What can I do to solve these problems?

deep-spin / robust_mt_evaluation Goto Github PK

robust_mt_evaluation's Introduction

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

TL;DR

Quick Installation

Important commands

Training your own Metric:

Scoring MT outputs:

COMET configurations

COMET Models

Related Publications

Citation

Acknowledgments

robust_mt_evaluation's People

Contributors

Stargazers

Watchers

robust_mt_evaluation's Issues

Recommend Projects

Recommend Topics

Recommend Org