Giter Site home page Giter Site logo

bert_score's Introduction

BERTScore

made-with-python arxiv PyPI version bert-score Downloads Downloads License: MIT Code style: black

Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT (ICLR 2020). We now support about 130 models (see this spreadsheet for their correlations with human evaluation). Currently, the best model is microsoft/deberta-xlarge-mnli, please consider using it instead of the default roberta-large in order to have the best correlation with human evaluation.

News:

  • Updated to version 0.3.13

    • Fix bug with transformers version > 4.17.0 (#148)
  • Updated to version 0.3.12

    • Having get_idf_dict compatible with DDP (#140)
    • Fix setup bug (#138)
  • Updated to version 0.3.11

    • Support 6 DeBERTa v3 models
    • Support 3 ByT5 models
  • Updated to version 0.3.10

    • Support 8 SimCSE models
    • Fix the support of scibert (to be compatible with transformers >= 4.0.0)
    • Add scripts for reproducing some results in our paper (See this folder)
    • Support fast tokenizers in huggingface transformers with --use_fast_tokenizer. Notably, you will get different scores because of the difference in the tokenizer implementations (#106).
    • Fix non-zero recall problem for empty candidate strings (#107).
    • Add Turkish BERT Supoort (#108).
  • Updated to version 0.3.9

    • Support 3 BigBird models
    • Fix bugs for mBART and T5
    • Support 4 mT5 models as requested (#93)
  • Updated to version 0.3.8

    • Support 53 new pretrained models including BART, mBART, BORT, DeBERTa, T5, BERTweet, MPNet, ConvBERT, SqueezeBERT, SpanBERT, PEGASUS, Longformer, LED, Blendbot, etc. Among them, DeBERTa achives higher correlation with human scores than RoBERTa (our default) on WMT16 dataset. The correlations are presented in this Google sheet.
    • Please consider using --model_type microsoft/deberta-xlarge-mnli or --model_type microsoft/deberta-large-mnli (faster) if you want the scores to correlate better with human scores.
    • Add baseline files for DeBERTa models.
    • Add example code to generate baseline files (please see the details).
  • Updated to version 0.3.7

    • Being compatible with Huggingface's transformers version >=4.0.0. Thanks to public contributers (#84, #85, #86).
  • See #22 if you want to replicate our experiments on the COCO Captioning dataset.

  • For people in China, downloading pre-trained weights can be very slow. We provide copies of a few models on Baidu Pan.

  • Huggingface's datasets library includes BERTScore in their metric collection.

Previous updates

  • Updated to version 0.3.6
    • Support custom baseline files #74
    • The option --rescale-with-baseline is changed to --rescale_with_baseline so that it is consistent with other options.
  • Updated to version 0.3.5
    • Being compatible with Huggingface's transformers >=v3.0.0 and minor fixes (#58, #66, #68)
    • Several improvements related to efficency (#67, #69)
  • Updated to version 0.3.4
    • Compatible with transformers v2.11.0 now (#58)
  • Updated to version 0.3.3
    • Fixing the bug with empty strings issue #47.
    • Supporting 6 ELECTRA models and 24 smaller BERT models.
    • A new Google sheet for keeping the performance (i.e., pearson correlation with human judgment) of different models on WMT16 to-English.
    • Including the script for tuning the best number of layers of an English pre-trained model on WMT16 to-English data (See the details).
  • Updated to version 0.3.2
    • Bug fixed: fixing the bug in v0.3.1 when having multiple reference sentences.
    • Supporting multiple reference sentences with our command line tool.
  • Updated to version 0.3.1
    • A new BERTScorer object that caches the model to avoid re-loading it multiple times. Please see our jupyter notebook example for the usage.
    • Supporting multiple reference sentences for each example. The score function now can take a list of lists of strings as the references and return the score between the candidate sentence and its closest reference sentence.

Please see release logs for older updates.

Authors:

*: Equal Contribution

Overview

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on sentence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

For an illustration, BERTScore recall can be computed as

If you find this repo useful, please cite:

@inproceedings{bert-score,
  title={BERTScore: Evaluating Text Generation with BERT},
  author={Tianyi Zhang* and Varsha Kishore* and Felix Wu* and Kilian Q. Weinberger and Yoav Artzi},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=SkeHuCVFDr}
}

Installation

  • Python version >= 3.6
  • PyTorch version >= 1.0.0

Install from pypi with pip by

pip install bert-score

Install latest unstable version from the master branch on Github by:

pip install git+https://github.com/Tiiiger/bert_score

Install it from the source by:

git clone https://github.com/Tiiiger/bert_score
cd bert_score
pip install .

and you may test your installation by:

python -m unittest discover

Usage

Python Function

On a high level, we provide a python function bert_score.score and a python object bert_score.BERTScorer. The function provides all the supported features while the scorer object caches the BERT model to faciliate multiple evaluations. Check our demo to see how to use these two interfaces. Please refer to bert_score/score.py for implementation details.

Running BERTScore can be computationally intensive (because it uses BERT :p). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can try our demo on Google Colab

Command Line Interface (CLI)

We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

  1. To evaluate English text files:

We provide example inputs under ./example.

bert-score -r example/refs.txt -c example/hyps.txt --lang en

You will get the following output at the end:

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0) P: 0.957378 R: 0.961325 F1: 0.959333

where "roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0)" is the hash code.

Starting from version 0.3.0, we support rescaling the scores with baseline scores

bert-score -r example/refs.txt -c example/hyps.txt --lang en --rescale_with_baseline

You will get:

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0)-rescaled P: 0.747044 R: 0.770484 F1: 0.759045

This makes the range of the scores larger and more human-readable. Please see this post for details.

When having multiple reference sentences, please use

bert-score -r example/refs.txt example/refs2.txt -c example/hyps.txt --lang en

where the -r argument supports an arbitrary number of reference files. Each reference file should have the same number of lines as your candidate/hypothesis file. The i-th line in each reference file corresponds to the i-th line in the candidate file.

  1. To evaluate text files in other languages:

We currently support the 104 languages in multilingual BERT (full list).

Please specify the two-letter abbreviation of the language. For instance, using --lang zh for Chinese text.

See more options by bert-score -h.

  1. To load your own custom model: Please specify the path to the model and the number of layers to use by --model and --num_layers.
bert-score -r example/refs.txt -c example/hyps.txt --model path_to_my_bert --num_layers 9
  1. To visualize matching scores:
bert-score-show --lang en -r "There are two bananas on the table." -c "On the table are two apples." -f out.png

The figure will be saved to out.png.

  1. If you see the following message while using BERTScore, please ignore it. This is expected.
Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.weight']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Practical Tips

  • Report the hash code (e.g., roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.3.0)-rescaled) in your paper so that people know what setting you use. This is inspired by sacreBLEU. Changes in huggingface's transformers version may also affect the score (See issue #46).
  • Unlike BERT, RoBERTa uses GPT2-style tokenizer which creates addition " " tokens when there are multiple spaces appearing together. It is recommended to remove addition spaces by sent = re.sub(r' +', ' ', sent) or sent = re.sub(r'\s+', ' ', sent).
  • Using inverse document frequency (idf) on the reference sentences to weigh word importance may correlate better with human judgment. However, when the set of reference sentences become too small, the idf score would become inaccurate/invalid. We now make it optional. To use idf, please set --idf when using the CLI tool or idf=True when calling bert_score.score function.
  • When you are low on GPU memory, consider setting batch_size when calling bert_score.score function.
  • To use a particular model please set -m MODEL_TYPE when using the CLI tool or model_type=MODEL_TYPE when calling bert_score.score function.
  • We tune layer to use based on WMT16 metric evaluation dataset. You may use a different layer by setting -l LAYER or num_layers=LAYER. To tune the best layer for your custom model, please follow the instructions in tune_layers folder.
  • Limitation: Because BERT, RoBERTa, and XLM with learned positional embeddings are pre-trained on sentences with max length 512, BERTScore is undefined between sentences longer than 510 (512 after adding [CLS] and [SEP] tokens). The sentences longer than this will be truncated. Please consider using XLNet which can support much longer inputs.

Default Behavior

Default Model

Language Model
en roberta-large
en-sci allenai/scibert_scivocab_uncased
zh bert-base-chinese
tr dbmdz/bert-base-turkish-cased
others bert-base-multilingual-cased

Default Layers

Please see this Google sheet for the supported models and their performance.

Acknowledgement

This repo wouldn't be possible without the awesome bert, fairseq, and transformers.

bert_score's People

Contributors

alistairewj avatar dougian avatar ethanjperez avatar felixgwu avatar fireindark707 avatar inmoonlight avatar jinyongyoo avatar kirzharov avatar magic-lantern avatar medecau avatar nikitajz avatar praveenjune17 avatar radhikadua123 avatar shirley-wu avatar stancld avatar tiiiger avatar varshakishore avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bert_score's Issues

If the candidate sentence string has nothing in it, I get an error.

If candidate sentence string is "", then it should give a score of 0, but instead it gives an error.

I run this statement:
sol = score([""], ["Hello World."], model_type=None, num_layers=None, verbose=True,
idf=True, device=None, batch_size=64, nthreads=4, all_layers=False,
lang="en", return_hash=False, rescale_with_baseline=True)

This is the output I got:
preparing IDF dict...
done in 0.64 seconds
calculating scores...
computing bert embedding.
0%
0/1 [00:00<?, ?it/s]
IndexError Traceback (most recent call last)
in ()
3 sol = score([""], ["Hello World."], model_type=None, num_layers=None, verbose=True,
4 idf=True, device=None, batch_size=64, nthreads=4, all_layers=False,
----> 5 lang="en", return_hash=False, rescale_with_baseline=True)

10 frames
/usr/local/lib/python3.6/dist-packages/bert_score/score.py in score(cands, refs, model_type, num_layers, verbose, idf, device, batch_size, nthreads, all_layers, lang, return_hash, rescale_with_baseline)
110 all_preds = bert_cos_score_idf(model, refs, cands, tokenizer, idf_dict,
111 verbose=verbose, device=device,
--> 112 batch_size=batch_size, all_layers=all_layers).cpu()
113
114 if ref_group_boundaries is not None:

/usr/local/lib/python3.6/dist-packages/bert_score/utils.py in bert_cos_score_idf(model, refs, hyps, tokenizer, idf_dict, verbose, batch_size, device, all_layers)
365 sen_batch = sentences[batch_start:batch_start+batch_size]
366 embs, masks, padded_idf = get_bert_embedding(sen_batch, model, tokenizer, idf_dict,
--> 367 device=device, all_layers=all_layers)
368 embs = embs.cpu()
369 masks = masks.cpu()

/usr/local/lib/python3.6/dist-packages/bert_score/utils.py in get_bert_embedding(all_sens, model, tokenizer, idf_dict, batch_size, device, all_layers)
235 tokenizer,
236 idf_dict,
--> 237 device=device)
238
239 if batch_size == -1: batch_size = len(all_sens)

/usr/local/lib/python3.6/dist-packages/bert_score/utils.py in collate_idf(arr, tokenizer, idf_dict, device)
202 - :param: device (str): device to use, e.g. 'cpu' or 'cuda'
203 """
--> 204 arr = [sent_encode(tokenizer, a) for a in arr]
205
206 idf_weights = [[idf_dict[i] for i in a] for a in arr]

/usr/local/lib/python3.6/dist-packages/bert_score/utils.py in (.0)
202 - :param: device (str): device to use, e.g. 'cpu' or 'cuda'
203 """
--> 204 arr = [sent_encode(tokenizer, a) for a in arr]
205
206 idf_weights = [[idf_dict[i] for i in a] for a in arr]

/usr/local/lib/python3.6/dist-packages/bert_score/utils.py in sent_encode(tokenizer, sent)
81 return tokenizer.encode(sent.strip(), add_special_tokens=True,
82 add_prefix_space=True,
---> 83 max_length=tokenizer.max_len)
84 else:
85 return tokenizer.encode(sent.strip(), add_special_tokens=True,

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, return_tensors, **kwargs)
1421 pad_to_max_length=pad_to_max_length,
1422 return_tensors=return_tensors,
-> 1423 **kwargs,
1424 )
1425

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in encode_plus(self, text, text_pair, add_special_tokens, max_length, stride, truncation_strategy, pad_to_max_length, is_pretokenized, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, **kwargs)
1563 )
1564
-> 1565 first_ids = get_input_ids(text)
1566 second_ids = get_input_ids(text_pair) if text_pair is not None else None
1567

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in get_input_ids(text)
1535 def get_input_ids(text):
1536 if isinstance(text, str):
-> 1537 tokens = self.tokenize(text, add_special_tokens=add_special_tokens, **kwargs)
1538 return self.convert_tokens_to_ids(tokens)
1539 elif isinstance(text, (list, tuple)) and len(text) > 0 and isinstance(text[0], str):

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_utils.py in tokenize(self, text, **kwargs)
1259 """
1260 all_special_tokens = self.all_special_tokens
-> 1261 text = self.prepare_for_tokenization(text, **kwargs)
1262
1263 # TODO: should this be in the base class?

/usr/local/lib/python3.6/dist-packages/transformers/tokenization_roberta.py in prepare_for_tokenization(self, text, add_special_tokens, **kwargs)
237 else:
238 add_prefix_space = add_special_tokens
--> 239 if add_prefix_space and not text[0].isspace():
240 text = " " + text
241 return text

IndexError: string index out of range

can the source sentence be incorperated into contextual embeddings?

When evaluating conversation models using bert-score, a natural idea is that is is beneficial to add the source sentence before both the hypothesis and the reference to form better contextual embeddings. But when calculating cosine similarity, the source sentence should not be involved.

Can the project do back propagation, i.e., loss.backward()?

I have a domain corpus, which have the binary golden annotation: sentence pair A-B is similar, sentence pair C-D is not similar.

Can I use this corpus to fine-tune bert and get more accurate sentence similarity score based on the corpus?
How to define the loss function and do loss.backward?
And how is the performance on with that loss function?
Have you ever had a try?

Scores are nan when cands/refs lists have one element

Hello,
I am trying this simple example here:

cands = ['hello how are you?'] 
refs = ['hello how are you?'] 
P, R, F1 = score(cands, refs, bert="bert-base-uncased", verbose=True)
P: tensor([nan])
R: tensor([nan])
F1: tensor([nan])

I noticed that if the list of cands/refs has only one element, the resulting scores are nan.
This does not happen when len(cands) > 1.
I was curious on why this happens. Thanks a lot!

Rescale in scorer.score vs scorer.plot_example

Hello,
I hope you can help me understand some part of the rescaling logic.

In the score method, rescaling is done using:

if self.rescale_with_baseline:
all_preds = (all_preds - self.baseline_vals) / (1 - self.baseline_vals)

This makes sense since all_preds contains P, R, F scores per row, and this is the information in self.baseline_vals

However, in the plot_example method, the rescaling is done using:

if self.rescale_with_baseline:
sim = (sim - self.baseline_vals[2].item()) / (1 - self.baseline_vals[2].item())

In this case the rescaling is done over the similarities and using the F values in self.baseline_vals[2] (if I'm understanding correctly). Why is this done this way here? Why are the F scores good rescaling values for the "raw" similarity scores?

I understand that rescaling is merely performed to make the scores more interpretable, since P, R, F are calculated before rescaling. However, I was curious about this difference in the implementations. Thank you in advance for your help.

A different results, Thank you.

Hello, I have modified the path of "model=get_model(path)" in your code score.py, and changed it to load the RoBerta model I have downloaded offline. In my path folder, there are "pytorch_model.bin(1.4g)", "config.json", "merges.txt" and "vocab.json", because it is slow to download RoBerta in my Terminal. But when I loaded your example: "demo.py" with plot_example(cands[0], refs[0], lang="en"), I found that the drawing was completely different from yours.
image

In my picture I found all sim>0.9:
tensor([[0.9865, 0.9735, 0.9606, 0.9709, 0.9444, 0.9624, 0.9701],
[0.9743, 0.9728, 0.9672, 0.9715, 0.9464, 0.9679, 0.9712],
[0.9752, 0.9774, 0.9704, 0.9777, 0.9518, 0.9686, 0.9747],
[0.9629, 0.9604, 0.9641, 0.9665, 0.9497, 0.9611, 0.9639],
[0.9707, 0.9666, 0.9641, 0.9817, 0.9467, 0.9659, 0.9685],
[0.9225, 0.9260, 0.9218, 0.9303, 0.9094, 0.9200, 0.9234],
[0.9518, 0.9455, 0.9472, 0.9544, 0.9376, 0.9527, 0.9514],
[0.9435, 0.9375, 0.9328, 0.9465, 0.9296, 0.9430, 0.9490],
[0.9829, 0.9759, 0.9686, 0.9825, 0.9513, 0.9718, 0.9787]])

I was confused if I load wrong pre-trained model?, I download Roberta models from "https:s3...huggingface.com/bert/roberta-large-model" like this.

Support for P-Means layer weighting

Hi, are you planning to or willing to share the code from Appendix C for using P-Means to select the best representation for the model when no validation data exists for layer selection?

Thanks for the open-sourced code & extensive ablation experiments in the paper.

Bert as an OCR metric

Hi, it would be useful is it can be used without reference. Just an ocr output and a similarity score between the output and the words in the embedding? Just thinking

Semantic vs Syntactic

Hi, Looks like position matters more than semantic meanings?

I came across this example

Initial with BERTScorer(lang="en", rescale_with_baseline=True)

Case1

cands = 'city in canada'
refs = ['city in china']
BERTScore = 0.6102

Case2

cands = 'city in canada'
refs = ['canadian city']
BERTScore = 0.0938

Should Case2 have better score than Case1? Or did I miss some critical settings?

BERT Score for Elastic Search

Hello,
I am trying to integrate lots of documents from different languages inside ElasticSearch;
a big difficulty for me is to use a stable method to embed all the BERT hidden outputs inside one unique vector in order to compute a significant similarity between them.
My aim is to add this BERTscore inside the function that searches for similar document, given a query document.
Actually is use this in elasticSearch:

script_query = {
        "script_score": {
            "query": {"match_all": {}},
            "script": {
                "source": "cosineSimilarity(params.query_vector, doc['text_vector']) + 1.0",
                "params": {"query_vector": query_vector}
            }
        }
    }

and I used the meanPooling in order to unify all the layers output in only one but it doesn't seem to work well .
Do you think that bertScore could be a good idea? Does the score work also on other languages?

Another point is that we already have a optimized system that calculates BERT embedding and we should only attach the bert score calculus in the output. Can it work for you?
Thanks

Comparing splitted wordpiece tokens

Hi there,

Thank you for releasing this repo and for the development of the metric.

I was wondering how are you handling cases when an OOV word is split into different pieces by the word-piece tokenizer used in BERT models?

Specifically, when you are comparing the embedding vectors for the tokens in a candidate-reference sentence pair, are you doing any preemptive pooling of vectors for splitted words before you run the cosine similarity, or do you leave them as is?

The download speed is very slow in China

After I execute bert_score to run examples, it begins to download files, and the first file is downloaded fast, but the second file of size 1.43G, which I suppose is the BERT model file, is very verty slow, and my base is China. Is there any method to work around this problem? for example can I download BERT models elsewhere first and then use bert_score to run them?

why L2 normalize each feature?

Thanks for repo!

My question refers to the following code snippet from score.py:

ref_embedding.div_(torch.norm(ref_embedding, dim=-1).unsqueeze(-1))
hyp_embedding.div_(torch.norm(hyp_embedding, dim=-1).unsqueeze(-1))

From what I understand from the code, you tried to L2-normalize each word embedding features across all tokens in a sentence. After that, you calculate similarity score matrix.

Would you mind explaining the rationale of doing the normalization?

tokens that exceeds max_len raise error

Hi, authors.
I'm Jennifer and thanks for the implementation.


I found that tokens exceeding model's max_len raise an error as following:

/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim =
 -2, IndexIsMajor = true]: block: [22,0,0], thread: [27,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim =
 -2, IndexIsMajor = true]: block: [22,0,0], thread: [28,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim =
 -2, IndexIsMajor = true]: block: [22,0,0], thread: [29,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim =
 -2, IndexIsMajor = true]: block: [22,0,0], thread: [30,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim =
 -2, IndexIsMajor = true]: block: [22,0,0], thread: [31,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
Traceback (most recent call last):
  File "$PYENV_PATH/bin/mteval", line 11, in <module>
    load_entry_point('mteval===0.5.4-dev-bertscore', 'console_scripts', 'mteval')()
  File "$PYENV_PATH/lib/python3.6/site-packages/mteval/cli.py", line 90, in main
    args.func(args)
  File "$PYENV_PATH/lib/python3.6/site-packages/mteval/_rtt.py", line 39, in mteval_rtt
    mteval_auto(args)
  File "$PYENV_PATH/lib/python3.6/site-packages/mteval/utils.py", line 146, in wrapper
    return func(*args, **kwargs)
  File "$PYENV_PATH/lib/python3.6/site-packages/mteval/_auto.py", line 379, in mteval_auto
    reference_based_evaluation(args)
  File "$PYENV_PATH/lib/python3.6/site-packages/mteval/_auto.py", line 318, in reference_based_evaluation
    P_idf, R_idf, F1_idf = bert_score.score(output_lines, reference_lines, lang=target, idf=True)
  File "$PYENV_PATH/lib/python3.6/site-packages/bert_score/score.py", line 95, in score
    batch_size=batch_size, all_layers=all_layers)
  File "$PYENV_PATH/lib/python3.6/site-packages/bert_score/utils.py", line 276, in bert_cos_score_idf
    embs = embs.cpu()

This issue was easily solved by adding one parameter in utils.py

If you don't mind, may I do a PR?

Unreasonably high cosine similarity between words

Hi! I tried to run the model with version 0.2.1, but found that the cosine similarity between different words are unreasonably high, as in the example:
image
generated from

bert-score-show --lang en -r "it is freezing today" -c "the weather is cold today" -f out.png

where the minimum similarity between any two words is greater than 0.8, different from the stats in the paper, which is
image

Also the last figure in the google codalab link https://colab.research.google.com/drive/1kpL8Y_AnUUiCxFjhxSrxCsc6-sDMNb_Q#scrollTo=UW1Nku_LMjzg has the same problem
I am wondering what may be the problem? Thanks!

Pandas verison

Hi, thanks for your wonderful works.

I notice that the rescaling attribute need the pandas version higher than 0.23, which is not clear in the README. So I raise the issue for you.

Hope you have a good day.

--model and --rescale-with-baseline can not be used together

I use --model to load local model and I use --lang en --rescale-with-baseline to use rescale-with-baseline. The output shows :(hug_trans=3.0.2)-rescaled P: R: F: , but I found the numerical is same with the model without rescale-with-baseline. The command I used is as follow:
bert-score -r XXX -c XXX --model XXX --num_layers 17 --lang en --rescale-with-baseline

How to use bert_score with multiply references ?

Hi , thanks for the releasing of bert_score ! I'm trying to use bert_score with multi-references. As you mentioned above , the code supports multi-references right now . So how to use bert_score with multi-references ?
Example is listed as follows:
bert-score -r example/refs.txt -c example/hyps.txt --lang en
How to change refs.txt to multi-refs ?

Load once, run every time

Hi, thank you for your helpful repo.

During using your wonderful works, I found that it's a little unsatisfied to use the score function in the python script.

In my work, I usually need to measure the performance of lots of datasets. But the score function always loads the model weights first and then run the evaluation, which is very time-consuming. So I think if you can add the function to load the model weights first (for example .load_weight(path) function of the BERTScore object) and then use it to evaluate the performance, which will be much better and faster.

Hope to get your responses about this suggestion.

Get final weighted average

Am I right that you just get back the individual score for each pairwise token?

For getting the final (weighted) average of each candidate to each reference I need to do this by myself?

rescale and specify certain model

Hi
Thank you for making your code available.
I have used your score before the last update (before muti-refs were possible and before scorer). I used to get the hash of the model to make sure I get the same results always.
With the new update, I'm struggling to find how to set a specific model and also rescale.

For example, would like to do like this
out, hash_code= score(preds, golds, model_type="roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0)", rescale_with_baseline= True, return_hash=True)

roberta-large_L17_no-idf_version=0.3.0(hug_trans=2.5.0) is the hash I got from my earlier runs couple of months ago.

Appreciate your help
Areej

Negative score

Hi, thank you for your helpful repo.

During using the score function (with rescaling), I find sometimes it generates the negative scores which makes me confused (in the paper, the score is range from 0 and 1). In my processing, I just ignore the negative scores, but I still think it is not suitable.

In your explanantions, the negative scores is calculated after the BERTScore. But I'm not sure what is the negative scores' meaning, can you provide me the meaning of negative scores?

Hope to get your responses about this suggestion.

bert_score version in demo on Google Colab should be changed.

On the demo on Google Colab link given in ReadMe.md the command is used to install bert_score is pip install bert_score==0.2.0; if I run the command it is showing error:

ERROR: Could not find a version that satisfies the requirement bert_score==0.2.0 (from versions: 0.1.0, 0.1.1, 0.1.2, 0.2.2, 0.2.3, 0.3.0, 0.3.1, 0.3.2, 0.3.3, 0.3.4)
ERROR: No matching distribution found for bert_score==0.2.0

error

It should be pip install bert_score==0.3.4 to make the cell perfectly.

working perfect

[QUESTION] Comparison to Sentence-BERT

I was using BERT-Score since a while for STS task and so far it’s the best choice for sentence comparison in my opinion and tests on real world text, hence not necessarily on synthetic text.
I was wondering if there is any comparison to Sentence-BERT that has SOTA results in STS with a Siamese Net architecture. Code is here.

Thank you!

Does BERT-Score support larger Chinese BERT models

I find BERT-Score correlate much better with human evaluation when evaluating conversation.
Specifically, it seems scores rated by higher layers are better than lower layers. So I really want to see scores with higher layers than 12.

AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'

bert-score is not working raising this error:

File "/home/rony/.conda/envs/exp/bin/bert-score", line 5, in <module> from bert_score_cli.score import main File "/home/rony/.conda/envs/exp/lib/python3.6/site-packages/bert_score_cli/score.py", line 6, in <module> import bert_score File "/home/rony/.conda/envs/exp/lib/python3.6/site-packages/bert_score/__init__.py", line 2, in <module> from .utils import * File "/home/rony/.conda/envs/exp/lib/python3.6/site-packages/bert_score/utils.py", line 30, in <module> list(XLMConfig.pretrained_config_archive_map.keys()) + \ AttributeError: type object 'BertConfig' has no attribute 'pretrained_config_archive_map'

small readme edit

Hello,
Thank you for the great resource!
A small edit to the readme where it says:

Starting from versino 0.3.0, we support rescaling the scores with baseline scores

bert-score -r example/refs.txt -c example/hyps.txt --lang en

-- I think should be:
Starting from version 0.3.0, we support rescaling the scores with baseline scores

bert-score -r example/refs.txt -c example/hyps.txt --lang en --rescale-with-baseline

(?)
Andrew

Critical assumption leads to erroneous results

Hi,

I've been getting unexpectedly high scores for some settings which I couldn't reproduce then. Only to find out in the code that, if -c <candidate> does not exist as a file, the script does not error and assumes that <candidate> is not a path but a sentence. It also assumes that the references are no longer paths but sentences as well. If there's some overlap between the characters of <candidate> path and reference paths, you get a decent bertscore around 0.60.

I think this is quite a dangerous issue

[Question] Cross-lingual Score

Assumed that the embeddings have learned joint languages representations (so that cat is closer to katze or chat, hence a sentence like I love eating will be closer to Ich esse gerne, as it happens in the MUSE or LASER models), would it be possible to evaluate the BERTscore against sentences in two different languages?

Create my own Model for Sentence Similarity/Automated Scoring

I have a wikipedia dump file as my corpus (which is in Indonesian, i've extract it and convert it to .txt)
How can i train my corpus file with bert multilingual cased (fine tune) with BERTScore so i can have my own model for specific task such as sentence similarity or automated short answer scoring?

Or maybe i should do this with the original BERT?
Thankyou so much in advance.

fine-tuning model according to my corpus

Hi,

I might be wrong but this module helps calculate similarity based on pre-trained models.

I was wondering if there is a way to use your library and fine tune/incorporate according to my corpus. I have huge text corpus for my relevant case, I was wondering if model can be fine tuned then I would like to get similarity between words according to that model.

I can train a word2vec on my corpus for this and then get the similarity between the words but I believe BERT is better for such tasks.

Thanks a lot for your help
Shikhar

Running slowly

Hi, Thank you for your nice work and release code, I have tried to run the code, but it is very slow, may be i should change some setting? Could you give me some advice ? Thank you very much~

Suggest to add type=int for batch_size and num_layers args

When passing an argument such as

bert-score -r refs.txt -c hyps.txt --bert bert-base-uncased -b 4

will cause the following error, since the 4 is interpreted as str

line 147, in bert_cos_score_idf for batch_start in range(0, len(refs), batch_size): TypeError: 'str' object cannot be interpreted as an integer.

adding type=int should resolve the issue.

Thanks!

The best way using bert_score to evaluate text generation like VAE,GAN

Thanks for your implementation.
I am now using your method to evaluate the performance of generated text by VAEs, GANs models which samples a distribution from a space without references. What's your suggestion to evaluate this kind of task? I didn't see a detail description in your paper.
Thanks!

Execute demo.py should only wait for download?

I use python demo.py to test your model. It occur some bars show me to download .json and so on.
I have a slow network speed. So I want to ask if I can copy some web page, and offline download?

segment-level doesn't seem to work on CLI (and also even if it did, would likely be misleading)

Thanks for making this available! I wanted segment level scores, but -s on the command line doesn't really accomplish anything. Beyond that, looking at the code, unless I'm misunderstanding something (entirely possible!!!) the scoring first dedupes and sorts, which means that even if -s printed something, it wouldn't be in the same order as the input sentences, which would be pretty misleading. This would be a great feature. I ended up writing my own python script to do it by processing each sentence one at a time, but of course this is pretty slow.

Recall and F1 are nan

Hi,
Any idea why I get R and F1 equal to nan when I do this:

    cands = ['hi how is going?', 'hi my name is Stella']
    refs = ['hi how are you?', 'hi how are you?']

    no_idf = True if len(refs) == 1 else False
    P, R, F = score(cands, refs, bert="bert-base-multilingual-cased", no_idf=no_idf)
    
    P: [0.75812787 0.55651724]
    R: [nan nan]
    F1: [nan nan]

No response

When I input the command in My linux
"bert-score -r example/refs.txt -c example/hyps.txt --lang en"
There is not any response and I have to force exit.
Can you help me solve it out?

About the fatal weakness of the Embedding-based metric

Hi, thank you for your wonderful repo.
In my view, I think BERTScore is a kind of the embedding-based metric for measuring the quality of the responses which is similar to the Embedding-Average and Greedy Matching.
After trying the Embedding-Average, Greedy Matching, Vector Extrema, and BERTScore, I found that the average scores of these embedding-based metrics are very high (average 0.817 on Dailydialog dataset and Cornell dataset). In this case, any responses or even very bad responses could achieve the "Good" score and the difference between the "Good" and "Bad" are very small.
I attribute this question to the "fuzzy" representation of the word embedding. So I think the embedding-based metrics are not very appropriate for measuring the performance of the generative models such as dialog systems and NMT.

How do you think about this issue ? And how to alleviate it ?

Hope to get the response from you. Thanks.

Getting "nan" as the score

For candidate and reference list of size 1, I get as a results tensor([nan]). Is this expected behavior?

This is the code that results in nan:

from bert_score import score 

cands = ['28-year-old chef found dead in San Francisco mall']

refs = ['A 28-year-old chef who had recently moved to San Francisco was found dead in the stairwell of a local mall this week.']

_, _, F1 = score(cands, refs, bert="bert-base-uncased", verbose=True)
print (F1)

Support for multiple references per candidate

Hi,
Thank you for the good work. I was checking the code and, as far I could understand, there is an underlying assumption that there is only one single reference per candidate. Am I correct or am I missing something?

I'm working on a task (sentence simplification) for which we have valid and test sets with multiple references. I was thinking of trying out BERTScore there and make the appropriate modifications for the multi-reference scenario, but I wanted to check first in case there was something already implemented on that line of work that I hadn't noticed.

Thanks.

strangely high scores w/ BERTScore

I am getting some pretty high scores for trivial inputs; take the following example:

(bert_score_p, bert_score_r, bert_score_f) = bert_score.score(cands=["a"], refs=["some random string"], lang='en')
print((bert_score_p, bert_score_r, bert_score_f))

The output is:

(tensor([0.8874]), tensor([0.8125]), tensor([0.8483]))

which are pretty high scores. What am I missing here?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.