embeddings-benchmark / mteb Goto Github PK

MTEB: Massive Text Embedding Benchmark

Home Page: https://arxiv.org/abs/2210.07316

License: Apache License 2.0

Python 99.68% Makefile 0.08% Shell 0.24%

benchmark clustering information-retrieval sentence-transformers sts text-embedding retrieval neural-search semantic-search sbert

mteb's Introduction

Massive Text Embedding Benchmark

Installation | Usage | Leaderboard | Documentation | Citing

Installation

pip install mteb

Usage

Using a python script (see scripts/run_mteb_english.py and mteb/mtebscripts for more):

from mteb import MTEB
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"
# or directly from huggingface:
# model_name = "sentence-transformers/all-MiniLM-L6-v2"

model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model, output_folder=f"results/{model_name}")

Using CLI

mteb --available_tasks

mteb -m sentence-transformers/all-MiniLM-L6-v2 \
    -t Banking77Classification  \
    --verbosity 3

# if nothing is specified default to saving the results in the results/{model_name} folder

Using multiple GPUs in parallel can be done by just having a custom encode function that distributes the inputs to multiple GPUs like e.g. here or here.

Advanced Usage (click to unfold)

Advanced Usage

Dataset selection

Datasets can be selected by providing the list of datasets, but also

by their task (e.g. "Clustering" or "Classification")

evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks

by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)

evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence datasets

by their languages

evaluation = MTEB(task_langs=["en", "de"]) # Only select datasets which are "en", "de" or "en-de"

You can also specify which languages to load for multilingual/crosslingual tasks like below:

from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining

evaluation = MTEB(tasks=[
        AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
        BUCCBitextMining(langs=["de-en"]), # Only load "de-en" subset of BUCC
])

There are also presets available for certain task collections, e.g. to select the 56 English datasets that form the "Overall MTEB English leaderboard":

from mteb import MTEB_MAIN_EN
evaluation = MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])

Evaluation split

You can evaluate only on test splits of all tasks by doing the following:

evaluation.run(model, eval_splits=["test"])

Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used.

Using a custom model

Models should implement the following interface, implementing an encode function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.). For inspiration, you can look at the mteb/mtebscripts repo used for running diverse models via SLURM scripts for the paper.

class MyModel():
    def encode(
        self, sentences: list[str], prompt: str, **kwargs: Any
    ) -> torch.Tensor | np.ndarray:
        """Encodes the given sentences using the encoder.

        Args:
            sentences: The sentences to encode.
            prompt: The prompt to use. Useful for prompt-based models.
            **kwargs: Additional arguments to pass to the encoder.

        Returns:
            The encoded sentences.
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)

If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for encode_queries and encode_corpus. If these methods exist, they will be automatically used for those tasks. You can refer to the DRESModel at mteb/evaluation/evaluators/RetrievalEvaluator.py for an example of these functions.

class MyModel():
    def encode_queries(self, queries: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
        """
        Returns a list of embeddings for the given sentences.
        Args:
            queries: List of sentences to encode

        Returns:
            List of embeddings for the given sentences
        """
        pass

    def encode_corpus(self, corpus: list[str] | list[dict[str, str]], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
        """
        Returns a list of embeddings for the given sentences.
        Args:
            corpus: List of sentences to encode
                or list of dictionaries with keys "title" and "text"

        Returns:
            List of embeddings for the given sentences
        """
        pass

Evaluating on a custom dataset

To evaluate on a custom task, you can run the following code on your custom task. See how to add a new task, for how to create a new task in MTEB.

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MyCustomTask(AbsTaskReranking):
    ...

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MyCustomTask()])
evaluation.run(model)

Documentation

Documentation
📋 Tasks	Overview of available tasks
📈 Leaderboard	The interactive leaderboard of the benchmark
🤖 Adding a model	Information related to how to submit a model to the leaderboard
👩‍💻 Adding a dataset	How to add a new task/dataset to MTEB
👩‍💻 Adding a leaderboard tab	How to add a new leaderboard tab to MTEB
🤝 Contributing	How to contribute to MTEB and set it up for development

Citing

MTEB was introduced in "MTEB: Massive Text Embedding Benchmark", feel free to cite:

@article{muennighoff2022mteb,
  doi = {10.48550/ARXIV.2210.07316},
  url = {https://arxiv.org/abs/2210.07316},
  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
  title = {MTEB: Massive Text Embedding Benchmark},
  publisher = {arXiv},
  journal={arXiv preprint arXiv:2210.07316},  
  year = {2022}
}

You may also want to read and cite the amazing work that has extended MTEB & integrated new datasets:

Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff. "C-Pack: Packaged Resources To Advance General Chinese Embedding" arXiv 2023
Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, Han Xiao. "Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents" arXiv 2023
Silvan Wehrli, Bert Arnrich, Christopher Irrgang. "German Text Embedding Clustering Benchmark" arXiv 2024

For works that have used MTEB for benchmarking, you can find them on the leaderboard.

mteb's People

Contributors

Stargazers

Watchers

Forkers

muennighoff amrmkayid cycycc techthiyanes skumarh89 sibtainrazajamali achintyax amanpreet692 aliang-nlp dopc hsl89 perceptiveshawty phymucs aysegulbumin ahoho alejandraro1992 goswamig stephantul nleroy917 izhx ai-bassem permutohedra jordane95 rahulseetharaman lml2468 mininglamp-mz gabbom hesic73 slvnwhrl regalius therakeshpurohit ghbacct gershwin97 bwanglzu hbcbh1999 avsolatorio liujuncn jaelgu kwojtasi jinlmsft kennethenevoldsen centre-for-humanities-computing aqhali apollohuang1 staoxiao rafalposwiata leo-gan garrett361 rsmith49 jina-ai guenthermi mgoin vineetm yutsai84 liuhong99 darcstar-solutions-tech dunzhang rufeng-h novak2000 maivtt podkamienna sandy4321 jina-ai lyon-nlp maisaai ihounie stat-eklee voladorlu thuwyh markus28 jade2290 dongjicheng ordalietech a7mad-magdy77 canslove sorokinvld duzida zhimin-z kekewind taeminlee discoresearch mixedbread-ai jina-ai taner45 avidale nanqiai flyingwing lishawn x-tabdeveloping hanhainebula ramunas-s cleverchloe 3a1b2c3 qingqinggit1 vgkienzler amandaabp ceia-memoreba lihuibng jixy2012 meiyuan666

mteb's Issues

Status of multi-gpu support for BEIR evaluation

It looks like the multi-gpu support for the BEIR benchmarks is still disabled as of https://github.com/embeddings-benchmark/mteb/releases/tag/1.0.1.

What is the current status of it? Is it actively developed in BEIR repo?

Btw. we are successful in running the multi gpu utils in the BEIR repository with downgraded dependencies but would like to switch over to MTEB to have a broader benchmark collection.

Multi-GPU for more tasks

It'd be great if we could figure out using multiple gpus on tasks other than BEIR.
E.g. RedditClusteringP2P takes >20h for a 5.8B model with embeddings of 4096 dimensions.

Is SciDocs the same as USEB SciDocs?

It's inconvenient to have two datasets with the same case insensitive name: SciDocs & SCIDOCS. E.g. MacOS is by default case insensitive.

Let's rename SciDocs? I'd propose SciDocsS2S or SciDocsRR instead, but maybe someone has a better idea.

Also do I understand correctly that SciDocs is the same SciDocs as in useb? I.e. it includes all the tasks of Cite, Co-Cite, Co-Read, Co-View, see their paper for details (+ recomm).

cc @loicmagne as I think you added it?

Add Hub link to all datasets in README

What does the BC in SprintDuplicateQuestionsBC stand for?

cc @NouamaneTazi ?

Do not skip if running new split

We currently skip when running a new split & a result file of the same name exists
It'd be better to run & append the new results of the new split to the existing result file.

ERROR:mteb.evaluation.MTEB:Error while evaluating QuoraRetrieval: 'SimCSEWrapper' object has no attribute 'stop_multi_process_pool'

Hi! Thanks for this easy-to-use repo!
While I'm getting this error when running the example script https://github.com/embeddings-benchmark/mtebscripts/blob/main/run_array_simcse.py on retrieval benchmarks like QuoraRetrieval.
How do I evaluate on retrieval tasks when using my own model with a wrapper?

Sequence Length

Hello,

Thanks for this extensive work.

I've question about the sequence length of the various models used for this benchmark. Different models supports different sequence lengths, like text-embedding-ada upto 8191 tokens, while instructor-xl trained only with 512 max token length. Are these considered during evaluation?

Please forgive if I'm being ignorant.

Add flag to force re-compute evaluations

Sometimes we would like to override already computed evaluations. It would be cool to have a flag override_results that would handle that

validation.tsv not present in qrels folder

Hi
I'm trying to run the below code on Colab

from mteb import MTEB
from sentence_transformers import SentenceTransformer
from mteb.tasks import QuoraRetrieval


model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)

evaluation = MTEB(tasks=["QuoraRetrieval"]) # Only select clustering and retrieval tasks
results = evaluation.run(model, output_folder=f"results/{model_name}")

I get the following error

Retrieval
    - QuoraRetrieval, beir, s2s

ERROR:mteb.evaluation.MTEB:Error while evaluating QuoraRetrieval: File /root/.cache/huggingface/datasets/BeIR/quora/qrels/validation.tsv not present! Please provide accurate file.
ERROR:mteb.evaluation.MTEB:Please check all the error logs at: error_logs.txt

When I check the qrels folder, I only find dev and test tsvs. This issue occurs for other tasks as well, such as MSMARCO.
Any idea what I'm doing wrong?

Optimise classification evaluation

Currently we use the following for classification:

We sample 8 training examples per class, compute the embeddings, and fit the LogReg classifier. We then evaluate on the (unchanged) dev / test set.
We repeat the previous step 10 times and compute the average for accuracy / f1 etc.

As the test set embeddings will be the same, we can compute the test set embeddings once and just need to feed them to the LR classifier. This will make the 10-times repeated evaluation much faster.

task_langs don't work when task is specified

valuation = MTEB(tasks=["AmazonCounterfactualClassification"],task_langs=["zh"])
valuation.run(model)

Task: AmazonCounterfactualClassification, split: validation, language: en. Running...
^CTraceback (most recent call last):
File "", line 1, in

Add embeddings speed

An important factor in choosing embeddings is the speed of embedding.

I suggest adding a "tab" in the evluation called "Speed" and it would be represented in sentences/sec for example (canalso be tokens/sec).

This is a very useful feature of the SBERT site for example:
https://www.sbert.net/docs/pretrained-models/msmarco-v3.html

and efficiency as a parameter is already mentioned in your paper.

ERROR:mteb.evaluation.MTEB:Error while evaluating FEVER: 'dev'

Can look into this once I have some bandwidth

Evaluating New Embeddings

I understand that standard word embeddings (like average_word_embeddings_glove.6B.300d) are downloaded from hugging face, but is there code to evaluate new embeddings? I have a .txt file with vectors trained with the GloVe model that I would like to evaluate.

I see in the documentation that we can write our own encoder model that can be evaluated. But is there a way to only input a .txt file of the word embeddings for evaluation?

If there is no code to support a .txt file input, then for the encoder, is the input sentences tokenized already?

Negative STS22 scores

STS22 scores should not be negative https://competitions.codalab.org/competitions/33835#results

SentenceTransformer hangs

Packages:

!pip install -q git+https://github.com/UKPLab/sentence-transformers.git
!pip install -q git+https://github.com/embeddings-benchmark/mteb.git
!pip install -q git+https://github.com/NouamaneTazi/beir.git@fix_drpes_ids
!pip install -q evaluate

Doing

import time
from mteb import MTEB
from sentence_transformers import SentenceTransformer

class SentenceTransformerX(SentenceTransformer):
  pass

model_name = "sentence-transformers/average_word_embeddings_komninos"


model = SentenceTransformerX(model_name)
evaluation = MTEB(tasks=["SciFact"])
a = time.time()
results = evaluation.run(model, output_folder=f"results/{model_name}", overwrite_results=True)
b = time.time()

hangs at

 p = ctx.Process(
                target=SentenceTransformer._encode_multi_process_worker,
                args=(process_id, device_name, self.model, input_queue, output_queue),
                daemon=True,
            )

I think you're the expert here - any ideas? @NouamaneTazi

This only affects the latest BEIR, i.e. I think it has something to do with DPRES. Using the below is fine

!pip install -q git+https://github.com/UKPLab/sentence-transformers.git
!pip install -q git+https://github.com/embeddings-benchmark/mteb.git
!pip install beir==1.0.0

Add revision to datasets

I'd propose to add the commit hash of the revision to tasks:

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MindSmallReranking(AbsTaskReranking):
    @property
    def description(self):
        return {
            "name": "MindSmallReranking",
            "hf_hub_name": "mteb/mind_small",
            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
            "type": "Reranking",
            "category": "s2s",
            "eval_splits": ["validation"],
            "eval_langs": ["en"],
            "main_score": "map",
            "revision": "75937953179...",
        }

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

This is then fed into load_dataset via revision= & added to the results json file.

This partly addresses #21

SICK-R / BIOSSES

If they're the same, one should be removed; if not should be fixed

OSError: Expected to be able to read 12300328 bytes for message body, got 12300316

Hello,
I've been trying to evaluate several custom models on MTEB

I faced some errors,

2022-11-08 11:12:04.822852 >>> ClimateFEVER
Traceback (most recent call last):
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
    results = task.evaluate(model, split, **kwargs)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/mteb/abstasks/AbsTaskRetrieval.py", line 93, in evaluate
    results = retriever.retrieve(corpus, queries)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/beir/retrieval/evaluation.py", line 23, in retrieve
    return self.retriever.search(corpus, queries, self.top_k, self.score_function, **kwargs)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/beir/retrieval/search/dense/exact_search_multi_gpu.py", line 150, in search
    cos_scores_top_k_values, cos_scores_top_k_idx, chunk_ids = metric.compute()
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/evaluate/module.py", line 433, in compute
    self._finalize()
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/evaluate/module.py", line 390, in _finalize
    self.data = Dataset(**reader.read_files([{"filename": f} for f in file_paths]))
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 236, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 171, in _read_files
    pa_table: Table = self._get_table_from_filename(f_dict, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 306, in _get_table_from_filename
    table = ArrowReader.read_table(filename, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 325, in read_table
    return table_cls.from_file(filename)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/table.py", line 1036, in from_file
    table = _memory_mapped_arrow_table_from_file(filename)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/table.py", line 51, in _memory_mapped_arrow_table_from_file
    pa_table = opened_stream.read_all()
  File "pyarrow/ipc.pxi", line 691, in pyarrow.lib.RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Expected to be able to read 12300328 bytes for message body, got 12300316

(other retrieval tasks have the same issue)

Is there any workaround?

Clarify why there a multiple runs in the logs

^ It should be explained in the logs why it seems to be repeating the same thing

Task: AmazonReviewsClassification, split: test, language: en. Running...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:03<00:00,  1.60s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|██████████| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:02<00:00,  1.39s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|██████████| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:03<00:00,  1.68s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|██████████| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:04<00:00,  2.48s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|██████████| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|██████████| 2/2 [00:03<00:00,  1.71s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...

Bitext Mining low scores

I'm getting the below for

from mteb import MTEB
from mteb.abstasks.AbsTaskClustering import AbsTaskClustering
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=["BUCC"])
evaluation.run(model)

{
  "dataset_version": null,
  "mteb_version": "0.0.2",
  "test": {
    "de-en": {
      "accuracy": 0.0017745302713987473,
      "f1": 0.0017745302713987473,
      "precision": 0.0017745302713987473,
      "recall": 0.0017745302713987473
    },
    "evaluation_time": 456.59,
    "fr-en": {
      "accuracy": 0.0,
      "f1": 0.0,
      "precision": 0.0,
      "recall": 0.0
    },
    "ru-en": {
      "accuracy": 6.927606511950121e-05,
      "f1": 6.927606511950121e-05,
      "precision": 6.927606511950121e-05,
      "recall": 6.927606511950121e-05
    },
    "zh-en": {
      "accuracy": 0.0,
      "f1": 0.0,
      "precision": 0.0,
      "recall": 0.0
    }
  }
}

Seems too low - I think there's a bug

cc @NouamaneTazi @loicmagne

Error for loading ArxivClusteringP2P

Hello @Muennighoff , I encountered the following issue when loading ArxivClusteringP2P dataset

repro

from mteb import MTEB

def test_loading_data():
    eval = MTEB(tasks=["ArxivClusteringP2P"])
    eval.load_tasks_data()
    return 


if __name__ == "__main__":
    test_loading_data()

output:

Generating test split: 23 examples [00:04,  6.38 examples/s]Failed to read file '/root/.cache/huggingface/datasets/downloads/extracted/2368c5e45f666e09c88b163b1db73ad115ce53e3954755e8936da145b036ae4b' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a closing quotation mark in string. in row 0
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 153, in _generate_tables
    dataset = json.load(f)
  File "/usr/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 25447588)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1817, in _prepare_split_single
    for _, table in generator:
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 156, in _generate_tables
    raise e
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 132, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Missing a closing quotation mark in string. in row 0

Add Prompt Retrieval

We could add the prompt retrieval benchmark: https://arxiv.org/abs/2209.01975

No test split for Mind Small

Mind Small Reranking has no test split - For all other ds we use the test split afaik, so should we use the val split for that one?

KeyError: 'validation' for RedditClustering and StackOverflowDupQuestions

Is the validation set used for RedditClustering and StackOverflowDupQuestions? Related to #83 and #84.

2022-12-16 05:26:27.662809 >>> RedditClustering
Traceback (most recent call last):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
results = task.evaluate(model, split, **kwargs)
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/abstasks/AbsTaskClustering.py", line 17, in evaluate
for cluster_set in tqdm.tqdm(self.dataset[split], desc="Clustering"):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/datasets/dataset_dict.py", line 57, in getitem
return super().getitem(k)
KeyError: 'validation'

2022-12-16 05:26:39.846393 >>> StackOverflowDupQuestions
Traceback (most recent call last):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
results = task.evaluate(model, split, **kwargs)
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/abstasks/AbsTaskReranking.py", line 21, in evaluate
data_split = self.dataset[split]
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/datasets/dataset_dict.py", line 57, in getitem
return super().getitem(k)
KeyError: 'validation'

S2S vs P2P

The BEIR tasks are currently all marked as S2S, but some of them are P2P or S2P / P2S. Retrieval is the only task where we have S2P / P2S. Does that make sense?

Options I see:

Add S2P & P2S to P2P assuming that the main use case for selecting S2S is to get short texts
Introduce S2P & P2S

Any thoughts? cc @NouamaneTazi

Bug while loading MTOPIntentClassification?

In fact I'm not sure if this is a bug. Below is what I thought the problem was

Before evaluating for MTOPIntentClassification, mteb will download a module in cache. In my case the module is located at /data2/.cache/huggingface/modules/datasets_modules/datasets/mteb--mtop_intent/7353fdf5b13e9bfd297fbf98bf66e7e0ee626def6321bd9293bbc6ee1d5fae7b and there is a script called mtop_intent.py:

import json
import datasets

_DESCRIPTION = "MTOP: Multilingual Task-Oriented Semantic Parsing"
_LANGUAGES = ["en", "de", "es", "fr", "hi", "th"]

URL = "" # https://huggingface.co/datasets/mteb/mtop/resolve/main/"

The URL is empty so the module will assume the files are located at pwd, which causes an error.
I change the URL to

URL = "https://huggingface.co/datasets/mteb/mtop_intent/resolve/main/"

and everything works fine.

Add hardware info to results file

It'd be nice to also have information about the hardware used in the results file in addition to the evaluation time if this is easy to get!

CodeSearchNet task

would the maintainers be interested in the addition of a code retrieval task (CodeSearchNet, uses text queries to retrieve code documents), either as a new code retrieval type or added into the existing retrieval category?

Adding a model to automated evaluation

I would like to add Universal Sentence Encoder family of models to the automated evaluation.

It is relatively simply to evaluate it (thanks for making it straightforward), but it is not clear how to create a pull request to add the model to automated evaluation on the website. Please advise.

# !pip install tensorflow_text 

import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
import tensorflow as tf

embedder=hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

class USE():
    def encode(self, sentences, batch_size=32, **kwargs):
        embeddings = []
        for i in range(0, len(sentences), batch_size):
            batch_sentences = sentences[i:i+batch_size]
            batch_embeddings = embedder(batch_sentences)
            embeddings.extend(batch_embeddings)
        return embeddings


model = USE()

Propose chunked computation for the `RerankingEvaluator`

The MindSmallReranking dataset contains 2,362,514 queries, 107,968 positive docs, 2,550,123 negative docs.

Currently, RerankingEvaluator.compute_metrics_batched() just gather all texts together and encode them, which would require a lot of memory / GPU memory. (I got CUDA OOM on 32GB V100.)

I made minor modifications to the code to implement chunked computation, reducing memory usage.

If this change is acceptable, I would be glad to make a PR.
Thanks.

Additional Dataset: FLORES200

It's a great dataset for bitext mining! Any help welcome 🤗

Edit: Done via #218

CQADupstack tasks support

Since BEIR doesn't provide a HF dataset for the CQA corpus we uploaded this one I think: https://huggingface.co/datasets/mteb/cqadupstack-retrieval/tree/main/data

However it cannot be loaded currently - possibly cause of the different format than other BEIR datasets (json's instead of jsonl).

Thus, CQADupstack tasks only work with beir <= 1.0.0 using the old data loading as of right now

TwitterSemEval2015 combines train, dev, and test

It looks like the TwitterSemEval2015 test data combines the train, dev, and test data from the original task. Was this intentional? My assumption is that some of the models would have trained on that data

What prompt was used for instruct models?

Was Instruct model tuned per task or was the default setting used?

Having inference or evaluation results

Hi all,
thank you for sharing this awesome repo!

I am having experiments on classification tasks.
I am wondering if the inference results (e.g., predicted class for each test sentence) and evaluation results (e.g., whether predicted class for each test sentence is correct) are available via some commands?

Best regards,
Jihyuk

remove `sentence-transformers` dependency

as far as I can tell, the sentence-transformers dependency is not necessary for this code to run, only as a shorthand for model loading in cmd. Because installing sentence-transformers also installs torch, sentencepiece, tokenizers and transformers itself, this is quite a big dependency to package. Maybe the installation of sentence-transformers can be split off into an optional dependency?

i.e., pip install mteb[sentencetransformers] could install mteb packaged with sentencetransformers. When running the functionality that requires sentencetransformers, the user could be prompted to install it.

Versioning

I think we should have some form of versioning.

E.g. for each task have an additional field in the json results file called "version" or "revision". We can set it to 0 for all tasks for now or to e.g. the commit string of the dataset on the Hub.

Silent skipping

Currently when a task name is wrong nothing happens upon evaluation.run
I think it'd be nice to raise a warning that a task wasn't found

Make classification deterministic

The following code should give the same results

import logging

from mteb import MTEB
from sentence_transformers import SentenceTransformer

logging.basicConfig(level=logging.INFO)

model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model, output_folder=None)

It would be nice to write a test for that as well in tests folder

Add Billboard task

It's referenced in https://arxiv.org/pdf/2212.09741.pdf

No 'validation' split for mteb/sickr-sts: raise KeyError during evaluation

Hi,

thank you for providing the benchmark and easy-to-use codebase!

When evaluating the sickr-sts task, there exists a KeyError: 'validation'. The reason is that the 'validation' is included in the "eval_splits" of SickrSTS task description while mteb/sts15-sts only provides the test set. Should the 'validation' be removed from the task description?

Leaderboard on Huggingface is down

It seems like the leaderboard on Huggingface is down, https://huggingface.co/spaces/mteb/leaderboard, it just says "Preparing Space" until it times out.

Some other people on Huggingface having the same issue:
https://huggingface.co/spaces/mteb/leaderboard/discussions/7

Is there a static version of the leaderboard, or another way of accessing the data?

Additional Scores: AUC, NDCG

For MIND we only compute MAP & MRR, while their leaderboards main scores are AUC & NDCG & MRR.
We should also compute AUC & NDCG.

Set `n_inits` explicitly in clustering tasks

When running clustering tasks, I keep seeing this warning:

FutureWarning: The default value of `n_init` will change from 3 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

To keep the behavior or MTEB stable across versions of sklearn you should set n_init to 3 explicitly. If someone happens to run this with an sklearn version >= 1.4, they would start getting different results. If you want, I can make a PR.

I'm on scikit-learn 1.2.2

How to customize parameters for AbsTaskClassification

Hi all,

I would like to compare varying configurations for AbsTaskClassification.
For example, I am wondering about evaluation results with method=kNN.
But, I am not sure how can I change those parameters in python scripts.
Could you help me with this?

BTW, I am wondering which method was used for the performance presented in the paper/leaderboard, between method=kNN (as described in the paper; 3.2 Tasks and evaluation - Classification) and method=logReg (which is the default value for method param in the codes).

Best regards,
Jihyuk

BitextMining could support both ways

For BUCC-fr-en we currently search from english texts given french texts.

a) Most uses cases are probably the inverse
b) We should generally support both ways / automatically run both (e.g. like it's done in https://arxiv.org/pdf/2007.01852.pdf)

What do you think @NouamaneTazi ?

How do you ensure that the comparisons are fair?

Hello, this work is wonderful. However, I have one question: How do you ensure that the comparisons are fair? Data leak may occur if some of the models use the train/test data for pretraining or finetuning, particularly for new models to submit.