Giter Site home page Giter Site logo

embeddings-benchmark / mteb Goto Github PK

View Code? Open in Web Editor NEW
1.4K 8.0 165.0 3.62 MB

MTEB: Massive Text Embedding Benchmark

Home Page: https://arxiv.org/abs/2210.07316

License: Apache License 2.0

Python 99.68% Makefile 0.08% Shell 0.24%
benchmark clustering information-retrieval sentence-transformers sts text-embedding retrieval neural-search semantic-search sbert

mteb's Introduction

Massive Text Embedding Benchmark

GitHub release GitHub release License Downloads

Installation

pip install mteb

Usage

from mteb import MTEB
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"
# or directly from huggingface:
# model_name = "sentence-transformers/all-MiniLM-L6-v2"

model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model, output_folder=f"results/{model_name}")
  • Using CLI
mteb --available_tasks

mteb -m sentence-transformers/all-MiniLM-L6-v2 \
    -t Banking77Classification  \
    --verbosity 3

# if nothing is specified default to saving the results in the results/{model_name} folder
  • Using multiple GPUs in parallel can be done by just having a custom encode function that distributes the inputs to multiple GPUs like e.g. here or here.

Advanced Usage (click to unfold)

Advanced Usage

Dataset selection

Datasets can be selected by providing the list of datasets, but also

  • by their task (e.g. "Clustering" or "Classification")
evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks
  • by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)
evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence datasets
  • by their languages
evaluation = MTEB(task_langs=["en", "de"]) # Only select datasets which are "en", "de" or "en-de"

You can also specify which languages to load for multilingual/crosslingual tasks like below:

from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining

evaluation = MTEB(tasks=[
        AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
        BUCCBitextMining(langs=["de-en"]), # Only load "de-en" subset of BUCC
])

There are also presets available for certain task collections, e.g. to select the 56 English datasets that form the "Overall MTEB English leaderboard":

from mteb import MTEB_MAIN_EN
evaluation = MTEB(tasks=MTEB_MAIN_EN, task_langs=["en"])

Evaluation split

You can evaluate only on test splits of all tasks by doing the following:

evaluation.run(model, eval_splits=["test"])

Note that the public leaderboard uses the test splits for all datasets except MSMARCO, where the "dev" split is used.

Using a custom model

Models should implement the following interface, implementing an encode function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.). For inspiration, you can look at the mteb/mtebscripts repo used for running diverse models via SLURM scripts for the paper.

class MyModel():
    def encode(
        self, sentences: list[str], prompt: str, **kwargs: Any
    ) -> torch.Tensor | np.ndarray:
        """Encodes the given sentences using the encoder.

        Args:
            sentences: The sentences to encode.
            prompt: The prompt to use. Useful for prompt-based models.
            **kwargs: Additional arguments to pass to the encoder.

        Returns:
            The encoded sentences.
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)

If you'd like to use different encoding functions for query and corpus when evaluating on Retrieval or Reranking tasks, you can add separate methods for encode_queries and encode_corpus. If these methods exist, they will be automatically used for those tasks. You can refer to the DRESModel at mteb/evaluation/evaluators/RetrievalEvaluator.py for an example of these functions.

class MyModel():
    def encode_queries(self, queries: list[str], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
        """
        Returns a list of embeddings for the given sentences.
        Args:
            queries: List of sentences to encode

        Returns:
            List of embeddings for the given sentences
        """
        pass

    def encode_corpus(self, corpus: list[str] | list[dict[str, str]], **kwargs) -> list[np.ndarray] | list[torch.Tensor]:
        """
        Returns a list of embeddings for the given sentences.
        Args:
            corpus: List of sentences to encode
                or list of dictionaries with keys "title" and "text"

        Returns:
            List of embeddings for the given sentences
        """
        pass

Evaluating on a custom dataset

To evaluate on a custom task, you can run the following code on your custom task. See how to add a new task, for how to create a new task in MTEB.

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MyCustomTask(AbsTaskReranking):
    ...

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MyCustomTask()])
evaluation.run(model)

Documentation

Documentation
πŸ“‹ Tasks Β Overview of available tasks
πŸ“ˆ Leaderboard The interactive leaderboard of the benchmark
πŸ€– Adding a model Information related to how to submit a model to the leaderboard
πŸ‘©β€πŸ’» Adding a dataset How to add a new task/dataset to MTEB
πŸ‘©β€πŸ’» Adding a leaderboard tab How to add a new leaderboard tab to MTEB
🀝 Contributing How to contribute to MTEB and set it up for development

Citing

MTEB was introduced in "MTEB: Massive Text Embedding Benchmark", feel free to cite:

@article{muennighoff2022mteb,
  doi = {10.48550/ARXIV.2210.07316},
  url = {https://arxiv.org/abs/2210.07316},
  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
  title = {MTEB: Massive Text Embedding Benchmark},
  publisher = {arXiv},
  journal={arXiv preprint arXiv:2210.07316},  
  year = {2022}
}

You may also want to read and cite the amazing work that has extended MTEB & integrated new datasets:

For works that have used MTEB for benchmarking, you can find them on the leaderboard.

mteb's People

Contributors

akash190104 avatar amrmkayid avatar asparius avatar awinml avatar davidstap avatar dipam7 avatar dokato avatar github-actions[bot] avatar guenthermi avatar imenelydiaker avatar isaac-chung avatar izhx avatar jaygala24 avatar kennethenevoldsen avatar loicmagne avatar martinbernstorff avatar muennighoff avatar nouamanetazi avatar nreimers avatar orionw avatar rasdani avatar rbroc avatar saitejautpala avatar sakshamrzt avatar shreeya-dhakal avatar slvnwhrl avatar taeminlee avatar x-tabdeveloping avatar xu3kev avatar zhimin-z avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mteb's Issues

Multi-GPU for more tasks

It'd be great if we could figure out using multiple gpus on tasks other than BEIR.
E.g. RedditClusteringP2P takes >20h for a 5.8B model with embeddings of 4096 dimensions.

Is SciDocs the same as USEB SciDocs?

It's inconvenient to have two datasets with the same case insensitive name: SciDocs & SCIDOCS. E.g. MacOS is by default case insensitive.

Let's rename SciDocs? I'd propose SciDocsS2S or SciDocsRR instead, but maybe someone has a better idea.

Also do I understand correctly that SciDocs is the same SciDocs as in useb? I.e. it includes all the tasks of Cite, Co-Cite, Co-Read, Co-View, see their paper for details (+ recomm).

cc @loicmagne as I think you added it?

Do not skip if running new split

We currently skip when running a new split & a result file of the same name exists
It'd be better to run & append the new results of the new split to the existing result file.

Sequence Length

Hello,

Thanks for this extensive work.

I've question about the sequence length of the various models used for this benchmark. Different models supports different sequence lengths, like text-embedding-ada upto 8191 tokens, while instructor-xl trained only with 512 max token length. Are these considered during evaluation?

Please forgive if I'm being ignorant.

validation.tsv not present in qrels folder

Hi
I'm trying to run the below code on Colab

from mteb import MTEB
from sentence_transformers import SentenceTransformer
from mteb.tasks import QuoraRetrieval


model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)

evaluation = MTEB(tasks=["QuoraRetrieval"]) # Only select clustering and retrieval tasks
results = evaluation.run(model, output_folder=f"results/{model_name}")

I get the following error

Retrieval
    - QuoraRetrieval, beir, s2s

ERROR:mteb.evaluation.MTEB:Error while evaluating QuoraRetrieval: File /root/.cache/huggingface/datasets/BeIR/quora/qrels/validation.tsv not present! Please provide accurate file.
ERROR:mteb.evaluation.MTEB:Please check all the error logs at: error_logs.txt

When I check the qrels folder, I only find dev and test tsvs. This issue occurs for other tasks as well, such as MSMARCO.
Any idea what I'm doing wrong?

Optimise classification evaluation

Currently we use the following for classification:

  • We sample 8 training examples per class, compute the embeddings, and fit the LogReg classifier. We then evaluate on the (unchanged) dev / test set.
  • We repeat the previous step 10 times and compute the average for accuracy / f1 etc.

As the test set embeddings will be the same, we can compute the test set embeddings once and just need to feed them to the LR classifier. This will make the 10-times repeated evaluation much faster.

task_langs don't work when task is specified

valuation = MTEB(tasks=["AmazonCounterfactualClassification"],task_langs=["zh"])
valuation.run(model)

Task: AmazonCounterfactualClassification, split: validation, language: en. Running...
^CTraceback (most recent call last):
File "", line 1, in

Add embeddings speed

An important factor in choosing embeddings is the speed of embedding.

I suggest adding a "tab" in the evluation called "Speed" and it would be represented in sentences/sec for example (canalso be tokens/sec).

This is a very useful feature of the SBERT site for example:
https://www.sbert.net/docs/pretrained-models/msmarco-v3.html

and efficiency as a parameter is already mentioned in your paper.

Evaluating New Embeddings

I understand that standard word embeddings (like average_word_embeddings_glove.6B.300d) are downloaded from hugging face, but is there code to evaluate new embeddings? I have a .txt file with vectors trained with the GloVe model that I would like to evaluate.

I see in the documentation that we can write our own encoder model that can be evaluated. But is there a way to only input a .txt file of the word embeddings for evaluation?

If there is no code to support a .txt file input, then for the encoder, is the input sentences tokenized already?

SentenceTransformer hangs

Packages:

!pip install -q git+https://github.com/UKPLab/sentence-transformers.git
!pip install -q git+https://github.com/embeddings-benchmark/mteb.git
!pip install -q git+https://github.com/NouamaneTazi/beir.git@fix_drpes_ids
!pip install -q evaluate

Doing

import time
from mteb import MTEB
from sentence_transformers import SentenceTransformer

class SentenceTransformerX(SentenceTransformer):
  pass

model_name = "sentence-transformers/average_word_embeddings_komninos"


model = SentenceTransformerX(model_name)
evaluation = MTEB(tasks=["SciFact"])
a = time.time()
results = evaluation.run(model, output_folder=f"results/{model_name}", overwrite_results=True)
b = time.time()

hangs at

 p = ctx.Process(
                target=SentenceTransformer._encode_multi_process_worker,
                args=(process_id, device_name, self.model, input_queue, output_queue),
                daemon=True,
            )

I think you're the expert here - any ideas? @NouamaneTazi

This only affects the latest BEIR, i.e. I think it has something to do with DPRES. Using the below is fine

!pip install -q git+https://github.com/UKPLab/sentence-transformers.git
!pip install -q git+https://github.com/embeddings-benchmark/mteb.git
!pip install beir==1.0.0

Add revision to datasets

I'd propose to add the commit hash of the revision to tasks:

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MindSmallReranking(AbsTaskReranking):
    @property
    def description(self):
        return {
            "name": "MindSmallReranking",
            "hf_hub_name": "mteb/mind_small",
            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
            "type": "Reranking",
            "category": "s2s",
            "eval_splits": ["validation"],
            "eval_langs": ["en"],
            "main_score": "map",
            "revision": "75937953179...",
        }

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

This is then fed into load_dataset via revision= & added to the results json file.

This partly addresses #21

SICK-R / BIOSSES

If they're the same, one should be removed; if not should be fixed

OSError: Expected to be able to read 12300328 bytes for message body, got 12300316

Hello,
I've been trying to evaluate several custom models on MTEB

I faced some errors,

2022-11-08 11:12:04.822852 >>> ClimateFEVER
Traceback (most recent call last):
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
    results = task.evaluate(model, split, **kwargs)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/mteb/abstasks/AbsTaskRetrieval.py", line 93, in evaluate
    results = retriever.retrieve(corpus, queries)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/beir/retrieval/evaluation.py", line 23, in retrieve
    return self.retriever.search(corpus, queries, self.top_k, self.score_function, **kwargs)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/beir/retrieval/search/dense/exact_search_multi_gpu.py", line 150, in search
    cos_scores_top_k_values, cos_scores_top_k_idx, chunk_ids = metric.compute()
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/evaluate/module.py", line 433, in compute
    self._finalize()
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/evaluate/module.py", line 390, in _finalize
    self.data = Dataset(**reader.read_files([{"filename": f} for f in file_paths]))
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 236, in read_files
    pa_table = self._read_files(files, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 171, in _read_files
    pa_table: Table = self._get_table_from_filename(f_dict, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 306, in _get_table_from_filename
    table = ArrowReader.read_table(filename, in_memory=in_memory)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/arrow_reader.py", line 325, in read_table
    return table_cls.from_file(filename)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/table.py", line 1036, in from_file
    table = _memory_mapped_arrow_table_from_file(filename)
  File "/home/qmin/anaconda3/envs/sgbert/lib/python3.7/site-packages/datasets/table.py", line 51, in _memory_mapped_arrow_table_from_file
    pa_table = opened_stream.read_all()
  File "pyarrow/ipc.pxi", line 691, in pyarrow.lib.RecordBatchReader.read_all
  File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: Expected to be able to read 12300328 bytes for message body, got 12300316

(other retrieval tasks have the same issue)

Is there any workaround?

Clarify why there a multiple runs in the logs

^ It should be explained in the logs why it seems to be repeating the same thing

Task: AmazonReviewsClassification, split: test, language: en. Running...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:03<00:00,  1.60s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:02<00:00,  1.39s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:03<00:00,  1.68s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:04<00:00,  2.48s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 157/157 [03:04<00:00,  1.18s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Fitting logistic regression classifier...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Evaluating...
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 40 training sentences...
Batches: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 2/2 [00:03<00:00,  1.71s/it]
INFO:mteb.evaluation.evaluators.ClassificationEvaluator:Encoding 5000 test sentences...

Bitext Mining low scores

I'm getting the below for

from mteb import MTEB
from mteb.abstasks.AbsTaskClustering import AbsTaskClustering
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=["BUCC"])
evaluation.run(model)
{
  "dataset_version": null,
  "mteb_version": "0.0.2",
  "test": {
    "de-en": {
      "accuracy": 0.0017745302713987473,
      "f1": 0.0017745302713987473,
      "precision": 0.0017745302713987473,
      "recall": 0.0017745302713987473
    },
    "evaluation_time": 456.59,
    "fr-en": {
      "accuracy": 0.0,
      "f1": 0.0,
      "precision": 0.0,
      "recall": 0.0
    },
    "ru-en": {
      "accuracy": 6.927606511950121e-05,
      "f1": 6.927606511950121e-05,
      "precision": 6.927606511950121e-05,
      "recall": 6.927606511950121e-05
    },
    "zh-en": {
      "accuracy": 0.0,
      "f1": 0.0,
      "precision": 0.0,
      "recall": 0.0
    }
  }
}

Seems too low - I think there's a bug

cc @NouamaneTazi @loicmagne

Error for loading ArxivClusteringP2P

Hello @Muennighoff , I encountered the following issue when loading ArxivClusteringP2P dataset

repro

from mteb import MTEB

def test_loading_data():
    eval = MTEB(tasks=["ArxivClusteringP2P"])
    eval.load_tasks_data()
    return 


if __name__ == "__main__":
    test_loading_data()

output:

Generating test split: 23 examples [00:04,  6.38 examples/s]Failed to read file '/root/.cache/huggingface/datasets/downloads/extracted/2368c5e45f666e09c88b163b1db73ad115ce53e3954755e8936da145b036ae4b' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a closing quotation mark in string. in row 0
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 153, in _generate_tables
    dataset = json.load(f)
  File "/usr/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/usr/lib/python3.8/json/__init__.py", line 357, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.8/json/decoder.py", line 340, in decode
    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 25447588)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/datasets/builder.py", line 1817, in _prepare_split_single
    for _, table in generator:
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 156, in _generate_tables
    raise e
  File "/usr/local/lib/python3.8/dist-packages/datasets/packaged_modules/json/json.py", line 132, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Missing a closing quotation mark in string. in row 0

KeyError: 'validation' for RedditClustering and StackOverflowDupQuestions

Is the validation set used for RedditClustering and StackOverflowDupQuestions? Related to #83 and #84.


2022-12-16 05:26:27.662809 >>> RedditClustering
Traceback (most recent call last):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
results = task.evaluate(model, split, **kwargs)
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/abstasks/AbsTaskClustering.py", line 17, in evaluate
for cluster_set in tqdm.tqdm(self.dataset[split], desc="Clustering"):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/datasets/dataset_dict.py", line 57, in getitem
return super().getitem(k)
KeyError: 'validation'

2022-12-16 05:26:39.846393 >>> StackOverflowDupQuestions
Traceback (most recent call last):
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/evaluation/MTEB.py", line 235, in run
results = task.evaluate(model, split, **kwargs)
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/mteb/abstasks/AbsTaskReranking.py", line 21, in evaluate
data_split = self.dataset[split]
File "/home/aiops/shenxd/Dependency/anaconda3/envs/HuggingFace/lib/python3.7/site-packages/datasets/dataset_dict.py", line 57, in getitem
return super().getitem(k)
KeyError: 'validation'

S2S vs P2P

The BEIR tasks are currently all marked as S2S, but some of them are P2P or S2P / P2S. Retrieval is the only task where we have S2P / P2S. Does that make sense?

Options I see:

  • Add S2P & P2S to P2P assuming that the main use case for selecting S2S is to get short texts
  • Introduce S2P & P2S

Any thoughts? cc @NouamaneTazi

Bug while loading MTOPIntentClassification?

In fact I'm not sure if this is a bug. Below is what I thought the problem was


Before evaluating for MTOPIntentClassification, mteb will download a module in cache. In my case the module is located at /data2/.cache/huggingface/modules/datasets_modules/datasets/mteb--mtop_intent/7353fdf5b13e9bfd297fbf98bf66e7e0ee626def6321bd9293bbc6ee1d5fae7b and there is a script called mtop_intent.py:

import json
import datasets

_DESCRIPTION = "MTOP: Multilingual Task-Oriented Semantic Parsing"
_LANGUAGES = ["en", "de", "es", "fr", "hi", "th"]

URL = "" # https://huggingface.co/datasets/mteb/mtop/resolve/main/"

The URL is empty so the module will assume the files are located at pwd, which causes an error.
I change the URL to

URL = "https://huggingface.co/datasets/mteb/mtop_intent/resolve/main/"

and everything works fine.

Add hardware info to results file

It'd be nice to also have information about the hardware used in the results file in addition to the evaluation time if this is easy to get!

CodeSearchNet task

would the maintainers be interested in the addition of a code retrieval task (CodeSearchNet, uses text queries to retrieve code documents), either as a new code retrieval type or added into the existing retrieval category?

Adding a model to automated evaluation

I would like to add Universal Sentence Encoder family of models to the automated evaluation.

It is relatively simply to evaluate it (thanks for making it straightforward), but it is not clear how to create a pull request to add the model to automated evaluation on the website. Please advise.

# !pip install tensorflow_text 

import tensorflow_hub as hub
from tensorflow_text import SentencepieceTokenizer
import tensorflow as tf

embedder=hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")

class USE():
    def encode(self, sentences, batch_size=32, **kwargs):
        embeddings = []
        for i in range(0, len(sentences), batch_size):
            batch_sentences = sentences[i:i+batch_size]
            batch_embeddings = embedder(batch_sentences)
            embeddings.extend(batch_embeddings)
        return embeddings


model = USE()

Propose chunked computation for the `RerankingEvaluator`

The MindSmallReranking dataset contains 2,362,514 queries, 107,968 positive docs, 2,550,123 negative docs.

Currently, RerankingEvaluator.compute_metrics_batched() just gather all texts together and encode them, which would require a lot of memory / GPU memory. (I got CUDA OOM on 32GB V100.)

I made minor modifications to the code to implement chunked computation, reducing memory usage.

If this change is acceptable, I would be glad to make a PR.
Thanks.

Having inference or evaluation results

Hi all,
thank you for sharing this awesome repo!

I am having experiments on classification tasks.
I am wondering if the inference results (e.g., predicted class for each test sentence) and evaluation results (e.g., whether predicted class for each test sentence is correct) are available via some commands?

Best regards,
Jihyuk

remove `sentence-transformers` dependency

as far as I can tell, the sentence-transformers dependency is not necessary for this code to run, only as a shorthand for model loading in cmd. Because installing sentence-transformers also installs torch, sentencepiece, tokenizers and transformers itself, this is quite a big dependency to package. Maybe the installation of sentence-transformers can be split off into an optional dependency?

i.e., pip install mteb[sentencetransformers] could install mteb packaged with sentencetransformers. When running the functionality that requires sentencetransformers, the user could be prompted to install it.

Versioning

I think we should have some form of versioning.

E.g. for each task have an additional field in the json results file called "version" or "revision". We can set it to 0 for all tasks for now or to e.g. the commit string of the dataset on the Hub.

Silent skipping

Currently when a task name is wrong nothing happens upon evaluation.run
I think it'd be nice to raise a warning that a task wasn't found

Make classification deterministic

The following code should give the same results

import logging

from mteb import MTEB
from sentence_transformers import SentenceTransformer

logging.basicConfig(level=logging.INFO)

model_name = "average_word_embeddings_komninos"
model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model, output_folder=None)

It would be nice to write a test for that as well in tests folder

Set `n_inits` explicitly in clustering tasks

When running clustering tasks, I keep seeing this warning:

FutureWarning: The default value of `n_init` will change from 3 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

To keep the behavior or MTEB stable across versions of sklearn you should set n_init to 3 explicitly. If someone happens to run this with an sklearn version >= 1.4, they would start getting different results. If you want, I can make a PR.

I'm on scikit-learn 1.2.2

How to customize parameters for AbsTaskClassification

Hi all,

I would like to compare varying configurations for AbsTaskClassification.
For example, I am wondering about evaluation results with method=kNN.
But, I am not sure how can I change those parameters in python scripts.
Could you help me with this?

  • BTW, I am wondering which method was used for the performance presented in the paper/leaderboard, between method=kNN (as described in the paper; 3.2 Tasks and evaluation - Classification) and method=logReg (which is the default value for method param in the codes).

Best regards,
Jihyuk

How do you ensure that the comparisons are fair?

Hello, this work is wonderful. However, I have one question: How do you ensure that the comparisons are fair? Data leak may occur if some of the models use the train/test data for pretraining or finetuning, particularly for new models to submit.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.