beir-cellar / beir Goto Github PK

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

License: Apache License 2.0

Python 100.00%

nlp information-retrieval bert benchmark sentence-transformers question-generation retrieval passage-retrieval elasticsearch dpr sbert retrieval-models dataset ance colbert zero-shot-retrieval use-qa deep-learning pytorch

beir's Introduction

Paper | Installation | Quick Example | Datasets | Wiki | Hugging Face

🍻 What is it?

BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.

For an overview, checkout our new wiki page: https://github.com/beir-cellar/beir/wiki.

For models and datasets, checkout out Hugging Face (HF) page: https://huggingface.co/BeIR.

For Leaderboard, checkout out Eval AI page: https://eval.ai/web/challenges/challenge-page/1897.

For more information, checkout out our publications:

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (NeurIPS 2021, Datasets and Benchmarks Track)
Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard (Arxiv 2023)

🍻 Installation

Install via pip:

pip install beir

If you want to build from source, use:

$ git clone https://github.com/beir-cellar/beir.git
$ cd beir
$ pip install -e .

Tested with python versions 3.6 and 3.7

🍻 Features

Preprocess your own IR dataset or use one of the already-preprocessed 17 benchmark datasets
Wide settings included, covers diverse benchmarks useful for both academia and industry
Includes well-known retrieval architectures (lexical, dense, sparse and reranking-based)
Add and evaluate your own model in a easy framework using different state-of-the-art evaluation metrics

🍻 Quick Example

For other example codes, please refer to our Examples and Tutorials Wiki page.

from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#### Download scifact.zip dataset and unzip the dataset
dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(pathlib.Path(__file__).parent.absolute(), "datasets")
data_path = util.download_and_unzip(url, out_dir)

#### Provide the data_path where scifact has been downloaded and unzipped
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

#### Load the SBERT model and retrieve using cosine-similarity
model = DRES(models.SentenceBERT("msmarco-distilbert-base-tas-b"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="dot") # or "cos_sim" for cosine similarity
results = retriever.retrieve(corpus, queries)

#### Evaluate your model with NDCG@k, MAP@K, Recall@K and Precision@K  where k = [1,3,5,10,100,1000] 
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

🍻 Available Datasets

Command to generate md5hash using Terminal: md5sum filename.zip.

You can view all datasets available here or on Hugging Face.

Dataset	Website	BEIR-Name	Public?	Type	Queries	Corpus	Rel D/Q	Down-load	md5
MSMARCO	Homepage	`msmarco`	✅	`train` `dev` `test`	6,980	8.84M	1.1	Link	`444067daf65d982533ea17ebd59501e4`
TREC-COVID	Homepage	`trec-covid`	✅	`test`	50	171K	493.5	Link	`ce62140cb23feb9becf6270d0d1fe6d1`
NFCorpus	Homepage	`nfcorpus`	✅	`train` `dev` `test`	323	3.6K	38.2	Link	`a89dba18a62ef92f7d323ec890a0d38d`
BioASQ	Homepage	`bioasq`	❌	`train` `test`	500	14.91M	4.7	No	How to Reproduce?
NQ	Homepage	`nq`	✅	`train` `test`	3,452	2.68M	1.2	Link	`d4d3d2e48787a744b6f6e691ff534307`
HotpotQA	Homepage	`hotpotqa`	✅	`train` `dev` `test`	7,405	5.23M	2.0	Link	`f412724f78b0d91183a0e86805e16114`
FiQA-2018	Homepage	`fiqa`	✅	`train` `dev` `test`	648	57K	2.6	Link	`17918ed23cd04fb15047f73e6c3bd9d9`
Signal-1M(RT)	Homepage	`signal1m`	❌	`test`	97	2.86M	19.6	No	How to Reproduce?
TREC-NEWS	Homepage	`trec-news`	❌	`test`	57	595K	19.6	No	How to Reproduce?
Robust04	Homepage	`robust04`	❌	`test`	249	528K	69.9	No	How to Reproduce?
ArguAna	Homepage	`arguana`	✅	`test`	1,406	8.67K	1.0	Link	`8ad3e3c2a5867cdced806d6503f29b99`
Touche-2020	Homepage	`webis-touche2020`	✅	`test`	49	382K	19.0	Link	`46f650ba5a527fc69e0a6521c5a23563`
CQADupstack	Homepage	`cqadupstack`	✅	`test`	13,145	457K	1.4	Link	`4e41456d7df8ee7760a7f866133bda78`
Quora	Homepage	`quora`	✅	`dev` `test`	10,000	523K	1.6	Link	`18fb154900ba42a600f84b839c173167`
DBPedia	Homepage	`dbpedia-entity`	✅	`dev` `test`	400	4.63M	38.2	Link	`c2a39eb420a3164af735795df012ac2c`
SCIDOCS	Homepage	`scidocs`	✅	`test`	1,000	25K	4.9	Link	`38121350fc3a4d2f48850f6aff52e4a9`
FEVER	Homepage	`fever`	✅	`train` `dev` `test`	6,666	5.42M	1.2	Link	`5a818580227bfb4b35bb6fa46d9b6c03`
Climate-FEVER	Homepage	`climate-fever`	✅	`test`	1,535	5.42M	3.0	Link	`8b66f0a9126c521bae2bde127b4dc99d`
SciFact	Homepage	`scifact`	✅	`train` `test`	300	5K	1.1	Link	`5f7d1de60b170fc8027bb7898e2efca1`

🍻 Additional Information

We also provide a variety of additional information in our Wiki page. Please refer to these pages for the following:

Quick Start

Datasets

Models

Metrics

Metrics Available

Miscellaneous

🍻 Disclaimer

Similar to Tensorflow datasets or Hugging Face's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, feel free to post an issue here or make a pull request!

If you're a dataset owner and wish to include your dataset or model in this library, feel free to post an issue here or make a pull request!

🍻 Citing & Authors

If you find this repository helpful, feel free to cite our publication BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models:

@inproceedings{
    thakur2021beir,
    title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
    author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych},
    booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
    year={2021},
    url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}

If you use any baseline score from the BEIR leaderboard, feel free to cite our publication Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard

@misc{kamalloo2023resources,
      title={Resources for Brewing BEIR: Reproducible Reference Models and an Official Leaderboard}, 
      author={Ehsan Kamalloo and Nandan Thakur and Carlos Lassance and Xueguang Ma and Jheng-Hong Yang and Jimmy Lin},
      year={2023},
      eprint={2306.07471},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

The main contributors of this repository are:

Nandan Thakur, Personal Website: nandan-thakur.com

Contact person: Nandan Thakur, [email protected]

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

🍻 Collaboration

The BEIR Benchmark has been made possible due to a collaborative effort of the following universities and organizations:

🍻 Contributors

Thanks go to all these wonderful collaborations for their contribution towards the BEIR benchmark:

_{Nandan Thakur}

_{Nils Reimers}

_{Iryna Gurevych}

_{Jimmy Lin}

_{Andreas Rücklé}

_{Abhishek Srivastava}

beir's People

Contributors

Stargazers

Watchers

Forkers

stjordanis seanmacavaney helioxgroup sunsishining abhesrivas octoberchang ngo010 mazic4 svakulenk0 china-challengehub barryzm yashsmehta jetrunner quant-prem paul-grundmann admariner iampromaster 32de18 stmnk techthiyanes pritamdeka targonaut zuacubd gumin2020 afiqmuzaffar barana91 arthurcamara shadowkiller33 mihaibalint eneasmesquita cchengz zeeroocooll vasco989k vpegasus gluver wang-yike ioanszilagyi rainatam gabinguo dragomirradev thigm85 andreaschandra jzbjyb forkedpro codefire53 sarshaw maximedb smiyawaki0820 jyoung1996 teddy309 arlancooper memray celinechen95 liuxiaoqun liyingyan7 tiffen modaccount zhengliu101 tstadel bharatr21 viperbatt jinmang2 yanwenqiang kwang2049 nick-s-2018 yang-zhikai piyushmittal2192 groverpr jordane95 cdj0311 crystina-z violetymr atypon nouamanetazi guilhermemr04 dh95 nvrayzhang thibault-formal amyxie361 tarpelite co-sylvie jimin9401 qherreros der-ofenmeister gregoireco lzh0525 nashid timothy2327 hsl89 reyonren rizwan09 agrover112 buoi umertariq1 zansara tortoise10101 argonism gueneumann gg-big-org ajunlonglive

beir's Issues

Can't understand a certain line

https://github.com/UKPLab/beir/blob/933b349bf300718cd6a2d285c51fe78f48fdec85/beir/retrieval/search/dense/exact_search.py#L77

I couldn't understand what this line is trying to do... corpus_id and query_id are from completely different groups and it's fine that they are the same, right? Removing this if statement has a huge impact on ndcg score (tested with ANCE@arguana).

No Quora & CQADupstack in leaderboard?

Hey,
why are Quora & CQADupstack excluded from the leaderboard?

Another metric with the same name already exists.

Hi @NThakur20

There seems to be an issue when I try to run the evaluate_sbert.py in Colab. It was working all fine till yesterday. I have not made any change. Just pip installed beir, git cloned the beir repo and ran the python file without any change to the file. The error is something like this:

2021-11-10 15:30:08.003342: E tensorflow/core/lib/monitoring/collection_registry.cc:77] Cannot register 2 metrics with the same name: /tensorflow/api/keras/optimizers
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2150, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_utils.py", line 637, in <module>
    class TFPreTrainedModel(tf.keras.Model, TFModelUtilsMixin, TFGenerationMixin, PushToHubMixin):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/lazy_loader.py", line 62, in __getattr__
    module = self._load()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/lazy_loader.py", line 45, in _load
    module = importlib.import_module(self.__name__)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/dist-packages/keras/__init__.py", line 25, in <module>
    from keras import models
  File "/usr/local/lib/python3.7/dist-packages/keras/models.py", line 20, in <module>
    from keras import metrics as metrics_module
  File "/usr/local/lib/python3.7/dist-packages/keras/metrics.py", line 26, in <module>
    from keras import activations
  File "/usr/local/lib/python3.7/dist-packages/keras/activations.py", line 20, in <module>
    from keras.layers import advanced_activations
  File "/usr/local/lib/python3.7/dist-packages/keras/layers/__init__.py", line 23, in <module>
    from keras.engine.input_layer import Input
  File "/usr/local/lib/python3.7/dist-packages/keras/engine/input_layer.py", line 21, in <module>
    from keras.engine import base_layer
  File "/usr/local/lib/python3.7/dist-packages/keras/engine/base_layer.py", line 43, in <module>
    from keras.mixed_precision import loss_scale_optimizer
  File "/usr/local/lib/python3.7/dist-packages/keras/mixed_precision/loss_scale_optimizer.py", line 18, in <module>
    from keras import optimizers
  File "/usr/local/lib/python3.7/dist-packages/keras/optimizers.py", line 26, in <module>
    from keras.optimizer_v2 import adadelta as adadelta_v2
  File "/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/adadelta.py", line 22, in <module>
    from keras.optimizer_v2 import optimizer_v2
  File "/usr/local/lib/python3.7/dist-packages/keras/optimizer_v2/optimizer_v2.py", line 37, in <module>
    "/tensorflow/api/keras/optimizers", "keras optimizer usage", "method")
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/monitoring.py", line 361, in __init__
    len(labels), name, description, *labels)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/eager/monitoring.py", line 135, in __init__
    self._metric = self._metric_methods[self._label_length].create(*args)
tensorflow.python.framework.errors_impl.AlreadyExistsError: Another metric with the same name already exists.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2150, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/usr/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/__init__.py", line 19, in <module>
    from . import (
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/layoutlm/__init__.py", line 22, in <module>
    from .configuration_layoutlm import LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP, LayoutLMConfig
  File "/usr/local/lib/python3.7/dist-packages/transformers/models/layoutlm/configuration_layoutlm.py", line 22, in <module>
    from ...onnx import OnnxConfig, PatchingSpec
  File "/usr/local/lib/python3.7/dist-packages/transformers/onnx/__init__.py", line 17, in <module>
    from .convert import export, validate_model_outputs
  File "/usr/local/lib/python3.7/dist-packages/transformers/onnx/convert.py", line 23, in <module>
    from .. import PreTrainedModel, PreTrainedTokenizer, TensorType, TFPreTrainedModel, is_torch_available
  File "<frozen importlib._bootstrap>", line 1032, in _handle_fromlist
  File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2140, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2154, in _get_module
    ) from e
RuntimeError: Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
Another metric with the same name already exists.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/content/beir/examples/retrieval/evaluation/dense/evaluate_sbert.py", line 2, in <module>
    from beir.retrieval import models
  File "/usr/local/lib/python3.7/dist-packages/beir/retrieval/models/__init__.py", line 1, in <module>
    from .sentence_bert import SentenceBERT
  File "/usr/local/lib/python3.7/dist-packages/beir/retrieval/models/sentence_bert.py", line 1, in <module>
    from sentence_transformers import SentenceTransformer
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/__init__.py", line 3, in <module>
    from .datasets import SentencesDataset, ParallelSentencesDataset
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/datasets/__init__.py", line 3, in <module>
    from .ParallelSentencesDataset import ParallelSentencesDataset
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/datasets/ParallelSentencesDataset.py", line 4, in <module>
    from .. import SentenceTransformer
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/SentenceTransformer.py", line 27, in <module>
    from .models import Transformer, Pooling, Dense
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/models/__init__.py", line 1, in <module>
    from .Transformer import Transformer
  File "/usr/local/lib/python3.7/dist-packages/sentence_transformers/models/Transformer.py", line 2, in <module>
    from transformers import AutoModel, AutoTokenizer, AutoConfig
  File "<frozen importlib._bootstrap>", line 1032, in _handle_fromlist
  File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2140, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/usr/local/lib/python3.7/dist-packages/transformers/file_utils.py", line 2154, in _get_module
    ) from e
RuntimeError: Failed to import transformers.models.auto because of the following error (look up to see its traceback):
Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
Another metric with the same name already exists.

About webis-touche2020 qrels

Thanks for your great works. I notice that average number of docs per query changes from 49.9 to 19.0 in your latest version of the paper. To keep the same setting as your, what should I need to do?

Thanks.

Datasets without Download Links

Hello,
Thanks for your great work !
I am just curious how can we reproduce some results and get the bioasq, signal1m etc ... datasets to work with beir.
(currently testing our own model on beir ! )

init_weights error when loading DPR models from HF modelhub

Congrats on this very well structured, documented and helpful framework for figuring out whats going on in IR - especially on OOD data. Keep up the good work!

When loading DPR models from HF modelhub like:

model = DRES(models.SentenceBERT((
    "facebook/dpr-question_encoder-multiset-base",
    "facebook/dpr-ctx_encoder-multiset-base",
    " [SEP] "), batch_size=128))

I run into an NotImplementedError: Make sure _init_weigths is implemented for <class 'transformers.models.dpr.modeling_dpr.DPRQuestionEncoder'>

I know you converted the model to a sentencetransformers already and can be loaded like this but an interoperability with the HF hub would be slick - also for other DPR models in other languages like French or German.

Thanks

Is it reasonable to add other models to construct QFilter except CrossEncoder And can QGenModel will help to improve with real query ?

Such as BM25Search or others,
I see in load_train function of TrainRetriever use score to filter sample,
I think some experiments conclusions about use different QFilter
to train retriever should be take in consideration.

Rerank scores lower than vanilla dense IR?

Hi,

I've got a dense IR pipeline running with rerank, for a search engine application. However my rerank scores are lower than just a dense IR run?

msmarco-distilbert-base-v3
ms-marco-electra-base cross encoder

Scores:

Dense IR                                        DenseIR + Re-Rank
2021-11-30 16:48:39 - NDCG@1: 0.3629		2021-11-30 16:56:16 - NDCG@1: 0.3538
2021-11-30 16:48:39 - NDCG@3: 0.5234		2021-11-30 16:56:16 - NDCG@3: 0.5170
2021-11-30 16:48:39 - NDCG@5: 0.5472		2021-11-30 16:56:16 - NDCG@5: 0.5401
2021-11-30 16:48:39 - NDCG@10: 0.5623		2021-11-30 16:56:16 - NDCG@10: 0.5540
2021-11-30 16:48:39 - NDCG@100: 0.5879		2021-11-30 16:56:16 - NDCG@100: 0.5812
2021-11-30 16:48:39 - NDCG@1000: 0.5965		2021-11-30 16:56:16 - NDCG@1000: 0.5812

2021-11-30 16:48:39 - MAP@1: 0.3629		2021-11-30 16:56:16 - MAP@1: 0.3538
2021-11-30 16:48:39 - MAP@3: 0.4844		2021-11-30 16:56:16 - MAP@3: 0.4774
2021-11-30 16:48:39 - MAP@5: 0.4977		2021-11-30 16:56:16 - MAP@5: 0.4903		
2021-11-30 16:48:39 - MAP@10: 0.5040		2021-11-30 16:56:16 - MAP@10: 0.4961
2021-11-30 16:48:39 - MAP@100: 0.5090		2021-11-30 16:56:16 - MAP@100: 0.5013
2021-11-30 16:48:39 - MAP@1000: 0.5093		2021-11-30 16:56:16 - MAP@1000: 0.5013

2021-11-30 16:48:39 - Recall@1: 0.3629		2021-11-30 16:56:16 - Recall@1: 0.3538
2021-11-30 16:48:39 - Recall@3: 0.6362		2021-11-30 16:56:16 - Recall@3: 0.6315
2021-11-30 16:48:39 - Recall@5: 0.6932		2021-11-30 16:56:16 - Recall@5: 0.6869
2021-11-30 16:48:39 - Recall@10: 0.7397		2021-11-30 16:56:16 - Recall@10: 0.7297
2021-11-30 16:48:39 - Recall@100: 0.8627	2021-11-30 16:56:16 - Recall@100: 0.8618
2021-11-30 16:48:39 - Recall@1000: 0.9310	2021-11-30 16:56:16 - Recall@1000: 0.8618

2021-11-30 16:48:39 - P@1: 0.3629		2021-11-30 16:56:16 - P@1: 0.3538
2021-11-30 16:48:39 - P@3: 0.2121		2021-11-30 16:56:16 - P@3: 0.2105
2021-11-30 16:48:39 - P@5: 0.1386		2021-11-30 16:56:16 - P@5: 0.1374
2021-11-30 16:48:39 - P@10: 0.0740		2021-11-30 16:56:16 - P@10: 0.0730
2021-11-30 16:48:39 - P@100: 0.0086		2021-11-30 16:56:16 - P@100: 0.0086
2021-11-30 16:48:39 - P@1000: 0.0009		2021-11-30 16:56:16 - P@1000: 0.0009

Any thoughts would be greatly appreciated.

Unsupported Elastic Search distribution on BEIR.ipynb

Hi,

Thanks for the great work. BEIR is extremely valuable!

I just tried to run BEIR.ipynb on Goggle Colab and I was unable to complete "Lexical Retrieval using BM25 (Elasticsearch)" section due to an unsupported error from ElasticSearch as shown below:

I tried different versions but I couldn't get it to work. Any advice?

Additional annotations for Trec-covid

Just saw the submission to Neurips (https://openreview.net/forum?id=wCu6T5xFjeJ), great work!
I was just wondering whether you plan to release the additional annotations for Trec-covid’s hole filling? Thanks!

Storing document embeddings index

Is there a way to cache/load embedded documents and queries? That would help to save time on embedding big datasets such as ms marco and nq

Processing CQADupstack does not work out-of-the-box

I'm using evaluate_anserini_docT5query.py. In that script it's using util.download_and_unzip then GenericDataLoader and fails with the following message when using dataset cqadupstack :

ValueError: File /home/josh/source/beir/examples/retrieval/evaluation/sparse/datasets/cqadupstack/corpus.jsonl not present! Please provide accurate file.

The CQADupstack dataset is divided up into sub-categories which is causing the above error since it contains an extra sub-directory per-category.

 ls -las ~/source/beir/examples/retrieval/evaluation/sparse/datasets/cqadupstack/
total 56
4 drwxr-xr-x 14 josh josh 4096 Jul  1 08:38 .
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:35 ..
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:38 android
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 english
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:35 gaming
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 gis
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 mathematica
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 physics
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 programmers
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 stats
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:36 tex
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 unix
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 webmasters
4 drwxr-xr-x  3 josh josh 4096 Jul  1 08:37 wordpress

How was CQADupstack used in the benchmarking for the paper and leaderboard? Was each category processed separately or was everything somehow combined into a single evaluation?

250 queries for Robust04?

The below gives me 250 queries for robust04, however you report having only 249 in the paper. How come?

import ir_datasets
dataset = ir_datasets.load("trec-robust04")
x=list(dataset.queries_iter())
len(x)

Unable to create Index in Elastic Search

I am getting an error when running the bm25 evaluation file.

Unable to create Index in Elastic Search. Reason: The client noticed that the server is not a supported distribution of Elasticsearch
The error was not there last week when I ran the file. I am using Colab to run the file. Any idea what maybe the issue here?

Training a T5 model

Hi @NThakur20 I was wondering if we can train a T5 model as when I was loading a T5 model from HF there seems to be an error.

The QRel file rank for BioASQ

How would you assign numbers in the QRel file for BioASQ?
Could you please provide an example. Let's say for a question Q with id ID the ideal documents are [D1,D2,D3]. How would the QRel entries look like?

Thank you

Dose this framework support function to make a summary about model performance on different datasets or single dataset in different model performance and support dynamic benchmark ?

I think this benchmark may have the function to support choose the best model from a model list,
by compare the performance measurements on one dataset among them.
This require the dataset have same interface.

And support a model combination choose support to switch the use model by different semantic
feature (sometime use “bm25”, sometime use “sbert” , switch by feature character), to make the
final conclusion more consistently.

This will make this benchmark not only a benchmark, but a meta ensemble model framework to
combine and improve the final performance on single dataset wth different features.

CQAdupstack evaluation

First I would like to thank you for this incredible framework, it has been of incredible help to me.

I have a question concerning the cqadupstack results. As there are 12 different corpus for that dataset, which mean did you used to compute your results? The mean of means (i.e. each dataset has equal weight) or the mean over all the queries (i.e. each query has equal weight)?

Thanks in advance.

Custom multilingual issues

Hi, thanks for your awesome work! Dose this framework support Chinese ? How can I use it in my own Chinese dataset (sparse ,dense ...),I mean that can I use my own tokenizer?

thanks

Training data for NQ?

Thanks for the great contribution!

I found that the downloaded data of NQ only contains test files and corpus, where can I get the training files?

Thank you!

Reranking with custom SentenceTransformer model issue - SentenceTransforme' object has no attribute 'encode_queries'Hi,

Elastic Search Connection Error

I tried to replicate BM25 evaluation on trec-covid, but ran into the following problem:

2021-04-21 16:16:57 - Downloading trec-covid.zip ...
2021-04-21 16:16:57 - Unzipping trec-covid.zip ...
2021-04-21 16:16:58 - Loaded 171332 TEST Documents.
2021-04-21 16:16:58 - Doc Example: {'text': 'OBJECTIVE: This retrospective chart review describes the epidemiology and clinical features of 40 patients with culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia. METHODS: Patients with positive M. pneumoniae cultures from respiratory specimens from January 1997 through December 1998 were identified through the Microbiology records. Charts of patients were reviewed. RESULTS: 40 patients were identified, 33 (82.5%) of whom required admission. Most infections (92.5%) were community-acquired. The infection affected all age groups but was most common in infants (32.5%) and pre-school children (22.5%). It occurred year-round but was most common in the fall (35%) and spring (30%). More than three-quarters of patients (77.5%) had comorbidities. Twenty-four isolates (60%) were associated with pneumonia, 14 (35%) with upper respiratory tract infections, and 2 (5%) with bronchiolitis. Cough (82.5%), fever (75%), and malaise (58.8%) were the most common symptoms, and crepitations (60%), and wheezes (40%) were the most common signs. Most patients with pneumonia had crepitations (79.2%) but only 25% had bronchial breathing. Immunocompromised patients were more likely than non-immunocompromised patients to present with pneumonia (8/9 versus 16/31, P = 0.05). Of the 24 patients with pneumonia, 14 (58.3%) had uneventful recovery, 4 (16.7%) recovered following some complications, 3 (12.5%) died because of M pneumoniae infection, and 3 (12.5%) died due to underlying comorbidities. The 3 patients who died of M pneumoniae pneumonia had other comorbidities. CONCLUSION: our results were similar to published data except for the finding that infections were more common in infants and preschool children and that the mortality rate of pneumonia in patients with comorbidities was high.', 'title': 'Clinical features of culture-proven Mycoplasma pneumoniae infections at King Abdulaziz University Hospital, Jeddah, Saudi Arabia'}
2021-04-21 16:16:58 - Loaded 50 TEST Queries.
2021-04-21 16:16:58 - Query Example: what is the origin of COVID-19
2021-04-21 16:16:58 - Activating Elasticsearch....
2021-04-21 16:16:58 - Elastic Search Credentials: {'hostname': 'localhost', 'index_name': 'trec-covid', 'keys': {'title': 'title', 'body': 'txt'}, 'timeout': 100, 'retry_on_timeout': True, 'maxsize': 24}
2021-04-21 16:16:58 - Deleting previous Elasticsearch-Index named - trec-covid
2021-04-21 16:16:58 - Unable to create Index in Elastic Search. Reason: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa550364110>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa550364110>: Failed to establish a new connection: [Errno 111] Connection refused)
2021-04-21 16:16:58 - Creating fresh Elasticsearch-Index named - trec-covid
2021-04-21 16:16:58 - Unable to create Index in Elastic Search. Reason: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa550364c10>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa550364c10>: Failed to establish a new connection: [Errno 111] Connection refused)
Traceback (most recent call last):
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connection.py", line 170, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 252, in perform_request
    method, url, body, retries=Retry(False), headers=request_headers, **kw
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/util/retry.py", line 507, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/packages/six.py", line 735, in reraise
    raise value
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connection.py", line 234, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/home/USER/.conda/envs/beir/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connection.py", line 200, in connect
    conn = self._new_conn()
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/urllib3/connection.py", line 182, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fa54912e690>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "evaluate_bm25.py", line 33, in <module>
    results = retriever.retrieve(corpus, queries)
  File "/home/USER/projects/beir-repo/beir/retrieval/evaluation.py", line 22, in retrieve
    return self.retriever.search(corpus, queries, self.top_k, self.score_function)
  File "/home/USER/projects/beir-repo/beir/retrieval/search/lexical/bm25_search.py", line 33, in search
    self.index(corpus)
  File "/home/USER/projects/beir-repo/beir/retrieval/search/lexical/bm25_search.py", line 66, in index
    progress=progress
  File "/home/USER/projects/beir-repo/beir/retrieval/search/lexical/elastic_search.py", line 89, in bulk_add_to_index
    client=self.es, index=self.index_name, actions=generate_actions,
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 326, in streaming_bulk
    **kwargs
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 246, in _process_bulk_chunk
    for item in gen:
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 193, in _process_bulk_chunk_error
    raise error
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/helpers/actions.py", line 234, in _process_bulk_chunk
    resp = client.bulk("\n".join(bulk_actions) + "\n", *args, **kwargs)
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/client/utils.py", line 153, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/client/__init__.py", line 460, in bulk
    body=body,
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/transport.py", line 413, in perform_request
    raise e
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/transport.py", line 388, in perform_request
    timeout=timeout,
  File "/home/USER/.conda/envs/beir/lib/python3.7/site-packages/elasticsearch/connection/http_urllib3.py", line 264, in perform_request
    raise ConnectionError("N/A", str(e), e)
elasticsearch.exceptions.ConnectionError: ConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa54912e690>: Failed to establish a new connection: [Errno 111] Connection refused) caused by: NewConnectionError(<urllib3.connection.HTTPConnection object at 0x7fa54912e690>: Failed to establish a new connection: [Errno 111] Connection refused)

The only things I've changed in evaluate_bm25.py are

dataset = "trec-covid" (line 17)
hostname = "localhost" #localhost (line 26)
index_name = dataset # scifact (line 27)

Elastic Search is new to me so I'm not sure whether I've missed anything. I used pip install -e . to install BEIR from source if that's relevant. Thanks!

Syntax issue in retrieval evaluate_custom_model.py

Hi,

Syntax issue in evaluate_custom_model.py due to missing import of typing for List and Dict:

from typing import List, Dict

Thanks for all the great work!

create qrel.tsv file

Hi, Nandan,

When we create the qrel.tsv file for custom preprocessed dataset, can it just innclude relevant documents, i.e., score=1 or non-relevant documents should also be included?

Cheers
Xiang

webis-touche2020 qrels

Thanks for providing this resource!

I noticed that the hash of webis-touche2020.zip recently changed, so I dug a bit into the differences. It looks like the only changes were changes to webis-touche2020/qrels/test.tsv:

$ wc -l {old,new}/webis-touche2020/qrels/test.tsv
  2963 old/webis-touche2020/qrels/test.tsv
  2299 new/webis-touche2020/qrels/test.tsv

$ head {old,new}/webis-touche2020/qrels/test.tsv -n5
==> old/webis-touche2020/qrels/test.tsv <==
query-id	corpus-id	score
1	197beaca-2019-04-18T11:28:59Z-00001-000	4
1	1a76ed9f-2019-04-18T16:07:27Z-00001-000	5
1	1a76ed9f-2019-04-18T16:07:27Z-00002-000	3
1	1a76ed9f-2019-04-18T16:07:27Z-00005-000	4

==> new/webis-touche2020/qrels/test.tsv <==
query-id	corpus-id	score
1	S197beaca-A971412e6	0
1	S1b03f390-A22aff8a0	0
1	S1b03f390-Aa73ba80f	1
1	S1b03f390-Ab387b162	0

I'm not super familiar with the task, but based on this page and the linked qrels, it looks like the old version corresponds to version 1 of the args.me corpus, and the new version corresponds with version 2. However, the corpus.jsonl remains unchanged, so the qrels corpus-id field in the qrels no longer corresponds with document _id fields in the corpus.

Can you please clarify these discrepancies?

Question about ideal architecture for deep IR

Hi again @NThakur20! I've got an interesting search project which consists of a golden set of search queries and their results, for a financial services domain search application. One of the search types is for financial analysts, e.g. a partial analyst name search query which is answered by the search engine with the full contact particulars for that analyst. Using a fine tune trained T5.1 large parameter model I am achieving 97%+ classification accuracy for observed searches but the issue here is generalization to new searches where the analyst exists only in the database and the model needs to generate the response based on unsupervised contact data that's in the contact analyst database. So the thought was to either train an MS MARCO T5 model in an unsupervised fashion on the contact database in hopes that it generalizes to unobserved search queries, or to populate a deep IR pipeline with those records and use that for the analyst contact retrieval. Is this a reasonable use case with BEIR?

Separate average score for zero-shot only

Between the two leaderboard tabs, they are laid out slightly differently. It would be nice to normalize these or put them on a single tab. If anything, just separating the "average" score to only include zero-shot methods in the "re-ranking leaderboard" would help compare across tabs. Basically the same view as is in the paper is preferred.

TypeError when running Quick Example

Thanks for the great work.

When I run the Quick Example, I got a TypeError:

Traceback (most recent call last):
  File "demo.py", line 34, in <module>
    ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
  File "/home/renruiyang/anaconda3/envs/beir/lib/python3.6/site-packages/beir/retrieval/evaluation.py", line 63, in evaluate
    evaluator = pytrec_eval.RelevanceEvaluator(qrels, {map_string, ndcg_string, recall_string, precision_string})
TypeError: Unable to resolve all measures.

Could you help me solve this error?

Thanks a lot!

Add BM25+RM3 as a baseline to re-ranking leaderboard

We see this approach in the "dense leaderboard", but it would be good to add a row to the "re-ranking leaderboard" for easier comparisons against baselines.

ValueError when running `evaluate_bm25.py`

Hi, I was trying to run your evaluate_bm25.py baseline, but I got the following error. There may be some problem with elasticsearch. Could you please help me fix it?

2022-02-17 02:38:34 - Loading Queries...
2022-02-17 02:38:34 - Loaded 300 TEST Queries.
2022-02-17 02:38:34 - Query Example: 0-dimensional biomaterials show inductive properties.
2022-02-17 02:38:34 - Activating Elasticsearch....
2022-02-17 02:38:34 - Elastic Search Credentials: {'hostname': 'localhost', 'index_name': 'scifact', 'keys': {'title': 'title', 'body': 'txt'}, 'timeout': 100, 'retry_on_timeout': True, 'maxsize': 24, 'number_of_shards': 1, 'language': 'english'}
Traceback (most recent call last):
  File "evaluate_bm25.py", line 64, in <module>
    model = BM25(index_name=index_name, hostname=hostname, initialize=initialize, number_of_shards=number_of_shards)
  File "/anaconda/envs/beir/lib/python3.8/site-packages/beir/retrieval/search/lexical/bm25_search.py", line 22, in __init__
    self.es = ElasticSearch(self.config)
  File "/anaconda/envs/beir/lib/python3.8/site-packages/beir/retrieval/search/lexical/elastic_search.py", line 34, in __init__
    self.es = Elasticsearch(
  File "/anaconda/envs/beir/lib/python3.8/site-packages/elasticsearch/_sync/client/__init__.py", line 312, in __init__
    node_configs = client_node_configs(
  File "/anaconda/envs/beir/lib/python3.8/site-packages/elasticsearch/_sync/client/utils.py", line 101, in client_node_configs
    node_configs = hosts_to_node_configs(hosts)
  File "/anaconda/envs/beir/lib/python3.8/site-packages/elasticsearch/_sync/client/utils.py", line 141, in hosts_to_node_configs
    node_configs.append(url_to_node_config(host))
  File "/anaconda/envs/beir/lib/python3.8/site-packages/elastic_transport/client_utils.py", line 198, in url_to_node_config
    raise ValueError(
ValueError: URL must include a 'scheme', 'host', and 'port' component (ie 'https://localhost:9200')

Missing datasets

Hi, it seems that some datasets that are on the leaderboard are missing from https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/

For example:

bioasq
signallm
trec-news

Is there a reason for removing those datasets?

Low BM25 baselines?

Hi there, thanks for providing this nice resource!

Looking at your paper, I think your BM25 baselines are a bit low? You report 0.218 nDCG@10 on MS MARCO, if I'm not mistaken - from Table 2.

With Pyserini https://github.com/castorini/pyserini/blob/master/docs/experiments-msmarco-passage.md - we can get, and this has been widely reproduced:

$ tools/eval/trec_eval.9.0.4/trec_eval -c -mrecall.1000 -mmap -m ndcg_cut.10   collections/msmarco-passage/qrels.dev.small.trec runs/run.msmarco-passage.bm25tuned.trec
map                   	all	0.1957
recall_1000           	all	0.8573
ndcg_cut_10           	all	0.2340

So, 1.6 points higher?

I suspect all the BM25 results should be a bit higher, based on our experience: https://arxiv.org/abs/2104.05740

For many of the other datasets with dense labels, a competitive baseline - and widely acknowledged in the IR community - would be something like BM25+RM3.

We would be happy to work with you on building out Pyserini as the competitive baseline for this task... Please reach out!

Error loading model in `query_gen_and_train.py`

Hello,

I'm working on the GenQ setting and find that the model loading part seems incorrect. In query_gen_and_train.py, the model loading part is like:

#### Provide any sentence-transformers model path
model_path = "bert-base-uncased" # or "msmarco-distilbert-base-v3"
retriever = TrainRetriever(model_path=model_path, batch_size=64, max_seq_length=350)

However, the TrainRetriever class doesn't have the argument model_path. It seems that the error was introduced in this commit.

Shi Yu

Unable to find pyserini docker

Hi, I'm trying to experiment with beir, BM25 and pyserini, but I'm unable to find the docker image (beir/beir-pyserini). Looking at dockerhub (https://hub.docker.com/u/beir) it seems that the only image available is pyserini-fastapi.

Is pyserini-fastapi the same as beir-pyserini or there is something that I'm doing wrong?

Adding TREC ad-hoc collections

Hi,
Very useful work, thanks !
I was wondering why you did not include standard ad-hoc retrieval collections in the benchmark (like Robust04) ? Is it intended ?
For people working on neural IR, it would be interesting to see how models trained on MS MARCO systematically generalize to these collections too

DPR on MSMARCO

Hi! Thanks for sharing BEIR - this is a wonderful project!

I wonder if you can include DPR model trained on MSMARCO on the leaderboard? The current DPR (Multi) and DPR (KILT) are not really comparable with other models which are trained on MARCO.

Thanks!

Error in accuracy function

In the custom_metrics.py file, top_accuracy function:
you write:
top_hits[query_id] = sorted(doc_scores.keys(), key=lambda item: item[1], reverse=True)[0:k_max]
but I think that you should instead wrote [elem[0] for elem in
sorted(doc_scores.items(), key=lambda item: item[1], reverse=True)[0:k_max]]
indeed, in your code, you are sorting by the key and not by the score

Include enriched sparse lexical retrieval methods

First, a thank you. The paper and repo have been fantastic resources to help conversations around out-of-domain retrieval!

Second, a feature request. I think it would be very interesting to see some of the document/index enrichment approaches added to the benchmark and paper discussion, as extensions to sparse lexical retrieval. You mention both doc2query and DeepCT/HDCT in the paper but don't provide benchmark data for them. Since they are trained on MS MARCO, it would be interesting to see if they perform well out-of-domain and in-comparison to both BM25+CE and ColBERT which perform very well out-of-domain.

UnsupportedProductError: The client noticed that the server is not a supported distribution of Elasticsearch

Does the provided elastic search installation script in the colab work for you?

I'm getting the above error, would appreciate any help :)

a question about how to use BM25 in evaluate_custom_dataset.

Hi~ I try use BM25 model to evaluate custom dataset , when I use the code as follow:

#### Sentence-Transformer ####
#### Provide any pretrained sentence-transformers model path
#### Complete list - https://www.sbert.net/docs/pretrained_models.html
# model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"))
model = BM25(index_name="your-index-name", hostname="127.0.0.1:9200", initialize=True )
# retriever = EvaluateRetrieval(model, score_function="cos_sim")
retriever = EvaluateRetrieval(model)
#### Retrieve dense results (format of results is identical to qrels)
results = retriever.retrieve(corpus, queries)

#### Evaluate your retrieval using NDCG@k, MAP@K ...
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

but it doesn't work

2021-08-04 17:57:32 - Activating Elasticsearch....
2021-08-04 17:57:32 - Elastic Search Credentials: {'hostname': '127.0.0.1:9200', 'index_name': 'your-index-name', 'keys': {'title': 'title', 'body': 'txt'}, 'timeout': 100, 'retry_on_timeout': True, 'maxsize': 24, 'number_of_shards': 'default', 'language': 'english'}
english
2021-08-04 17:57:32 - Deleting previous Elasticsearch-Index named - your-index-name
2021-08-04 17:57:32 - Creating fresh Elasticsearch-Index named - your-index-name
  0%|                                                                                                                   | 0/2 [00:00<?, ?docs/s]
que: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 26.55it/s]
Traceback (most recent call last):
  File "/.../evaluate_custom_dataset.py", line 67, in <module>
    ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)
  File "/.../beir/beir/retrieval/evaluation.py", line 74, in evaluate
    ndcg[f"NDCG@{k}"] = round(ndcg[f"NDCG@{k}"]/len(scores), 5)
ZeroDivisionError: float division by zero

Any other code need I modify? thank you~

Robust04 preprocessing

Hi, I saw from https://docs.google.com/spreadsheets/d/1L8aACyPaXrL8iEelJLGqlMqXKPX2oSP_R10pZoy77Ns/edit#gid=0 that Robust04 has been added to the BEIR leaderboard. Thanks for providing the results! Meanwhile, I was wondering what preprocessing is made for the dataset. For example, which fields of the topics are used for constructing the queries (title, desc, or narr)? Which parts of the documents are included (Headline, Text, Date, etc)? I'm asking because I tried to evaluate ANCE (the publicly released checkpoint) on Robust04 with minor preprocessing, but the ndcg@10 score is around 0.33, which is much lower than 0.39 as reported in the leaderboard. Thanks a lot!

Providing BM25 top-100 for all datasets

First of all, thanks for this amazing benchmark. I'd like to evaluate a re-ranking model on several datasets. If I got I correctly, I will have to download and index all datasets independently to get the top-100 BM25 rank lists. Could you please provide those for each dataset for easier evaluation?

Thanks a lot!

Allow GPU parallelization to take advantage of multi-GPU instances

I'm running evaluate_anserini_docT5query.py and it's currently only able to utilize a single GPU. I'm using GCP instances with 4x V100 GPUs and I'd like to finish more experiments in less time by using more GPUs. Is there a simple way to parallelize batches or configure using multiple GPUs that I'm not aware of?

Error in `beir/beir/retrieval/search/sparse/sparse_search.py`

https://github.com/UKPLab/beir/blob/568f4c34fa0be1901e2aaa8479978c9e54a1e377/beir/retrieval/search/sparse/sparse_search.py#L28-L34

Hello,

I think there is an error in sparse_search.py; only the first query in each batch is evaluated.

Thank you!

Training ColBERT on a new test collection

Hi,

Thanks for the wonderful package!

I want to train a ColBERT model on a new test collection which I have. I couldn't find any example related to it.

Could you please point out how to train a ColBERT model with the package?

Thanks

Question about ArguAna dataset - same content for the `text` field of queries.jsonl and corpus.jsonl

Hello,

Thanks for releasing the benchmark datasets. These are great work!

After downloading the ArguAna dataset you provided, we found that the text field of queries.jsonl and corpus.jsonl are the same.

Is that on purpose or a bug?

Looking forward to hearing from you, thanks!

Script to generate leaderboard metrics

Can I find the complete script used to generate this leaderboard somewhere? I saw snippets such as benchmark_bm25.py but not a full-scale script that includes elastic search config and all.

I am implementing a BEIR compatible Vespa version that I plan to submit as a PR soon. I am, however, finding different results between my BM25 metrics and the elastic BM25 results from the leaderboard.

Generating results side by side would be great to debug my implementation.

Leaderboard results of USE

Hi,
Thank you for publicizing this great benchmark!
I am interested in the retrieval performance of models that are not trained on retrieval datasets (e.g., MS MARCO). I think this benchmark already supports USE, but it is not listed on the leaderboard. Did you already test the performance of the model on this benchmark? I think it would be very helpful if the leaderboard also provides results of general sentence embedding models (e.g., USE, SimCSE).

How to evaluate on Trec-Covid

Hi, I have questions regarding the Trec-Covid datasets.
(1) I should retrieve the document in the entire corpus (171K) right?
(2) The qrel test.tsv has three labels 0,1,2, when I get the prediction from the BM25 baseline, how should I assign the values to them? all 1 ?
Thank you

BaseException with rerank - 'dict' object has no attribute 'strip'

Hi there,

I've been working on a dense IR pipeline with BEIR including a custom dataloader, which works fine for dense IR runs but throws an exception whenever I add a cross encoder for reranking.

Rerank:

cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-electra-base')
reranker = Rerank(cross_encoder_model, batch_size=128)

Dataloader:

corpus = {}

for index, item in corpusdf.iteritems():
    corpus.update({
        "doc"+(str(index)): {
            "title": "",
            "text": item,
            },
    })

queries = {}

for index, row in queriesdf.iterrows():
    queries.update({
        "q"+str(index): {
            "doc"+(str(index)): row[0],
            },
    })

qrels = {}

for i in range(len(df)):
    qrels.update({
        "q"+str(i): {
            "doc"+(str(i)): 1,
        },
    })

Exception:

Traceback (most recent call last):
  File "C:\Users\costco\venv\lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py", line 273, in predict
    for features in iterator:
  File "C:\Users\costco\venv\lib\site-packages\tqdm\std.py", line 1180, in __iter__
    for obj in iterable:
  File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\dataloader.py", line 521, in __next__
    data = self._next_data()
  File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\dataloader.py", line 561, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "C:\Users\costco\venv\lib\site-packages\torch\utils\data\_utils\fetch.py", line 52, in fetch
    return self.collate_fn(data)
  File "C:\Users\costco\venv\lib\site-packages\sentence_transformers\cross_encoder\CrossEncoder.py", line 93, in smart_batching_collate_text_only
    texts[idx].append(text.strip())
AttributeError: 'dict' object has no attribute 'strip'

Seems like a simple fix but I am trying to avoid modifying BEIR sources, any ideas would be greatly appreciated!

beir-cellar / beir Goto Github PK

beir's Introduction

Paper | Installation | Quick Example | Datasets | Wiki | Hugging Face

🍻 What is it?

🍻 Installation

🍻 Features

🍻 Quick Example

🍻 Available Datasets

🍻 Additional Information

Quick Start

Datasets

Models

Metrics

Miscellaneous

🍻 Disclaimer

🍻 Citing & Authors

🍻 Collaboration

🍻 Contributors

beir's People

Contributors

Stargazers

Watchers

Forkers

beir's Issues

Recommend Projects

Recommend Topics

Recommend Org