Giter Site home page Giter Site logo

air-bench's Introduction

☁️ Motivation

Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as MSMARCO, Natural Question (open-domain QA), MIRACL (multilingual retrieval), BEIR and MTEB (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.

  • Incapability of dealing with new domains. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users.
  • Potential risk of over-fitting and data leakage. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake.

☁️ Features

  • 🤖 Automated. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
  • 🔍 Retrieval and RAG-oriented. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question.
  • 🔄 Heterogeneous and Dynamic: The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.

☁️ Results

We plan to release new test dataset on regular basis. The latest version of is 24.04. You could check out the results at AIR-Bench Leaderboard.

Detailed results are available here.

☁️ Usage

Installation

This repo is used to maintain the codebases for running AIR-Bench evaluation. To run the evaluation, please install air-benchmark.

pip install air-benchmark

Evaluations

Refer to the steps below to run evaluations and submit the results to the leaderboard (refer to here for more detailed information).

  1. Run evaluations

    • See the scripts to run evaluations on AIR-Bench for your models.
  2. Submit search results

    • Package the output files

      • As for the results without reranking models,
      cd scripts
      python zip_results.py \
      --results_dir search_results \
      --retriever_name [YOUR_RETRIEVAL_MODEL] \
      --save_dir search_results
      • As for the results with reranking models
      cd scripts
      python zip_results.py \
      --results_dir search_results \
      --retriever_name [YOUR_RETRIEVAL_MODEL] \
      --reranker_name [YOUR_RERANKING_MODEL] \
      --save_dir search_results
    • Upload the output .zip and fill in the model information at AIR-Bench Leaderboard

☁️ Documentation

Documentation
🏭 Pipeline The data generation pipeline of AIR-Bench
📋 Tasks Overview of available tasks in AIR-Bench
📈 Leaderboard The interactive leaderboard of AIR-Bench
🚀 Submit Information related to how to submit a model to AIR-Bench
🤝 Contributing How to contribute to AIR-Bench

☁️ Acknowledgement

This work is inspired by MTEB and BEIR. Many thanks for the early feedbacks from @tomaarsen, @Muennighoff, @takatost, @chtlp.

☁️ Citing

TBD

air-bench's People

Contributors

hanhainebula avatar nan-wang avatar bwanglzu avatar staoxiao avatar ziniuyu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.