Giter Site home page Giter Site logo

bbsngg / milebench Goto Github PK

View Code? Open in Web Editor NEW

This project forked from milebench/milebench

0.0 0.0 0.0 3.6 MB

This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context"

Home Page: https://milebench.github.io/

License: Apache License 2.0

Shell 6.68% Python 93.32%

milebench's Introduction

MileBench ๐Ÿ›ฃ๏ธ

๐ŸŒ Homepage | ๐Ÿค— Dataset | ๐Ÿค— Paper | ๐Ÿ“– arXiv | GitHub

This repo contains evaluation code for the paper "MileBench: Benchmarking MLLMs in Long Context".

Python 3.10+ Pytorch 2.1.1 transformers accelerate

๐ŸŒˆ Update

  • [2024.4.15] ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ MileBench is public!๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰

Contents

Introduction

We introduce MileBench, a pioneering benchmark designed to rigorously test the MultImodal Long-contExt capabilities of MLLMs. This benchmark comprises a mix of text and images, long contexts, multiple tasks, and tasks requiring both comprehension and generation. To systematically assess the capabilities of MLLM in multimodal long contexts, our benchmark consists of two distinct evaluation sets, diagnostic evaluation and realistic evaluation. The former explores the long-context recall abilities of MLLMs, using needle-in-a-haystack and image retrieval tasks, while the latter stress-tests the model in a manner akin to real-world conditions using both temporal multi-image tasks and semantic multi-image tasks.

After evaluating 20 models, the closed-source Gemini 1.5 excelled in the realistic evaluation, achieving an impressive score of 54.7%, though it still falls short of a perfect 100% score. Meanwhile, GPT-4(Vision) managed to reach a peak score of 99.4% in the diagnostic evaluation. On the contrary, most open-source multimodal models struggled with long-context tasks. Only VILA and Qwen-VL-7B managed average scores of 44.4% and 37.2% in realistic and diagnostic evaluations respectively. These results underscore that there are "miles to go" towards fully-realized long-context MLLMs, prompting a call for increased research focus on such tasks, especially those involving numerous images.

MileBench Examples

Preparation

๐Ÿค— Dataset Preparation

The MileBench dataset comprises 6,440 samples from 29 datasets, with each sample containing multiple images. The data has been archived on the cloud and can be downloaded from HuggingFace link or BaiduYun link. Save the dataset under data folder.

๐Ÿค– Environment Setup

Install required packages:

pip install -r requirements.txt

โ„น๏ธ How to Evaluate

Modify model configuration file

Click to expand

In configs/model_configs.yaml:

# Add a new model "my_model"
my_model:
    model_name: "my_model"
    model_dir: "path/to/full/model" # HuggingFace model weights
    cfg_path: "path/to/full/model_config"   # can be none
    gen_kwargs:
        max_new_tokens: 512
        min_new_tokens: 1
        do_sample: false

Modify model worker

Click to expand

In workers/model_workers.py:

  1. add a new model class
class MyModel(BaseWorker):

    def init_components(self, config) -> None:
        # init the model components

    def forward(self, questions: list[str], image_paths: list[list], device, gen_kwargs) -> list[str]:
        # Prepare images and text for generate function
  1. for github packages of different VLM models, we recommand you to save them to ./packages directory. Then you don't need to install pip packages in your env.

Modify utils.py

Click to expand

In utils.py: import your model

from workers.model_workers import MyModel   # modify here

name2worker = {
    "my_model": MyModel,  # modify here
    }

Generate response

Click to expand Set GPU num in `/configs/accelerate_configs.yaml`:
num_processes: GPU_NUM    # modify here

Modify eval.sh:

gpu_num=GPU_NUM  # modify here

for model in my_model; do  # modify here
    for dataset_name in dataset_name; do  # modify here
...

and run:

source eval.sh

Run evaluation

Click to expand

run:

python score.py \
    --result-dir outputs \
    --models my_model  # models to eval
# Result saved to outputs/result.csv

License

Code License Data License

All software is licensed under the Apache License, Version 2.0 (Apache 2.0). All other materials are licensed under the Creative Commons Attribution 4.0 International License (CC-BY).

Declaration

The dataset we're using is an aggregation of publicly accessible datasets licensed under the Creative Commons license (CC-BY) or other open-source licenses. We've meticulously adhered to all required legal procedures to incorporate this data into our research, recognizing the importance of transparency in data licensing for proper attribution and suitable data utilization. Our dataset also encompasses images derived from publicly accessible datasets and language data created through the GPT-4V api. While measures have been put in place to secure suitable content, we acknowledge the potential existence of problematic content. Should you come across any such content, we urge you to inform us immediately so we can make the necessary adjustments to sustain a dataset free from inappropriate content. We are unwavering in our commitment to maintain a high-quality, ethically responsible dataset and promise to uphold principles of privacy and transparency throughout our work.

Contact

Citation

If you find this repository helpful, please consider citing it:

@article{song2024milebench,
  title={MileBench: Benchmarking MLLMs in Long Context},
  author={Song, Dingjie and Chen, Shunian and Chen, Guiming Hardy and Yu, Fei and Wan, Xiang and Wang, Benyou},
  journal={arXiv preprint arXiv:2404.18532},
  year={2024}
}

milebench's People

Contributors

milebench avatar bbsngg avatar g-h-chen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.