Giter Site home page Giter Site logo

opengvlab / multi-modality-arena Goto Github PK

View Code? Open in Web Editor NEW
364.0 6.0 24.0 22.02 MB

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

HTML 0.07% CSS 0.12% JavaScript 0.69% Python 99.07% Shell 0.04%
chat chatbot chatgpt gradio large-language-models llms vqa multi-modality vision-language-model

multi-modality-arena's People

Contributors

bellxp avatar fanqingm avatar leimeng86 avatar lqf-hfnju avatar siyuanhuang95 avatar toggle1995 avatar wqshao126 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

multi-modality-arena's Issues

Code for VCR evaluation

First, I really appreciate for your great contributions in LVLM field.

Do you have any plan to release the visual commonsense reasoning (VCR) evaluation code?
There's some elaboration about how to properly locate and download the dataset, but I couldn't find the corresponding code.

Thanks again for your work.

Some Problems with VPGTrans

I am the first author of VPGTrans. Thanks so much for using VPGTrans! I try to see the excellent work from WeChat Articles. However, it seems to be some problems with the VPGTrans.

  1. I try your example in the WeChat Article. My demo (https://vpgtrans.github.io/) shows that:
    Selection_412
    But the result in the WeChat article is :
    Selection_413

It is different. I am not sure whether there are some modifications with the default hyperparameters like the prompt format or the beam size. I will also try to check the code. If any findings, I will also report them here.

For your debug use, you can compare it with our demo (https://vpgtrans.github.io/). If the demo is down, just mail me ([email protected]).

  1. The main authors are from NUS. But the main institution in the WeChat Article is Tsinghua University. If it is possible, hope you can modify it to NUS&THU. If it is inconvenient, hope you can add a comment at the bottom of the WeChat Article or at least correct it in this repo (model.jpg).

always getting (error_code: 1)

Hello and thank you for your amazing work!

However, I have a problem: the models are loaded well but I continue getting

NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE. (error_code: 1)

The models (blip2 and minigpt4) are both on GPUs (I'm using them with --keep-in-device otherwise they were not even loaded) but every try I do, I always get the same error.

Any advice?

mPLUG-Owl have been updated.

Hi, just quick note that mPLUG-Owl has been updated with two different checkpoint (lora & ft). Would you mind replacing the online arena demo? Thanks!

MiniGPT-4 and LLaVA evaluation

Hi! I'm a fan of your work. Can you please provide more details about how to do eval for MiniGPT-4 and LLaVA on various datasets? Thanks a lot!

Hardware requirements

Hi all,
Could anyone provide with the hardware requirements to run and test these models. I am planning to run these models on Local Systems
It would be great if the hardware requirements for the open-source models are provided.

Thanking in Advance.

LLaVA evaluation on Flickr30k

Hello, thanks for the great work! I was looking at this script for llava evaluation on Flickr30k, but am facing some issues, detailed here.

Could you please help me with the exact generation settings and model checkpoint used for this evaluation? Thanks!

Chatbot Arena conversation data

Hi,

thanks for the efforts in the great work!
I would like to ask whether you plan to open-source the Chatbot Arena conversation data.

Thanks in advance!

Best, Wei

How to reproduce the Tiny-eHub eval

Thanks for releasing this benchmark. Now we tried to compute the categorical score for each ability but found low scores on several abilities, like visual reasoning, and visual perception. We compute the text matching accuracy. We also download the inference results of existing works, like BLIP2, etc., and manually check that the text matching accuracy can hardly achieve ~50%. Below is an example of the prediction result of BLIP2. I wonder how these works can achieve high scores in the Tiny LVLM evaluation?

{
"question": "When does the coupon expire?",
"answer": "it expires on january 31st",
"gt_answers": [
"12/31/87"
],
"image_path": "updated_datasets/Visual_Reasoning/001.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "What is the \u201cunit of quantity\u201d of Pulp?",
"answer": "Pulp is a term used to refer to the amount of pulp produced by a pulp mill, or the amount of pulp produced by a",
"gt_answers": [
"Tonne"
],
"image_path": "updated_datasets/Visual_Reasoning/002.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "what is the % of sugar in ro-neet?",
"answer": "% of sugar in ro-neet",
"gt_answers": [
"17.1%",
"17.1"
],
"image_path": "updated_datasets/Visual_Reasoning/003.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "What is the total consultant costs under column "-04" based on "II. CONSULTANT COSTS"?",
"answer": "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0",
"gt_answers": [
"$1,532"
],
"image_path": "updated_datasets/Visual_Reasoning/004.png",
"model_name": "BLIP2",
"task_type": "VQA"
},

Which test set for Flickr30k?

Wondering if you use the karpathy test set for Flickr30k, or a different test set in your LVLM-eHUB paper. Thanks!

details of the Elo rating algorithm

Nice work! Interested in the design of 1 vs 1 battles between LVLMs, but can you share more details about the Elo rating algorithm? Like the choice of k-factor, the expected confidence intervals with the collected user ratings, etc. Appreciated if you can share more of the details.

Can not load scienceQA dataset.

I run the scripts on ScienceQA but it raises error:
'''
File "./Multi-Modality-Arena/LVLM_evaluation/task_datasets/vqa_datasets.py", line 140, in load_save_dataset
self.image_list.append(sample['image'].convert('RGB'))
^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'convert'
'''

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.