opengvlab / multi-modality-arena Goto Github PK

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!

HTML 0.07% CSS 0.12% JavaScript 0.69% Python 99.07% Shell 0.04%

chat chatbot chatgpt gradio large-language-models llms vqa multi-modality vision-language-model

multi-modality-arena's People

Contributors

Stargazers

Watchers

multi-modality-arena's Issues

Code for VCR evaluation

First, I really appreciate for your great contributions in LVLM field.

Do you have any plan to release the visual commonsense reasoning (VCR) evaluation code?
There's some elaboration about how to properly locate and download the dataset, but I couldn't find the corresponding code.

Thanks again for your work.

Some Problems with VPGTrans

I am the first author of VPGTrans. Thanks so much for using VPGTrans! I try to see the excellent work from WeChat Articles. However, it seems to be some problems with the VPGTrans.

I try your example in the WeChat Article. My demo (https://vpgtrans.github.io/) shows that:

But the result in the WeChat article is :

It is different. I am not sure whether there are some modifications with the default hyperparameters like the prompt format or the beam size. I will also try to check the code. If any findings, I will also report them here.

For your debug use, you can compare it with our demo (https://vpgtrans.github.io/). If the demo is down, just mail me ([email protected]).

The main authors are from NUS. But the main institution in the WeChat Article is Tsinghua University. If it is possible, hope you can modify it to NUS&THU. If it is inconvenient, hope you can add a comment at the bottom of the WeChat Article or at least correct it in this repo (model.jpg).

always getting (error_code: 1)

Hello and thank you for your amazing work!

However, I have a problem: the models are loaded well but I continue getting

NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE. (error_code: 1)

The models (blip2 and minigpt4) are both on GPUs (I'm using them with --keep-in-device otherwise they were not even loaded) but every try I do, I always get the same error.

Any advice?

I want to join the wechat group mentioned in README, but the QR code is invalid, can I add wechat to join?

Could be possible to make Leaderboard page online with latest updates?

mPLUG-Owl have been updated.

Hi, just quick note that mPLUG-Owl has been updated with two different checkpoint (lora & ft). Would you mind replacing the online arena demo? Thanks!

Submit users' own evaluation results to the benchmark

Hello, authors. It is an amazing work.
Do you consider about adding the function which support users submit their own evaluation results to the benchmark like C-eval?

MiniGPT-4 and LLaVA evaluation

Hi! I'm a fan of your work. Can you please provide more details about how to do eval for MiniGPT-4 and LLaVA on various datasets? Thanks a lot!

WeChat barcode Doesn't Work

Hi, I'm the author of the paper Aligning Large Multi-Modal Model with Robust Instruction Tuning and want to add our model to your amazing arena. May I know you email address or WeChat id so we can discuss more. my email is [email protected]. Thanks!

Hardware requirements

Hi all,
Could anyone provide with the hardware requirements to run and test these models. I am planning to run these models on Local Systems
It would be great if the hardware requirements for the open-source models are provided.

Thanking in Advance.

LLaVA evaluation on Flickr30k

Hello, thanks for the great work! I was looking at this script for llava evaluation on Flickr30k, but am facing some issues, detailed here.

Could you please help me with the exact generation settings and model checkpoint used for this evaluation? Thanks!

Chatbot Arena conversation data

Hi,

thanks for the efforts in the great work!
I would like to ask whether you plan to open-source the Chatbot Arena conversation data.

Thanks in advance!

Best, Wei

How to reproduce the Tiny-eHub eval

Thanks for releasing this benchmark. Now we tried to compute the categorical score for each ability but found low scores on several abilities, like visual reasoning, and visual perception. We compute the text matching accuracy. We also download the inference results of existing works, like BLIP2, etc., and manually check that the text matching accuracy can hardly achieve ~50%. Below is an example of the prediction result of BLIP2. I wonder how these works can achieve high scores in the Tiny LVLM evaluation?

{
"question": "When does the coupon expire?",
"answer": "it expires on january 31st",
"gt_answers": [
"12/31/87"
],
"image_path": "updated_datasets/Visual_Reasoning/001.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "What is the \u201cunit of quantity\u201d of Pulp?",
"answer": "Pulp is a term used to refer to the amount of pulp produced by a pulp mill, or the amount of pulp produced by a",
"gt_answers": [
"Tonne"
],
"image_path": "updated_datasets/Visual_Reasoning/002.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "what is the % of sugar in ro-neet?",
"answer": "% of sugar in ro-neet",
"gt_answers": [
"17.1%",
"17.1"
],
"image_path": "updated_datasets/Visual_Reasoning/003.png",
"model_name": "BLIP2",
"task_type": "VQA"
},
{
"question": "What is the total consultant costs under column "-04" based on "II. CONSULTANT COSTS"?",
"answer": "0 0 0 0 0 0 0 0 0 0 0 0 0 0 0",
"gt_answers": [
"$1,532"
],
"image_path": "updated_datasets/Visual_Reasoning/004.png",
"model_name": "BLIP2",
"task_type": "VQA"
},

Which test set for Flickr30k?

Wondering if you use the karpathy test set for Flickr30k, or a different test set in your LVLM-eHUB paper. Thanks!

details of the Elo rating algorithm

Nice work! Interested in the design of 1 vs 1 battles between LVLMs, but can you share more details about the Elo rating algorithm? Like the choice of k-factor, the expected confidence intervals with the collected user ratings, etc. Appreciated if you can share more of the details.

the QR code is invalid. how can i join the wechat?

Can not load scienceQA dataset.

I run the scripts on ScienceQA but it raises error:
'''
File "./Multi-Modality-Arena/LVLM_evaluation/task_datasets/vqa_datasets.py", line 140, in load_save_dataset
self.image_list.append(sample['image'].convert('RGB'))
^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'convert'
'''

opengvlab / multi-modality-arena Goto Github PK

multi-modality-arena's People

Contributors

Stargazers

Watchers

Forkers

multi-modality-arena's Issues

Recommend Projects

Recommend Topics

Recommend Org