bradyfu / awesome-multimodal-large-language-models Goto Github PK

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

instruction-tuning instruction-following large-vision-language-model visual-instruction-tuning multi-modality in-context-learning large-language-models large-vision-language-models multimodal-chain-of-thought multimodal-in-context-learning

awesome-multimodal-large-language-models's People

Contributors

Stargazers

Watchers

Forkers

zimyang marsaki haorand mea-lab-421 alexwangmac akiraxty mr-meermoazzam zfw1111 yuntao229 sinboyxx renshuhuai-andy lp0905 zhihao-chen liugingko paradoxzw eichi7 a0308 anusornc meghanathmacha chr11stian thanhpham1987 lowdias hsyngmtrk amitkml doniben bekyilma jbinkleyj techventurebuilder chorseng aldoihi1 holyetsui csshali vishaal27 ukaserge kid-gorgeous guoqiangjia wch88 polokobe qingping209 thenetguy dikaio yuangongnd jaesuphwang nahidalam jiangzongkang lazykumasensei mulala00 archersama chenxingqiang hhy5277 mrcabellom renormalizedkat lm0007 flying2023 vityavitalich lzh1206 canslove suninus liangofthechen rese1f cherishtttz techthiyanes jack1981 wolfworld6 xiaojichao qmpham yul-git alberta-lee wanboyang johnfan888 universeresearch klonggan tulw4r hertera1 zhangjiahuan17 dansonc acondess hcn1973 leoyr2022 zhoumz123 freezesoul monkquan lyhiving playdelphi coinhubx 2132660698 kamskanagi jeffara mrliuda masi235 jesikkka korieg hadryan hoiyeungng yangjie-cv zhouzw87 yhna940 lemoner20 ironicbo lula-zx

awesome-multimodal-large-language-models's Issues

"Code Reasoning" Leaderboard seems wrong

mPLUG-DocOwl

https://arxiv.org/abs/2307.02499

we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. We open-source our code at this https URL and provide an interactive demo.

Inconsistent Results for Mini-GPT4 with Vicuna-13b

I re-evaluate the performance of Mini-GPT4 with Vicuna-13B on MME, using the MMBench code base. I got scores of 580.5 for perception and 144.29 for cognition, which are very close to the result in the official leaderboard. A system prompt as in official code for Mini-GPT4 is set up.

However, in the original paper, I notice that there are huge gaps of performance in reference of Figure 2. The score of perception is 866.58 and that of cognition is 292.14, ranking the first place.

I wonder where does this difference come from and which one should be taken as correct evaluation.

Wechat Group

It seems like the link is not avail.

Update Qwen-VL results

We are honored to evaluate the Qwen-VL series on your good work MME Benchmark.

Qwen-VL-Chat achieved the SOTAs on MME until now. We provide all code and steps HERE to reproduce the results.

We would appreciate it if you update these changes on your home page and pictures as soon as possible.

=========== Perception ===========
total score: 1487.576330532213 

         existence  score: 158.33333333333331
         count  score: 150.0
         position  score: 128.33333333333334
         color  score: 170.0
         posters  score: 178.57142857142856
         celebrity  score: 120.58823529411764
         scene  score: 152.25
         landmark  score: 164.0
         artwork  score: 125.5
         OCR  score: 140.0


=========== Cognition ===========
total score: 360.71428571428567 

         commonsense_reasoning  score: 130.7142857142857
         numerical_calculation  score: 40.0
         text_translation  score: 147.5
         code_reasoning  score: 42.5

A Question about Evaluation with V100

Hi, thanks for the great work! BLIP-2's FlanT5-xxl use bfloat16, while V100 does not support bfloat16. As shown in your paper, all your experiments are done using V100 GPU. I also use V100 in my lab. Are there any methods to run BLIP-2 FlanT5 on the V100 GPU?

LMM(Large Multimodal Model) sounds more professional than MLLM.

Add Pink

I hope you can add Pink to the list. https://github.com/SY-Xuan/Pink.

Best.

Microsoft's kosmos-2 is missing from the evaluation leaderboard

https://arxiv.org/pdf/2306.14824.pdf

https://github.com/microsoft/unilm/tree/master/kosmos-2

Evaluate LMEye

Hello,
This is a very nice benchmark for evaluating MLLMs.
Could you help us to evaluate our released LMEye variant shown in https://github.com/YunxinLi/LingCloud/tree/main/LMEye

Thanks.

有没有考虑评测一下GPT-4V

What is the difference between main and evaluation branch?

I feel confused about those two branches since it seems the evaluation branch is out of sync with the main branch...
Anyone could confirm which branch I should refer to if I want to commit?

Load images to evaluate our model.

May I know where could I load the images such as tt0074749.jpg to evaluate our model? I can see some of them are COCO format but some are not. What could I do if I wish to submit our own result?

[An simple question] The visualization of Evaluation Results.

I want to learn how to Creating a Python Radar Chart like yours.

I wonder whether you could share the code. Thanks!

Many of the MME landmark images are not available for download

Hello, thank you for the wonderful work and the dataset!

I was trying to download the landmark images by running MME_Benchmark_release_version/landmark/images/download_landmark.py. But only ~35 images were successfully downloaded, while others having error: Failed to download the URL file.

Is there other alternative sources for downloading these images? Or did I do anything incorrectly.

Thanks for the help in advance :)

About foundation models

What is the classification basis for the category of "Foundation Models"? For example, why flamingo and mPLUG-Owl are not foundation models?

Demo of Polite Flamingo and PF-1M dataset are ready

Thanks for this awesome repo!

Our work (Polite Flamingo) has some updated links.

Demo link: http://clever_flamingo.xiaoice.com/
Dataset link: https://huggingface.co/datasets/chendelong/PF-1M/tree/main

Suggesting to modifiy the dataset note of PF-1M into:

A collection of 37 vision-language dataset with responses rewriten by Polite Flamingo.

Thanks a lot~

May I ask if poster refers to a placard like this？

A poster like this？or something else？

Inference prompt for MME evaluation

Hello, I first would like to say thank you for this great repository.

I evaluated mPLUG-Owl using their official prompt as following:

The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: {question}
AI:

But I got score of about 1100, which is worse than the reported number in the MME paper (about 1250).

What is the exact prompt you used for evaluatation?

Related works

We have a related work that we hope is added to this awesome repository. Thanks a lot.

Paper Title: Explainable Multimodal Emotion Reasoning

Paper link: https://arxiv.org/pdf/2306.15401.pdf

Project: https://github.com/zeroQiaoba/Explainable-Multimodal-Emotion-Reasoning

add CogVLM

https://github.com/THUDM/CogVLM

Github link to the multiInstruct paper

Thanks for organizing such useful repo for multimodal papers. I am the author of the multiInstruct paper. We recently open sourced all the datasets and instructions used in our multimodal Instruction tuning paper: MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. After submission, we explain the number of tasks to 62.
Here is the GitHub link: https://github.com/VT-NLP/MultiInstruct

Could please help me to update the GitHub link in your repo?

BTW, we plan to release additional 150 diverse vision-language tasks next month.

Shikra's online demo is ready.

Hi! Shikra's online demo is ready.
If possible, please update it to the list :)
http://demo.zhaozhang.net:7860/

Thank you!

建议增加一个感知和认知能力合计的总榜

如题

Move HallusionBench to the Evaluation Section

Can you help me add our paper HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models in the evaluation section? Thanks!

New Llava 1.5 and Mini GPT-v2 on Leaerboard?

Hi,

Llava 1.5 and MiniGPT-v2 are released recently.
Does the Leaderboard reflect the latest updates?

Thanks,

Plan for supporting Chinese evaluation of MLLM

Do you have plan for supporting Chinese evaluation of MLLM of the MME benchmark

evaluate llava-13B base on vicuna-13B v1.1

Great work! Can you evaluate llava-13B base on vicuna-13B v1.1? Thanks!

a recent work about MLLM evaluation

Hi, thanks for curating this awesome list! There is a recent paper that evaluates several MLLM (InstructBLIP, BLIP2 etc) on more challenging human exam questions requiring images:

M3Exam: A Multilingual, Multimodal, Multilevel Benchmark for Examining Large Language Models
arXiv: https://arxiv.org/pdf/2306.05179.pdf
Github: https://github.com/DAMO-NLP-SG/M3Exam

Maybe you can consider adding this since it quite relates to this repo. Thanks!

New multimodal evaluation benchmark and unified model

Hey,

We have two papers that might be relevant to this repo:

UnIVAL (in foundation models): https://unival-model.github.io/
EvALign-ICL (in evaluation benchmarks and/or Multimodal ICL): https://evalign-icl.github.io/

Best

Would like to suggest a few works

Hi, thanks for compiling this list! I hope to bring the following works from my team to your attention:

VIMA: General Robot Manipulation with Multimodal Prompts. ICML 2023. https://vimalabs.github.io/ (paper, code, model). This work pre-dates and is related to PaLM-E.
Prismer: A Vision-Language Model with Multi-Modal Experts. An open-source multimodal LLM from NVIDIA that pre-dates GPT-4. https://github.com/NVlabs/prismer (paper, code, model, demo).
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. NeurIPS 2022 Best Paper Award. Large-scale vision-language foundation model and datasets for Minecraft. https://github.com/MineDojo/MineDojo

Update the GitHub link of SVIT

Hi, thanks for this awesome repo!
The GitHub link for SVIT is ready: https://github.com/BAAI-DCAI/Visual-Instruction-Tuning.
If convenient, could you help to update it to the list?

Thank you!

Related Work

Mindstorms in Natural Language-Based Societies of Mind from KAUST AI Initiative

Related work

Thanks for the amazing survey. We have a related work and hope you could consider to include.

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn (https://arxiv.org/abs/2306.08640)

Project Page: https://assistgpt-project.github.io/

A new ICCV 2023 paper about Multimodal In-context Learning - ICL-D3IE

Hi,

We have a new accepted paper, which might be related to mm in-context learning and LLM-aided visual reasoning.
ICL-D3IE: In-context learning with diverse demonstrations updating for document information extraction

Thanks!

Update the github link of ICL-D3IE in Readme

The GitHub link to ICL-D3IE is now accessible. If possible, could you add it to the readme?
Furthermore, we have a new paper about Multimodal Chain-of-Thought. Would it be possible to add this paper to your paper list? The paper's name is T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering. (https://arxiv.org/abs/2305.03453)

Thank you.

您能看懂中文吗，请问poster指的是海报吗？

请问指的是AI海报制作吗？还是海报相关什么的～

May be some minor suggestions

Hi, Awesome-Multimodal-Large-Language-Models is a nice repo with a great hierarchical structure. Below are some suggestions that might be helpful.

Some references are crucial but neglected in the research track of Awesome-Multimodal-Large-Language-Models, such as VL-T5, FrozenBILM, VL-Adapter and LST. Also feel free to diff and complete the current repo with my research trends.

For the survey, there exists another concern that needs careful consideration: Is it worthwhile for our research community to follow tool-oriented technical reports? Although it may be difficult to determine until the result of the next top conference, I believe you could handle this matter properly.

By the way, is there any way to participate in the ongoing MLLM survey? Thank you very much for your time.

"Numerical Calculation" Leaderboard seems wrong

Result of random guess

Hi, I wonder to know if a model randomly outputs yes or no for every question, What performance will it get?

Include LENS

https://arxiv.org/abs/2306.16410

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at this https URL and provide an interactive demo.

Related Work

Such exciting project list of multimodal LLM. We have a related work that we hope is added to this awesome repository.

Paper Title: LMEye: An Interactive Perception Network for Large Language Models

Project: https://github.com/YunxinLi/LingCloud

Wechat can not be added

Very nice and impressive work! Your wechat ID seems to be frequently added with restriction. Is there any other way to join the Wechat group? Thanks!

Image download error

I found that 03d5e3bfc958be38.jpg could not be downloaded. Could you fixed the link or upload the image? the link is:
https://upload.wikimedia.org/wikipedia/commons/a/a2/Pietrarsa_railway_museum_67.JPG

Experiment Results

I wanna know why so many models achieve scores lower than 75 in Fig 2, but the random accuracies of the two metrics are equal to 50% and 25%. Didnt they follow the instruction? Didnt they answer yes/no?

Does the performance of GIT2 come from the weights in 2022?

The performance of GIT2 in the leaderboard is quite impressive. It only has 5.1B parameters. The original paper was published in 2022 and their repository has not been updated since March 2023. The original GIT and GIT2 models did not use techniques like instruct fine-tuning. However, GIT2 still beats many state-of-the-art models in August 2023.

The performance comes from a newer close-source variant from Microsoft, or an open-source version, or the original GIT2 in 2022?

About LAMM git

Hi, this is the author of LAMM. We have released data, benchmark and model of LAMM. Please update the link in your list. Thank you for your support :)

Github: https://github.com/OpenLAMM/LAMM

I can not find Octopus.

The method Octopus can not be found.

A multimodal dataset

Thanks for the curated list of multimodal LLM. We have a related work that we hope is added to this awesome repository.

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation (https://arxiv.org/pdf/2303.05983.pdf)

Project Page: https://matrix-alpha.github.io/

New method submission

Shikra, an MLLM designed to kick off referential dialogue by excelling in spatial coordinate inputs/outputs in natural language, without additional vocabularies, position encoders, pre-/post-detection, or external plug-in models.

arXiv:https://arxiv.org/abs/2306.15195
code: https://github.com/shikras/shikra