Giter Site home page Giter Site logo

bradyfu / awesome-multimodal-large-language-models Goto Github PK

View Code? Open in Web Editor NEW
10.7K 247.0 713.0 78.41 MB

:sparkles::sparkles:Latest Advances on Multimodal Large Language Models

instruction-tuning instruction-following large-vision-language-model visual-instruction-tuning multi-modality in-context-learning large-language-models large-vision-language-models multimodal-chain-of-thought multimodal-in-context-learning

awesome-multimodal-large-language-models's People

Contributors

bradyfu avatar donglixp avatar islinxu avatar keli-61 avatar mea-lab-421 avatar pengzhiliang avatar renshuhuai-andy avatar wangjiongw avatar xjtupanda avatar yuleiqin avatar ziyuguo99 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

awesome-multimodal-large-language-models's Issues

mPLUG-DocOwl

https://arxiv.org/abs/2307.02499

we propose mPLUG-DocOwl based on mPLUG-Owl for OCR-free document understanding. Specifically, we first construct a instruction tuning dataset featuring a wide range of visual-text understanding tasks. Then, we strengthen the OCR-free document understanding ability by jointly train the model on language-only, general vision-and-language, and document instruction tuning dataset with our unified instruction tuning strategy. We also build an OCR-free document instruction understanding evaluation set LLMDoc to better compare models' capabilities on instruct compliance and document understanding. Experimental results show that our model outperforms existing multi-modal models, demonstrating its strong ability of document understanding. Besides, without specific fine-tuning, mPLUG-DocOwl generalizes well on various downstream tasks. We open-source our code at this https URL and provide an interactive demo.

Inconsistent Results for Mini-GPT4 with Vicuna-13b

I re-evaluate the performance of Mini-GPT4 with Vicuna-13B on MME, using the MMBench code base. I got scores of 580.5 for perception and 144.29 for cognition, which are very close to the result in the official leaderboard. A system prompt as in official code for Mini-GPT4 is set up.

However, in the original paper, I notice that there are huge gaps of performance in reference of Figure 2. The score of perception is 866.58 and that of cognition is 292.14, ranking the first place.

I wonder where does this difference come from and which one should be taken as correct evaluation.

Update Qwen-VL results

We are honored to evaluate the Qwen-VL series on your good work MME Benchmark.

Qwen-VL-Chat achieved the SOTAs on MME until now. We provide all code and steps HERE to reproduce the results.

We would appreciate it if you update these changes on your home page and pictures as soon as possible.

=========== Perception ===========
total score: 1487.576330532213 

         existence  score: 158.33333333333331
         count  score: 150.0
         position  score: 128.33333333333334
         color  score: 170.0
         posters  score: 178.57142857142856
         celebrity  score: 120.58823529411764
         scene  score: 152.25
         landmark  score: 164.0
         artwork  score: 125.5
         OCR  score: 140.0


=========== Cognition ===========
total score: 360.71428571428567 

         commonsense_reasoning  score: 130.7142857142857
         numerical_calculation  score: 40.0
         text_translation  score: 147.5
         code_reasoning  score: 42.5

A Question about Evaluation with V100

Hi, thanks for the great work! BLIP-2's FlanT5-xxl use bfloat16, while V100 does not support bfloat16. As shown in your paper, all your experiments are done using V100 GPU. I also use V100 in my lab. Are there any methods to run BLIP-2 FlanT5 on the V100 GPU?

Load images to evaluate our model.

May I know where could I load the images such as tt0074749.jpg to evaluate our model? I can see some of them are COCO format but some are not. What could I do if I wish to submit our own result?

Many of the MME landmark images are not available for download

Hello, thank you for the wonderful work and the dataset!

I was trying to download the landmark images by running MME_Benchmark_release_version/landmark/images/download_landmark.py. But only ~35 images were successfully downloaded, while others having error: Failed to download the URL file.

Is there other alternative sources for downloading these images? Or did I do anything incorrectly.

Thanks for the help in advance :)

About foundation models

What is the classification basis for the category of "Foundation Models"? For example, why flamingo and mPLUG-Owl are not foundation models?

Inference prompt for MME evaluation

Hello, I first would like to say thank you for this great repository.

I evaluated mPLUG-Owl using their official prompt as following:

The following is a conversation between a curious human and AI assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
Human: <image>
Human: {question}
AI: 

But I got score of about 1100, which is worse than the reported number in the MME paper (about 1250).

What is the exact prompt you used for evaluatation?

Github link to the multiInstruct paper

Thanks for organizing such useful repo for multimodal papers. I am the author of the multiInstruct paper. We recently open sourced all the datasets and instructions used in our multimodal Instruction tuning paper: MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning. After submission, we explain the number of tasks to 62.
Here is the GitHub link: https://github.com/VT-NLP/MultiInstruct

Could please help me to update the GitHub link in your repo?

BTW, we plan to release additional 150 diverse vision-language tasks next month.

Would like to suggest a few works

Hi, thanks for compiling this list! I hope to bring the following works from my team to your attention:

  • VIMA: General Robot Manipulation with Multimodal Prompts. ICML 2023. https://vimalabs.github.io/ (paper, code, model). This work pre-dates and is related to PaLM-E.
  • Prismer: A Vision-Language Model with Multi-Modal Experts. An open-source multimodal LLM from NVIDIA that pre-dates GPT-4. https://github.com/NVlabs/prismer (paper, code, model, demo).
  • MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. NeurIPS 2022 Best Paper Award. Large-scale vision-language foundation model and datasets for Minecraft. https://github.com/MineDojo/MineDojo

Update the github link of ICL-D3IE in Readme

The GitHub link to ICL-D3IE is now accessible. If possible, could you add it to the readme?
Furthermore, we have a new paper about Multimodal Chain-of-Thought. Would it be possible to add this paper to your paper list? The paper's name is T-SciQ: Teaching Multimodal Chain-of-Thought Reasoning via Large Language Model Signals for Science Question Answering. (https://arxiv.org/abs/2305.03453)

Thank you.

May be some minor suggestions

Hi, Awesome-Multimodal-Large-Language-Models is a nice repo with a great hierarchical structure. Below are some suggestions that might be helpful.

Some references are crucial but neglected in the research track of Awesome-Multimodal-Large-Language-Models, such as VL-T5, FrozenBILM, VL-Adapter and LST. Also feel free to diff and complete the current repo with my research trends.

For the survey, there exists another concern that needs careful consideration: Is it worthwhile for our research community to follow tool-oriented technical reports? Although it may be difficult to determine until the result of the next top conference, I believe you could handle this matter properly.

By the way, is there any way to participate in the ongoing MLLM survey? Thank you very much for your time.

Result of random guess

Hi, I wonder to know if a model randomly outputs yes or no for every question, What performance will it get?

Include LENS

https://arxiv.org/abs/2306.16410

We propose LENS, a modular approach for tackling computer vision problems by leveraging the power of large language models (LLMs). Our system uses a language model to reason over outputs from a set of independent and highly descriptive vision modules that provide exhaustive information about an image. We evaluate the approach on pure computer vision settings such as zero- and few-shot object recognition, as well as on vision and language problems. LENS can be applied to any off-the-shelf LLM and we find that the LLMs with LENS perform highly competitively with much bigger and much more sophisticated systems, without any multimodal training whatsoever. We open-source our code at this https URL and provide an interactive demo.

Related Work

Such exciting project list of multimodal LLM. We have a related work that we hope is added to this awesome repository.

Paper Title: LMEye: An Interactive Perception Network for Large Language Models

Project: https://github.com/YunxinLi/LingCloud

Wechat can not be added

Very nice and impressive work! Your wechat ID seems to be frequently added with restriction. Is there any other way to join the Wechat group? Thanks!

Experiment Results

I wanna know why so many models achieve scores lower than 75 in Fig 2, but the random accuracies of the two metrics are equal to 50% and 25%. Didnt they follow the instruction? Didnt they answer yes/no?

Does the performance of GIT2 come from the weights in 2022?

The performance of GIT2 in the leaderboard is quite impressive. It only has 5.1B parameters. The original paper was published in 2022 and their repository has not been updated since March 2023. The original GIT and GIT2 models did not use techniques like instruct fine-tuning. However, GIT2 still beats many state-of-the-art models in August 2023.

The performance comes from a newer close-source variant from Microsoft, or an open-source version, or the original GIT2 in 2022?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.