mbzuai-oryx / video-llava Goto Github PK

View Code? Open in Web Editor NEW

206.0 14.0 12.0 19.28 MB

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

Home Page: https://mbzuai-oryx.github.io/Video-LLaVA

Python 99.32% Shell 0.68%

llm lmm video video-conversation grounding transcription video-grounding

video-llava's Introduction

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

Shehan Munasinghe* , Rusiru Thushara* , Muhammad Maaz , Hanoona Rasheed, Salman Khan, Mubarak Shah, Fahad Shahbaz Khan.

*Equal Contribution

Mohamed bin Zayed University of Artificial Intelligence, UAE

📢 Latest Updates

📦 27-Dec-2023: Code, models released! 🚀

Overview

PG-Video-LLaVA is the first video-based Large Multimodal Model (LMM) with pixel-level grounding capabilities. 🔥🔥🔥

🏆 Contributions

The key contributions of this work are:

We propose PG-Video-LLaVA, the first video-based LMM with pixel-level grounding capabilities, featuring a modular design for enhanced flexibility. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially ground objects in videos following user instructions.
We introduce a new benchmark specifically designed to measure prompt-based object grounding performance.
By incorporating audio context, PG-Video-LLaVA significantly enhances its understanding of video content, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding (e.g., dialogues and conversations, news videos, etc.).
We introduce improved quantitative benchmarks for video-based conversational models. Our benchmarks utilize open-source Vicuna LLM to ensure better reproducibility and transparency. We also propose benchmarks to evaluate the grounding capabilities of video-based conversational models.

PG-Video-LLaVA : Architecture

Installation and CLI Demo

For installation and setting up the CLI demo, please refer to the instructions here.

Training

For training, please refer to the instructions here.

Qualitative Analysis 🔍

Video Grounding 🎯

Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to localize objects in videos following user instructions.

Including Audio Modality 🎧

By incorporating audio context, PG-Video-LLaVA significantly enhances its understanding of video content, making it more comprehensive and aptly suited for scenarios where the audio signal is crucial for video understanding (e.g., dialogues and conversations, news videos, etc.).

Video-ChatGPT vs PG-Video-LLaVA

PG-Video-LLaVA is based on a stronger image-LMM baseline which gives it better conversational ability compared to its predecessor.

Quantitative Evaluation 📊

We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks. We also introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos.

Video Grounding 🎯

To quantitatively assess PG-Video-LLaVA’s spatial grounding capability, we conducted quantitative evaluations of PG-Video-LLaVA’s spatial grounding capabilities using two benchmarks that are derived from the test set of the VidSTG and HC-STVG datasets.

For detailed instructions on performing quantitative evaluation on video grounding, please refer this.

Video-based Generative Performance Benchmarking 🤖

We apply the benchmarking framework from Video-ChatGPT which measures performance on several axes critical for video-based conversational agents, including correctness of information, detail orientation, contextual understanding, temporal understanding, and consistency. In order to facilitate a reliable and reproducible evaluation, we have updated our assessment pipeline by replacing GPT-3.5 with Vicuna-13b-v1.5.

Zero-Shot Question Answering 💬

Zero-shot question-answering (QA) capabilities were evaluated quantitatively using several established open-ended QA datasets: MSRVTT-QA, MSVD-QA, TGIF-QA, and ActivityNet-QA.

For detailed instructions on video-based generative performance benchmarking and zero-shot question answering benchmark, please refer this.

Acknowledgements 🙏

LLaMA: a great attempt towards open and efficient LLMs!
Vicuna: has the amazing language capabilities!
LLaVA: our architecture is inspired from LLaVA.
Video-ChatGPT: the predecessor to PG-Video-LLaVA

Citation 📜

If you're using PG-Video-LLaVA in your research or applications, please cite using this BibTeX:

  @article{munasinghe2023PGVideoLLaVA,
        title={PG-Video-LLaVA: Pixel Grounding Large Video-Language Models}, 
        author={Shehan Munasinghe and Rusiru Thushara and Muhammad Maaz and Hanoona Abdul Rasheed and Salman Khan and Mubarak Shah and Fahad Khan},
        journal={ArXiv 2311.13435},
        year={2023}
  }

video-llava's People

Contributors

Stargazers

Watchers

Forkers

wondward gjtjx thusharakart insafim ybkangster ralphhan ommos92 manishkumart kushal-10 hz920120 hanoonar terrencechungong

video-llava's Issues

Comparison between running the model with grounding and without Grounding.

Hi, just checking what is the implication of running the code with grounding and without grounding. What changes will it do to the output?

License

Please add license for this repo and model.
In deed very nice work.

Is 8 cards 4090gpu (24g) enough to train your model?

Hello, thank you for your great work, I only have 8 cards 4090gpu (24g), is this resource enough to train your model?

Using ASR caption instead of heavy audio encoder can be more efficient

Audio has info redundancy compare with picture.

Flash Attention

Hi,

Thank you for the codebase and the models! I notice that flash attention is one of the dependencies of the project. Since I'm working on AMD GPUs and currently installing flash attention with ROCm support is rather challenging, I was wondering whether I could skip installing it, as I want to use Video-Llava mostly for inference, and the instructions suggest to install flash attention only if training is required. Does this mean that if I don't install it and run inference it should run without issues? Thank you!

Could you early release the evaluation scripts with vicuna model.

CLI Demo can be me made much simpler by adding more instructions in the README.md section

I'd like to suggest several enhancements that could improve the project's usability and documentation:

Incorporation of a setup.py File: It would be highly beneficial to include a setup.py file in the repository. This file could automate the installation process by building the setup tools dependency, and streamlining the setup for new users.
Documentation of FlashAttention in Training.md: I recommend adding a section or note about the FlashAttention mechanism within the Training.md documentation. This addition would help users understand its role and implementation within the training process.
Guidance on Downloading the LLaVA Model: Providing a command or step-by-step instructions for downloading the LLaVA model using the snapshot module from Hugging Face would greatly assist users in getting started with the model. This clarity could prevent confusion and streamline the initial setup.
Separate README for Grounding Functionality: Considering the complexity and importance of grounding functionality, create a separate README.md focused on this aspect could make the information more accessible and easier to digest for new users.

Exploring the model with grounding capabilities hosted on Hugging Face would be immensely helpful for gaining a deeper understanding.

@shehanmunasinghe, these suggestions are intended to enhance the project's accessibility, documentation, and user experience. I believe these improvements could make a significant difference for both new and existing users.

Time codes

This is really an excellent tool.
I want to ask that is it possible to take the time codes also with each sentence explanation? if you have any tip to export it would be a great help.

Error while loading tokenizer

Hello,

Thanks for making the code and models available. I was following the guide to set up the repo and run a CLI demo.

The command line arguments looks like this:

python video_chatgpt/chat.py --model-name weights/llava/llava-v1.5-7b --projection_path weights/projection/mm_projector_7b_1.5_336px.bin --use_asr --conv_mode pg-video-llava

The --model-name argument is path to the folder who's contents are shown here and the --projection_path argument is path to the folder containing mm_projector_7b_1.5_336px.bin file.

I'm facing an error while loading the vocab_file, the resolved vocab_file is weights/llava/llava-v1.5-7b/tokenizer.model
The error traceback is as follows:

╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /media/vishal/2TB_storage/repos/Video-LLaVA/video_chatgpt/chat.py:362 in     │
│ <module>                                                                     │
│                                                                              │
│   359 │   │   )                                                              │
│   360 │   │   chat.interact()                                                │
│   361 │   else:                                                              │
│ ❱ 362 │   │   chat = VideoChatGPTInterface(                                  │
│   363 │   │   │   args_model_name=args.model_name,                           │
│   364 │   │   │   args_projection_path=args.projection_path,                 │
│   365 │   │   │   use_asr=args.use_asr,                                      │
│                                                                              │
│ /media/vishal/2TB_storage/repos/Video-LLaVA/video_chatgpt/chat.py:29 in      │
│ __init__                                                                     │
│                                                                              │
│    26 │   │   self.use_asr=use_asr                                           │
│    27 │   │   self.conv_mode = conv_mode                                     │
│    28 │   │                                                                  │
│ ❱  29 │   │   model, vision_tower, tokenizer, image_processor, video_token_l │
│    30 │   │   self.tokenizer = tokenizer                                     │
│    31 │   │   self.image_processor = image_processor                         │
│    32 │   │   self.vision_tower = vision_tower                               │
│                                                                              │
│ /media/vishal/2TB_storage/repos/Video-LLaVA/video_chatgpt/eval/model_utils.p │
│ y:101 in initialize_model                                                    │
│                                                                              │
│    98 │   model_name = os.path.expanduser(model_name)                        │
│    99 │                                                                      │
│   100 │   # Load tokenizer                                                   │
│ ❱ 101 │   tokenizer = AutoTokenizer.from_pretrained(model_name)              │
│   102 │                                                                      │
│   103 │   # Load model                                                       │
│   104 │   model = VideoChatGPTLlamaForCausalLM.from_pretrained(model_name, l │
│                                                                              │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/tra │
│ nsformers/models/auto/tokenization_auto.py:682 in from_pretrained            │
│                                                                              │
│   679 │   │   │   │   raise ValueError(                                      │
│   680 │   │   │   │   │   f"Tokenizer class {tokenizer_class_candidate} does │
│   681 │   │   │   │   )                                                      │
│ ❱ 682 │   │   │   return tokenizer_class.from_pretrained(pretrained_model_na │
│   683 │   │                                                                  │
│   684 │   │   # Otherwise we have to be creative.                            │
│   685 │   │   # if model is an encoder decoder, the encoder tokenizer class  │
│                                                                              │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/tra │
│ nsformers/tokenization_utils_base.py:1805 in from_pretrained                 │
│                                                                              │
│   1802 │   │   │   else:                                                     │
│   1803 │   │   │   │   logger.info(f"loading file {file_path} from cache at  │
│   1804 │   │                                                                 │
│ ❱ 1805 │   │   return cls._from_pretrained(                                  │
│   1806 │   │   │   resolved_vocab_files,                                     │
│   1807 │   │   │   pretrained_model_name_or_path,                            │
│   1808 │   │   │   init_configuration,                                       │
│                                                                              │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/tra │
│ nsformers/tokenization_utils_base.py:1959 in _from_pretrained                │
│                                                                              │
│   1956 │   │                                                                 │
│   1957 │   │   # Instantiate tokenizer.                                      │
│   1958 │   │   try:                                                          │
│ ❱ 1959 │   │   │   tokenizer = cls(*init_inputs, **init_kwargs)              │
│   1960 │   │   except OSError:                                               │
│   1961 │   │   │   raise OSError(                                            │
│   1962 │   │   │   │   "Unable to load vocabulary from file. "               │
│                                                                              │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/tra │
│ nsformers/models/llama/tokenization_llama.py:71 in __init__                  │
│                                                                              │
│    68 │   │   self.add_eos_token = add_eos_token                             │
│    69 │   │   self.decode_with_prefix_space = decode_with_prefix_space       │
│    70 │   │   self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwa │
│ ❱  71 │   │   self.sp_model.Load(vocab_file)                                 │
│    72 │   │   self._no_prefix_space_tokens = None                            │
│    73 │   │                                                                  │
│    74 │   │   """ Initialisation"""                                          │
│                                                                              │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/sen │
│ tencepiece/__init__.py:905 in Load                                           │
│                                                                              │
│    902 │   │   raise RuntimeError('model_file and model_proto must be exclus │
│    903 │     if model_proto:                                                 │
│    904 │   │   return self.LoadFromSerializedProto(model_proto)              │
│ ❱  905 │     return self.LoadFromFile(model_file)                            │
│    906                                                                       │
│    907                                                                       │
│    908 # Register SentencePieceProcessor in _sentencepiece:                  │
│                                                                              │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/sen │
│ tencepiece/__init__.py:310 in LoadFromFile                                   │
│                                                                              │
│    307 │   │   return _sentencepiece.SentencePieceProcessor_serialized_model │
│    308 │                                                                     │
│    309 │   def LoadFromFile(self, arg):                                      │
│ ❱  310 │   │   return _sentencepiece.SentencePieceProcessor_LoadFromFile(sel │
│    311 │                                                                     │
│    312 │   def _EncodeAsIds(self, text, enable_sampling, nbest_size, alpha,  │
│    313 │   │   return _sentencepiece.SentencePieceProcessor__EncodeAsIds(sel │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) 
[model_proto->ParseFromArray(serialized.data(), serialized.size())]

The versions of tokenizers and transformers are 0.13.3 and 4.28.0.dev0 respectively.

Could you help me out to solve this error?
Thanks,
Vishal

Segmentation Error

@shehanmunasinghe I was implementing the code locally but, I stuck with an error that says:

While debugging: Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

While running through CLI: Segmentation fault (core dumped)

Also can you please let me know the minimum requirement to run inference?

Thank you in Advance

To fix this you could try to:

loosen the range of package versions you've specified
remove package versions to allow pip attempt to solve the dependency conflict

ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts`

Demo on Gradio

Will there be a Gradio demo for this model similar to Video-LLaVA? It would be highly beneficial

mbzuai-oryx / video-llava Goto Github PK

video-llava's Introduction

PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models

📢 Latest Updates

Overview

🏆 Contributions

PG-Video-LLaVA : Architecture

Installation and CLI Demo

Training

Qualitative Analysis 🔍

Video Grounding 🎯

Including Audio Modality 🎧

Video-ChatGPT vs PG-Video-LLaVA

Quantitative Evaluation 📊

Video Grounding 🎯

Video-based Generative Performance Benchmarking 🤖

Zero-Shot Question Answering 💬

Acknowledgements 🙏

Citation 📜

video-llava's People

Contributors

Stargazers

Watchers

Forkers

video-llava's Issues

Recommend Projects

Recommend Topics

Recommend Org