mbzuai-oryx / video-llava Goto Github PK
View Code? Open in Web Editor NEWPG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
Home Page: https://mbzuai-oryx.github.io/Video-LLaVA
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
Home Page: https://mbzuai-oryx.github.io/Video-LLaVA
This is really an excellent tool.
I want to ask that is it possible to take the time codes also with each sentence explanation? if you have any tip to export it would be a great help.
which runing "pip install -r requirements.txt", here comes an error:
`ERROR: Cannot install -r requirements.txt (line 1) and -r requirements.txt (line 19) because these package versions have conflicting dependencies.
The conflict is caused by:
torch 2.1.0 depends on triton==2.1.0; platform_system == "Linux" and platform_machine == "x86_64"
whisper-at 0.5 depends on triton==2.0.0
To fix this you could try to:
ERROR: ResolutionImpossible: for help visit https://pip.pypa.io/en/latest/topics/dependency-resolution/#dealing-with-dependency-conflicts`
@shehanmunasinghe I was implementing the code locally but, I stuck with an error that says:
While debugging: Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)
While running through CLI: Segmentation fault (core dumped)
Also can you please let me know the minimum requirement to run inference?
Thank you in Advance
Please add license for this repo and model.
In deed very nice work.
Hello,
Thanks for making the code and models available. I was following the guide to set up the repo and run a CLI demo.
The command line arguments looks like this:
python video_chatgpt/chat.py --model-name weights/llava/llava-v1.5-7b --projection_path weights/projection/mm_projector_7b_1.5_336px.bin --use_asr --conv_mode pg-video-llava
The --model-name
argument is path to the folder who's contents are shown here and the --projection_path
argument is path to the folder containing mm_projector_7b_1.5_336px.bin
file.
I'm facing an error while loading the vocab_file
, the resolved vocab_file is weights/llava/llava-v1.5-7b/tokenizer.model
The error traceback is as follows:
╭───────────────────── Traceback (most recent call last) ──────────────────────╮
│ /media/vishal/2TB_storage/repos/Video-LLaVA/video_chatgpt/chat.py:362 in │
│ <module> │
│ │
│ 359 │ │ ) │
│ 360 │ │ chat.interact() │
│ 361 │ else: │
│ ❱ 362 │ │ chat = VideoChatGPTInterface( │
│ 363 │ │ │ args_model_name=args.model_name, │
│ 364 │ │ │ args_projection_path=args.projection_path, │
│ 365 │ │ │ use_asr=args.use_asr, │
│ │
│ /media/vishal/2TB_storage/repos/Video-LLaVA/video_chatgpt/chat.py:29 in │
│ __init__ │
│ │
│ 26 │ │ self.use_asr=use_asr │
│ 27 │ │ self.conv_mode = conv_mode │
│ 28 │ │ │
│ ❱ 29 │ │ model, vision_tower, tokenizer, image_processor, video_token_l │
│ 30 │ │ self.tokenizer = tokenizer │
│ 31 │ │ self.image_processor = image_processor │
│ 32 │ │ self.vision_tower = vision_tower │
│ │
│ /media/vishal/2TB_storage/repos/Video-LLaVA/video_chatgpt/eval/model_utils.p │
│ y:101 in initialize_model │
│ │
│ 98 │ model_name = os.path.expanduser(model_name) │
│ 99 │ │
│ 100 │ # Load tokenizer │
│ ❱ 101 │ tokenizer = AutoTokenizer.from_pretrained(model_name) │
│ 102 │ │
│ 103 │ # Load model │
│ 104 │ model = VideoChatGPTLlamaForCausalLM.from_pretrained(model_name, l │
│ │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/tra │
│ nsformers/models/auto/tokenization_auto.py:682 in from_pretrained │
│ │
│ 679 │ │ │ │ raise ValueError( │
│ 680 │ │ │ │ │ f"Tokenizer class {tokenizer_class_candidate} does │
│ 681 │ │ │ │ ) │
│ ❱ 682 │ │ │ return tokenizer_class.from_pretrained(pretrained_model_na │
│ 683 │ │ │
│ 684 │ │ # Otherwise we have to be creative. │
│ 685 │ │ # if model is an encoder decoder, the encoder tokenizer class │
│ │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/tra │
│ nsformers/tokenization_utils_base.py:1805 in from_pretrained │
│ │
│ 1802 │ │ │ else: │
│ 1803 │ │ │ │ logger.info(f"loading file {file_path} from cache at │
│ 1804 │ │ │
│ ❱ 1805 │ │ return cls._from_pretrained( │
│ 1806 │ │ │ resolved_vocab_files, │
│ 1807 │ │ │ pretrained_model_name_or_path, │
│ 1808 │ │ │ init_configuration, │
│ │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/tra │
│ nsformers/tokenization_utils_base.py:1959 in _from_pretrained │
│ │
│ 1956 │ │ │
│ 1957 │ │ # Instantiate tokenizer. │
│ 1958 │ │ try: │
│ ❱ 1959 │ │ │ tokenizer = cls(*init_inputs, **init_kwargs) │
│ 1960 │ │ except OSError: │
│ 1961 │ │ │ raise OSError( │
│ 1962 │ │ │ │ "Unable to load vocabulary from file. " │
│ │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/tra │
│ nsformers/models/llama/tokenization_llama.py:71 in __init__ │
│ │
│ 68 │ │ self.add_eos_token = add_eos_token │
│ 69 │ │ self.decode_with_prefix_space = decode_with_prefix_space │
│ 70 │ │ self.sp_model = spm.SentencePieceProcessor(**self.sp_model_kwa │
│ ❱ 71 │ │ self.sp_model.Load(vocab_file) │
│ 72 │ │ self._no_prefix_space_tokens = None │
│ 73 │ │ │
│ 74 │ │ """ Initialisation""" │
│ │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/sen │
│ tencepiece/__init__.py:905 in Load │
│ │
│ 902 │ │ raise RuntimeError('model_file and model_proto must be exclus │
│ 903 │ if model_proto: │
│ 904 │ │ return self.LoadFromSerializedProto(model_proto) │
│ ❱ 905 │ return self.LoadFromFile(model_file) │
│ 906 │
│ 907 │
│ 908 # Register SentencePieceProcessor in _sentencepiece: │
│ │
│ /home/vishal/miniconda3/envs/pg_video_llava/lib/python3.10/site-packages/sen │
│ tencepiece/__init__.py:310 in LoadFromFile │
│ │
│ 307 │ │ return _sentencepiece.SentencePieceProcessor_serialized_model │
│ 308 │ │
│ 309 │ def LoadFromFile(self, arg): │
│ ❱ 310 │ │ return _sentencepiece.SentencePieceProcessor_LoadFromFile(sel │
│ 311 │ │
│ 312 │ def _EncodeAsIds(self, text, enable_sampling, nbest_size, alpha, │
│ 313 │ │ return _sentencepiece.SentencePieceProcessor__EncodeAsIds(sel │
╰──────────────────────────────────────────────────────────────────────────────╯
RuntimeError: Internal: src/sentencepiece_processor.cc(1101)
[model_proto->ParseFromArray(serialized.data(), serialized.size())]
The versions of tokenizers and transformers are 0.13.3
and 4.28.0.dev0
respectively.
Could you help me out to solve this error?
Thanks,
Vishal
What is the difference in training between this work and VideoChatGPT?
Hi,
Thank you for the codebase and the models! I notice that flash attention is one of the dependencies of the project. Since I'm working on AMD GPUs and currently installing flash attention with ROCm support is rather challenging, I was wondering whether I could skip installing it, as I want to use Video-Llava mostly for inference, and the instructions suggest to install flash attention only if training is required. Does this mean that if I don't install it and run inference it should run without issues? Thank you!
I'd like to suggest several enhancements that could improve the project's usability and documentation:
Incorporation of a setup.py File: It would be highly beneficial to include a setup.py file in the repository. This file could automate the installation process by building the setup tools dependency, and streamlining the setup for new users.
Documentation of FlashAttention in Training.md: I recommend adding a section or note about the FlashAttention mechanism within the Training.md documentation. This addition would help users understand its role and implementation within the training process.
Guidance on Downloading the LLaVA Model: Providing a command or step-by-step instructions for downloading the LLaVA model using the snapshot module from Hugging Face would greatly assist users in getting started with the model. This clarity could prevent confusion and streamline the initial setup.
Separate README for Grounding Functionality: Considering the complexity and importance of grounding functionality, create a separate README.md focused on this aspect could make the information more accessible and easier to digest for new users.
Exploring the model with grounding capabilities hosted on Hugging Face would be immensely helpful for gaining a deeper understanding.
@shehanmunasinghe, these suggestions are intended to enhance the project's accessibility, documentation, and user experience. I believe these improvements could make a significant difference for both new and existing users.
Hello, thank you for your great work, I only have 8 cards 4090gpu (24g), is this resource enough to train your model?
Hi, just checking what is the implication of running the code with grounding and without grounding. What changes will it do to the output?
Hi, the weight of the projector seems down.
RuntimeError: PytorchStreamReader failed reading zip archive: failed finding
central directory
how can I fix this?
Audio has info redundancy compare with picture.
Will there be a Gradio demo for this model similar to Video-LLaVA? It would be highly beneficial
Thanks for the awesome work!
Just wonder when will the code available?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.