damo-nlp-sg / video-llama Goto Github PK
View Code? Open in Web Editor NEW[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
License: BSD 3-Clause "New" or "Revised" License
[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
License: BSD 3-Clause "New" or "Revised" License
Hi, thanks for your brilliant work. I want to finetune the model on my dataset. Can it be done with 4 * 3090? Or how much GPU memory is necessary?
Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement pytorch==1.12.1 (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch==1.12.1
I just installed, "Video LLaMa finetune vicuna7B". When running the app.py I get key error['API-TOKEN'] when I get to the loading QFormer stage. I can see that the video-llama.py is looking for os.environ['API_TOKEN'] but where do I need to go in order to get an api token? is this a hugging face token? Or possibly a LLaMa token? What is the best way to save the token, so that os.environ['API_TOKEN'] can find it? Thank you!
is it necessary?
想load一个基于ziya的video-llama模型跑video caption任务,目前是否提供类似的inference脚本。或者直接从demo.py里去改吗?
Hi, I want to generate instruct data on my dataset with GPT4. But I don't know how to write the code. And I also notice that there is rate limit from openai. So I 'd like to have some suggestion or help from you~~~
First of all - Amazing work on this one.
I'm a bit getting lost with the repo, may I request a simple few line script that does something like the following:
model = CLIPViP("pretrain_clipvip_base_32.pt")
text_features = model.encode_text("This is a very cute cat")
video_features = model.encode_video("vid_file.mp4")
cosine(text_features, video_features)
[Extra] Preferably I wish to get the video features for a batch of mp4 files with different lengths
Thank you
Since Video-LLama still use fp16, I think it's better to integrate a lossless quantization such as SqueezeLLM that already supports Vicuna 7B and 13B
Thanks for the great work! One small question: what is the advantage of your proposed Video-LLaMA over VideoChat?
As the title
Great work,
I noticed the image size is fixed to 224 * 224 in a few files despite being a parameter in .yaml files. Also, I wonder if any size video would work (if someone wants to use your code to fine-tune their own data) if we just change the image size parameter. (considering some of the pre-trained models maybe work better with that fixed image size?). I'll appreciate it if you could explain what's the best approach if I want to use your code to fine-tune on a different video size(not using your model weight but rather your code)
Video-LLaMA/video_llama/datasets/datasets/video_instruct_dataset.py
Lines 250 to 252 in 1728e14
I was reading _mask_targets()
, I guess this function is using mask to ignore loss from instruct text, but why do you manually keep [curr_idx: curr_idx+2] which is "###" before the actual instruct text?
Thanks for your great work. Please fix the IndentationError in train_configs/video_llama_stage2_finetune.yaml, lol.
I notice that in your work, the required transformers version is 4.28.0, which, if I'm not mistaking, had a bug with llama tokenizer. Have you tested higher versions of transformers? If I'm using, for example, version 4.29, will this affect the model converted?
same as title
Thanks for your great job.
I really wonder how it works.
Although there is Paper button on ReadMe, Not found error occurs.
Could you provide Video-LLaMA paper please?
Thanks.
kind regards.
Initializing Chat
Loading VIT
Loading VIT Done
Loading Q-Former
Traceback (most recent call last):
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
response.raise_for_status()
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/pretrain_vicuna7b-v2.pth/resolve/main/tokenizer.model
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/utils/hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1166, in hf_hub_download
metadata = get_hf_file_metadata(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
Traceback (most recent call last):
File "demo_audiovideo.py", line 66, in
model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))
File "/home/akilliceviribilisim/Video-LLaMA/video_llama/models/video_llama.py", line 567, in from_config
model = cls(
File "/home/akilliceviribilisim/Video-LLaMA/video_llama/models/video_llama.py", line 120, in init
self.llama_tokenizer = LlamaTokenizer.from_pretrained(llama_model, use_fast=False)
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1770, in from_pretrained
resolved_vocab_files[file_id] = cached_file(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/utils/hub.py", line 424, in cached_file
raise EnvironmentError(
OSError: pretrain_vicuna7b-v2.pth is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token
or log in with huggingface-cli login
and pass use_auth_token=True
.
can you help me how can i solve it
Based on my understanding, you use only 8 frames sampled uniformly across the video. I have the following questions :
I hope you could help me understand your work better and hopefully be able to use it on my own data.
感谢大佬的工作,有两个问题想请教下;
def forward(self, samples):
if 'conv_type' in samples.keys() and samples['conv_type']=='multi':
im_patch_token_id = self.IMAGE_PATCH_TOKEN_ID
image = samples["images"]
input_ids = samples['input_ids']
if len(image.size())==4:
time = 1
image = einops.repeat(image, 'b c h w -> b c t h w',t = time)
if self.train_flag == 0:
num_patch_tokens = self.num_video_query_token
img_embeds, atts_img = self.encode_videoQformer_visual(image)
elif self.train_flag == 1:
num_patch_tokens = self.num_audio_query_token
image = einops.rearrange(image, 'b c t h w -> b t c h w')
img_embeds, atts_img = self.encode_audioQformer(image, modality_type=ModalityType.VISION)
temp_input_ids = copy.deepcopy(input_ids)
temp_input_ids[temp_input_ids == im_patch_token_id] = 0
temp_input_embedding = self.llama_model.model.embed_tokens(temp_input_ids)
new_input_embeds=[]
cur_image_idx = 0
for cur_input_ids, cur_input_embeds in zip(input_ids, temp_input_embedding):
cur_image_features = img_embeds[cur_image_idx]
if (cur_input_ids == im_patch_token_id).sum() != num_patch_tokens:
raise ValueError("The number of image patch tokens should be the same as the number of image patches.")
masked_indices = torch.where(cur_input_ids == im_patch_token_id)[0]
mask_index_start = masked_indices[0]
if (masked_indices != torch.arange(mask_index_start, mask_index_start+num_patch_tokens, device=masked_indices.device, dtype=masked_indices.dtype)).any():
raise ValueError("The image patch tokens should be consecutive.")
cur_new_input_embeds = torch.cat((cur_input_embeds[:mask_index_start], cur_image_features, cur_input_embeds[mask_index_start+num_patch_tokens:]), dim=0)
new_input_embeds.append(cur_new_input_embeds)
cur_image_idx+=1
inputs_embeds = torch.stack(new_input_embeds, dim=0)
targets = samples['labels']
attention_mask = samples['attention_mask']
with self.maybe_autocast():
outputs = self.llama_model(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
return_dict=True,
labels=targets,
)
loss = outputs.loss
return {"loss": loss}
The checkpoint size after training through video_llama_stage1_pretrain.yaml is far less than what you have in the public checkpoint.
The size of your public checkpoint (finetune-vicuna7b-v2.pth) is: 254MB
But the one I got is just around 37MB
Are you using a different training config than what is in video_llama_stage1_pretrain.yaml?
Can you share your training config? I wanted to fine-tune by loading the checkpoint from finetune-vicuna7b-v2.pth.
But get mismatch errors.
I understand that the output of the audio model is the input required by the 7b model, not the 13b model, so can audio use the 7b model and LLM use the 13b model? If not, what's the point of releasing vicuna13b-v2?
https://github.com/DAMO-NLP-SG/Video-LLaMA/blob/main/video_llama/processors/blip_processors.py#L49C1-L49C1
这个预处理很奇怪,把句号感叹号都去掉了 大小写也没了,然后还写了个max_word在里面把训练数据截断到50个的长度 查了好久模型输出截断最后发现是这里的问题。不太清楚这么设计的原因是什么
Video-LLaMA/video_llama/models/video_llama.py
Line 146 in a9f3237
There is an error. model should be self.llama_proj
from video_llama.processors.video_processor import AlproVideoTrainProcessor should be in video_llama/datasets/datasets/llava_instruct_dataset.py
Llama 2 buzz@
模型加载没有问题,模型参数如下
model:
arch: video_llama
model_type: pretrain_vicuna
freeze_vit: True
freeze_qformer: True
max_txt_len: 512
end_sym: "###"
low_resource: False
frozen_llama_proj: False
llama_model: "vicuna-7b-delta-v0"
imagebind_ckpt_path: "imagebind_huge.pth"
fusion_head_layers: 2
max_frame_pos: 32
fusion_header_type: "seqTransf"
ckpt: "finetune-vicuna7b-v2.pth"
ckpt_2: "finetune_vicuna7b_audiobranch.pth"
datasets:
webvid:
vis_processor:
train:
name: "alpro_video_eval"
n_frms: 8
image_size: 224
text_processor:
train:
name: "blip_caption"
run:
task: video_text_pretrain
I have download LLaVA-CC3M-Pretrain-595K but don't found the filter_cap.json,how can i get the filter_cap.json
Thanks for sharing the code and this brilliant work! I wonder if i can run insturct-finetune Video-LLaMA in 8 * 24 GB Nvidia A5000 using my own video datasets?
Have you ever tried to complete sentiment analysis or emotion recognition tasks in videos? How does the model perform?
I would like to inquire about the max_epoch
values set for pre-training and fine-tuning. Are they set to 5 and 3 as in the config file, respectively? I tried running the fine-tuning code with these values, but the results were not as coherent as the fine-tuned models provided by the official repository. Therefore, I would like to confirm if these are the correct values.
Thank you for your help and for sharing your work with the community.
Hi, I run the demo by
python demo_audiovideo.py --cfg-path eval_configs/video_llama_eval_withaudio.yaml --gpu-id 0
It launches successfully, but the input field is stuck and I cannot input words. And the section of 'Video-LLaMA' is null on the interface. I wonder what is the possible solution for it.
The following is how I fill the file the video_llama_eval_withaudio.yaml.
model:
arch: video_llama
model_type: pretrain_vicuna
freeze_vit: True
freeze_qformer: True
max_txt_len: 512
end_sym: "###"
low_resource: False
frozen_llama_proj: False
# models/vicuna-7b-v1 uses llama-7b-hf + vicuna-7b-delta-v0
llama_model: "models/vicuna-7b-v1"
imagebind_ckpt_path: "models/imageBind/"
fusion_head_layers: 2
max_frame_pos: 32
fusion_header_type: "seqTransf"
ckpt: "models/Video-LLaMA-Series/finetune-vicuna7b-v2.pth" #path/visual_branch_ckpt/ # pretrain: miniGPT4
ckpt_2: "models/Video-LLaMA-Series/finetune_vicuna7b_audiobranch.pth" #path/audio_branch_ckpt/ # pretrain: ImageBind
datasets:
webvid:
vis_processor:
train:
name: "alpro_video_eval"
n_frms: 8
image_size: 224
text_processor:
train:
name: "blip_caption"
run:
task: video_text_pretrain
Great project !
I would like to ask 3 questions to learn:
1.Does your public checkpoint include the parameters of the 2-layer Q-former and the linear projection layer?
2.Seeing that the freeze_qformer is set to True in your stage1 and stage2 yaml files, is it because you have frozen the parameters of the Q-former and only fine-tuned the llama_proj? But I saw that the parameters of the Q-former were fine-tuned on your model diagram.
3. Is the amount of parameters fine-tuned the same in the pre-training stage1 and fine-tuning stage2 ?
Thank you very much~
I'm also training this... I haven't downloaded webvid2.5m yet and then I found that you have done everything I want to do, hahahaha
Hi.
Thanks for your great works.
I'm a new researcher. And I'm interested in Video understanding field.
I read your paper. And I wonder what conference did you submit?
The form looks like NIPS.
Thanks.
kind regards.
if i want to use the demo?
what should i write in /video_llama_eval_withaudio.yaml
ckpt:?
ckpt_2?
May I know how much data was used in the fine-tuning stage? How much data was used for fine-tuning from MiniGPT-4, LLaVA, and VideoChat respectively?
By the way, I would like to share my WeChat ID: suozhang717. I hope the author and fellow researchers can add me.
如题。我现在需要在本地运行起来,看到文档里面提示需要修改相关配置。但是对其中有几项配置有点疑问,还请帮解答一下。谢谢!
配置文件当中提示有4处内容需要修改,分别是:
# 这个是不是改成合并权重后的vicuna-13b的文件路径?
# 是不是通过执行apply_delta.py脚本得到的那个target目录下的文件?
llama_model: "ckpt/vicuna-13b/" or "ckpt/vicuna-7b/"
# 这个我没找到对应的文件目录,我理解应该也是要下载一个模型文件,但是看文档没找到对应的地址。我该如何处理?
imagebind_ckpt_path: "ckpt/imagebind_path/"
# 这个是不是就是Vision-Language Branch表格当中列出来的模型文件?
# 但我有一个疑问,我应该使用表格当中的哪个呢?如果需要支持中文是不是直接使用finetune-ziya13b-zh?还是pretrain-ziya13b-zh?
ckpt: path/visual_branch_ckpt/
# 对于这个如果我不需要audio是不是可以不配置?还是说一定要配置?
ckpt_2: path/audio_branch_ckpt/
还请帮我解答一下上面代码当中的注释疑问,再次感谢!
Hi! It's interesting to build a chatbot for video understanding!
In our project "VideoChat🦜: Chat-Centric Video Understanding", we also build an interesting chatbot🤖 for both image and video. More importantly, we have released 11K video instruction data for spatiotemporal reasoning. You can also utilize them to enhance your project 💪🏻!
If you have tried it, don't forget to give me the feedback. We will improve the data in the future.
Project: https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat
Paper: https://arxiv.org/abs/2305.06355
I tuned the demo file and ran it, but I got terrible results, not as good as in the hugging-face demo, and also repeating the question asked at the end of each answer. I don't know what caused it.
The following is my configuration file, is there any problem?
model:
arch: video_llama
model_type: pretrain_vicuna
freeze_vit: True
freeze_qformer: True
max_txt_len: 512
end_sym: "###"
low_resource: False
frozen_llama_proj: False
llama_model: "vicuna-7b/"
imagebind_ckpt_path: "imagebind/"
fusion_head_layers: 2
max_frame_pos: 32
fusion_header_type: "seqTransf"
ckpt: "finetune-vicuna7b-v2.pth"
ckpt_2: "finetune_vicuna7b_audiobranch.pth"
datasets:
webvid:
vis_processor:
train:
name: "alpro_video_eval"
n_frms: 8
image_size: 224
text_processor:
train:
name: "blip_caption"
run:
task: video_text_pretrain
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.