damo-nlp-sg / video-llama Goto Github PK

[EMNLP 2023 Demo] Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

large-language-models video-language-pretraining vision-language-pretraining blip2 llama minigpt4 cross-modal-pretraining multi-modal-chatgpt

video-llama's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes jackzhousz ttjjmm adambear jadeluo ryanliu112 haorand healthonrails seshakiran 2132660698 magicwang1111 ruixiangzhao xxxx001 aristarx-lintter rajathbharadwaj disasterresponseinitiative zfw1111 leebufan damonclifford eltociear peterop12 ukaserge isparshp goswamig wavebird standardgalactic yangget hhy5277 basukinath stjordanis xlc-github lazykumasensei mfishzhang watsonwuh husterrc kzke tonywang-sh tfgbestneal sorokinvld miblue119 papiguy neskuchny blue0rigin jhy1993 honwei189 camenduru chanhee-luke velvrix haoyz joskid binzhu-ece fomennianhua restelli kuntal-c gaowudao alexanderinum efengx chenlei1812 nemoair dhruvapatil98 octag0no gmpdtd95 ai-ar4s-dev people-art loverlost brucekyle99 wyj1996 crazyboystop mcorgi chengshuang18 raywang-iat nitin-mane goyanx jaedukseo adityahulk huguensjean giftededu vgees redcod minghsuanwu lllloda gegalu abhinav95 rohan7958 jjhw shunlinlu taesiri zxs-learn andysingal thanhpham1987 saiakhil19932 abecid per1shed theolawrence86 jasonqi146 saulocatharino billxw xuemingdi ariktan yanghaocsg

video-llama's Issues

Can I Instruction Fine-tuning the model with my dataset using 4*3090?

Hi, thanks for your brilliant work. I want to finetune the model on my dataset. Can it be done with 4 * 3090? Or how much GPU memory is necessary?

can not install pytorch 1.12.1

Pip subprocess error:
ERROR: Could not find a version that satisfies the requirement pytorch==1.12.1 (from versions: 0.1.2, 1.0.2)
ERROR: No matching distribution found for pytorch==1.12.1

key error['API-TOKEN']

I just installed, "Video LLaMa finetune vicuna7B". When running the app.py I get key error['API-TOKEN'] when I get to the loading QFormer stage. I can see that the video-llama.py is looking for os.environ['API_TOKEN'] but where do I need to go in order to get an api token? is this a hugging face token? Or possibly a LLaMa token? What is the best way to save the token, so that os.environ['API_TOKEN'] can find it? Thank you!

what is "The video contains {len(indices)} frames sampled at {sec} seconds. " added in text prompt for?

is it necessary?

是否提供基于ziya的inference code？

想load一个基于ziya的video-llama模型跑video caption任务，目前是否提供类似的inference脚本。或者直接从demo.py里去改吗？

Could you please share the code to generate the instruct data?

Hi, I want to generate instruct data on my dataset with GPT4. But I don't know how to write the code. And I also notice that there is rate limit from openai. So I 'd like to have some suggestion or help from you~~~

formats: no handler for file extension `mp4'

python demo_audiovideo.py --cfg-path eval_configs/video_llama_eval_withaudio.yaml --gpu-id 0

Codec issue?

Asking for a simple script to get text and video features

First of all - Amazing work on this one.

I'm a bit getting lost with the repo, may I request a simple few line script that does something like the following:

model = CLIPViP("pretrain_clipvip_base_32.pt")
text_features = model.encode_text("This is a very cute cat")
video_features = model.encode_video("vid_file.mp4")
cosine(text_features, video_features)

[Extra] Preferably I wish to get the video features for a batch of mp4 files with different lengths

Thank you

Quantize Video-LLama with SqueezeLLM

Since Video-LLama still use fp16, I think it's better to integrate a lossless quantization such as SqueezeLLM that already supports Vicuna 7B and 13B

source: https://github.com/SqueezeAILab/SqueezeLLM

Great work!

Thanks for the great work! One small question: what is the advantage of your proposed Video-LLaMA over VideoChat?

虽然git里发布了13b的llama模型，但audio模型还是7b的，这样启动demo程序，会报错，维度不兼容，咋整？

When the "[06.08] 🚀🚀 Next BIG update is on the way!!" will be available? today is [06, 08]

As the title

the image size is defined in .yaml files but they are also manually set here and there

Great work,
I noticed the image size is fixed to 224 * 224 in a few files despite being a parameter in .yaml files. Also, I wonder if any size video would work (if someone wants to use your code to fine-tune their own data) if we just change the image size parameter. (considering some of the pre-trained models maybe work better with that fixed image size?). I'll appreciate it if you could explain what's the best approach if I want to use your code to fine-tune on a different video size(not using your model weight but rather your code)

The link of the weights unavailable

The config of finetuning video-llama with the video instruction data?

Great job! Will you release the machine-translated VideoChat instructions data?

why keep "###" before instruct text?

Video-LLaMA/video_llama/datasets/datasets/video_instruct_dataset.py

Lines 250 to 252 in 1728e14

    
           for tokenized_len, speaker in zip(tokenized_lens, speakers): 
        
               if speaker == "human": 
        
                   target[cur_idx+2:cur_idx + tokenized_len] = IGNORE_INDEX

I was reading _mask_targets(), I guess this function is using mask to ignore loss from instruct text, but why do you manually keep [curr_idx: curr_idx+2] which is "###" before the actual instruct text?

IndentationError in train_configs/video_llama_stage2_finetune.yaml

Thanks for your great work. Please fix the IndentationError in train_configs/video_llama_stage2_finetune.yaml, lol.

transformers version issue

I notice that in your work, the required transformers version is 4.28.0, which, if I'm not mistaking, had a bug with llama tokenizer. Have you tested higher versions of transformers? If I'm using, for example, version 4.29, will this affect the model converted?

prompts/alignment_video.txt is missing

same as title

Paper.pdf is missing.

Thanks for your great job.
I really wonder how it works.

Although there is Paper button on ReadMe, Not found error occurs.

Could you provide Video-LLaMA paper please?

Thanks.

kind regards.

video_llama_eval_withaudio.yaml issue

Initializing Chat
Loading VIT
Loading VIT Done
Loading Q-Former
Traceback (most recent call last):
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_errors.py", line 259, in hf_raise_for_status
response.raise_for_status()
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/requests/models.py", line 1021, in raise_for_status
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/pretrain_vicuna7b-v2.pth/resolve/main/tokenizer.model

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/utils/hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1166, in hf_hub_download
metadata = get_hf_file_metadata(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn
return fn(*args, **kwargs)

Traceback (most recent call last):
File "demo_audiovideo.py", line 66, in
model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))
File "/home/akilliceviribilisim/Video-LLaMA/video_llama/models/video_llama.py", line 567, in from_config
model = cls(
File "/home/akilliceviribilisim/Video-LLaMA/video_llama/models/video_llama.py", line 120, in init
self.llama_tokenizer = LlamaTokenizer.from_pretrained(llama_model, use_fast=False)
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1770, in from_pretrained
resolved_vocab_files[file_id] = cached_file(
File "/home/akilliceviribilisim/Video-LLaMA/venv/lib/python3.8/site-packages/transformers/utils/hub.py", line 424, in cached_file
raise EnvironmentError(
OSError: pretrain_vicuna7b-v2.pth is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True .

can you help me how can i solve it

frame rate and duration of videos

Based on my understanding, you use only 8 frames sampled uniformly across the video. I have the following questions :

does increasing the total frames by a lot slow down the training significantly ? my own conclusion was that it shouldn't because the features are extracted frame by frame and then concatenated. so it won't add to learnable parameters. is this correct ?
what if I want to fix the frame rate instead of the total frames? so that I can account for longer or shorter videos but maintain a certain frame rate on them. is this possible with your model?
if 2 is possible in theory, do you think your arch would work well on videos that need a higher sampling rate?
does your model treat videos as 8 separate images and process them one by one or does it take into account the whole 8 frames at once?
if I want to use other models for feature extraction of videos, can I still use your arch ? like instead of using raw videos using features that are already extracted from frozen models. (I think in a way this is what you do too by using blip-2)
last but not least, where does learning actually happen in your work? there are a lot of pre-trained and frozen models in use, I'm kinda confused about what part gets updated based on fine-tuning.

I hope you could help me understand your work better and hopefully be able to use it on my own data.

推理时forward提取audio是在哪里？

感谢大佬的工作，有两个问题想请教下；

请问下forward里输入的只有video的帧数，img_embeds, atts_img = self.encode_audioQformer(image, modality_type=ModalityType.VISION) 的输入应该是图片batch？audio特征是怎么提取的呢？
这段代码是图像特征还是图像和audio的融合特征呢?是与input_ids（groundtruth？）列拼接吗?
for cur_input_ids, cur_input_embeds in zip(input_ids, temp_input_embedding)
....

def forward(self, samples):
if 'conv_type' in samples.keys() and samples['conv_type']=='multi':

        im_patch_token_id = self.IMAGE_PATCH_TOKEN_ID
        image = samples["images"]
        input_ids = samples['input_ids']
        if len(image.size())==4:
            time = 1
            image = einops.repeat(image, 'b c h w -> b c t h w',t = time)

        if self.train_flag == 0:
            num_patch_tokens = self.num_video_query_token
            img_embeds, atts_img = self.encode_videoQformer_visual(image)
        elif self.train_flag == 1:
            num_patch_tokens = self.num_audio_query_token
            image = einops.rearrange(image, 'b c t h w -> b t c h w')
            img_embeds, atts_img = self.encode_audioQformer(image, modality_type=ModalityType.VISION)
            
        temp_input_ids = copy.deepcopy(input_ids)
        temp_input_ids[temp_input_ids == im_patch_token_id] = 0
        temp_input_embedding = self.llama_model.model.embed_tokens(temp_input_ids)

        new_input_embeds=[]
        cur_image_idx = 0
        for cur_input_ids, cur_input_embeds in zip(input_ids, temp_input_embedding):
            cur_image_features = img_embeds[cur_image_idx]

            if (cur_input_ids == im_patch_token_id).sum() != num_patch_tokens:
                    raise ValueError("The number of image patch tokens should be the same as the number of image patches.")
            masked_indices = torch.where(cur_input_ids == im_patch_token_id)[0]
            mask_index_start = masked_indices[0]
            if (masked_indices != torch.arange(mask_index_start, mask_index_start+num_patch_tokens, device=masked_indices.device, dtype=masked_indices.dtype)).any():
                raise ValueError("The image patch tokens should be consecutive.")
            
            cur_new_input_embeds = torch.cat((cur_input_embeds[:mask_index_start], cur_image_features, cur_input_embeds[mask_index_start+num_patch_tokens:]), dim=0)
            new_input_embeds.append(cur_new_input_embeds)
            
            cur_image_idx+=1
        inputs_embeds = torch.stack(new_input_embeds, dim=0)
        targets = samples['labels']
        attention_mask = samples['attention_mask']
        with self.maybe_autocast():
            outputs = self.llama_model(
                inputs_embeds=inputs_embeds,
                attention_mask=attention_mask,
                return_dict=True,
                labels=targets,
            )
        loss = outputs.loss
        return {"loss": loss}

Differences in the checkpoint size

The checkpoint size after training through video_llama_stage1_pretrain.yaml is far less than what you have in the public checkpoint.
The size of your public checkpoint (finetune-vicuna7b-v2.pth) is: 254MB
But the one I got is just around 37MB

Are you using a different training config than what is in video_llama_stage1_pretrain.yaml?
Can you share your training config? I wanted to fine-tune by loading the checkpoint from finetune-vicuna7b-v2.pth.
But get mismatch errors.

If i use vicuna7b_audiobranch, Is the model compatible to run vicuna13b-v2 ?

I understand that the output of the audio model is the input required by the 7b model, not the 13b model, so can audio use the 7b model and LLM use the 13b model? If not, what's the point of releasing vicuna13b-v2?

The link for pretrain-vicuna7b is invalid, the weight is missing.

模型输出截断

https://github.com/DAMO-NLP-SG/Video-LLaMA/blob/main/video_llama/processors/blip_processors.py#L49C1-L49C1
这个预处理很奇怪，把句号感叹号都去掉了大小写也没了，然后还写了个max_word在里面把训练数据截断到50个的长度查了好久模型输出截断最后发现是这里的问题。不太清楚这么设计的原因是什么

Don't use miniGPT4 to init.

Video-LLaMA/video_llama/models/video_llama.py

Line 146 in a9f3237

msg = model.load_state_dict(llama_proj_weight['model'], strict=False)

There is an error. model should be self.llama_proj

Nice work！

Can this project be directly applied in chinese scenarios?

miss import

from video_llama.processors.video_processor import AlproVideoTrainProcessor should be in video_llama/datasets/datasets/llava_instruct_dataset.py

Any plan to incorporate Llama 2?

Llama 2 buzz@

搭建demo返回错误的结果

模型加载没有问题，模型参数如下
model:
arch: video_llama
model_type: pretrain_vicuna
freeze_vit: True
freeze_qformer: True
max_txt_len: 512
end_sym: "###"
low_resource: False

frozen_llama_proj: False

llama_model: "vicuna-7b-delta-v0"
imagebind_ckpt_path: "imagebind_huge.pth"

fusion_head_layers: 2
max_frame_pos: 32
fusion_header_type: "seqTransf"

ckpt: "finetune-vicuna7b-v2.pth"
ckpt_2: "finetune_vicuna7b_audiobranch.pth"

datasets:
webvid:
vis_processor:
train:
name: "alpro_video_eval"
n_frms: 8
image_size: 224
text_processor:
train:
name: "blip_caption"

run:
task: video_text_pretrain

how to get the cc3m filter_cap.json data

I have download LLaVA-CC3M-Pretrain-595K but don't found the filter_cap.json,how can i get the filter_cap.json

Is is possible to insturct-finetune Video-LLaMA in 8 * 24 GB Nvidia A5000

Thanks for sharing the code and this brilliant work! I wonder if i can run insturct-finetune Video-LLaMA in 8 * 24 GB Nvidia A5000 using my own video datasets?

the ability of sentiment analysis

Have you ever tried to complete sentiment analysis or emotion recognition tasks in videos? How does the model perform?

Inquiry about “max_epoch” values for pre-training and fine-tuning

I would like to inquire about the max_epoch values set for pre-training and fine-tuning. Are they set to 5 and 3 as in the config file, respectively? I tried running the fine-tuning code with these values, but the results were not as coherent as the fine-tuned models provided by the official repository. Therefore, I would like to confirm if these are the correct values.

Thank you for your help and for sharing your work with the community.

No place to input in the Demo

Hi, I run the demo by

python demo_audiovideo.py     --cfg-path eval_configs/video_llama_eval_withaudio.yaml  --gpu-id 0

It launches successfully, but the input field is stuck and I cannot input words. And the section of 'Video-LLaMA' is null on the interface. I wonder what is the possible solution for it.

The following is how I fill the file the video_llama_eval_withaudio.yaml.

model:
  arch: video_llama
  model_type: pretrain_vicuna
  freeze_vit: True
  freeze_qformer: True
  max_txt_len: 512
  end_sym: "###"
  low_resource: False

  frozen_llama_proj: False

  # models/vicuna-7b-v1 uses llama-7b-hf + vicuna-7b-delta-v0
  llama_model: "models/vicuna-7b-v1" 
  imagebind_ckpt_path: "models/imageBind/"

  fusion_head_layers: 2
  max_frame_pos: 32
  fusion_header_type: "seqTransf"

  ckpt: "models/Video-LLaMA-Series/finetune-vicuna7b-v2.pth" #path/visual_branch_ckpt/  # pretrain: miniGPT4
  ckpt_2: "models/Video-LLaMA-Series/finetune_vicuna7b_audiobranch.pth" #path/audio_branch_ckpt/ # pretrain: ImageBind

datasets:
  webvid:
    vis_processor:
      train:
        name: "alpro_video_eval"
        n_frms: 8
        image_size: 224
    text_processor:
      train:
        name: "blip_caption"

run:
  task: video_text_pretrain

size mismatch error

python demo_audiovideo.py --cfg-path eval_configs/video_llama_eval_withaudio.yaml  --gpu-id 3

The amount of updated parameters during stage1 and stage2 ?

Great project !
I would like to ask 3 questions to learn：
1.Does your public checkpoint include the parameters of the 2-layer Q-former and the linear projection layer?
2.Seeing that the freeze_qformer is set to True in your stage1 and stage2 yaml files, is it because you have frozen the parameters of the Q-former and only fine-tuned the llama_proj? But I saw that the parameters of the Q-former were fine-tuned on your model diagram.
3. Is the amount of parameters fine-tuned the same in the pre-training stage1 and fine-tuning stage2 ?
Thank you very much~

I'm also training this... I haven't downloaded webvid2.5m yet and then I found that you have done everything I want to do, hahahaha

Which conference do you submit this paper?

Hi.
Thanks for your great works.

I'm a new researcher. And I'm interested in Video understanding field.

I read your paper. And I wonder what conference did you submit?
The form looks like NIPS.

Thanks.
kind regards.

which checkpoint should i write?minigpt4?audio.pth or visual.pth?

if i want to use the demo?
what should i write in /video_llama_eval_withaudio.yaml
ckpt:?
ckpt_2?

how much data was used in the fine-tuning stage?

May I know how much data was used in the fine-tuning stage? How much data was used for fine-tuning from MiniGPT-4, LLaVA, and VideoChat respectively?

By the way, I would like to share my WeChat ID: suozhang717. I hope the author and fellow researchers can add me.

imagebind_ckpt_path这个应该如何配置？去哪里下载吗？

如题。我现在需要在本地运行起来，看到文档里面提示需要修改相关配置。但是对其中有几项配置有点疑问，还请帮解答一下。谢谢！
配置文件当中提示有4处内容需要修改，分别是：

# 这个是不是改成合并权重后的vicuna-13b的文件路径？
# 是不是通过执行apply_delta.py脚本得到的那个target目录下的文件？
  llama_model: "ckpt/vicuna-13b/" or "ckpt/vicuna-7b/" 
# 这个我没找到对应的文件目录，我理解应该也是要下载一个模型文件，但是看文档没找到对应的地址。我该如何处理？
  imagebind_ckpt_path: "ckpt/imagebind_path/"
# 这个是不是就是Vision-Language Branch表格当中列出来的模型文件？
# 但我有一个疑问，我应该使用表格当中的哪个呢？如果需要支持中文是不是直接使用finetune-ziya13b-zh?还是pretrain-ziya13b-zh？
  ckpt: path/visual_branch_ckpt/
# 对于这个如果我不需要audio是不是可以不配置？还是说一定要配置？
  ckpt_2: path/audio_branch_ckpt/

还请帮我解答一下上面代码当中的注释疑问，再次感谢！

Explanation about "data" log parameter

The values in the data are always zero. What does it represent?

finetune-billa7b-zh模型推理错误

使用video_llama_eval.yaml和demo_video.py，配置如下：
ckpt:finetune-billa7b-zh.pth

llama_proj_model：pretrained_minigpt4.pth
llama_model：Neutralzz/BiLLa-7B-LLM

报错如下图：

[VideoChat] Video instruction data

Hi! It's interesting to build a chatbot for video understanding!
In our project "VideoChat🦜: Chat-Centric Video Understanding", we also build an interesting chatbot🤖 for both image and video. More importantly, we have released 11K video instruction data for spatiotemporal reasoning. You can also utilize them to enhance your project 💪🏻!
If you have tried it, don't forget to give me the feedback. We will improve the data in the future.
Project: https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat
Paper: https://arxiv.org/abs/2305.06355

The effect gap is too large

I tuned the demo file and ran it, but I got terrible results, not as good as in the hugging-face demo, and also repeating the question asked at the end of each answer. I don't know what caused it.

The following is my configuration file, is there any problem?

model:
arch: video_llama
model_type: pretrain_vicuna
freeze_vit: True
freeze_qformer: True
max_txt_len: 512
end_sym: "###"
low_resource: False