antoyang / frozenbilm Goto Github PK

[NeurIPS 2022] Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Home Page: https://arxiv.org/abs/2206.08155

License: Apache License 2.0

Python 99.93% Shell 0.07%

multimodal-learning video-understanding vqa weakly-supervised-learning large-language-models pre-training video-question-answering videoqa vision-and-language visual-question-answering

frozenbilm's Introduction

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Webpage • Paper

FrozenBiLM is a new model for video question answering that builds on a frozen bidirectional language model. FrozenBiLM excels in settings without any manual annotation (zero-shot) or with limited training data (few-shot), while performing competitively when trained on standard datasets (fully-supervised).

This repository provides the code for our FrozenBiLM paper (NeurIPS 2022), including:

Environment setup
Data downloading instructions
Data preprocessing and visual feature extraction scripts, as well as preprocessed data and features
Pretrained checkpoints
Training and evaluation scripts for cross-modal training, downstream fully-supervised, few-shot and zero-shot VideoQA, including various baselines
VideoQA demo script

Setup

To install requirements, run:

conda create -n frozenbilm_env python=3.8 
conda activate frozenbilm_env
conda install pytorch==1.8.1 torchvision==0.9.1 cudatoolkit=11.1 -c pytorch -c nvidia 
pip install -r requirements.txt

You may fill the global paths in args.py.
To use a given text-pretrained language model, you should download the corresponding weights from the Hugging Face Hub and put them in TRANSFORMERS_CACHE.

Quick Start

If you wish to start VideoQA training or inference quickly.

Download preprocessed data, visual features and checkpoints

To download pretrained checkpoints, pre-processed data, ASR and visual features, run:

bash download/download_checkpoints.sh <MODEL_DIR>
bash download/download_downstream.sh <DATA_DIR>

If you have issues with gshell, you can access the processed data here and the checkpoints here.
It requires about 8GB for the models, and 12GB for the data.
Note that most pretrained checkpoints only contain updated parameters due to storage limitations (and not the frozen parameters).
This means you have to make sure that you have properly downloaded weights from Hugging Face for the language model of your choice when using a provided checkpoint.
For completeness, frozenbilm.pth, frozenbilm_bertbase_noadapter.pth and frozenbilm_bertlarge_noadapter.pth contain all parameters.
Also note that due to storage issue, we do not host publicly visual features for the WebVid10M dataset.

Long Start

Data Downloading

Click for details...

**WebVid10M** Download the annotations and videos from [the dataset providers](https://m-bain.github.io/webvid-dataset/). The annotations should be in `/WebVid`.

LSMDC-FiB Download the annotations and videos from the dataset providers. The annotations should be in <DATA_DIR>/LSMDC.

TGIF-FrameQA Download the annotations and GIFs from the dataset providers. The annotations should be in <DATA_DIR>/TGIF-QA.

How2QA Download the annotations and videos from the dataset providers. The annotations should be in <DATA_DIR>/How2QA.

TVQA Download the annotations and videos from the dataset providers. The annotations should be in <DATA_DIR>/TVQA.

For iVQA, MSRVTT-QA, MSVD-QA and ActivityNet-QA, we use the preprocessed files from Just Ask and download them in <DATA_DIR>/iVQA, <DATA_DIR>/MSRVTT-QA, <DATA_DIR>/MSVD-QA and <DATA_DIR>/ActivityNet-QA.

To download automatic speech subtitles, we use youtube-dl, except for LSMDC, How2QA and TVQA for which the authors provide them. We then convert the vtt files for each video from a dataset to a pickle file subtitles.pkl containing a dictionary mapping each video_id to a dictionary containing a start, end and text key, corresponding to the speech in the corresponding video_id.

Annotation Preprocessing

Click for details...

To preprocess annotations for the different datasets, run: ``` python preproc/preproc_webvid.py python preproc/preproc_lsmdc.py python preproc/preproc_tgifqa.py python preproc/preproc_how2qa.py python preproc/preproc_tvqa.py ``` iVQA, MSRVTT-QA, MSVD-QA, and ActivityNet-QA are already preprocessed (see Data Downloading instructions).

Visual Feature extraction

Click for details...

We provide in the `extract` folder the code to extract visual features from videos with CLIP ViT-L/14@224px. It requires downloading the pretrained weights available at [this repository](https://github.com/openai/CLIP).

Extraction You should prepare for each dataset a csv with columns video_path, and feature_path. Then use (you may launch this script on multiple GPUs to fasten the extraction process):

python extract/extract.py --csv <csv_path>

Merge files To merge the extracted features into a single file for each downstream dataset, use:

python extract/merge_features.py --folder <features_path> \
--output_path <DEFAULT_DATASET_DIR>/clipvitl14.pth --dataset <dataset>

For WebVid10M, you may let the features in separate files (one per video) as the dataset is too big for the features to be stored in a single file.
You may preferably put these features on a SSD to fasten up on-the-fly reading during training.

Available checkpoints

Training data	LSMDC	iVQA	MSRVTT-QA	MSVD-QA	ActivityNet-QA	TGIF-QA	How2QA	TVQA	url	size
WebVid10M	51.5	26.8	16.7	33.8	25.9	41.9	58.4	59.7	Drive	3.7GB (inc. frozen weights)
WebVid10M + LSMDC	63.5								Drive	114MB
WebVid10M + iVQA		39.6							Drive	114MB
WebVid10M + MSRVTT-QA			47.0						Drive	114MB
WebVid10M + MSVD-QA				54.8					Drive	114MB
WebVid10M + ActivityNet-QA					43.2				Drive	114MB
WebVid10M + TGIF-QA						68.6			Drive	114MB
WebVid10M + How2QA							86.3		Drive	114MB
WebVid10M + TVQA								82.0	Drive	114MB

Note that checkpoints finetuned on 10% or 1% of downstream datasets (few-shot setting) are also made accessible here.
Variants using a BERT-Base or BERT-Large language model (without adapters) instead of DeBERTa are also present in this folder.

Cross-modal training

FrozenBiLM

To train FrozenBiLM on WebVid10M, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env main.py \
--combine_datasets webvid --combine_datasets_val webvid --save_dir=trainwebvid \
--lr=3e-5 --ds_factor_ff=8 --ds_factor_attn=8 \
--batch_size=16 --batch_size_val=16 --epochs=2 \

Baselines

Click for details...

Based on the previous command: - Variant without adapters: Pass `--lr=3e-4 --ds_factor_ff=0 --ds_factor_attn=0` - UnFrozenBiLM variant: Pass `--lr=1e-5 --ft_lm --ft_mlm --ds_factor_ff=0 --ds_factor_attn=0 --batch_size=8` - UnFrozenBiLM variant with no language initialization: Pass `-lr=1e-5 --ft_lm --ft_mlm --scratch --ds_factor_ff=0 --ds_factor_attn=0 --batch_size=8` - Other language models: Pass `--model_name=bert-large-uncased` or `--model_name=bert-base-uncased` to use BERT-Base or BERT-Large instead of Deberta-V2-XLarge - Train on a subpart of WebVid10M: Sample a random subpart of the train dataframe file and change the `--webvid_train_csv_path`. The random subsets used in the paper will be released soon.

Autoregressive variants

Click for details...

To train the GPT-J-6B-based autoregressive variant on WebVid10M, run: ``` python -m torch.distributed.launch --nproc_per_node 8 --use_env main_ar.py \ --combine_datasets webvid --combine_datasets_val webvid --save_dir=trainarwebvid \ --lr=3e-4 --model_name=gpt-j-6b \ --batch_size=4 --batch_size_val=4 --epochs=2 ``` Other language models: Pass `--model_name=gpt-neo-1p3b --batch_size=16 --batch_size_val=16` or `--model_name=gpt-neo-2p7b --batch_size=8 --batch_size_val=8` to use GPT-Neo-1.3B or GPT-Neo-2.7B instead of GPT-J-6B

Zero-shot VideoQA

Fill-in-the-blank and open-ended VideoQA

FrozenBiLM

To evaluate the cross-modal trained FrozenBiLM on LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA or TGIF-QA FrameQA, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env videoqa.py --test --eval \
--combine_datasets <dataset> --combine_datasets_val <dataset> --save_dir=zs<dataset> \
--ds_factor_ff=8 --ds_factor_attn=8 --suffix="." \
--batch_size_val=32 --max_tokens=256 --load=<CKPT_PATH> --<dataset>_vocab_path=$DATA_DIR/<dataset>/vocab1000.json

Baselines

Click for details...

Based on the previous command: - Variant without adapters: Pass `--ds_factor_ff=0 --ds_factor_attn=0` - UnFrozenBiLM variant: Pass `--ft_lm --ft_mlm --ds_factor_ff=0 --ds_factor_attn=0` - UnFrozenBiLM variant with no language initialization:`--ft_lm --ft_mlm --scratch --ds_factor_ff=0 --ds_factor_attn=0` - Other language models: Pass `--model_name=bert-large-uncased` or `--model_name=bert-base-uncased` to use BERT-Base or BERT-Large instead of Deberta-V2-XLarge - Text-only: Pass `--no_video` and no `--load` - No speech: Pass `--no_context` to remove the speech - No suffix: Pass `--no_context` and no `--suffix` argument

Autoregressive variants

Click for details...

To evaluate the cross-modal trained GPT-J-6B-based autoregressive variant on iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA or TGIF-QA FrameQA, run: ``` python -m torch.distributed.launch --nproc_per_node 8 --use_env videoqa_ar.py --test --eval \ --combine_datasets --combine_datasets_val --save_dir=zsar \ --model_name=gpt-j-6b --batch_size_val=8 --max_tokens=256 --load= ``` Other language models: Pass `--model_name=gpt-neo-1p3b --batch_size_val=32` or `--model_name=gpt-neo-2p7b --batch_size_val=16` to use GPT-Neo-1.3B or GPT-Neo-2.7B instead of GPT-J-6B

CLIP baseline

Click for details...

To run the CLIP baseline on LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA or TGIF-QA FrameQA, run: ``` python -m torch.distributed.launch --nproc_per_node 8 --use_env videoqa_clip.py --test --eval \ --combine_datasets --combine_datasets_val --save_dir=zsclip \ --batch_size_val=16 --max_feats=1 ```

Multiple-choice VideoQA

FrozenBiLM

To evaluate the cross-modal trained FrozenBiLM on How2QA or TVQA, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env mc.py --eval \ 
--combine_datasets <dataset> --combine_datasets_val <dataset> --save_dir=zs<dataset> \ 
--ds_factor_ff=8 --ds_factor_attn=8 --suffix="." \
--batch_size_val=32 --max_tokens=512 --load=<CKPT_PATH>

Baselines

Click for details...

Based on the previous command: - Variant without adapters: Pass `--ds_factor_ff=0 --ds_factor_attn=0` - UnFrozenBiLM variant: Pass `--ft_lm --ft_mlm --ds_factor_ff=0 --ds_factor_attn=0` - UnFrozenBiLM variant with no language initialization: `--ft_lm --ft_mlm --scratch --ds_factor_ff=0 --ds_factor_attn=0` - Other language models: Pass `--model_name=bert-large-uncased` or `--model_name=bert-base-uncased` to use BERT-Base or BERT-Large instead of Deberta-V2-XLarge - Text-only: Pass `--no_video` and no `--load` - No speech: Pass `--no_context` to remove the speech

CLIP baseline

Click for details...

To run the CLIP baseline on How2QA or TVQA: ``` python -m torch.distributed.launch --nproc_per_node 8 --use_env mc_clip.py --test --eval \ --combine_datasets --combine_datasets_val \ --save_dir=zsclip --batch_size_val=8 --max_feats=1 ```

Fully-supervised VideoQA

Fill-in-the-blank and open-ended VideoQA

To finetune the cross-modal trained FrozenBiLM on LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA or TGIF-QA FrameQA, run:

python -m torch.distributed.launch --nproc_per_node 4 --use_env videoqa.py \
--combine_datasets <dataset> --combine_datasets_val <dataset> --save_dir=ft<dataset> \
--lr=5e-5 --schedule=linear_with_warmup --load=<CKPT_PATH> \
--ds_factor_ff=8 --ds_factor_attn=8 --suffix="." \
--batch_size=8 --batch_size_val=32 --max_tokens 256 --epochs=20

Pass --ft_lm --ft_mlm --ds_factor_ff=0 --ds_factor_attn=0 for the UnFrozenBiLM variant.

Multiple-choice VideoQA

To finetune the cross-modal trained FrozenBiLM on How2QA or TVQA, run:

python -m torch.distributed.launch --nproc_per_node 8 --use_env mc.py \
--combine_datasets <dataset> --combine_datasets_val <dataset> --save_dir=ft<dataset> \
--lr=5e-5 --schedule=linear_with_warmup --load=<CKPT_PATH> \
--ds_factor_ff=8 --ds_factor_attn=8 --suffix="." \
--batch_size=2 --batch_size_val=8 --max_tokens=256 --epochs=20

Pass --ft_lm --ft_mlm --ds_factor_ff=0 --ds_factor_attn=0 --batch_size=1 for the UnFrozenBiLM variant.

Few-shot VideoQA

For few-shot VideoQA, we sample a subpart of the train dataframe file and change --<dataset>_train_csv_path.
The random subsets used in the paper are released here.

VideoQA Demo

Using a trained checkpoint, you can also run a VideoQA example with a video file of your choice, and the question of your choice. For that, use (the answer vocabulary is taken from msrvtt_vocab_path):

python demo_videoqa.py --combine_datasets msrvtt --combine_datasets_val msrvtt \
--suffix="." --max_tokens=256 --ds_factor_ff=8 --ds_factor_attn=8 \
--load=<CKPT_PATH> --msrvtt_vocab_path=<VOCAB_PATH> \
--question_example <question> --video_example <video_path>

This demo can run on CPUs, with at least 4 physical cores. For this, use --device='cpu'. Note that this demo does not use speech input which would require using an off-the-shelf ASR extractor.

Acknowledgements

The transformer models implementation is inspired by Hugging Face's transformers library.
The feature extraction code is inspired by Just Ask.

Licenses

This code is released under the Apache License 2.0.
The licenses for datasets used in the paper are available at the following links: iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, How2QA and TVQA.

Citation

If you found this work useful, consider giving this repository a star and citing our paper as followed:

@inproceedings{yang2022frozenbilm,
title = {Zero-Shot Video Question Answering via Frozen Bidirectional Language Models},
author = {Antoine Yang and Antoine Miech and Josef Sivic and Ivan Laptev and Cordelia Schmid},
booktitle={NeurIPS}
year = {2022}}

frozenbilm's People

Contributors

Stargazers

Watchers

frozenbilm's Issues

Do you L2 Normalize the CLIP Features?

Hi @antoyang,

After extracting the features from CLIP-L/14, do you L2 normalize the features before passing it to the subsequent pipeline? Thanks

Unexpected Zero-shot Results

Hi,

I tried to evaluate the fine-tuned checkpoints provided in the repo. My environment has been correctly configured and I followed all steps up to Zero-shot VideoQA section. As I only have one GPU, I didn't use distributed inference.
Here is what I used to run the evaluation:
python videoqa.py --test --eval --combine_datasets <dataset> --combine_datasets_val <dataset> --save_dir=zs<dataset> --ds_factor_ff=8 --ds_factor_attn=8 --suffix="." --batch_size_val=32 --max_tokens=256 --load=checkpoints/frozenbilm_<dataset>.pth --<dataset>_vocab_path <data_folder>/vocab1000.json
I tried with ActivityNet-VQA and iVQA and couldn't get any expected results.
For instance, here is what got by testing on ActivityNet-VQA:

number of params: 29735424
loading from checkpoints/frozenbilm_activitynet.pth
test:  [  0/250]  eta: 0:07:27  acc: 0.0000 (0.0000)  time: 1.7891  data: 0.3052  max mem: 6485
test:  [100/250]  eta: 0:03:35  acc: 0.0000 (0.0006)  time: 1.4358  data: 0.0020  max mem: 7765
test:  [200/250]  eta: 0:01:11  acc: 0.0000 (0.0005)  time: 1.4355  data: 0.0021  max mem: 7765
test:  [249/250]  eta: 0:00:01  acc: 0.0000 (0.0006)  time: 1.4344  data: 0.0020  max mem: 7765
test: Total time: 0:05:59 (1.4361 s / it)
activitynet
test acc1:  0.06%
test acc10:  0.55%
acc motion:  0.00%
acc spatial:  0.12%
acc temporal:  0.00%
acc yesno:  0.00%
acc color:  0.57%
acc object:  0.00%
acc location:  0.00%
acc number:  0.00%
acc other:  0.00%
acc sub:  0.10%; proportion  25.25%

And results on iVQA:
number of params: 29735424

loading from checkpoints/frozenbilm_ivqa.pth
test:  [ 0/63]  eta: 0:02:40  acc: 0.0000 (0.0000)  time: 2.5405  data: 0.2846  max mem: 6485
test:  [62/63]  eta: 0:00:01  acc: 0.0000 (0.0000)  time: 1.1953  data: 0.0018  max mem: 7766
test: Total time: 0:01:16 (1.2169 s / it)
ivqa
test acc1:  0.00%
test acc10:  0.95%
acc sub:  0.00%; proportion  14.20%

Do you have any ideas on this issue?

Cheers

Few-shot VideoQA training details

Hi,

Thanks for the great work and publicly available code.

Could you please share the few-shot training parameters (batch size, learning rate, etc.)? I could not reproduce the results.

Thanks in advance.

About the stage of pre-training

Hi, I noticed that you only trained on webvid10m for two epochs, is it converging already?

Autoregressive Model Checkpoint

Hi,

Thank you for the great work!
Is it possible for you to upload the checkpoint for GPT-J-6B as well?

Conda Environment Setting

Hi.

Instruction says to run "pip install requirements.txt", but it is "pip install -r requirements.txt", right?

And my question is about this error;

$ pip install -r requirements.txt
ERROR: Could not find a version that satisfies the requirement clip==1.0 (from versions: 0.0.1, 0.1.0, 0.2.0)
ERROR: No matching distribution found for clip==1.0

How can I download clip==1.0?

Problems in reproducing the code process

Hello, thank you very much for being able to share your work,！I've run into a couple of problems in trying to reproduce your work:

running main.py, videoqa.py gives an error:“ERROR:root:No token file found. Also make sure that a [prod] section with a 'token = value' assignment exists.”
how to set combine_datasets and combine_datasets_val
I hope you can take time out of your busy schedule to help me out!

Errors in MSRVTT-QA test set

Hi, I have found some spelling errors in the test set of MSRVTT. For example, "badmitten", "peson", "tenni". How did you handle such ground truth errors during the testing?

Problematic Tokennizer?

Hi! I am trying zeroshot inference with the code below

DATA_DIR=data
DATASET=activitynet
DATASET_FILE=ActivityNet-QA
CKPT_PATH=checkpoints/frozenbilm_activitynet.pth

TRANSFORMERS_CACHE=/root/.cache/huggingface/transformers \
CUDA_VISIBLE_DEVICES=4,5,6,7 \
CUDA_LAUNCH_BLOCKING=1 \
python -m torch.distributed.run --nproc_per_node 4 videoqa.py --test --eval \
--combine_datasets $DATASET --combine_datasets_val $DATASET --save_dir=zs${DATASET} \
--ds_factor_ff=8 --ds_factor_attn=8 --suffix="." \
--batch_size_val=32 --max_tokens=256 --load=$CKPT_PATH \
"--${DATASET}_vocab_path"=$DATA_DIR/$DATASET_FILE/vocab1000.json \
"--${DATASET}_train_csv_path"=$DATA_DIR/$DATASET_FILE/train.json "--${DATASET}_test_csv_path"=$DATA_DIR/$DATASET_FILE/test.csv

While I encountered the issue of sentencepiece

*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
ERROR:root:No token file found. Also make sure that a [prod] section with a 'token = value' assignment exists.
ERROR:root:No token file found. Also make sure that a [prod] section with a 'token = value' assignment exists.
ERROR:root:No token file found. Also make sure that a [prod] section with a 'token = value' assignment exists.
ERROR:root:No token file found. Also make sure that a [prod] section with a 'token = value' assignment exists.
| distributed init (rank 0): env://
| distributed init (rank 3): env://
| distributed init (rank 1): env://
| distributed init (rank 2): env://
Namespace(combine_datasets=['activitynet'], combine_datasets_val=['activitynet'], webvid_features_path='webvid_clipvitl14_features', webvid_train_csv_path='data/WebVid/train_captions.csv', webvid_val_csv_path='data/WebVid/val_captions.csv', lsmdc_features_path='data/LSMDC/clipvitl14.pth', lsmdc_train_csv_path='data/LSMDC/training.csv', lsmdc_val_csv_path='data/LSMDC/val.csv', lsmdc_test_csv_path='data/LSMDC/test.csv', lsmdc_vocab_path='data/LSMDC/vocab.json', lsmdc_subtitles_path='data/LSMDC/subtitles.pkl', ivqa_features_path='data/iVQA/clipvitl14.pth', ivqa_train_csv_path='data/iVQA/train.csv', ivqa_val_csv_path='data/iVQA/val.csv', ivqa_test_csv_path='data/iVQA/test.csv', ivqa_vocab_path='data/iVQA/vocab.json', ivqa_subtitles_path='data/iVQA/subtitles.pkl', msrvtt_features_path='data/MSRVTT-QA/clipvitl14.pth', msrvtt_train_csv_path='data/MSRVTT-QA/train.csv', msrvtt_val_csv_path='data/MSRVTT-QA/val.csv', msrvtt_test_csv_path='data/MSRVTT-QA/test.csv', msrvtt_vocab_path='data/MSRVTT-QA/vocab.json', msrvtt_subtitles_path='data/MSRVTT-QA/subtitles.pkl', msvd_features_path='data/MSVD-QA/clipvitl14.pth', msvd_train_csv_path='data/MSVD-QA/train.csv', msvd_val_csv_path='data/MSVD-QA/val.csv', msvd_test_csv_path='data/MSVD-QA/test.csv', msvd_vocab_path='data/MSVD-QA/vocab.json', msvd_subtitles_path='data/MSVD-QA/subtitles.pkl', activitynet_features_path='data/ActivityNet-QA/clipvitl14.pth', activitynet_train_csv_path='data/ActivityNet-QA/train.json', activitynet_val_csv_path='data/ActivityNet-QA/val.csv', activitynet_test_csv_path='data/ActivityNet-QA/test.csv', activitynet_vocab_path='data/ActivityNet-QA/vocab1000.json', activitynet_subtitles_path='data/ActivityNet-QA/subtitles.pkl', tgif_features_path='data/TGIF-QA/clipvitl14.pth', tgif_frameqa_train_csv_path='data/TGIF-QA/train_frameqa.csv', tgif_frameqa_test_csv_path='data/TGIF-QA/test_frameqa.csv', tgif_vocab_path='data/TGIF-QA/vocab.json', how2qa_features_path='data/How2QA/clipvitl14_split.pth', how2qa_train_csv_path='data/How2QA/train.csv', how2qa_val_csv_path='data/How2QA/public_val.csv', how2qa_subtitles_path='data/How2QA/subtitles.pkl', tvqa_features_path='data/TVQA/clipvitl14.pth', tvqa_train_csv_path='data/TVQA/train.csv', tvqa_val_csv_path='data/TVQA/val.csv', tvqa_test_csv_path='data/TVQA/test_public.csv', tvqa_subtitles_path='data/TVQA/subtitles.pkl', vqa_features_path='data/VQA/clipvitl14.pth', vqa_train_pkl_path='data/VQA/train_list.pkl', vqa_val_pkl_path='data/VQA/val_list.csv', vqa_vocab_path='data/VQA/vocab.json', mlm_prob=0.15, lr=0.0003, beta1=0.9, beta2=0.95, batch_size=32, batch_size_val=32, weight_decay=0, epochs=10, lr_drop=10, optimizer='adam', clip_max_norm=0.1, schedule='', fraction_warmup_steps=0.1, eval_skip=1, print_freq=100, freeze_lm=True, model_name='/root/.cache/huggingface/transformers/deberta-v2-xlarge', ds_factor_attn=8, ds_factor_ff=8, ft_ln=True, freeze_mlm=True, dropout=0.1, scratch=False, n_ans=0, freeze_last=True, test=True, save_dir='zsactivitynet', presave_dir='', device='cuda', seed=42, load='checkpoints/frozenbilm_activitynet.pth', resume=False, start_epoch=0, eval=True, num_workers=3, world_size=4, dist_url='env://', max_feats=10, features_dim=768, use_video=True, use_context=True, max_tokens=256, max_atokens=5, prefix='', suffix='.', rank=0, gpu=0, distributed=True, dist_backend='nccl')
Traceback (most recent call last):
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/videoqa.py", line 530, in <module>
    main(args)
Traceback (most recent call last):
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/videoqa.py", line 266, in main
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/videoqa.py", line 530, in <module>
    tokenizer = get_tokenizer(args)
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/model/__init__.py", line 96, in get_tokenizer
    tokenizer = DebertaV2Tokenizer.from_pretrained(    
main(args)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/videoqa.py", line 266, in main
    tokenizer = get_tokenizer(args)
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/model/__init__.py", line 96, in get_tokenizer
    tokenizer = DebertaV2Tokenizer.from_pretrained(
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1932, in _from_pretrained
    return cls._from_pretrained(
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1932, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 149, in __init__
    self._tokenizer = SPMTokenizer(vocab_file, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 301, in __init__
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 149, in __init__
    self._tokenizer = SPMTokenizer(vocab_file, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 301, in __init__
    spm.load(vocab_file)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    spm.load(vocab_file)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return self.LoadFromFile(model_file)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg): 
Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 
Traceback (most recent call last):
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/videoqa.py", line 530, in <module>
    main(args)
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/videoqa.py", line 266, in main
    tokenizer = get_tokenizer(args)
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/model/__init__.py", line 96, in get_tokenizer
    tokenizer = DebertaV2Tokenizer.from_pretrained(
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1932, in _from_pretrained
Traceback (most recent call last):
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/videoqa.py", line 530, in <module>
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 149, in __init__
    self._tokenizer = SPMTokenizer(vocab_file, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 301, in __init__
        spm.load(vocab_file)main(args)

  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/videoqa.py", line 266, in main
    return self.LoadFromFile(model_file)
      File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
tokenizer = get_tokenizer(args)
  File "/mnt/lustre/lychen/code/sm/FrozenBiLM/model/__init__.py", line 96, in get_tokenizer
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 
    tokenizer = DebertaV2Tokenizer.from_pretrained(
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1777, in from_pretrained
    return cls._from_pretrained(
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 1932, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 149, in __init__
    self._tokenizer = SPMTokenizer(vocab_file, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 301, in __init__
    spm.load(vocab_file)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/sentencepiece/__init__.py", line 905, in Load
    return self.LoadFromFile(model_file)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/sentencepiece/__init__.py", line 310, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())] 
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 1066196) of binary: /mnt/lustre/anaconda3/envs/dream/bin/python
Traceback (most recent call last):
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/torch/distributed/run.py", line 766, in <module>
    main()
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/mnt/lustre/anaconda3/envs/dream/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
videoqa.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2022-11-07_10:48:31
  host      : localhost.vm
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 1066197)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2022-11-07_10:48:31
  host      : localhost.vm
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 1066198)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[3]:
  time      : 2022-11-07_10:48:31
  host      : localhost.vm
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 1066199)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-11-07_10:48:31
  host      : localhost.vm
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 1066196)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

This isssue is the same as the one below. It looks like some prblem from vocab. How can we fix it?

sentencepiece\sentencepiece\src\sentencepiece_processor.cc(1102) [model_proto->ParseFromArray(serialized.data(), serialized.size())] · Issue #20011 · huggingface/transformers
huggingface/transformers#20011

[Import Error] with demo_videoqa.py

python demo_videoqa.py --combine_datasets msrvtt --combine_datasets_val msrvtt --suffix="." --max_tokens=256 --ds_factor_ff=8 --ds_factor_attn=8 --load=checkpoints/frozenbilm_msrvtt10p.pth --msrvtt_vocab_path=data/MSRVTT-QA/vocab.json --question_example "what is that dog doing?" --video_example ./angry_cute_dog.mp4

I downloaded all the data and checkpoints files. Also i downloaded transformers library from hugging face. But... plz.. check my error message..

ImportError: cannot import name 'GreedySearchOutput' from 'transformers.generation_utils'(FrozenBiLM/transformers/src/transformers/generation_utils.py)

what version of transformers library are u using?

Bad zero-shot results on TVQA

Hi, I ran the zero-shot result on TVQA dataset with the given zero-shot checkpoint frozenbilm.pth and the given TVQA video features clipvitl14.pth. I also used the microsoft/deberta-v2-xlarge checkpoint. However, I got the val acc 31.59 instead of the reported 59.7.

Error on zero-shot VQA

Hi. Thanks for providing code! I'm having the same issue as #3 on the VQA demo. I have the Microsoft deberta-v2-xlarge ( https://huggingface.co/microsoft/deberta-v2-xlarge ) downloaded from huggingface in a folder called transformers_cache. I've set the TRANSFORMERS_CACHE environment variable to point at it (if I remove this, it complains that deberta is missing, so I assume this part is correct). Do you have any idea why it might be failing?

The command I'm running is:

python demo_videoqa.py --combine_datasets msrvtt --combine_datasets_val msrvtt \ --suffix="." --max_tokens=256 --ds_factor_ff=8 --ds_factor_attn=8 \ --load=models/frozenbilm.pth --msrvtt_vocab_path=data/MSRVTT-QA/vocab.json \ --question_example question --video_example test.mp4 --device='cpu'

And the error is:

Traceback (most recent call last):
File "demo_videoqa.py", line 170, in
main(args)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "demo_videoqa.py", line 32, in main
tokenizer = get_tokenizer(args)
File "/user/work/tp8961/FrozenBiLM/model/init.py", line 96, in get_tokenizer
tokenizer = DebertaV2Tokenizer.from_pretrained(
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1788, in from_pretrained
return cls._from_pretrained(
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 1923, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 145, in init
self._tokenizer = SPMTokenizer(vocab_file, split_by_punct=split_by_punct, sp_model_kwargs=self.sp_model_kwargs)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/transformers/models/deberta_v2/tokenization_deberta_v2.py", line 296, in init
spm.load(vocab_file)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/sentencepiece/init.py", line 367, in Load
return self.LoadFromFile(model_file)
File "/user/work/tp8961/conda_envs/frozenbilm_env/lib/python3.8/site-packages/sentencepiece/init.py", line 171, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(890) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

Verification code is need for downloading the data features

Hi @antoyang ,

Thanks a lot for open-sourcing the project! I just got in trouble when trying to download the pre-computed features, since it requires a verification code that seems not provided in the repo.

Best,
Junting

webvid_clipvitl14_features

Hi Antoine,
Thanks for your great and open work! I was failed to find the video features of WebVid in your provided files. Could you please provide me with the download link?

antoyang / frozenbilm Goto Github PK

frozenbilm's Introduction

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Setup

Quick Start

Download preprocessed data, visual features and checkpoints

Long Start

Data Downloading

Annotation Preprocessing

Visual Feature extraction

Available checkpoints

Cross-modal training

FrozenBiLM

Baselines

Autoregressive variants

Zero-shot VideoQA

Fill-in-the-blank and open-ended VideoQA

FrozenBiLM

Baselines

Autoregressive variants

CLIP baseline

Multiple-choice VideoQA

FrozenBiLM

Baselines

CLIP baseline

Fully-supervised VideoQA

Fill-in-the-blank and open-ended VideoQA

Multiple-choice VideoQA

Few-shot VideoQA

VideoQA Demo

Acknowledgements

Licenses

Citation

frozenbilm's People

Contributors

Stargazers

Watchers

Forkers

frozenbilm's Issues

Recommend Projects

Recommend Topics

Recommend Org