[NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models

🔥 [14/03/2024] This paper has been accepeted to NAACL 2024 (Findings)!

Introduction

This repository contains the official implementation code of the paper "Self-adaptive Sampling for Efficient Video Question Answering". In this work we introduce and study two simple sampling strategies (MIF and MDF) for the tuning of Video Question Answering tasks on pretrained Visual Language Models (VLMs).

Specifically, we first systematically test the performance of MIF (Most Implied Frames) with varied backbone models as captioner and scorer. They work together to perform a "question-and-vision-aware" sampling. Then we draw inspiration from the results and analysis to further propose the more lightweight MDF (Most Dominant Frames), which takes one more step to discard the correlation of question and executs a "question-agnostic, vision-aware" sampling. This routine significantly boosts the efficiency and gains competative or higher performance on the tested datasets.

Once running completes, sampled frames will be saved in a hdf5 (.h5) file as a "dataset" for fast loading during training and test time. We test our methods on three models (CLIP, GIT and All-in-one) and 4 datasets (MSVD-QA, MSRVTT-QA, TGIF-Frame, NeXT-QA). The implementation on CLIP (including our refined structure CLIP-Dec which significantly enhances the performance on raw-CLIP) and GIT are in the folder clip_and_git, while the implementation on All-in-one are under the folder all_in_one.

Usage

1. Downloading Datasets

Please visit the corresponding repository and follow the instruction there to download the datasets.

The suggested path to store these datasets is "model/dataset/<dataset_name>"

2. Preprocessing

The code to do sampling for all three models is same, under the folder "clip_and_git/src/preprocessing".

To sample via MDF method, run the python script as follows:
```
python extract_features.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --model_name=<vlm_model_name> ... (other hps)
```
If your code prompts an out-of-memory exception, please using a smaller chunksize (default=512) to shrink the input size per computation.

To sample via MIF method, first run a uniform sampling with large K (e.g., 16 or 32) to obtain a sparse frame sequence

python extract_features.py --sampling_strategy='uni' --K 16 ...

Then run the python script to capture and start sampling

python gen_sample.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --vlm_model=<vlm_model_name> --sim_model=<sim_model_name> --task='gen_cap'

python gen_sample.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --vlm_model=<vlm_model_name> --sim_model=<sim_model_name> --task='gen_inds'

3. Training and Inference

For experiments on CLIP and GIT, please modify our provided reference scripts (in src/scripts). For all-in-one, please check its attached README file for more details.

Results (Partial)

The following results are prediction accuracy, which has been defined and customized for each dataset/model in our paper.

CLIP-Dec (3 Frame)

Sampling	MSVD-QA	MSRVTT-QA	TGIF-Frame
noDec	27.7	30.3	42.8
Uniform	33.8	33.7	47.2
MDF	35.0	35.2	63.2
MIF	35.0	35.4	61.8

GIT-Base (6 Frame)

Sampling	MSVD-QA	MSRVTT-QA	TGIF-Frame
Report	51.2	41.0	69.1
Uniform	52.2	41.1	67.5
MDF	55.3	42.0	69.9
MIF	54.5	42.3	69.6

AIO-Base (3 Frame)

Sampling	MSVD-QA	MSRVTT-QA	TGIF-Frame
Report	46.5	42.9	64.2
Reprd.	46.1	42.7	64.0
MDF	46.9	43.8	66.2
MIF	46.7	44.0	65.9

AIO-Base+ on Next-QA (3 Frame)

Method	Val	Test
Base	48.4	48.1
MIF	49.7	49.5
MDF	50.2	49.8

BLIP2-T5XXL on Next-QA (3 Frame)

Method	Val	Test
Base	60.1	59.7
MIF	61.5	61.2
MDF	61.8	61.1

Citation

Please cite our paper if you find this project is related to your work

@misc{han2023sas,
      title={SAS Video-QA: Self-Adaptive Sampling for Efficient Video Question-Answering}, 
      author={Wei Han and Hui Chen and Min-Yen Kan and Soujanya Poria},
      year={2023},
      eprint={2307.04192},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

Code for AIO is adapted from AIO official implementation

Contact

If you have any enquiries about our code and paper, feel free to contact us at [email protected] or [email protected].

declare-lab / sealing Goto Github PK

sealing's Introduction

[NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models

Introduction

Usage

1. Downloading Datasets

2. Preprocessing

3. Training and Inference

Results (Partial)

CLIP-Dec (3 Frame)

GIT-Base (6 Frame)

AIO-Base (3 Frame)

AIO-Base+ on Next-QA (3 Frame)

BLIP2-T5XXL on Next-QA (3 Frame)

Citation

Acknowledgement

Contact

sealing's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org