Giter Site home page Giter Site logo

declare-lab / sealing Goto Github PK

View Code? Open in Web Editor NEW
6.0 4.0 1.0 9.13 MB

[NAACL 2024] Official Implementation of paper "Self-Adaptive Sampling for Efficient Video Question Answering on Image--Text Models"

Home Page: https://arxiv.org/pdf/2307.04192.pdf

License: MIT License

Python 99.93% Shell 0.07%
multimodality video-understanding video-question-answering visual-language-models naacl2024

sealing's Introduction

[NAACL 2024] Self-adaptive Sampling for Efficient Video Question Ansering on Image--Text Models

๐Ÿ”ฅ [14/03/2024] This paper has been accepeted to NAACL 2024 (Findings)!

Introduction

This repository contains the official implementation code of the paper "Self-adaptive Sampling for Efficient Video Question Answering". In this work we introduce and study two simple sampling strategies (MIF and MDF) for the tuning of Video Question Answering tasks on pretrained Visual Language Models (VLMs).

Specifically, we first systematically test the performance of MIF (Most Implied Frames) with varied backbone models as captioner and scorer. They work together to perform a "question-and-vision-aware" sampling. Then we draw inspiration from the results and analysis to further propose the more lightweight MDF (Most Dominant Frames), which takes one more step to discard the correlation of question and executs a "question-agnostic, vision-aware" sampling. This routine significantly boosts the efficiency and gains competative or higher performance on the tested datasets.

Once running completes, sampled frames will be saved in a hdf5 (.h5) file as a "dataset" for fast loading during training and test time. We test our methods on three models (CLIP, GIT and All-in-one) and 4 datasets (MSVD-QA, MSRVTT-QA, TGIF-Frame, NeXT-QA). The implementation on CLIP (including our refined structure CLIP-Dec which significantly enhances the performance on raw-CLIP) and GIT are in the folder clip_and_git, while the implementation on All-in-one are under the folder all_in_one.

Usage

1. Downloading Datasets

Please visit the corresponding repository and follow the instruction there to download the datasets.

The suggested path to store these datasets is "model/dataset/<dataset_name>"

2. Preprocessing

The code to do sampling for all three models is same, under the folder "clip_and_git/src/preprocessing".

  • To sample via MDF method, run the python script as follows:

    python extract_features.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --model_name=<vlm_model_name> ... (other hps)
    

    If your code prompts an out-of-memory exception, please using a smaller chunksize (default=512) to shrink the input size per computation.

  • To sample via MIF method, first run a uniform sampling with large K (e.g., 16 or 32) to obtain a sparse frame sequence

    python extract_features.py --sampling_strategy='uni' --K 16 ...
    

    Then run the python script to capture and start sampling

    python gen_sample.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --vlm_model=<vlm_model_name> --sim_model=<sim_model_name> --task='gen_cap'
    
    python gen_sample.py --dataset=<dataset_name> --dataset_root=<root_path> --sampling_strategy='repr' --vlm_model=<vlm_model_name> --sim_model=<sim_model_name> --task='gen_inds'
    

3. Training and Inference

For experiments on CLIP and GIT, please modify our provided reference scripts (in src/scripts). For all-in-one, please check its attached README file for more details.

Results (Partial)

The following results are prediction accuracy, which has been defined and customized for each dataset/model in our paper.

CLIP-Dec (3 Frame)

Sampling MSVD-QA MSRVTT-QA TGIF-Frame
noDec 27.7 30.3 42.8
Uniform 33.8 33.7 47.2
MDF 35.0 35.2 63.2
MIF 35.0 35.4 61.8

GIT-Base (6 Frame)

Sampling MSVD-QA MSRVTT-QA TGIF-Frame
Report 51.2 41.0 69.1
Uniform 52.2 41.1 67.5
MDF 55.3 42.0 69.9
MIF 54.5 42.3 69.6

AIO-Base (3 Frame)

Sampling MSVD-QA MSRVTT-QA TGIF-Frame
Report 46.5 42.9 64.2
Reprd. 46.1 42.7 64.0
MDF 46.9 43.8 66.2
MIF 46.7 44.0 65.9

AIO-Base+ on Next-QA (3 Frame)

Method Val Test
Base 48.4 48.1
MIF 49.7 49.5
MDF 50.2 49.8

BLIP2-T5XXL on Next-QA (3 Frame)

Method Val Test
Base 60.1 59.7
MIF 61.5 61.2
MDF 61.8 61.1

Citation

Please cite our paper if you find this project is related to your work

@misc{han2023sas,
      title={SAS Video-QA: Self-Adaptive Sampling for Efficient Video Question-Answering}, 
      author={Wei Han and Hui Chen and Min-Yen Kan and Soujanya Poria},
      year={2023},
      eprint={2307.04192},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

Code for AIO is adapted from AIO official implementation

Contact

If you have any enquiries about our code and paper, feel free to contact us at [email protected] or [email protected].

sealing's People

Contributors

chchenhui avatar clement25 avatar soujanyaporia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

wing-nus

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.