Giter Site home page Giter Site logo

renshuhuai-andy / testa Goto Github PK

View Code? Open in Web Editor NEW
40.0 2.0 3.0 855 KB

[EMNLP 2023] TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

Home Page: https://arxiv.org/abs/2310.19060

License: MIT License

Python 100.00%
long-video-understanding video-qa video-text-retrieval video-understanding

testa's Introduction

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding

TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
Shuhuai Ren, Sishuo Chen, Shicheng Li, Xu Sun, Lu Hou

paper

PWC PWC PWC PWC PWC

๐Ÿš€ News

  • (Nov 11, 2023)
    • Upload 32-frame finetuned ckpt for paragraph-video retrieval.
  • (Oct 29, 2023)
    • Codes for video pre-training, video qa, video-paragraph retrieval.
    • Checkpoints of pre-trained TESTA-base model.
  • (Oct 8, 2023)
    • Our paper has been accepted by EMNLP 2023 (Findings).

Highlights

TESTA Visualization

Main Contributions

  1. We introduce an efficient method named TESTA (TEmporal-Spatial Token Aggregation) for long-form video understanding. TESTA progressively aggregates similar visual tokens during video encoding, which can reduce the number of visual tokens by 75% and thus accelerate video encoding.
  2. Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video encoder block.
  3. Experimental results on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks show that, TESTA improves computing efficiency by 1.7 times, and achieves significant performance gains from its scalability in processing longer input frames, e.g., +13.7 R@1 on QuerYD and +6.5 R@1 on Condensed Movie.

TESTA Arch

Currently, the repository contains the code for pre-training a general-purpose video-language model and fine-tuning it on downstream video understanding tasks including video-paragraph retrieval and VideoQA.

Installation

To install the dependencies, run

# create 
conda env create -f environment.yml
# activate
conda activate testa

Data preparation

Please follow the instructions at DATASETS.md to prepare all datasets.

Models

Pre-trained model

zero-shot performance on paragraph-to-video retrieval:

Model frames QuerYD R@1 DiDeMo R@1 ActivityNet Caption R@1 GFLOPs Checkpoint
TESTA-base (ViT-B/16) 32 64.4 64.9 37.1 786 testa_model_base_pretrain.pth

Fine-tuned model

QuerYD paragraph-to-video retrieval

Model frames R@1 R@5 R@10 GFLOPs Checkpoint
TESTA-base (ViT-B/16) 32 77.0 90.8 92.6 420 testa_model_base_queryd_f32_f1p12.pth

ActivityNet paragraph-to-video retrieval

Model frames R@1 R@5 R@10 GFLOPs Checkpoint
TESTA-base (ViT-B/16) 32 51.6 79.1 88.3 420 testa_model_base_anet_f32_f1p12.pth

DiDeMo paragraph-to-video retrieval

Model frames R@1 R@5 R@10 GFLOPs Checkpoint
TESTA-base (ViT-B/16) 32 57.7 83.3 89.4 420 testa_model_base_didemo_f32_f1p12.pth

CondensedMovie paragraph-to-video retrieval

Model frames R@1 R@5 R@10 GFLOPs Checkpoint
TESTA-base (ViT-B/16) 32 21.5 42.4 50.7 420 testa_model_base_cm_f32_f1p12.pth

Training and Evaluation

Please refer to the RUN.md for detailed instructions on training, evaluating and reproducing the results.

Todo list

  • Upload fine-tuned checkpoints
  • Add visualization code
  • Add demos

Contact

If you have any questions, please feel free to create an issue on this repository.

Citation

If you find this code useful for your research, please consider citing:

@article{Ren2023TESTA,
  title={TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding},
  author={Shuhuai Ren and Sishuo Chen and Shicheng Li and Xu Sun and Lu Hou},
  journal={ArXiv},
  year={2023},
  volume={abs/2310.19060},
}

Acknowledgement

The codebase relies on resources from BLIP, ToMe,and TimeSFormer. We thank the original authors for their open-sourcing.

testa's People

Contributors

renshuhuai-andy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.