Giter Site home page Giter Site logo

efficient-video-recognition's Introduction

Frozen CLIP models are Efficient Video Learners

This is the official implementation of the paper Frozen CLIP models are Efficient Video Learners

@article{lin2022frozen,
  title={Frozen CLIP Models are Efficient Video Learners},
  author={Lin, Ziyi and Geng, Shijie and Zhang, Renrui and Gao, Peng and de Melo, Gerard and Wang, Xiaogang and Dai, Jifeng and Qiao, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2208.03550},
  year={2022}
}

Introduction

The overall architecture of the EVL framework includes a trainable Transformer decoder, trainable local temporal modules and a pretrained, fixed image backbone (CLIP is used for instance).

Using a fixed backbone significantly saves training time, and we managed to train a ViT-B/16 with 8 frames for 50 epochs in 60 GPU-hours (NVIDIA V100).

Despite with a small training computation and memory consumption, EVL models achieves high performance on Kinetics-400. A comparison with state-of-the-art methods are as follows

Installation

We tested the released code with the following conda environment

conda create -n pt1.9.0cu11.1_official -c pytorch -c conda-forge pytorch=1.9.0=py3.9_cuda11.1_cudnn8.0.5_0 cudatoolkit torchvision av

Data Preparation

We expect that --train_list_path and --val_list_path command line arguments to be a data list file of the following format

<path_1> <label_1>
<path_2> <label_2>
...
<path_n> <label_n>

where <path_i> points to a video file, and <label_i> is an integer between 0 and num_classes - 1. --num_classes should also be specified in the command line argument.

Additionally, <path_i> might be a relative path when --data_root is specified, and the actual path will be relative to the path passed as --data_root.

The class mappings in the open-source weights are provided at Kinetics-400 class mappings

Backbone Preparation

CLIP weights need to be downloaded from CLIP official repo and passed to the --backbone_path command line argument.

Script Usage

Training and evaliation scripts are provided in the scripts folder. Scripts should be ready to run once the environment is setup and --backbone_path, --train_list_path and --val_list_path are replaced with your own paths.

For other command line arguments please see the help message for usage.

Kinetics-400 Main Results

This is a re-implementation for open-source use. We are still re-running some models, and their scripts, weights and logs will be released later. In the following table we report the re-run accuracy, which may be slightly different from the original paper (typically +/-0.1%)

Backbone Decoder Layers #frames x stride top-1 top-5 Script Model Log
ViT-B/16 4 8 x 16 82.8 95.8 script google drive google drive
ViT-B/16 4 16 x 16 83.7 96.2 script google drive google drive
ViT-B/16 4 32 x 8 84.3 96.6 script google drive google drive
ViT-L/14 4 8 x 16 86.3 97.2 script google drive google drive
ViT-L/14 4 16 x 16 86.9 97.4 script google drive google drive
ViT-L/14 4 32 x 8 87.7 97.6 script google drive google drive
ViT-L/14 (336px) 4 32 x 8 87.7 97.8

Data Loading Speed

As the training process is fast, video frames are consumed at a very high rate. For easier installation, the current version uses PyTorch-builtin data loaders. They are not very efficient and can become a bottleneck when using ViT-B as backbones. We provide a --dummy_dataset option to bypass actual video decoding for training speed measurement. The model accuracy should not be affected. Our internal data loader is pure C++-based and does not bottleneck training by much on a machine with 2x Xeon Gold 6148 CPUs and 4x V100 GPUs.

Acknowledgements

The data loader code is modified from PySlowFast. Thanks for their awesome work!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.