Giter Site home page Giter Site logo

liuguoyou / video-swin-transformer Goto Github PK

View Code? Open in Web Editor NEW

This project forked from swintransformer/video-swin-transformer

0.0 2.0 0.0 42.02 MB

This is an official implementation for "Video Swin Transformers".

License: Apache License 2.0

Python 97.58% Dockerfile 0.30% Shell 2.06% Makefile 0.03% Batchfile 0.03%

video-swin-transformer's Introduction

Video Swin Transformer

By Ze Liu*, Jia Ning*, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin and Han Hu.

This repo is the official implementation of "Video Swin Transformer". It is based on mmaction2.

Updates

06/25/2021 Initial commits

Introduction

Video Swin Transformer is initially described in "Video Swin Transformer", which advocates an inductive bias of locality in video Transformers, leading to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2).

teaser

Results and Models

Kinetics 400

Backbone Pretrain Lr Schd spatial crop acc@1 acc@5 #params FLOPs config model
Swin-T ImageNet-1K 30ep 224 78.8 93.6 28M 87.9G config github/baidu
Swin-S ImageNet-1K 30ep 224 80.6 94.5 50M 165.9G config github/baidu
Swin-B ImageNet-1K 30ep 224 80.6 94.6 88M 281.6G config github/baidu
Swin-B ImageNet-22K 30ep 224 82.7 95.5 88M 281.6G config github/baidu

Kinetics 600

Backbone Pretrain Lr Schd spatial crop acc@1 acc@5 #params FLOPs config model
Swin-B ImageNet-22K 30ep 224 84.0 96.5 88M 281.6G config github/baidu

Something-Something V2

Backbone Pretrain Lr Schd spatial crop acc@1 acc@5 #params FLOPs config model
Swin-B Kinetics 400 60ep 224 69.6 92.7 89M 320.6G config github/baidu

Notes:

Usage

Installation

Please refer to install.md for installation.

We also provide docker file cuda10.1 (image url) and cuda11.0 (image url) for convenient usage.

Data Preparation

Please refer to data_preparation.md for a general knowledge of data preparation. The supported datasets are listed in supported_datasets.md.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <CHECKPOINT_FILE> --eval top_k_accuracy

# multi-gpu testing
bash tools/dist_test.sh <CONFIG_FILE> <CHECKPOINT_FILE> <GPU_NUM> --eval top_k_accuracy

Training

To train a video recognition model with pre-trained image models (for Kinetics-400 and Kineticc-600 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-T model for Kinetics-400 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_tiny_patch244_window877_kinetics400_1k.py 8 --cfg-options model.backbone.pretrained=<PRETRAIN_MODEL> 

To train a video recognizer with pre-trained video models (for Something-Something v2 datasets), run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
bash tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options load_from=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Swin-B model for SSv2 dataset with 8 gpus, run:

bash tools/dist_train.sh configs/recognition/swin/swin_base_patch244_window1677_sthv2.py 8 --cfg-options load_from=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, use our provided docker or run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, comment out the following code block in the configuration files:

# do not use mmcv version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citation

If you find our work useful in your research, please cite:

@article{liu2021video,
  title={Video Swin Transformer},
  author={Liu, Ze and Ning, Jia and Cao, Yue and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Hu, Han},
  journal={arXiv preprint arXiv:2106.13230},
  year={2021}
}

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

Other Links

Image Classification: See Swin Transformer for Image Classification.

Object Detection: See Swin Transformer for Object Detection.

Semantic Segmentation: See Swin Transformer for Semantic Segmentation.

Self-Supervised Learning: See MoBY with Swin Transformer.

video-swin-transformer's People

Contributors

carolinecheng233 avatar congee524 avatar domdu avatar dreamerlin avatar hellock avatar hust-nj avatar hypnosxc avatar innerlee avatar irvingzhang0512 avatar jackytown avatar jin-s13 avatar joannalxy avatar kennymckormick avatar magicdream2222 avatar mmeendez8 avatar parskatt avatar rlleshi avatar sczwangxiao avatar sebastienlinker avatar shoufachen avatar sunnyxiaohu avatar tangh avatar wangruohui avatar wjn922 avatar wwdok avatar xwen99 avatar yaochaorui avatar yrquni avatar yuta1125tp avatar yzfly avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.