Giter Site home page Giter Site logo

amerssun / videomae Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mcg-nju/videomae

0.0 0.0 0.0 265 KB

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Home Page: https://arxiv.org/abs/2203.12602

License: Other

Python 95.11% Shell 4.89%

videomae's Introduction

Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [Arxiv]

VideoMAE Framework

PWC
PWC
PWC
PWC

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong, Yibing Song, Jue Wang, Limin Wang
Nanjing University, Tencent AI Lab

📰 News

[2022.4.24] Code and pre-trained models are available now! Please give a star⭐️ for our best efforts.😆
[2022.4.15] The LICENSE of this project has been upgraded to CC-BY-NC 4.0.
[2022.3.24] Code and pre-trained models will be released here. Welcome to watch this repository for the latest updates.

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 84.7% on Kinects-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

✨ Something-Something V2

Method Extra Data Backbone Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-B 16x2x3 70.3 92.7
VideoMAE no ViT-L 16x2x3 74.2 94.7
VideoMAE no ViT-L 32x1x3 75.3 95.2

✨ Kinetics-400

Method Extra Data Backbone Frames x Clips x Crops Top-1 Top-5
VideoMAE no ViT-B 16x5x3 80.9 94.7
VideoMAE no ViT-L 16x5x3 84.7 96.5
VideoMAE Kinetics-700 ViT-L 16x5x3 85.8 96.8

✨ UCF101 & HMDB51

Method Extra Data Backbone UCF101 HMDB51
VideoMAE no ViT-B 90.8 61.1
VideoMAE Kinetics-400 ViT-B 96.1 73.3

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: [email protected]

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen and Chongjian Ge for their kindly support. This project is built upon MAE-pytorch and BEiT. Thanks to the contributors of these great codebases.

🔒 License

This project is released under the CC-BY-NC 4.0 license as found in the LICENSE file.

✏️ Citation

If you think this project is helpful, please feel free to give a star⭐️ and cite our paper:

@article{videomae,
  title={VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training},
  author={Tong, Zhan and Song, Yibing and Wang, Jue and Wang, Limin},
  journal={arXiv preprint arXiv:2203.12602},
  year={2022}
}

videomae's People

Contributors

yztongzhan avatar wanglimin avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.