Giter Site home page Giter Site logo

deepmim's Introduction

DeepMIM

Introduction

This repository is the official implementation of our

DeepMIM: Deep Supervision for Masked Image Modeling

[arxiv] [code]

Sucheng Ren, Fangyun Wei, Samuel Albanie, Zheng Zhang, Han Hu

Deep supervision, which involves extra supervisions to the intermediate features of a neural network, was widely used in image classification in the early deep learning era since it significantly reduces the training difficulty and eases the optimization like avoiding gradient vanish over the vanilla training. Nevertheless, with the emergence of normalization techniques and residual connection, deep supervision in image classification was gradually phased out. In this paper, we revisit deep supervision for masked image modeling (MIM) that pre-trains a Vision Transformer (ViT) via a mask-and-predict scheme. Experimentally, we find that deep supervision drives the shallower layers to learn more meaningful representations, accelerates model convergence, and expands attention diversities. Our approach, called DeepMIM, significantly boosts the representation capability of each layer. In addition, DeepMIM is compatible with many MIM models across a range of reconstruction targets.

method

News

  • Code and checkpoints are released!

Installation

We build the repo based on MAE

Pretraining

We pretrain DeepMIM on 32 V100 GPU with overall batch size of 4096 which is identical to that in MAE.

python -m torch.distributed.launch \
--nnodes 4 --node_rank $noderank \
--nproc_per_node 8 --master_addr $ip --master_port $port \
main_pretrain.py \
    --batch_size 128 \
    --model mae_vit_base_patch16 \
    --norm_pix_loss --clip_path /path/to/clip \
    --mask_ratio 0.75 \
    --epochs 1600 \
    --warmup_epochs 40 \
    --blr 1.5e-4 --weight_decay 0.05 \
    --data_path /path/to/imagenet/

Fine-tuning on ImageNet-1K (Classification)

Expected results: 85.6% Top-1 Accuracy log

python -m torch.distributed.launch --nproc_per_node=8 main_finetune.py \
    --batch_size 128 \
    --model vit_base_patch16 \
    --finetune ./output_dir/checkpoint-1599.pth \
    --epochs 100 \
    --output_dir ./out_finetune/ \
    --blr 1e-4 --layer_decay 0.6 \
    --weight_decay 0.05 --drop_path 0.1 --reprob 0.25 --mixup 0.8 --cutmix 1.0 \
    --dist_eval --data_path /path/to/imagenet

Fune-tuning on ADE20K (Semantic Segmentation)

Please refer Segmentation/README.md

Checkpoint

The pretrained and finetuned model on ImageNet-1K are available at

[Google Drive]

Comparison

Performance comparison on ImageNet-1K classification and ADE20K Semantic Segmentation.

Method Model Size Top-1 mIoU
MAE ViT-B 83.6 48.1
DeepMIM-CLIP ViT-B 85.6 53.1

Citation

If you have any question, feel free to contact Sucheng Ren :)

@article{ren2023deepmim,
    title={DeepMIM: Deep Supervision for Masked Image Modeling},
    author={Sucheng Ren and Fangyun Wei and Samuel Albanie and  Zheng Zhang and Han Hu},
    year={2023},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

deepmim's People

Contributors

oliverrensu avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.