DaViT: Dual Attention Vision Transformer (ECCV 2022)
This repo contains the official detection and segmentation implementation of paper "DaViT: Dual Attention Vision Transformer", by Mingyu Ding, Bin Xiao, Noel Codella, Ping Luo, Jingdong Wang, and Lu Yuan. See Introduction.md for an introduction.
The official implementation for image classification will be released in https://github.com/microsoft/DaViT.
Getting Started
Python3, PyTorch>=1.8.0, torchvision>=0.7.0 are required for the current codebase.
# An example on CUDA 10.2
pip install torch===1.9.0+cu102 torchvision===0.10.0+cu102 torchaudio===0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install thop pyyaml fvcore pillow==8.3.2
Object Detection and Instance Segmentation
-
cd mmdet
& install mmcv/mmdet# An example on CUDA 10.2 and pytorch 1.9 pip install mmcv-full==1.3.0 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.9.0/index.html pip install -r requirements/build.txt pip install -v -e . # or "python setup.py develop"
-
mkdir data
& Prepare the dataset in data/coco/ (Format: ROOT/mmdet/data/coco/annotations, train2017, val2017) -
Finetune on COCO
bash tools/dist_train.sh configs/davit_retinanet_1x_coco.py 8 \ --cfg-options model.pretrained=PRETRAINED_MODEL_PATH
Semantic Segmentation
-
cd mmseg
& install mmcv/mmseg# An example on CUDA 10.2 and pytorch 1.9 pip install mmcv-full==1.3.0 -f https://download.openmmlab.com/mmcv/dist/cu102/torch1.9.0/index.html pip install -e .
-
mkdir data
& Prepare the dataset in data/ade/ (Format: ROOT/mmseg/data/ADEChallengeData2016) -
Finetune on ADE
bash tools/dist_train.sh configs/upernet_davit_512x512_160k_ade20k.py 8 \ --options model.pretrained=PRETRAINED_MODEL_PATH
-
Multi-scale Testing
bash tools/dist_test.sh configs/upernet_davit_512x512_160k_ade20k.py \ TRAINED_MODEL_PATH 8 --aug-test --eval mIoU
Benchmarking
ImageNet-1K
Image Classification onModel | Pretrain | Resolution | acc@1 | acc@5 | #params | FLOPs | Checkpoint | Log |
---|---|---|---|---|---|---|---|---|
DaViT-T | IN-1K | 224 | 82.8 | 96.2 | 28.3M | 4.5G | download | log |
DaViT-S | IN-1K | 224 | 84.2 | 96.9 | 49.7M | 8.8G | download | log |
DaViT-B | IN-1K | 224 | 84.6 | 96.9 | 87.9M | 15.5G | download | log |
COCO
Object Detection and Instance Segmentation onMask R-CNN
Backbone | Pretrain | Lr Schd | #params | FLOPs | box mAP | mask mAP | Checkpoint | Log |
---|---|---|---|---|---|---|---|---|
DaViT-T | ImageNet-1K | 1x | 47.8M | 263G | 45.0 | 41.1 | download | log |
DaViT-T | ImageNet-1K | 3x | 47.8M | 263G | 47.4 | 42.9 | download | log |
DaViT-S | ImageNet-1K | 1x | 69.2M | 351G | 47.7 | 42.9 | download | log |
DaViT-S | ImageNet-1K | 3x | 69.2M | 351G | 49.5 | 44.3 | download | log |
DaViT-B | ImageNet-1K | 1x | 107.3M | 491G | 48.2 | 43.3 | download | log |
DaViT-B | ImageNet-1K | 3x | 107.3M | 491G | 49.9 | 44.6 | download | log |
RetinaNet
Backbone | Pretrain | Lr Schd | #params | FLOPs | box mAP | Checkpoint | Log |
---|---|---|---|---|---|---|---|
DaViT-T | ImageNet-1K | 1x | 38.5M | 244G | 44.0 | download | log |
DaViT-T | ImageNet-1K | 3x | 38.5M | 244G | 46.5 | download | log |
DaViT-S | ImageNet-1K | 1x | 59.9M | 332G | 46.0 | download | log |
DaViT-S | ImageNet-1K | 3x | 59.9M | 332G | 48.2 | download | log |
DaViT-B | ImageNet-1K | 1x | 98.5M | 471G | 46.7 | download | log |
DaViT-B | ImageNet-1K | 3x | 98.5M | 471G | 48.7 | download | log |
ADE20K
Semantic Segmentation onBackbone | Pretrain | Method | Resolution | Iters | #params | FLOPs | mIoU | Checkpoint | Log |
---|---|---|---|---|---|---|---|---|---|
DaViT-T | ImageNet-1K | UPerNet | 512x512 | 160k | 60M | 940G | 46.3 | download | log |
DaViT-S | ImageNet-1K | UPerNet | 512x512 | 160k | 81M | 1030G | 48.8 | download | log |
DaViT-B | ImageNet-1K | UPerNet | 512x512 | 160k | 121M | 1175G | 49.4 | download | log |
Citation
If you find this repo useful to your project, please consider citing it with the following bib:
@inproceedings{ding2022davit,
title={DaViT: Dual Attention Vision Transformer},
author={Ding, Mingyu and Xiao, Bin and Codella, Noel and Luo, Ping and Wang, Jingdong and Yuan, Lu},
booktitle={ECCV},
year={2022},
}
Acknowledgement
Our codebase is built based on timm, MMDetection, MMSegmentation. We thank the authors for the nicely organized code!