Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet, arxiv
Update token_performer.py, now t2t-vit-7 can be trained with 4 GPUs with 12G memory,other T2T-ViT also can be trained with 4 or 8 GPUs.
Our codes are based on the official imagenet example by PyTorch and pytorch-image-models by Ross Wightman
timm, pip install timm
torch>=1.4.0
torchvision>=0.5.0
pyyaml
Model | T2T Transformer | Top1 Acc | #params | Download |
---|---|---|---|---|
T2T-ViT-7 | Performer | 71.1 | 4.2M | coming |
T2T-ViT-10 | Performer | 74.1 | 5.9M | coming |
T2T-ViT-12 | Performer | 75.5 | 6.9M | here |
T2T-ViT-14 | Performer | 80.6 | 21.5M | here |
T2T-ViT-19 | Performer | 81.4 | 39.0M | here |
T2T-ViT-24 | Performer | 81.8 | 64.1M | here |
T2T-ViT_t-14 | Transformer | 80.7 | 21.5M | here |
T2T-ViT_t-19 | Transformer | 81.75 | 39.0M | here |
T2T-ViT_t-24 | Transformer | 82.2 | 64.1M | here |
Test the T2T-ViT-12 (take Performer in T2T transformer),
Download the T2T-ViT-12, then test it by running:
CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model T2t_vit_12 -b 100 --eval_checkpoint path/to/checkpoint
Test the T2T-ViT-14 (take Performer in T2T transformer),
Download the T2T-ViT-14, then test it by running:
CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model T2t_vit_14 -b 100 --eval_checkpoint path/to/checkpoint
Test the T2T-ViT_t-24 (take Transformer in T2T transformer),
Download the T2T-ViT_t-24, then test it by running:
CUDA_VISIBLE_DEVICES=0 python main.py path/to/data --model T2t_vit_t_24 -b 100 --eval_checkpoint path/to/checkpoint
Train the T2T-ViT-14 or T2T-ViT-19 or T2T-ViT-24 (take Performer in T2T transformer):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model T2t_vit_14 -b 64 --lr 5e-4 --weight-decay .05 --img-size 224
Train the T2T-ViT-12 (take Performer in T2T transformer):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model T2t_vit_12 -b 64 --lr 5e-4 --weight-decay .035 --cutmix 0.0 --reprob 0.25 --img-size 224
Train the T2T-ViT_t-14, T2T-ViT_t-19 or T2T-ViT_t-24 (take Transformer in T2T transformer):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 path/to/data --model T2t_vit_t_14 -b 64 --lr 5e-4 --weight-decay .05 --img-size 224
Visualize the image features of ResNet50, you can open and run the visualization-resnet.ipynb file in jupyter notebook or jupyter lab; some results are given as following:
Visualize the image features of ViT, you can open and run the visualization-vit.ipynb file in jupyter notebook or jupyter lab; some results are given as following:
Visualize attention map, you can refer to this file. A simple example by visualizing the attention map in attention block 4 and 5 is:
Updating...
If you find this repo useful, please consider citing:
@misc{yuan2021tokenstotoken,
title={Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet},
author={Li Yuan and Yunpeng Chen and Tao Wang and Weihao Yu and Yujun Shi and Francis EH Tay and Jiashi Feng and Shuicheng Yan},
year={2021},
eprint={2101.11986},
archivePrefix={arXiv},
primaryClass={cs.CV}
}