Giter Site home page Giter Site logo

implus / um-mae Goto Github PK

View Code? Open in Web Editor NEW
231.0 5.0 20.0 5.86 MB

Official Codes for "Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality"

License: Other

Makefile 0.51% Python 20.38% Shell 0.06% Jupyter Notebook 79.06%
ade20k coco hierarchical-vision-transformer imagenet-classification mae masked-autoencoder masked-image-modeling pyramid-vision-transformer self-supervised-learning swin-transformer

um-mae's Introduction

UM-MAE

Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality

Xiang Li, Wenhai Wang, Lingfeng Yang, Jian Yang

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.
Visualization: See Colab notebook.

@article{Li2022ummae,
  author  = {Li, Xiang and Wang, Wenhai and Yang, Lingfeng and Yang, Jian},
  journal = {arXiv:2205.10063},
  title   = {Uniform Masking: Enabling MAE Pre-training for Pyramid-based Vision Transformers with Locality},
  year    = {2022},
}

Updates

30/May/2022: Visualization code/demo is updated at Colab notebook.

26/May/2022: The Chinese blog of this paper is available at zhihu.

23/May/2022: The preprint version is public at arxiv.

Motivation

(a) In MAE, the global window of Vanilla ViT can receive arbitrary subset of image patches by skipping random 75% of the total, whilst (b) skipping these 75% patches is unacceptable for Pyramid-based ViT as patch elements are not equivalent across the local windows. (c) A straightforward solution is to adopt the mask token for the encoder (e.g., SimMIM) at the cost of slower training. (d) Our Uniform Masking (UM) approach (including Uniform Sampling and Secondary Masking) enables the efficient MAE-style pre-training for Pyramid-based ViTs while keeping its competitive fine-tuning accuracy.

Introduction

UM-MAE is an efficient and general technique that supports MAE-style MIM Pre-training for popular Pyramid-based Vision Transformers (e.g., PVT, Swin).

  • We propose Uniform Masking, which successfully enables MAE pre-training (i.e., UM-MAE) for popular Pyramid-based ViTs.
  • We empirically show that UM-MAE considerably speeds up pre-training efficiency by ~2X and reduces the GPU memory consumption by at least ~2X compared to the existing sota Masked Image Modelling (MIM) framework (i.e, SimMIM) for Pyramid-based ViTs, whilst maintaining the competitive fine-tuning performance. Notably, using HTC++ detector, the pre-trained Swin-Large backbone self-supervised under UM-MAE only in ImageNet-1K (57.4 AP^bbox, 49.8 AP^mask) can even outperform the one supervised in ImageNet-22K (57.1 AP^bbox, 49.5 AP^mask).
  • We also reveal and discuss several notable different behaviors between Vanilla ViT and Pyramid-based ViTs under MIM. tenser

Main Results on ImageNet-1K

Models Pre-train Method Sampling Strategy Secondary Mask Ratio Encoder Ratio Pretrain Epochs Pretrain Hours FT acc@1(%) FT weight/log
ViT-B MAE RS -- 25% 200 todo 82.88 weight/log
ViT-B MAE UM 25% 25% 200 todo 82.88 weight/log
PVT-S SimMIM RS -- 100% 200 38.0 79.28 weight/log
PVT-S UM-MAE UM 25% 25% 200 21.3 79.31 weight/log
Swin-T SimMIM RS -- 100% 200 49.3 82.20 weight/log
Swin-T UM-MAE UM 25% 25% 200 25.0 82.04 weight/log
Swin-L SimMIM RS -- 100% 800 -- 85.4 link
Swin-L UM-MAE UM 25% 25% 800 todo 85.2 weight/log

RS: Random Sampling; UM: Uniform Masking, consisting of Uniform Sampling and Secondary Masking

Acknowledgement

The pretraining and finetuning of our project are based on DeiT, MAE and SimMIM. The object detection and semantic segmentation parts are based on MMDetection and MMSegmentation respectively. Thanks for their wonderful work.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

um-mae's People

Contributors

anonymousdl avatar implus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

um-mae's Issues

Can't use provided checkpoint to pre-train

I've been trying to load the checkpoint provided in README.md for Swin-T using UM-MAE, but I haven't been able to use it. Maybe I'm doing it completely wrong, but I'm passing it as an argument with --load_from (I've also tried --resume), and I get this in the console:

image

Any idea why this is happening? Thanks!

Hello,may i ask you some questions

Hello, first of all, thank you very much for your excellent works. In addition, I would like to ask whether your downstream task uses maskcnn, which is the part of target detection. Have you considered joining mmselfsup? I think it can let more people know your model. I am also a CV practitioner who is very interested in MAE. I thought before that the images used in the pre training phase are masked, and you are a secondary mask, while the downstream tasks input images that are not masked. Will there be some bias in the parameters. Do you have any better ideas? I'd like to discuss them.

Vanilla ViT for Segmentation and Detection

Hi @implus
Thanks for the great work and released code!

I have checked the configs in both DET and SEG, but found there are no configs for ViT, which is reported in Table 1. Could you provide the configs and the coressponding models.py?

Question about the Uniform Sampling

Thanks for your great job.
I have read the code about Uniform Sampling, and it's interesting to realize it in mask_transform.py. However, I am confused about the limitation as follows:

if len(candidate_list) * 4 >= self.num_patches * 2:

For example,

# Give
self.height = self.width = 14
self.num_patches = 196
# We have
len(candidate_list) = 196 / 2 = 98

In this way, the candidate_list is repeated for 98 // 4 = 24 times. In fact, Grid Sampling is a special case of Uniform Sampling. However, it seems impossible to repeat the same mask pattern for 196 // 4 = 49 under the above limitation.

Should the limitation be set as follows?

if len(candidate_list) // 4 >= self.num_patches // 4:
# i.e.,
if len(candidate_list) >= self.num_patches:

Thus Grid Sampling may be realized when shuffling.

Of course, such a problem may not affect the result. In contrast, it avoids Grid Sampling, leading to better convergence as in Table1. But I am just confused about the reason for the limitation.

Can you give the tran.txt file or a specific description about it?

[17:17:52.401794] using regular, mask_candidate shape = (128, 4)
[17:17:52.401841] load info from /home/UM-MAE-main/path/to/ImageNet/train.txt
Traceback (most recent call last):
File "main_pretrain.py", line 247, in
main(args)
File "main_pretrain.py", line 139, in main
dataset_train = ImageListFolder(train_folder, transform=transform_train, ann_file=train_ann_file)
File "/home/UM-MAE-main/util/datasets.py", line 32, in init
ann = open(ann_file)
FileNotFoundError: [Errno 2] No such file or directory: '/home/UM-MAE-main/path/to/ImageNet/train.txt'

How to produce reconstruction visualizations during pre-training?

I haven't been able to produce images to visualize the effectiveness of the model during each epoch of pre-training. I was wondering if a method was included in the code to do this, or if you could provide the code to do this, just like in figure 8 of your paper where we can see the original picture, the mask and the reconstruction side-by-side.

Thanks!

Failed to pretrain a model with multi-nodes

Hi, sorry to bother you again.

I failed to train UM-MAE with multi-node, which will be stuck here for each node:

*****************************************
| distributed init (rank 5): env://, gpu 5
| distributed init (rank 3): env://, gpu 3
| distributed init (rank 1): env://, gpu 1
| distributed init (rank 0): env://, gpu 0
| distributed init (rank 4): env://, gpu 4
| distributed init (rank 7): env://, gpu 7
| distributed init (rank 2): env://, gpu 2
| distributed init (rank 6): env://, gpu 6

However, MAE is able to be trained with multi-nodes.

why can model see mask token when training?

In the MAE paper, kaiming He said, if Vit-mae sees mask token when training, and can't see mask token when testing, the inconsistency exists, which results a bad accuracy. But in your paper, you send mask token to encoder, is it right?

CUDA error for Swin models

Hi, thanks for your great work in combining MAE with hierarchical vision transformers!

However, when I tried to reproduce the results using your code, I encountered a CUDA error when training MAE with Swin. Here is a part of the log.

[09:52:59.848646] base lr: 1.50e-04
[09:52:59.848669] actual lr: 1.87e-05
[09:52:59.848696] accumulate grad iterations: 1
[09:52:59.848714] effective batch size: 32
[09:52:59.905504] AdamW (
Parameter Group 0
    amsgrad: False
    betas: (0.9, 0.95)
    eps: 1e-08
    lr: 1.875e-05
    weight_decay: 0.0

Parameter Group 1
    amsgrad: False
    betas: (0.9, 0.95)
    eps: 1e-08
    lr: 1.875e-05
    weight_decay: 0.05
)
[09:52:59.906380] Checkpoint not founded in /data/code/pretrain/checkpoints/pretrain/simmim_swin_tiny_256_um_simmim_bs2048_ep200_temp.pth, train from random initialization
[09:52:59.906428] Start training for 200 epochs
[09:52:59.907566] log_dir: /data/code/pretrain/tb/simmim/simmim_swin_tiny_256_um_simmim_bs2048_ep200
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can a
dversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have
 unused parameters. (function operator())
[W reducer.cpp:1050] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters. This flag results in an extra traversal of the autograd graph every iteration, which can a
dversely affect performance. If your model indeed never has any unused parameters, consider turning this flag off. Note that this warning may be a false positive your model has flow control causing later iterations to have
 unused parameters. (function operator())
Traceback (most recent call last):
  File "main_pretrain.py", line 380, in <module>
    main(args)
  File "main_pretrain.py", line 266, in main
    train_stats = train_one_epoch(
  File "/home/wanghaochen/project/UM-MAE/engine_pretrain.py", line 58, in train_one_epoch
    loss_scaler(loss, optimizer, parameters=model.parameters(),
  File "/home/wanghaochen/project/UM-MAE/util/misc.py", line 256, in __call__
    self._scaler.scale(loss).backward(create_graph=create_graph)
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([1, 32, 64, 64], dtype=torch.half, device='cuda', requires_grad=True).to(memory_format=torch.channels_last)
net = torch.nn.Conv2d(32, 24, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().half().to(memory_format=torch.channels_last)
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
    data_type = CUDNN_DATA_HALF
    padding = [0, 0, 0]
    stride = [1, 1, 0]
    dilation = [1, 1, 0]
    groups = 1
    deterministic = false
    allow_tf32 = true
input: TensorDescriptor 0x7f4c4403d690
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 1, 32, 64, 64,
    strideA = 131072, 1, 2048, 32,
output: TensorDescriptor 0x7f4c4403d2b0
    type = CUDNN_DATA_HALF
    nbDims = 4
    dimA = 1, 24, 64, 64,
    strideA = 98304, 1, 1536, 24,
weight: FilterDescriptor 0x7f4c440688f0
    type = CUDNN_DATA_HALF
    tensor_format = CUDNN_TENSOR_NHWC
    nbDims = 4
    dimA = 24, 32, 1, 1,
Pointer addresses:
    input: 0x7f4c6d848000
    output: 0x7f4c7b918000
    weight: 0x7f4ce2dff600

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:733 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f4e686dd2f2 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x5b (0x7f4e686da67b in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x809 (0x7f4e689351f9 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f4e686c53a4 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #4: std::vector<c10d::Reducer::Bucket, std::allocator<c10d::Reducer::Bucket> >::~vector() + 0x2f9 (0x7f4ebb5be8d9 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #5: c10d::Reducer::~Reducer() + 0x26a (0x7f4ebb5b389a in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #6: std::_Sp_counted_ptr<c10d::Reducer*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0x12 (0x7f4ebb5dab32 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #7: std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() + 0x46 (0x7f4ebaf17a86 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #8: <unknown function> + 0xa20e2f (0x7f4ebb5dde2f in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x369b90 (0x7f4ebaf26b90 in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #10: <unknown function> + 0x36adfe (0x7f4ebaf27dfe in /home/wanghaochen/.local/lib/python3.8/site-packages/torch/lib/libtorch_python.so)
frame #11: /usr/bin/python3() [0x5d28f4]
frame #12: /usr/bin/python3() [0x5a729d]
frame #13: /usr/bin/python3() [0x5ec780]
frame #14: /usr/bin/python3() [0x5441f8]
frame #15: /usr/bin/python3() [0x54424a]
frame #16: PyDict_SetItemString + 0x536 (0x5d1686 in /usr/bin/python3)
frame #17: PyImport_Cleanup + 0x79 (0x684619 in /usr/bin/python3)
frame #18: Py_FinalizeEx + 0x7f (0x67f8af in /usr/bin/python3)
frame #19: Py_RunMain + 0x32d (0x6b70fd in /usr/bin/python3)
frame #20: Py_BytesMain + 0x2d (0x6b736d in /usr/bin/python3)
frame #21: __libc_start_main + 0xf3 (0x7f4ec081f0b3 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: _start + 0x2e (0x5fa5ce in /usr/bin/python3)
Killing subprocess 133
Killing subprocess 134
Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 340, in <module>
    main()
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 326, in main
    sigkill_handler(signal.SIGTERM, None)  # not coming back
  File "/home/wanghaochen/.local/lib/python3.8/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'main_pretrain.py', '--local_rank=1', '--batch_size', '16', '--accum_iter', '1', '--model', 'simmim_swin_tiny_256', '--input_size', '256', '--token_size',
'16', '--mask_ratio', '0.75', '--epochs', '200', '--warmup_epochs', '10', '--blr', '1.5e-4', '--weight_decay', '0.05', '--data_path', 's3://sky/datasets/imagenet/imagenet', '--dataloader_type', 'nori', '--output_dir', '/da
ta/code/pretrain/checkpoints/pretrain/', '--log_dir', '/data/code/pretrain/tb/simmim/', '--experiment', 'um_simmim_bs2048_ep200']' died with <Signals.SIGABRT: 6>.

While MAE with ViTs or PVTs are successfully trained, and when I tried to train SimMIM with Swin, this issue came up again.

Getting Runtime Errors for Swin and PVT's MAE Pre-Training

Hi,

In both mae_swin and mae_pvt, I tested with the default configurations (i.e., swin-tiny and pvt-small)

What I pass to both models are :

imgs = torch.randn(4, 3, 256, 256)
masks = torch.randn(4, 256)

out = model(imgs, masks)

The following line (in the forward_decoder function) causes error for both models:

 x = torch.cat([x_vis + pos_vis, self.mask_token + pos_mask], dim=1)

the error is:

RuntimeError: The size of tensor a (64) must match the size of tensor b (65536) at non-singleton dimension 1

How to carry out single GPU detection training

Sorry to bother you again. I have downloaded your fine-tuning weight and tested the det part, but the following errors appear in the display
RuntimeError: GFL: FPN: Default process group has not been initialized, please make sure to call init_process_group.
Since I only have one GPU, my running code is
python DET/train.py DET/configs/gfl/gfl_mae_swin_tiny_256_mask_vmr025_200e_100e_fpn_25ep_1024x1024_coco.py --gpus 1
How can I adjust it so that he can continue? Thank you very much

Semantic segmentation Pretrained model?

I use upernet_mae_swin_tiny_256_mask...py, but when I use checkpoint-99-model.pth as the backbone pretrained model. It reports this:

mmseg - WARNING - The model and loaded state dict do not match exactly
unexpected key in source state_dict: fc_norm.weight, fc_norm.bias, head.weight, head.bias

when trainning with this model, Gradient overflow.

full checkpoint

Hi,@implus, could you kindly provide the full checkpoint (including the decoder) of Swin-T and PVT-S? Lots of thanks!

May be a bug?

Hi~Thanks for your code.
But in this line it may be a bug. I try your code in my tasks, and the image shape is [382,382]. But args.token_size=16 means self.num_patches=256, but 382/16 * 382/16 = 512. In your demo, your image shape is [256,256] but 256/16 =16 is a coincide. When the image is not [256,256], the number of patches may not right.

So I guess maybe we should use args.input_size//args.token_size instead of args.token_size.

Please take a look, excuse me.

Reported Train-Loss at epoch 200

Thank you for the code!
I'm pertraining Swin Tiny on my dataset (around 799K training samples). The train-loss is slowly decreasing, reached around 0.115 at epoch 100. Can you please report the train-loss at epoch 200 for the ImageNet dataset pretraining, so I can get an idea on how my model is doing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.