Giter Site home page Giter Site logo

mae's Introduction

Masked Autoencoders: A PyTorch Implementation

This is a PyTorch/GPU re-implementation of the paper Masked Autoencoders Are Scalable Vision Learners:

@Article{MaskedAutoencoders2021,
  author  = {Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Doll{\'a}r and Ross Girshick},
  journal = {arXiv:2111.06377},
  title   = {Masked Autoencoders Are Scalable Vision Learners},
  year    = {2021},
}
  • The original implementation was in TensorFlow+TPU. This re-implementation is in PyTorch+GPU.

  • This repo is a modification on the DeiT repo. Installation and preparation follow that repo.

  • This repo is based on timm==0.3.2, for which a fix is needed to work with PyTorch 1.8.1+.

Catalog

  • Visualization demo
  • Pre-trained checkpoints + fine-tuning code
  • Pre-training code

Visualization demo

Run our interactive visualization demo using Colab notebook (no GPU needed):

Fine-tuning with pre-trained checkpoints

The following table provides the pre-trained checkpoints used in the paper, converted from TF/TPU to PT/GPU:

ViT-Base ViT-Large ViT-Huge
pre-trained checkpoint download download download
md5 8cad7c b8b06e 9bdbb0

The fine-tuning instruction is in FINETUNE.md.

By fine-tuning these pre-trained models, we rank #1 in these classification tasks (detailed in the paper):

ViT-B ViT-L ViT-H ViT-H448 prev best
ImageNet-1K (no external data) 83.6 85.9 86.9 87.8 87.1
following are evaluation of the same model weights (fine-tuned in original ImageNet-1K):
ImageNet-Corruption (error rate) 51.7 41.8 33.8 36.8 42.5
ImageNet-Adversarial 35.9 57.1 68.2 76.7 35.8
ImageNet-Rendition 48.3 59.9 64.4 66.5 48.7
ImageNet-Sketch 34.5 45.3 49.6 50.9 36.0
following are transfer learning by fine-tuning the pre-trained MAE on the target dataset:
iNaturalists 2017 70.5 75.7 79.3 83.4 75.4
iNaturalists 2018 75.4 80.1 83.0 86.8 81.2
iNaturalists 2019 80.5 83.4 85.7 88.3 84.1
Places205 63.9 65.8 65.9 66.8 66.0
Places365 57.9 59.4 59.8 60.3 58.0

Pre-training

The pre-training instruction is in PRETRAIN.md.

License

This project is under the CC-BY-NC 4.0 license. See LICENSE for details.

mae's People

Contributors

endernewton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mae's Issues

Different masking strategies

Thank you so much for releasing the code.

In the paper, there are experiments with different masking strategies such as random. block and grid. The code implementation includes only the random strategy. Is there any plan to also release the code for other strategies?

mae_visualize models vs mae_pretrain_full models

Hello,

thank you for the great work and the great repo. I was playing with different pre-trained models for visualization. When I use a mae_visualize_vit_base.pth I get the reconstruction results as in the demo and the paper such as below:
Screen Shot 2022-01-12 at 5 56 40 PM

However when I use the mae_pretrain_vit_base_full.pth checkpoint the results are as below:
Screen Shot 2022-01-12 at 5 57 07 PM

mask_ratio=0.75 for both results.
So here are my questions:

  1. Can you please clarify what is the difference between visualize and full checkpoints and why the results look worse with full checkpoints?
  2. If I want to finetune an MAE model (both encoder and decoder parts) for reconstruction on a custom dataset, which checkpoint is recommended?

I would appreciate it if you could help me with these questions.

fine-tuned checkpoint

Hello, do you by any chance have a checkpoint for the model fine-tuned on ImageNet instead of pretrained?

Sample log files / Tensorboard summaries

It would be great to publish the log files generated by pre-training/fine-tuning/linear probe runs or their tensorboard summary files (or screenshots) for reference.

More concretely, I am trying to replicate the base config pre-training and linear probe as closely as possible in Jax but I am getting 60% accuracy which is much lower than the expected 68%. Inspecting these logs/TB summaries will help me find the possible discrepancies.

This is the loss curve on pixel prediction (not normalized)
image

And this is the linear probe run results
image

Does this implement support timm > 0.3.2?

Hi,

Thanks for releasing this awesome work!

I notice the code will check the version of timm.

assert timm.__version__ == "0.3.2" # version check

Does this implement support updated timm, e.g. 0.4.12?

Thanks!

Best,
Vera

Problems about reproducing ViT-Base

Thanks for your great work.
My reproduction result for ViT-B is only 83.3, which is 0.3 lower than the paper result. And I have no idea what may cause this.
My exp totally follows this repo, except the following changes:
a. I use 32 V100 with batch size 128 (32x128 = 4096). The recommended setting of this repo is 64 V100 with batch size 64(64x64=4096).
b. I didn't use submitit_pretrain.py but directly use main_pretrain.py for multi-node training.

Are these two changes the potential causes for performance gap?
Any suggestions would be deeply appreciated~

Interpolate mode for donwstream tasks

Hello, thank you for creating this awesome algorithm!
I have a problem with applying MAE pre-trained weights in downstream tasks. Is the interpolation mode of position encoding bicubic?
If I train mask-rcnn, will the performance go up by simply replacing the deit weight with mae weight?

Mentioning a community tutorial

Hi @endernewton!

We were in touch via email regarding MAE as well. Thanks for your help there!

@ariG23498 and I implemented a minimal version of MAE in TensorFlow/Keras with the CIFAR-10 dataset in a hope that it would benefit the community. We have open-sourced our experiments here: https://github.com/ariG23498/mae-scalable-vision-learners. We have also written an introductory blog post on it: https://keras.io/examples/vision/masked_image_modeling/.

We would really appreciate it if you could give it a mention from the README of this repository.

Questions on object detection model

Hi, thank you for releasing such a wonderful work. I have a question regarding modifications done on ViT model to adapt it with FPN: as mentioned in the "Benchmarking Detection" paper, in addition to up/down-sampling operations (which are straightforward), the ViT attention is also modified to have local window attention, interleaved with 4 global attention blocks. In this case, how to adapt the ViT pre-trained by MAE (which only has global attention all the way) with the new attention mechanism?

Thanks! I look forward to your reply.

Fine-Tuning in unsupervised way

Hello. After reading the paper I wanted to train VITMae in order to extract features in unsupervised (to do image clustering). I have the impression that the fine-tuning script is done in supervised way for classification after looking at the code.

So I would like to know how to load pre-trained Vit-base model and continue unsupervised training on my dataset please. Do I need to use submitit_pretrain or main_pretrain file ?

Sorry to ask such a basic question, I am new to pytorch.

Thanks you very much
Respectfully

Include CrossEnropy loss in pretraining?

According to the paper, the reconstruction loss is the only loss used in pretraining. However, I noticed that the cls token is appended in the input sequence of the decoder, but is dropped after the decoder.

mae/models_mae.py

Lines 193 to 194 in 6a2ba40

# remove cls token
x = x[:, 1:, :]

I am wondering if you have tried to include the cross-entropy loss between the prediction of the cls token and the ground truth label during pretraining. Just like the NSP (next sentence prediction) task in the BERT training, intuitively, including this cross-entropy loss helps the model learns more high-level information.

Replication of ImageNet Scratch Vit-Base and Vit-Large

Hi @KaimingHe,
thank you for your wonderful work and initiative to open-sourcing it.

I have been banging my head against this for a month now and any help would be deeply appreciated!
I am trying to replicate the IN-1k scratch training on Vit-Base and Vit-large. So far failed. I get around 81.9 for base (82.3 in paper) and 82 for vit-large (82.6 in paper) using recipes mentioned below - both taken from here.

I also saw your reply in #28 and also the FINETUNE.md page. I noticed the supervised recipe in paper is different from the fine-tune one. Hence two different recipes I tried.

pytorch version: 1.10.0
cuda: 11.3
timm version 0.5.0
setup is single host with 8 gpu. base lr = 1e-4. so effective batch size 4096 => effective lr = 0.0016
with recipe mentioned in paper: vit-base 82.3, I get 81.9

            --grad_accum=4 \
            --batch-size=128 \
            --lr=0.0016 \
            --weight-decay=0.30 \
            --opt="AdamW" \
            --opt-betas 0.9 0.95 \
            --sched="cosine" \
            --warmup-epochs=20 \
            --epochs=300 \
            --aa="rand-m9-mstd0.5" \
            --smoothing=0.1 \
            --mixup=0.8 \
            --cutmix=1.0 \
            --drop-path=0.1 \
            --model-ema \
            --model-ema-decay=0.9999 \
            --model="vit_base_patch16_224" \

Following your advise in #28 and FINETUNE.md page, I got inspired to try this different recipe. I get 81.2 accuracy with below recipe
here too single host with 8 gpu. base lr = 1e-3 so for batch size 1024 (as mentioned in finetune.py so => effective lr = 0.004
Also paper does not mention layer-decay or different --aa strategy

            --grad_accum=4 \
            --batch-size=32 \
            --lr=0.004 \
            --weight-decay=0.05 \
            --layer_decay=0.65 \
            --opt="AdamW" \
            --opt-betas 0.9 0.95 \
            --sched="cosine" \
            --warmup-epochs=20 \
            --epochs=300 \
            --aa="rand-m9-mstd0.5-inc1" \
            --smoothing=0.1 \
            --mixup=0.8 \
            --cutmix=1.0 \
            --drop-path=0.1 \
            --model-ema \
            --model-ema-decay=0.9999 \
            --model="vit_base_patch16_224" \
            --reprob=0.25 \

What is wrong in these recipes and why don't they replicate the scratch performance mentioned in the paper?

decoder checkpoint of the pre-trained MAE

Hi,

Thank you for the awesome work and glad to see that the codes are released!

Are there any plans to release the decoder checkpoints of the ViT -Base or -Huge in the near future?

I was trying to run the visualization demo with the ViT-Base model and found that the released checkpoint only consists of the encoder part. (I only found the ViT-Large model’s decoder checkpoint in the visualization demo, i.e., mae_visualize_vit_large.pth)

Thank you very much for your time!

Best regards,

Jihoon

list of available models

Hello and thanks for the great work and the great repo :)

I wonder if you could add a single list with link and description of all available models. I think it would help clarify and fasten adoption.

Something like

model name model link description
mae_visualize_vit_large.pth link pretrained encoder + decoder
mae_visualize_vit_base.pth link pretrained encoder + decoder
mae_pretrain_vit_base.pth link pretrained encoder
- - segmentation model
etc..

Mask ratio in forward method

I was wondering why you put the mask_ratio parameter in the forward method.

Did you experiment with changing the mask ratio during training?
Would that be possible?

Thanks for his great repo!

ADE20k Learning Rate

Section A.4 of the paper has "We search for the optimal lr for each entry in Table 5 " when referring to segmentation on ADE20k. Could you share these learning rates? My reproduced baselines are about 2 points too low.

Thanks!

There is a spelling error in the Finetune.md

The patch_size of the ViT_Huge model is 14, where the parameters of the model are written vit_huge_patch16, and the correct should be vit_huge_patch14. I switched here to vit_huge_patch14, and the result of finetune is *Acc@1 86.926 Acc@5 98.094 loss 0.584

How MAE is sensitive to batch size?

Thanks for releasing the code!

I am trying to play with MAE but only with limited # of GPUs. I saw in the doc that we can use accum_iter to make the effective batch size the same as the paper (4096). But I wonder what if we use smaller batch sizes (typical case for a university lab)? Do you have some data points on how MAE would perform with batch size smaller than 4096, e.g., ranging from 256 to 1024 etc?

Thank you!

MAE with extra GAN loss

Hello,

thank you for releasing the code! I have seen that the Colab notebook includes the visualization of an MAE trained with an extra GAN loss, which was not mentioned in the paper.
I was wondering, does it improve the results (linear probe/fine-tuning) in any way ? if so, are you planning to release its code?

Thanks

How to pretrain mae in 1 node with 8 gpus?

I m trying to pretrain a vit_small model in 1 node with 8 gpus, but the submitit_pretrain.py goes wrong with:
Traceback (most recent call last):
File "submitit_pretrain.py", line 131, in
main()
File "submitit_pretrain.py", line 89, in main
args.job_dir = get_shared_folder() / "%j"
File "submitit_pretrain.py", line 39, in get_shared_folder
raise RuntimeError("No shared folder available")
RuntimeError: No shared folder available

it seems like that I have no shared folder, so how can I pretrain mae in 1 node with 8 gpus?
Thanks for your exciting job!!

Finetune in IN1k with 1 node 8 gpus

I use this following command to finetune vit-base.

OMP_NUM_THREADS=1 python -m torch.distributed.launch --nproc_per_node=8 main_finetune.py \
    --accum_iter 4 \
    --batch_size 32 \
    --model vit_base_patch16 \
    --finetune mae_pretrain_vit_base.pth \
    --epochs 100 \
    --blr 5e-4 --layer_decay 0.65 \
    --weight_decay 0.05 --drop_path 0.1 --mixup 0.8 --cutmix 1.0 --reprob 0.25 \
    --dist_eval --data_path ${IMAGENET1K_DIR}

However, the loss is always 6.9068 and the acc is 0.10%. I find len(data_loader) == 3 and data_iter_step is 0,1,2. The params do not update because

loss_scaler(loss, optimizer, clip_grad=max_norm,
                    parameters=model.parameters(), create_graph=False,
                    update_grad=(data_iter_step + 1) % accum_iter == 0)

(data_iter_step + 1) % accum_iter == 0 can't be satisfied.

Importing finetuned MAE into another project

I am trying to load a MAE-model into another project. I use the Vision Transformer class from https://github.com/facebookresearch/mae/blob/main/models_vit.py#L56 (I also tried the mae class from https://github.com/facebookresearch/mae/blob/main/models_mae.py#L223).

I tried loading the model using the class and then loading the weights from https://github.com/facebookresearch/mae/blob/main/FINETUNE.md

However with my code it does not seem to load the weights from the checkpoint. When printing the model before and after loading the checkpoint the weights haven't changed. Does someone have an idea why this happens?

Here is my implementation:

#Create model
model = VisionTransformer(patch_size=16, embed_dim=768, depth=12, num_heads=12, mlp_ratio=4, qkv_bias=True,
                      norm_layer=partial(nn.LayerNorm, eps=1e-6))
#Load checkpoint
checkpoint = torch.load('PATH/TO/mae_finetuned_vit_base.pth', map_location='cpu')
checkpoint_model = checkpoint['model']
    
#load state dict

model.load_state_dict(checkpoint_model, strict=False)`

Edit: simplified code and clarified my error

loss fluctuates during training

Thanks for the great work!

I wanted to pre-train a mae_vit_base_patch16 model on imagenet-1k, and I found the loss fluctuates during training. I was wondering if this is normal for pre-training mae.
截屏2022-01-20 15 19 52

distribute training with horovod cost more time every epoch

bandwidth: 10Gbps
machine: 8*v100
nodes: 8
dateset: imagenet

If training with resnet50, the training time is same as paper(about 3min every epoch), but when training with vit_base_patch16, the training time is 10min(the peak receive network is 7Gbps).
After use method in link, the training time is 6min(the peak receive network is 4.4Gbps)
I wonder if the bandwidth is bottleneck for the mae training. What bandwidth do you use in the paper?

Doubts about the Position Encoding

Thank you so much for releasing the code.

I had a small doubt about the position encoding generated for the 2d sequence. In the file pos_embed.py we have divided the input sequence into two groups of size D/2 and I am not sure the reason for it. If we have to move to higher dimensions, say 3, how should the same be done? Should we divide the sequence into three blocks (and make sure that individual blocks are of even size)?

Question about positional embedding?

Hi, thanks for the great work!

I have a small question regarding the positional embedding:
In this line

grid = np.meshgrid(grid_w, grid_h) # here w goes first

it seems to return [grid_w, grid_h]

However, in line

emb_h = get_1d_sincos_pos_embed_from_grid(embed_dim // 2, grid[0]) # (H*W, D/2)

it seems that grid_w is used for making emd_h? or did I miss something?

expected loss value on custom dataset

Thank you so much for releasing the code.

I wanted to train the model from scratch on a custom dataset. I was wondering if you could share what was the loss value computed for the pre-training step. I was getting a very high loss value and was wondering what could be the possible reason.
image

Also, is there some correlation between the amount of masking and the reconstruction loss? Can a high loss value indicate that the masking ratio is low?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.