This repository hosts the code for our NUS DSA5204 Project (AY 2023/2024 Semester 2), which aims to reproduce and extend the work done in Masked Autoencoders Are Scalable Vision Learners.
This diagram shows the model architecture for the paper. The encoder is a ViT and decoder is a set of transformer blocks.
To set up using pip, run the following:
cd <project_root>
pip install -r requirements.txt
.
├── src # Source code
│ ├── dataset # Dataloaders
│ ├── checkpoints # To put reproduction weights in this folder
│ ├── model # Model architectures
│ ├── utilities # Utility functions
│ └── scripts # Training, Evaluation scripts
├── inference_notebook_examples # Quick look on our model inference results after training
├── .gitignore
├── README.md
└── requirements.txt
We make use of the following datasets:
- Reproduction, TinyImagenet: Data is available at https://huggingface.co/datasets/zh-plus/tiny-imagenet. Instead of manual download, it can be called using the huggingface library directly.
- Time Series, ETTh1: Data is available at: https://github.com/zhouhaoyi/ETDataset
- 2d Segmentation, ADE20k: Data is available at: http://data.csail.mit.edu/places/ADEchallenge/ADEChallengeData2016.zip. Unzip and place directly in data folder as is
- 3d Segmentation, BTCV: https://www.synapse.org/#!Synapse:syn3193805
- Imputation of missing data, California Housing Price, Data is available at https://www.kaggle.com/datasets/camnugent/california-housing-prices
Reproduction of paper results using TinyImageNet dataset. Example results of image reconstruction of TinyImageNet data using MAE architecture. For each triplet, the leftmost image displays the base image, the middle image displays the input data and the rightmost image displays the reconstructed image.
Extension 1: Time Series forecast. The smoothed models’ forecast predictions on the test dataset is shown. A smoothing window has been
applied using a moving average with a window size of 20 timesteps. This results in plots that only capture the general trend.
Extension 2: 2D Segmentation. Example results of the models conducting semantic segmentation. MAE pretraining shows improvement over no MAE pretraining.
Extension 3: 3D Segmentation. Example 3D Segmentation result. MAE pretraining shows improvement over no MAE pretraining.
Extension 4: Data Imputation. Training loss over epochs comparison. MAE shows unstable training and eventually poorer results compared to no MAE.
Trained model weights can be accessed via this Google drive link: https://drive.google.com/drive/u/0/folders/1oI8RIMEDl6vOW0-mutafXjPSTByXshNO