In this project we implement the flow-based deep generative model Glow from the paper "Glow: Generative Flow with Invertible 1x1 Convolutions" by Kingma and Dhariwal (https://arxiv.org/abs/1807.03039). We train the model on MNIST and reproduce some of the main results from the original paper. Additionally, we investigate the model's performance on Out-of-Distribution (OOD) detection using typicality tests. The implementation was done in TensorFlow using Keras and the project was done as part of the course DD2412 Advanced Deep Learning at KTH. The report detailing our method and findings can be seen here.
The architecture of Glow builds heavily upon the work done in NICE and Real NVP. The main contribution of the Glow paper was the introduction of an invertible 1x1 convolution in the flow for permuting the channel dimensions. Each step of flow in Glow consists of an activation normalization layer, an invertible 1x1 convolution, and an affine coupling layer. The architecture can be seen in the figure below and is described in more detail in the report.
A single step of flow and the multi-scale architecture of Glow. Image from here.
When training the model on MNIST, we found that the 1x1 convolution achieved a lower average negative log-likelihood than both a shuffle and reversing operation, demonstrating the power of the learnable 1x1 convolutional permutation, a key result by Kingma and Dhariwal (see figure below).
Average negative log-likelihood in bits/dimension of the MNIST test set. The 1x1 convolution clearly outperforms the shuffle and reverse operations.
Images generated by the model can be seen below. We also performed linear interpolation in latent space between test images for each class (see below) and found that this produces realistic images with smooth transitions.
New samples generated by the model with temperature=0.7.
Linear interpolation in latent space.
To assess the model's ability to distinguish between In-Distribution samples (i.e. samples from MNIST) and Out-of-Distribution (OOD) samples, we performed so-called typicality tests. In essence, typicality tests try to determine if a batch of unseen data is OOD by estimating if the batch belongs to the typical set of the model's probability density. We performed typicality tests on batches of data from the test sets of MNIST, EMNIST-Letters, and Fashion MNIST. The results can be seen in the table below (for different batch sizes). Using typicality tests, the model is able to classify samples from Fashion MNIST as OOD almost perfectly for all batch sizes. For EMNIST Letters, the model's OOD-accuracy differ greatly depending on the batch size (which can most likely be explained by the similarities present between certain digits in MNIST and certain letters in EMNIST-Letters).