In this repository I aim to give a brief description and demonstration of
Dumoulin's style transfer model. The implementation used is part of my python
package nstesia
and can be found in
its module dumoulin_2017
.
Note
There are two simple ways to run the demo notebook yourself without installing any software on your local machine:
- View the notebook on GitHub and click the Open in Colab button (requires a Google account).
- Create a GitHub Codespace for this repository and run the notebook in VSCode (requires a GitHub account).
In 2016, Johnson, Alahi, and Fei-Fei published their article Perceptual Losses for Real-Time Style Transfer and Super-Resolution in which they introduced a style transfer network which, once trained for a given style image, allows for fast neural stylization of arbitrary input images using a single forward pass. Soon after, Ulyanov, Vedaldi, and Lempitsky submitted their note Instance Normalization: The Missing Ingredient for Fast Stylization, in which they showed that replacing the batch normalization layers of the model by instance normalization lead to a significant improvement in visual results.
In their 2017 article A Learned Representation for Artistic Style, Dumoulin,
Kudlur, and Shlens presented a way to perform neural style transfer for
multiple styles—or a mix of styles—using the forward pass of a single style
transfer network trained on a collection of style images simultaneously.
Their key realization was that the learned filters of the style transfer
network could be shared between different styles, and it is sufficient to
learn a different set of normalization parameters for each style. To allow
for this, they proposed to replace each batch or instance normalization layer
by their new conditional instance normalization layer which learns
The style transfer network has three main parts: the encoder, the bottleneck, and the decoder. In addition, some pre- and post-processing is performed. Below I give a description of the various parts of the network. The given output dimensions are based on input image size of 256x256, but note they are provided for illustration only; the network is fully convolutional and handles input images of any size.
Input images are assumed to take RGB values in 0.0..255.0
. The preprocessing
layer centers the RGB values around their ImageNet means. Note that the
padding layer of Johnson's model was removed: Padding is applied to each
convolution layer individually instead of in one chunk here.
Layer Description Output Size
----------------------------------------------------------------------
preprocess Pre-Processing 256x256x3
The encoder is composed of three convolutional blocks, each consisting of a
reflection padding layer, a convolutional layer, a normalization layer, and a
relu
activation layer. All reflection layers of the transfer model use
same
amount of reflection padding, i.e., the amount of padding is calculated
by the same formulas as it is for same
padding, but the type of padding
applied is reflection padding instead of constant. The second and third block
uses strided convolutions to reduce the spatial dimensions by a total factor
of 4.
Block / Layer Description Output Size
----------------------------------------------------------------------
down_block_1 / rpad ReflectionPadding
/ conv Convolution (32, 9x9, stride 1)
/ norm ConditionalInstanceNormalization
/ act Activation (ReLU) 256x256x32
down_block_2 / rpad ReflectionPadding
/ conv Convolution (64, 3x3, stride 2)
/ norm ConditionalInstanceNormalization
/ act Activation (ReLU) 128x128x64
down_block_3 / rpad ReflectionPadding
/ conv Convolution (128, 3x3, stride 2)
/ norm ConditionalInstanceNormalization
/ act Activation (ReLU) 64x64x128
The bottleneck comprises five residual blocks, each of which consists of seven
layers and a residual connection. In order, the layers are: padding,
convolution, normalization, and relu
activation, followed by another
padding, convolution, and normalization layer.
Block / Layer Description Output Size
----------------------------------------------------------------------
res_block_1..5 / rpad1 ReflectionPadding
/ conv1 Convolution (128, 3x3, stride 1)
/ norm1 ConditionalInstanceNormalization
/ relu1 Activation (ReLU)
/ rpad2 ReflectionPadding
/ conv2 Convolution (128, 3x3, stride 1)
/ norm2 ConditionalInstanceNormalization
+ res Residual 64x64x128
The decoder roughly mirrors the encoder: It is again composed of three convolutional blocks, each of which consist of a reflection padding, a convolutional layer, a normalization layer, and an activation layer. The first two of these layers start with an upsampling layer performing 2x nearest neighbor upsampling before the other layers and use 'relu' activations, while the final layer end with a 'sigmoid' activation.
Block / Layer Description Output Size
----------------------------------------------------------------------
up_block_1 / up UpSampling (2x, nearest)
/ rpad ReflectionPadding
/ conv Convolution (64, 3x3, stride 1)
/ norm ConditionalInstanceNormalization
/ act Activation (ReLU) 128x128x64
up_block_2 / up UpSampling (2x, nearest)
/ rpad ReflectionPadding
/ conv Convolution (32, 3x3, stride 1)
/ norm ConditionalInstanceNormalization
/ act Activation (ReLU) 256x256x32
up_block_3 / rpad ReflectionPadding
/ conv Convolution (3, 9x9, stride 1)
/ norm ConditionalInstanceNormalization
/ act Activation (Sigmoid) 256x256x3
The final sigmoid
activation layer of the decoder gives an output image with
rgb values in 0.0..1.0
. These values are multiplied by a factor 255.0
to
obtain an output image with values in the desired range.
Layer Description Output Size
----------------------------------------------------------------------
rescale Rescaling (factor 255.0) 256x256x3
Let
where gatys_2015
.
Note the absence of a variation loss term: the use of upsampling instead of
transposed (or fractionally strided) convolutions makes its use unnecessary.
Training is performed for 8 epochs over the images of the Microsoft COCO/2014
dataset using an adam
optimizer with a learning rate of 1e-3
. All images
are resized to 256x256 and served in batches of 16.
This repository contains a python script train.py
which takes a
collection of style images as well as some training parameters as input,
downloads the training dataset, performs training of the style transfer model,
and finally saves the trained model to disk. The directory saved/model
contains a model trained in this way for the 32 images in img/style
. To try
the model out yourself, have a look at the notebook
multi-style-transfer.ipynb
.
All images below were produced using it.
Note The images included here are lower quality jpeg files. I have linked them to their lossless png versions.
The following are two sets of stylizations of the same content images in the
same styles as used to demonstrate Johnson et al's style transfer network.
Note that all of these pastiches were produced using a single style transfer
network. The quality of the results here is comparable to that of pastiches
produced by Johnson's networks trained for individual styles—see my repository
johnson-fast-style-transfer
.
Note that the use of upsampling layers instead of transposed or fractionally strided convolutions can lead to improved results by eliminating checkerboard artifacts. This is particularly clear when comparing the first of the following stylizations to the the corresponding one created using Johnson's network.
The following demonstrates the ability of Dumoulin et al's network to produce pastiches in mixed styles.
- Dumoulin, Kudlur, Shlens - A Learned Representation for Artistic Style, 2017. [arxiv] [code]
- Johnson, Alahi, Fei-Fei - Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 2016. [pdf] [suppl] [code]
- Ulyanov, Vedaldi, Lempitsky - Instance Normalization: The Missing Ingredient for Fast Stylization, 2016. [arxiv]
- Gatys, Ecker, Bethge - A Neural Algorithm of Artistic Style, 2015. [pdf]
- Lin et al - Microsoft COCO: Common Objects in Context, 2014. [www] [arxiv]