Giter Site home page Giter Site logo

mdehling / dumoulin-multi-style-transfer Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 9.17 MB

A demo of multi style transfer as described in: Dumoulin, Shlens, Kudlur - _A Learned Representation for Artistic Style_, 2017.

License: BSD 3-Clause "New" or "Revised" License

Python 21.45% Jupyter Notebook 78.55%
neural-style-transfer

dumoulin-multi-style-transfer's Introduction

Multi Style Transfer (Dumoulin et al, 2017)

In this repository I aim to give a brief description and demonstration of Dumoulin's style transfer model. The implementation used is part of my python package nstesia and can be found in its module dumoulin_2017.

Note

There are two simple ways to run the demo notebook yourself without installing any software on your local machine:

  1. View the notebook on GitHub and click the Open in Colab button (requires a Google account).
  2. Create a GitHub Codespace for this repository and run the notebook in VSCode (requires a GitHub account).

Introduction

In 2016, Johnson, Alahi, and Fei-Fei published their article Perceptual Losses for Real-Time Style Transfer and Super-Resolution in which they introduced a style transfer network which, once trained for a given style image, allows for fast neural stylization of arbitrary input images using a single forward pass. Soon after, Ulyanov, Vedaldi, and Lempitsky submitted their note Instance Normalization: The Missing Ingredient for Fast Stylization, in which they showed that replacing the batch normalization layers of the model by instance normalization lead to a significant improvement in visual results.

In their 2017 article A Learned Representation for Artistic Style, Dumoulin, Kudlur, and Shlens presented a way to perform neural style transfer for multiple styles—or a mix of styles—using the forward pass of a single style transfer network trained on a collection of style images simultaneously. Their key realization was that the learned filters of the style transfer network could be shared between different styles, and it is sufficient to learn a different set of normalization parameters for each style. To allow for this, they proposed to replace each batch or instance normalization layer by their new conditional instance normalization layer which learns $N$ sets of instance normalization parameters and takes an $N$-dimensional style vector as a second input to select which (linear combination of) parameters to use. For more details, see the 'Network Architecture' section below.

Network Architecture

The style transfer network has three main parts: the encoder, the bottleneck, and the decoder. In addition, some pre- and post-processing is performed. Below I give a description of the various parts of the network. The given output dimensions are based on input image size of 256x256, but note they are provided for illustration only; the network is fully convolutional and handles input images of any size.

Pre-Processing

Input images are assumed to take RGB values in 0.0..255.0. The preprocessing layer centers the RGB values around their ImageNet means. Note that the padding layer of Johnson's model was removed: Padding is applied to each convolution layer individually instead of in one chunk here.

Layer                   Description                        Output Size
----------------------------------------------------------------------
preprocess              Pre-Processing                     256x256x3

The Encoder

The encoder is composed of three convolutional blocks, each consisting of a reflection padding layer, a convolutional layer, a normalization layer, and a relu activation layer. All reflection layers of the transfer model use same amount of reflection padding, i.e., the amount of padding is calculated by the same formulas as it is for same padding, but the type of padding applied is reflection padding instead of constant. The second and third block uses strided convolutions to reduce the spatial dimensions by a total factor of 4.

Block / Layer           Description                        Output Size
----------------------------------------------------------------------
down_block_1 / rpad     ReflectionPadding
             / conv     Convolution (32, 9x9, stride 1)
             / norm     ConditionalInstanceNormalization
             / act      Activation (ReLU)                  256x256x32

down_block_2 / rpad     ReflectionPadding
             / conv     Convolution (64, 3x3, stride 2)
             / norm     ConditionalInstanceNormalization
             / act      Activation (ReLU)                  128x128x64

down_block_3 / rpad     ReflectionPadding
             / conv     Convolution (128, 3x3, stride 2)
             / norm     ConditionalInstanceNormalization
             / act      Activation (ReLU)                  64x64x128

The Bottleneck

The bottleneck comprises five residual blocks, each of which consists of seven layers and a residual connection. In order, the layers are: padding, convolution, normalization, and relu activation, followed by another padding, convolution, and normalization layer.

Block / Layer           Description                        Output Size
----------------------------------------------------------------------
res_block_1..5 / rpad1  ReflectionPadding
               / conv1  Convolution (128, 3x3, stride 1)
               / norm1  ConditionalInstanceNormalization
               / relu1  Activation (ReLU)
               / rpad2  ReflectionPadding
               / conv2  Convolution (128, 3x3, stride 1)
               / norm2  ConditionalInstanceNormalization
               + res    Residual                           64x64x128

The Decoder

The decoder roughly mirrors the encoder: It is again composed of three convolutional blocks, each of which consist of a reflection padding, a convolutional layer, a normalization layer, and an activation layer. The first two of these layers start with an upsampling layer performing 2x nearest neighbor upsampling before the other layers and use 'relu' activations, while the final layer end with a 'sigmoid' activation.

Block / Layer           Description                        Output Size
----------------------------------------------------------------------
up_block_1 / up         UpSampling (2x, nearest)
           / rpad       ReflectionPadding
           / conv       Convolution (64, 3x3, stride 1)
           / norm       ConditionalInstanceNormalization
           / act        Activation (ReLU)                  128x128x64

up_block_2 / up         UpSampling (2x, nearest)
           / rpad       ReflectionPadding
           / conv       Convolution (32, 3x3, stride 1)
           / norm       ConditionalInstanceNormalization
           / act        Activation (ReLU)                  256x256x32

up_block_3 / rpad       ReflectionPadding
           / conv       Convolution (3, 9x9, stride 1)
           / norm       ConditionalInstanceNormalization
           / act        Activation (Sigmoid)               256x256x3

Post-Processing

The final sigmoid activation layer of the decoder gives an output image with rgb values in 0.0..1.0. These values are multiplied by a factor 255.0 to obtain an output image with values in the desired range.

Layer                   Description                        Output Size
----------------------------------------------------------------------
rescale                 Rescaling (factor 255.0)           256x256x3

Training Method

Let $S = \{x_s^i\}$ be a collection of $N$ style images, and denote by $T_S$ the image transformation network. The goal is to produce, for any content image $x_c$ and any style vector $v = (v^1,...,v^N)$, a pastiche image $x_p = T_S(x_c, v)$ using a single forward pass of the network. The objective formulated to achieve this goal is to minimize a weighted sum

$$ \mathcal{L}(x_c,x_s,x_p) = w_C\cdot\mathcal{L}_C(x_c,x_p) + w_S \cdot \sum_i^N v^i\cdot\mathcal{L}_S(x_s^i,x_p) \quad, $$

where $\mathcal{L}_C$ and $\mathcal{L}_S$ denote the content and style loss as introduced by Gatys et al. For my implementation of these losses, see the module gatys_2015. Note the absence of a variation loss term: the use of upsampling instead of transposed (or fractionally strided) convolutions makes its use unnecessary.

Training is performed for 8 epochs over the images of the Microsoft COCO/2014 dataset using an adam optimizer with a learning rate of 1e-3. All images are resized to 256x256 and served in batches of 16.

Results

This repository contains a python script train.py which takes a collection of style images as well as some training parameters as input, downloads the training dataset, performs training of the style transfer model, and finally saves the trained model to disk. The directory saved/model contains a model trained in this way for the 32 images in img/style. To try the model out yourself, have a look at the notebook multi-style-transfer.ipynb. All images below were produced using it.

Note The images included here are lower quality jpeg files. I have linked them to their lossless png versions.

The following are two sets of stylizations of the same content images in the same styles as used to demonstrate Johnson et al's style transfer network. Note that all of these pastiches were produced using a single style transfer network. The quality of the results here is comparable to that of pastiches produced by Johnson's networks trained for individual styles—see my repository johnson-fast-style-transfer.

Note that the use of upsampling layers instead of transposed or fractionally strided convolutions can lead to improved results by eliminating checkerboard artifacts. This is particularly clear when comparing the first of the following stylizations to the the corresponding one created using Johnson's network.

The following demonstrates the ability of Dumoulin et al's network to produce pastiches in mixed styles.

References

  • Dumoulin, Kudlur, Shlens - A Learned Representation for Artistic Style, 2017. [arxiv] [code]
  • Johnson, Alahi, Fei-Fei - Perceptual Losses for Real-Time Style Transfer and Super-Resolution, 2016. [pdf] [suppl] [code]
  • Ulyanov, Vedaldi, Lempitsky - Instance Normalization: The Missing Ingredient for Fast Stylization, 2016. [arxiv]
  • Gatys, Ecker, Bethge - A Neural Algorithm of Artistic Style, 2015. [pdf]
  • Lin et al - Microsoft COCO: Common Objects in Context, 2014. [www] [arxiv]

dumoulin-multi-style-transfer's People

Contributors

mdehling avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.