Logo

The Data Daily

Efforts in Normalization Layer to improve Arbitrary Style Transfer

Efforts in Normalization Layer to improve Arbitrary Style Transfer

Inspired by the Convolution Neural Network (CNN) power, Gatys et al.[1] tried to find a way to produce a famous painting style on a raw image or paint. Knowing that CNN can exploit the feature of an image, he proposed one way to regenerate the image feature by incorporating style features. His idea is iteratively to modify the image feature and find an output image that fits his requirement. Although time-cost, his model does a great job in the style transfer task and creates a new AI field called Neural Style Transfer (NST).

In subsequent years, extensive research about Style Transfer emerged from the non-neural network method to the neural-network method (NST). It generated many branches, including today’s topic — arbitrary style transfer.

We have two methods within the NST field: the image-optimization-based method and the model-optimization-based method. For an image-optimization-based method, like Gatys et al., they will propose a way to optimize the output images until reaching the objective iteratively. Although with impressive performance, these methods still have efficiency limitations. The model-Optimization-Based method uses data to train a feed-forward neural network and generate the output immediately.

However, as most models can only transfer images for a fixed style, Arbitrary-Style-Transfer (AST) hope to transfer any style you want as you provide the style image.

The work of Ioffe and Szegedy[4] proposed using BatchNormalization can help resist Internal Covariate Shift and accelerate the training process. The idea also has been used in NST by Wang et al.[5].

A generator network is a convolution block connecting with the BN layer, which can learn the style feature, and the Descriptor network is a sound pre-trained network to generate a loss. Quite similar to the thought of GAN (not the same). Within the BN layer:

We call γ and β as affine parameters which learn in the training stage and maintain the style characteristics. As the mean and variance are generated from the mini-batch during the training, it will be replaced as the population statistics when testing.

As inspired using batch normalization layer to retain the style information in affine parameters, Ulyanov et al.[6] make an excellent performance by replacing the BN with the Instance Normalization layer:

As you might notice, it is pretty similar to BN’s work. It just replaces the way we calculate the mean and variance. It gets the μ and σ in a channel-wise, which can retain more style information. Also, it will not use a population statistic in the testing phase.

Even with IN’s outstanding performance, we can still transfer one fixed style after training the model. But using affine parameters gives a great inspiration to further improvement and AST work.

Before reaching Arbitrary Style Transfer, people try to transfer for a fixed set of styles. The Conditional Instance Normalization Layer proposed by Dumoulin et al.[7] use a set of affine parameters to do this.

Rather than train with only one affine parameter, Dumoulin et al.use a set of parameters to store the information for multiple styles. However, this method increases the training time, parameters, and dataset with increasing style numbers.

To reach the goal of AST, why don’t we make the affine parameters (γ and β) predictable according to different styles? Inspired by this idea, Huang et al.[8] proposed Adaptive Instance Normalization Layer, which is the first time to finish an excellent job in AST.

The architecture is simple but efficient. It will take one content image and one style image into a VGG-19 encode and get a content representation and a style representation. Then both of the features will be input into the AdaIN layer. As the μ and σ are the channel-wise statistics for mini-batch, affine parameters will be calculated by given style representations. Then the content feature will be modified by affine parameters (generated by the style feature). The output feature will be our regenerated or recombined feature and taken by a VGG-19 decoder. The decoder will output an image which is our style transferred image.

To update the model’s parameters, the author proposes to use a weighted combination of a content loss and a style loss as the loss function for his model.

The formula shows that the content loss is calculated as L2 Loss between our content feature and output feature via a VGG-19 encoder. The style loss calculation is a little complicated. We will get every hidden layer’s output of the VGG-19 encoder by inputting a style image and an output image. For every single hidden layer, we will calculate the L2 Loss between the mean of style image and output image and L2 Loss between the variance of style image and output image. Finally, we sum these L2 Losses of every hidden layer. A weighted sum between content loss and style loss can help us train the model, and also modifying the weight between content loss and style loss can help adjust the learning preference.

Huang et al.’s work is considered the first job to improve AST and provide inspiration for the AST field significantly. More complicated models and excellent performance are driven by improving the structure of the normalization layer.

After the famous article Attention is all, you needby Vaswani et al.[9], extensive work is researched on using self-attention architecture to improve the performance, including AST. AdaAttN is proposed by Huang et al.[10]. Rather than using an adaptive instance normalization layer, he offers to use an Adaptive self-Attention Normalization Layer which can give better performance.

Like AdaIN, Huang et al. replace the normalization layer with their own AdaAttN. Firstly, the model will take one content image and one style image into a VGG-19 encoder. Unlike AdaIN, they output the features from the 3rd, 4th, and 5th hidden layer of VGG and input them into the AdaAttN layer to learn style and transfer the content image. As found by Huang et al., the last layer’s output of VGG only consider in-depth information of the picture and loses some basic information. So they conduct the adaptive transformation for different layers from swallow to deep, generate multiple transferred features, combine them and give to the Decoder. They extract the information from the swallow to resounding and let the model decide which information to use.

Like Instance Normalization Layer, AdaAttN will use the self-attention method to generate the mean and variance and learn the affine parameters. Firstly, it will take the content feature, generate Q, take the style feature, and generate K. By multiplying them together and developing an attention map by Softmax. Furthermore, it will create a value matrix V and V² by a 1 * 1 convolution layer. By V ⊗ A, we can get the mean for the style feature β. By Sqrt( (V·V)⊗A−M·M ), we get the standard deviation of the style feature γ. Finally, we can transfer our content feature by S·Norm(Fx)+M.

The author also proposes a weighted combination loss between a stylized global loss and a local feature loss to facilitate the training process further. Similar to the way calculating loss in AdaIN[8], the local feature loss cares about the content retaining performance, and the global stylized caring about the style transferring version.

For global stylized loss, it will calculate the L2 Loss from the mean and variance of style image and output image for hidden layer 2 to 5. For local feature loss, it will calculate the L2 Loss of each AdaAttN layer for the feature generated by content image and output image.

AdaAttN proposes a pretty original method to use a self-attention structure to learn the affine parameters of style image and transfer the content image. And also, the author presents a way to explore both the content features and style features from low-level to high-level, which is beneficial for generating a good output image.

Images Powered by Shutterstock