Targeted style transfer using cycle consistent generative adversarial networks with quantitative analysis of different loss functions 1

Abstract

Targeted style transfer is the visual computing and deep learning problem where the input and target image sets are used to train the network by learning the mapping between those for conversion of the input image to the style of the target image. One of the popular methods for this task is Cycle-GANs (Cycle Consistent Generative Adversarial Networks), with Mean Squared Error, Binary Cross Entropy Error, and L1 loss functions. In this paper, our network is trained for image-to-image translation where the style or content of the Target image is changed by the network by modifying loss functions of Cycle GANs. Most accurate translation could be trained to the network through the use of paired images i.e. Supervised Learning where the input image and output images are known and thus, the network learns to minimize the gap between the expected output and observed output. However, this kind of paired data is not readily available and is strenuous to mass produce. Cycle GANs uses unpaired data, and our work is dedicated to finding the best possible loss function combination for making it even more efficient.

In Cycle GANs, there is a combination of 2 networks: Discriminators and Generators for each data set, which compete against each other to out-perform the other. Discriminator network uses Classification loss functions for distinguishing the images for the 2 datasets, while the Generator network uses Regression loss functions for determining Cycle loss and Identity loss. These loss functions play a vital role in the style transfer as they determine how much the images have been modified. We have worked on various loss functions like Mean Square Error loss, Binary Cross Entropy Error loss, Hinge loss, Huber loss, Log loss, Square loss and L1 loss for experimentation for the best losses combination to be used. We discuss the strengths and limitations of the loss functions already used and propose different combinations of loss functions for better accuracy. A separate classifier was trained extensively for performance evaluation purpose, which gives the most optimal combination of loss functions which is Binary Cross Entropy loss for Classification loss function and Huber loss for Regression loss function.

Keywords

Cycle GANS image-to-image translation deep learning loss functions neural networks visual computing

Figure 1.

An overview of system design.

1. Introduction

In the recent years, our day to day life deals with images on a grand scale. It’s not possible to manually generate or even modify all the required images, thus we need computer vision and its applications like image-to-image translation. Targeted style transfer is really beneficial for applications like visualization on different features in different backgrounds, for example: trying varying spectacles, hairstyles or clothes on humans, several flooring or furniture options for rooms, and movie editing. It can also be used for conjuring the older version of a face in an image or similarly, learning how a face might have looked like when it was younger. Simple doodles can be made into solid objects, as well as the network can be trained to convert different animals into one another.

This task was initially object identification and transformation [8], which was solved by just Convolutional Neural Networks, the most powerful class of Deep Learning for image processing. CNNs consist of small computational units which hierarchically process the feature information in the image. These layers can be represented as different filters, working their way up from low (points, edges) to higher (object outlines) amount of information. The style of an image and its content are separable entities, which the network learns through sets of input images, whose content has to be reproduced and sets of target images, whose style has to be produced into the output image.

Targeted style transfer requires only a part of the image to be altered, without changing its background information. For example, in the popular game Pokémon Go or in the Son of Zorn series, there are cartoons in real-world backgrounds. This work [2] is especially useful in augmented reality applications.

This image-to-image translation task is easier when there are paired datasets for the network to be trained from. However, creating paired training data is cumbersome and there are not a lot of repositories supplying the same. So, this challenge was overcome through Cycle GANs work [7], in which the approach deals with unaligned training data from a source domain to a target domain. The goal is to convert the images in source domain into target domain using an adversarial loss such that the converted image domain and target image domain are identical. Also, the inverse mapping of the converted image back to source domain is constrained through Cycle Consistency loss.

As a summary of our method, the Adversarial Networks framework is explained, in which there are 2 functions: Discriminators (D) and Generators (G). The Generators produce the fake images from one domain to the other while the Discriminators distinguish between the real or fake images for that domain. There are 3 losses in the networks: GAN loss used by Discriminator for Generated images, Cycle Consistency loss used for A to B to A converted images and identity loss used by discriminator for its own domain images.

Figure 2.

Generator neural network architecture.

Advantages of the Cycle GANs are that they use unpaired datasets as a form of unsupervised learning to perform better and are more versatile than the previous methods. Our motivation for this particular work was that we wanted to make this method even more efficient. There are many loss functions; some are widely used while others are little known of. But there was a chance that some combination of loss functions might perform better than used techniques. Hence, we tried all the possible combinations with the appropriate loss functions that we thought would work. This led to the experimentation at the basic level of these neural networks.

In this work, we have experimented on various combinations for these losses. The losses [15, 1, 9] we have used are Binary cross entropy error loss, Huber loss, L1 loss, Mean Squared error loss, Hinge loss, Log loss and Square loss. Using a separate classifier trained on these datasets, we have then calculated the accuracy and cross entropy [10] for each of the combinations and discussed the conclusion.

The rest of the paper is organized as follows: Section 2 presents proposed system design. Section 3 demonstrates the experimental results of proposed system. Finally, the results of the experimentation are conveyed through concluding and future works are discussed.

2. System design

The proposed system objective is to learn mapping function between the two domains $A$ and $B$ . In order to do this, proposed system use two functions are Generator and Discriminator. The system design is shown in Fig. 1.

As can be observed from Fig. 1, when an image ( $I_{A}$ ) is inserted into the Generator ( $G_{\textit{AtoB}}$ ) converting source domain $A$ images to target domain $B$ images, a Fake image is constructed in domain $B$ . This Fake image ( $F_{B}$ ) is added to Discriminator for domain $B$ ( $D_{B}$ ), which outputs a decision (0 or 1) whether or not that image belongs to that domain. $D_{B}$ computes using Regression type loss functions and thus, updates its network.

The fake image produced, $F_{B}$ is also added to Generator ( $G_{\textit{BtoA}}$ ), for conversion back to source domain $A$ from domain $B$ . This results in Reconstructed Image ( $R_{A}$ ), which is then compared with Input image ( $I_{A}$ ) for calculating the Classification type loss functions. Input images from both the domains $A$ and $B$ , $I_{A}$ and $I_{B}$ are added to the Discriminators $D_{A}$ and $D_{B}$ respectively for computing Classification type loss functions. Classification type loss is the loss for ensuring the Discriminators are able to distinguish its own domain’s images. Creation of this system’s network is carried out through combinations of CNNs – Convolutional neural Networks, though layers of computational units. The Figs 2 and 3 portray the layers used in Generators’ and Discriminators’ networks.

In Fig. 2, Generator neural network architecture is explained. This network contains two stride-2 convolutions, several residual blocks, and two fractionally strided convolutions with stride 1/2. Instance normalization is used instead of Batch normalization. The convolutional layers used are Conv2d, Norm Layer (Normalization), ReLu (Rectified Linear Unit) and Tanh Layer (for maintaining output from $-$ 1 to 1).

In Fig. 3, the Discriminator neural network architecture is displayed. For this network, 70 $\times$ 70 PatchGANs, which aim to classify whether 70 $\times$ 70 overlapping image patches are fake or real, are used. Such a patch-level discriminator architecture can be exercised to arbitrarily-sized images in a fullyconvolutional fashion, and has a lesser quantity of parameters than a full-image discriminator. The convolutional layers employed are Conv2d, Leaky ReLu (Leaky Rectified Linear Unit) and Norm Layer (for normalization).

From the Fig. 1 observed that the a loss function is a function that maps values or an event of one or more variables onto a real number intuitively representing the “cost” corresponding with the event. loss function is an important part in artificial neural networks, which is used to measure the inconsistency between predicted value ( $y^{\prime}$ ) and actual label ( $y$ ). It is a non-negative value, where the robustness of model increases along with the decrease of the value of loss function.

Figure 3.

Discriminator neural network architecture.

An optimization problem seeks to minimize a loss function.Generative Adversarial Networks pose a challenging optimization problem due to the multiple loss functions which must be optimized simultaneously. There are 2 major types of losses incorporated in this system

Regression loss

Classification loss

In this work, we have dealt with the various types of loss functions and compared these to find out which of the combinations used is the best.

2.1 Regression losses

2.1.1 Absolute loss

Absolute loss function minimizes the absolute differences between the existing target values and the estimated values. It can be defined in the Eq. (1)

$\displaystyle L(r)=\frac{1}{n}\sum_{i=1}^{n}|y_{i}-y_{i}^{\prime}|$ (1)

where $L(r)$ is the loss function, $y$ is the existing target value, $y^{\prime}$ is the estimated value and $n$ is the number of values. Absolute loss also called as Laplace or L1 loss.

2.1.2 Mean Squared Error loss

Mean Squared Error (MSE), or quadratic, loss function is widely used in linear regression as the performance measure, and the method of minimizing MSE is called Ordinary Least Squares (OSL), the basic principle of OSL is that the optimized fitting line should be a line which minimizes the sum of distance of each point to the regression line, i.e., minimizes the quadratic sum. The standard form of MSE loss function is defined in Eq. (2)

$\displaystyle L(r)=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-y_{i}^{\prime})^{2}$ (2)

where ( $y_{i}-y_{i}^{\prime}$ ) is named as residual, and the target of MSE loss function is to minimize the residual sum of squares.

2.1.3 Huber loss

In statistics, the Huber loss is a loss function used in robust regression that is less sensitive to outliers in data than the squared error loss.

$\displaystyle L(r)=\frac{1}{n}\sum_{i=1}^{n}z_{i}$ (3) $\displaystyle z_{i}=\left\{\begin{array}[]{ll}0.5(y_{i}-y_{i}^{\prime})^{2},&% \text{if }|y_{i}-y_{i}^{\prime}|<1\\ |y_{i}-y_{i}^{\prime}|-0.5,&\text{otherwise}\end{array}\right.$ (4)

where $n$ is the total data, $y$ is the existing target value and $y^{\prime}$ is the estimated value.

2.1.4 GAN loss

GAN loss is the mapping function. It is used to map the $x$ with $y$ and it is defined in the Eq. (5)

$\displaystyle L(r)=\log D_{Y}(y)+\log(1-D_{Y}(G(x)))$ (5)

where $G$ is the generator function and $D$ is the discriminator function.

2.2 Classification losses

2.2.1 Binary cross entropy error loss

The cross entropy loss is ubiquitous in modern deep neural networks. This function is not naturally represented as a product of the true label and the predicted value, but is convex and can be minimized using stochastic gradient descent methods. Cross-entropy loss increases as the predicted probability diverges from the actual label. It can be defined in Eq. (6)

$\displaystyle\!\!\!\!\!L(r)\!=\!-\frac{1}{n}\!\sum_{i=1}^{n}\!(y_{i}^{\prime}% \!\circ\!\log y_{i}\!+\!(1\!-\!y_{i}^{\prime})\!\circ\!\log(1\!-\!y_{i}))$ (6)

where $L(r)$ is the loss function, $y$ is the actual label and $y^{\prime}$ is predicted label.

2.2.2 Hinge loss

The hinge loss provides a relatively tight, convex upper bound on the 0–1 indicator function. However, for the purpose of labelling the output in the range of $-$ 1 to $+$ 1, Tan function is applied to the output. In addition, the empirical risk minimiza-tion of this loss is equivalent to the classical formulation for support vector machines (SVMs). Correctly classified points lying outside the margin boundaries of the support vectors are not penalized, whereas points within the margin boundaries or on the wrong side of the hyperplane are penalized in a linear fashion compared to their distance from the correct boundary. It can be defined in Eq. (7)

$\displaystyle L(r)=\frac{1}{n}\sum_{i=1}^{n}\max(0,1-y_{i}\circ y_{i}^{\prime})$ (7)

where $n$ is the total data, $y_{i}$ is the existing target value and $y_{i}^{\prime}$ is the estimated value.

2.2.3 Log loss

The Logistic loss function does not assign zero penalty to any points. Instead, functions that correctly classify points with high confidence (i.e., with high values of $|f(x)|$ ) are penalized less. This structure leads the logistic loss function to be sen-sitive to outliers in the data.

$\displaystyle L(r)=\frac{1}{n}\sum_{i=1}^{n}\ln(1+e^{-y_{i}^{\prime}\circ y_{i% }})$ (8)

where $L(r)$ is the loss function, $y$ is the actual label and $y^{\prime}$ is predicted label.

2.2.4 Square loss

While more commonly used in regression, the square loss function can be re-written and utilized for classification. The square loss function is both convex and smooth and matches the 0–1 indicator function when $y\circ f(x)=$ 0 and $y\circ f(x)=$ 1. However, for the purpose of labelling the output in the range of $-$ 1 to $+$ 1, Tan function is ap-plied to the output. The square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions.

$\displaystyle L(r)=\frac{1}{n}\sum_{i=1}^{n}(1-y_{i}^{\prime}\circ y_{i})^{2}$ (9)

where $n$ is the total data, $y$ is the existing target value and $y^{\prime}$ is the estimated value.

2.2.5 Cycle consistency loss

Cycle consistency loss is used to produce outputs identically distributed as target domain and it is defined in Eq. (10)

$\displaystyle\!\!\!\!L(r)=\|F(G(x))-x\|_{1}+\|G(F(y))-y\|_{1}$ (10)

where $G$ and $F$ are generator function.

3. Experimental results and performance analysis

3.1 Dataset

To evaluate the performance of the proposed method the images collected from ImageNet database is used. The keywords searched in the ImageNet database for these datasets are wild horse and zebra [16]. The database contains 2 classes of 1973 images. The dimensions of the image are 256 $\times$ 256.

3.2 Implementation

3.2.1 Network details

In this paper, we have used the architecture from Johnson et al. [6], which is contains two stride-2 convolutions, several residual blocks, and two fractionally strided convolutions with stride 1/2. For discriminator network, 70 $\times$ 70 PatchGANs was used, which is aim to classify whether 70 $\times$ 70 overlapping image patches are fake or real.

3.2.2 Training details

The training set size of each class was horse (domain A): 500 images, zebra (domain B): 500 images and test size for evaluation was 973 images.

In our research, we worked on (5 $\times$ 3) 15 combinations for regression and classification losses. The training details for all the experiments are as follows. We use $\lambda=$ 10 where $\lambda$ is the relative importance to the Cycle Consistency loss and Adam solver [3] has a batch size of 1. Adam is an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. All networks are trained from scratch with a learning rate of 0.0002. We keep the same learning rate for the first half of the epochs and linearly decay the rate to zero over the next half of the epochs.

3.3 Performance evaluation measures

Accuracy, cross entropy metrics [5, 11, 12] are used for performance evaluation. TP, TN, FP and FN indicate True Positives, True Negatives, False Positives and False Negatives respectively. Accuracy is the best measure for classification problem [3, 4]. It is given in Eq. (11).

$\displaystyle\text{Accuracy}=\frac{\text{TP $+$ FN}}{\text{TP $+$ FP $+$ FN $+% $ TN}}$ (11)

Cross entropy calculates a network performance given targets and outputs, with optional performance weights and other parameters. The function returns a result that heavily penalizes outputs that are extremely inaccurate ( $y$ near $1-t$ ), with very little penalty for fairly correct classifications ( $y$ near $t$ ). Minimizing cross-entropy leads to good classifiers. The binary cross-entropy expression is defined in Eq. (12).

$\displaystyle ce=-t\times\log(y)-(1-t)\times\log(1-y)$ (12)

In special case ( $N=$ 1): If an output consists of only one element, then the outputs and targets are interpreted as binary encoding. That is, there are two classes with targets of 0 and 1, whereas in 1-of-N encoding, there are two or more classes.

Cross entropy is similar to Negative Log loss Likely-hood function.

3.4 Experimental results

The experimental results of the proposed targeted style transfer method is presented in this section. The experimental results are presented in two stages. In the first stage the training and test results are shown and in the second stage the results of the targeted style transfer method models are compared and the best one is chosen.

Figure 4.

The sample real images, fake images, reconstructed images and identity images from various experiments.

Figure 5.

The sample result of the proposed method with different loss functions.

3.4.1 Training and test results

For performance evaluation, a separate classifier: TensorFlow for Poets was trained over an extensive test set and accuracy of 97% was achieved. This classifier is used on the trained models of loss combinations, classifying the fake images produced and providing the relative accuracy and cross entropy for all of the models. All the combinations of loss functions models have been trained through sets of 500 images for both domains A and B. Each model was trained for 100 epochs, and then tested.

The models are trained and observed between the checkpoints of the iterations for which model performs better in minimum iterations which can be visualized. Then the model is tested on 973 images. For domains A and B, datasets of horses and zebras have been selected from [7]’s implementation.

These implementations can be compared to the system architecture, regarding which images are produced. Real_A, Fake_B, Rec_A, Real_B, Fake_A, Rec_B, Idt_A and Idt_B are the images associated in the system. The Real A and B images belong to the original images in domain A and B. Fake A and B are the translated images in the opposite domains. The fake images are then inserted into Discriminator for correct classification. Rec A and B are the reconstructed images from fake images back to original domains for checking the each classification losses. Idt A and B are the Identity images i.e. the images from the same domain, inserted into the discriminator for checking the Identity loss. The sample results of the real (input) images, generated (fake) images, the reconstructed images and identity images from various experiments is shown in Fig. 4.

Figure 5 displays the sample results of the proposed method with the various combination of the regression losses and classification losses. From the Fig. 5, it is observed that the Binary Cross Entropy loss is the best among all of the classification losses and Huber loss is the best among all of the regression losses.

3.4.2 Comparison of different loss function results

To evaluate the performance of the proposed method quantitatively accuracy and cross entropy measures are used. The performance comparison of proposed method with different loss functions using accuracy values is shown in Table 1. The value of the Binary Cross Entropy loss and Huber loss combination provided the better accuracy than the other loss functions combination. Similarly, the comparison of the proposed method with the different loss functions using cross entropy values is shown in Table 2. From Table 2, it can be noted that the value of the Binary Cross Entropy loss and Huber loss combination is lower than the other loss functions combination. It indicates that the Binary Cross Entropy loss and Huber loss combination are superior to the other combination loss functions.

Table 1
Comparison of proposed method with the different loss functions using accuracy values

S.no.	Classification loss	Regression loss	Accuracy
1.	BCE	HUBER	0.886
2.	BCE	L1	0.780
3.	BCE	MSE	0.859
4.	HINGE	HUBER	0.010
5.	HINGE	L1	0.012
6.	HINGE	MSE	0.009
7.	LOG	HUBER	0.012
8.	LOG	L1	0.017
9.	LOG	MSE	0.010
10.	MSE	HUBER	0.576
11.	MSE	L1	0.584
12.	MSE	MSE	0.541
13.	SQUARE	HUBER	0.012
14.	SQUARE	L1	0.012
15.	SQUARE	MSE	0.012

Table 2

Comparison of proposed method with the different loss functions using cross entropy values

S.no.	Classification loss	Regression loss	Cross-entropy
1.	BCE	HUBER	0.349
2.	BCE	L1	0.704
3.	BCE	MSE	0.462
4.	HINGE	HUBER	5.270
5.	HINGE	L1	5.342
6.	HINGE	MSE	5.368
7.	LOG	HUBER	5.248
8.	LOG	L1	5.307
9.	LOG	MSE	5.240
10.	MSE	HUBER	1.429
11.	MSE	L1	1.465
12.	MSE	MSE	1.525
13.	SQUARE	HUBER	5.198
14.	SQUARE	L1	5.214
15.	SQUARE	MSE	5.234

4. Conclusion

In this paper, a new method for targeted style transfer using Cycle-GANs with different loss functions is proposed. The results can be visualized properly through the Fig. 5 and Tables 1 and 2. If the accuracy is high, then it means that the model has performed better. The accuracies are in the range of 0 to 1. Also, the cross entropy values being lower are supposed to be better for the model. The ideal value is 0, and if the calculated probability is further than the expected probability, then the cross entropy value increases. From the results, it is perceived that the Binary Cross Entropy loss is the best among all of the classification losses and Huber loss is better than each of the regression losses with 0.886 accuracy and 0.349 cross entropy value.

In future work, our endeavor to make this system even more efficient. For this at present working on replacing the CNNs used-ResNets (state-of-the-art) to more recent Deep learning frameworks being researched like Capsule networks. Also, the ideas for generating better activation functions are also being assessed for the future works.

References

Ghosh

Kumar

and Sastry

P.S.

, robust loss functions under label noise for deep neural networks, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, 2017, pp. 1919–1925.

Castillo

Han

Singh

Yadav

A.K.

and Goldstein

, Son of Zorn’s lemma: Targeted style transfer using instance-aware semantic segmentation, in: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Louisiana, 2017, pp. 1348–1352.

Kingma

and Ba

, Adam, a method for stochastic optimization, in: Proceedings of the 3𝑟𝑑 International Conference for Learning Representations, San Diego, 2015, pp. 1–15.

D.N.

Nguyen

G.N.

Bhateja

and Satapathy

S.C.

, Optimizing feature selection in video-based recognition using Max-Min Ant System for the online video contextual advertisement user-oriented system, Journal of Computer Science 21(1) (2017), 361–370.

Csurka

Larlus

Perronnin

and Meylan

, What is a good evaluation measure for semantic segmentation? The British Machine Vision Association and Society for Pattern Recognition 27(1) 2013, 1–11.

Johnson

Alahi

and Fei-Fei

, Perceptual losses for real-time style transfer and super-resolution, in: Proceedings of European Conference on Computer Vision, Netherlands, 2016, pp. 694–711.

Zhu

J.Y.

Park

Isola

and Efros

, Unpaired image-to-image translation using cycle-consistent adversarial networks, in: Proceedings of the IEEE International Conference on Computer Vision, Italy, 2017, pp. 2242–2251.

Gatys

L.A.

Ecker

A.S.

and Bethge

, A neural algorithm of artistic style, arXiv preprint arXiv:1508.06576, 2015.

Berrada

Zisserman

and Kumar

M.P.

, Smooth loss functions for deep top-k classification, in: Proceedings of the Sixth International Conference on Learning Representations, Canada, 2018, pp. 1–25.

10.

Sokolova

and Lapalme

, A systematic analysis of performance measures for classification tasks, Information Processing and Management 45(4) (2009), 427–437.

11.

Soundrapandiyan

and Mouli

P.C.

, Adaptive pedestrian detection in infrared images using fuzzy enhancement and top-hat transform, International Journal of Computational Vision and Robotics 7(1/2) (2017), 49–67.

12.

Soundrapandiyan

and Mouli

P.C.

, Adaptive pedestrian detection in infrared images using background subtraction and local thresholding, Procedia Computer Science 58(1) (2015), 706–713.

13.

Satapathy

S.C.

El-Maleh

and Bhateja

, Intelligent computing in multidisciplinary engineering applications, Arabian Journal of Science and Engineering 43(8) (2018), 3861–3862.

14.

Sabour

Frosst

and Hinton

G.E.

, Dynamic routing between capsules, in: Proceedings of the Advances in Neural Information Processing Systems, California, 2017, pp. 3859–3869.

15.

LeCun

Chopra

Hadsell

Ranzato

and Huang

, A tutorial on energy-based learning, Predicting Structured Data 1 (2006), 1–59.

16.

http://image-net.org/download-images.