Auto-encoder generative adversarial networks

Abstract

Generative Adversarial Networks have demonstrated potential on a variety of generative tasks, although they are regarded as unstable and sometimes they miss modes. We propose Auto-encoder Generative Adversarial Networks - a convolutional neural network combining auto-encoders with Generative Adversarial Networks. The former brings more information to Generative Adversarial Networks to reduce problems of miss modes and the latter makes the picture generated more coherent because it can better handle multiple modes in the output. We also show that image composition is available for Auto-encoder Generative Adversarial Networks so that it can be used for many feature-based tasks. Besides, we can generate different samples by adding a random noise to a feature vector.

Keywords

Generative adversarial networks auto encoder joint loss image composition

1 Introduction

Generative adversarial networks have achieved great results in semi-supervised learning as well as unsupervised learning and showed a good application prospect. Christian Ledig et al. (2016) presented SRGAN model, the first framework capable of inferring photo-realistic natural images for 4× upscaling factors [1]. Ashish Shrivastava et al. (2016) developed a method for Simulated+Unsupervised (S+U) learning that uses an adversarial network, where the task is to learn a model to improve the realism of a simulator’s output using unlabeled real data, while preserving the annotation information from the simulator [2]. Scott Reed et al. (2016) developed a novel deep architecture and GAN formulation to effectively bridge advances in text and image modeling, translating visual concepts from characters to pixels [3]. Phillip Isola et al. (2016) investigated conditional adversarial networks as a general-purpose solution to image-to-image translation problems [4]. Pauline Luc et al. (2016) proposed an adversarial training approach to train semantic segmentation models [7]. Deepak Pathak et al. (2016) presented an unsupervised visual feature learning algorithm driven by context-based pixel prediction [8].

While the benefits of GAN are indisputable, there are also obvious disadvantages. First, GAN is not stable so that it collapses easily. Optimization algorithms often approach a saddle point or local minimum rather than a global minimum and sometimes game solving algorithms may not approach an equilibrium at all. Second, missing mode is a common problem when training GAN. When missing mode happens, generated pictures are restricted to some mode and lose richness of the training set.

Now many papers begin to solve the two problems and achieve some good results. Salimans et al. (2016) add minibatch that classify each example by comparing it to other members of the minibatch and they use nearest-neighbor style features detect to prevent a minibatch from containing samples that are too similar to each other [9]. In principle, minibatch can help prevent from mode collapse. Metz et al. (2016) proposed unrolling GAN back propagate through k updates of the discriminator to prevent mode collapse [22]. That is to say, GAN chooses the best D net every k epochs according to the loss, instead of updating D net every time. It works out as well.

The structure of GAN determines that the generative net generates a picture from a random vector. It actually maps a random distribution, such as Gaussian, into a multi-dimensional space. The random vector is, in a way, the feature of the generated picture. However, as a result of the randomness, we cannot make use of the features effectively. When we need some features such as wearing glasses or smiling, we can only search in large amounts of pictures to look for corresponding vectors.

Considering that Convolutional Neural Networks [15, 16] have good properties when extracting features, we add a CNN before the traditional generative network. The structure of the new generative net is like an auto-encoder [13, 14]. Unlike the original G net, mapping a noise vector z into the sample space directly, the new G net firstly encodes the training data into a vector z, and then decodes z (Fig. 1). As a result, the training process is more stable and missing modes problem will be reduced because of extra information.

Fig. 1

The Auto-encoder generative adversarial network. After training, we use the forward model f to transfer pictures to vectors and g to transfer composited vector to a picture.

2 Related works

2.1 GAN

Supervised learning is based on large amounts of data while sometimes we face problems of lacking data. In this situation, we can use generative model to transform the supervised learning to semi-supervised one. Ian Goodfellow et al. (2014) [10] proposed Generative Adversarial Net in 2014. Yann Lecun called it “the most interesting idea in the last 10 years in ML” in Quora post [11]. GAN trains two neural nets simultaneously. The Discriminator net takes an input (e.g. an image) from a training set and outputs a scalar that indicates the probability that the input is in the training set. The Generator produces images from a vector randomly sampled in a simple distribution (e.g. Gaussian). The G net trains itself to produce images so as to fool D while D trains itself to discriminate real data and the generated one better. That is why GAN is called “adversarial”. In other words, D and G play the following two-player minimax game with value function V (G; D): $\begin{matrix} min_{G} max_{D} V (D, G) = E x \sim pdata (x) [log D (x)] \\ + E z \sim pz (z) [log (1 - D (G (z)))] \end{matrix}$ (1)

2.2 DCGAN

Radford et al. proposed DCGAN (Deep Convolutional Generative Adversarial Network) in 2015 [5]. Their G net is actually a reversed convolutional neuron network. Convolutional filters extract meaningful features and decrease the picture sizes. On the contrary, the G net in DCGAN enlarges the features from a random distribution and generates a new picture.

Meanwhile, they proposed many tricks, such as removing pooling layer in traditional CNN to make G fully differentiable. These tricks make training of GAN more stable and manageable. In addition, DCGAN mentioned that we can get some interesting results from vector operations between features [5].

However, z is randomly distributed, so we cannot predict what picture will be generated from a vector z. Furthermore, we think vector space arithmetic meaningless if the picture is totally unpredictable.

We now introduce AEGAN: a convolutional neural network combining auto-encoders with Generative Adversarial Networks. We first give an overview of the general architecture, and then we provide details on the learning procedure.

2.3 Different distances

Arjovsky [24] defined elementary distances and divergences between two distributions P_r, P_g ∈ Pr(χ)

The Total Variation(TV) distance $δ (P_{r}, P_{g}) = sup_{A \in Σ} | P_{r} (A) - P_{g} (A) | .$ (2)

The Kullback-Leibler(KL) divergence $KL (P_{r} ∥ P_{g}) = \int log (\frac{P_{r} (x)}{P_{g} (x)}) P_{r} (x) d μ (x)$ (3)

The Jensen-Shannon(JS) divergence $JS (P_{r} ∥ P_{g}) = KL (P_{r} ∥ P_{m}) + KL (P_{g} ∥ P_{m})$ (4) The Earth-Mover(EM) distance or Wasserstein-1 $W (P_{r} ∥ P_{g}) = inf_{γ \in Π (P_{r} ∥ P_{g})} E_{(x, y) \sim γ} [∥ x - y ∥]$ (5)

3 Approach and model architecture

People are familiar with the generation of random numbers. In essence, GAN can be seen as a large random number generator. We investigate the simplest linear congruent generation algorithm $x_{n + 1} = {ax}_{n} + b (mod m)$ (6)

where $a, b, m \in ℤ$ . {x_n/m} constitutes the uniform distribution pseudo-random number on the unitinterval.

Then we are going to generate a random sampling point for the gaussian distribution on the unit disk. We first generate $u_{1}, u_{2} \sim Unif (0, 1)$ (7)

Then we define the mapping $T : (u_{1}, u_{2}) \mapsto (\sqrt{- 2 log u_{1}}, 2 π u_{2})$ (8)

That means we can transform the uniform distribution by a mapping into a gaussian. If we view the probability distribution as some mass density, the mapping will bring about the change in area, as well as in density.

In the image generation application, GAN essentially transforms a fixed probability distribution, such as Gaussian, into the probability distribution of training data, such as the distribution of face images. The ideal mathematical model of GAN is as follows: we make a space, denoted as image space χ, of all n × n images and view each image as a point x in χ. We denote the probability that x is a real face picture as υ (x) so υ is the target probability measure for learning. In engineering practice, we only have some samples of face images {y1, y2, . . . , yn}, consisting an approximation of the empirical distribution. The empirical distribution is $υ = \frac{1}{n} \sum_{i = 1}^{n} δ (x - y_{i})$ (9)

Most images are not facial images, so the support set of υ $Σ (υ) : = {x \in χ | υ (x) > 0}$ (10) is a submanifold in the image space. The dimension of Σ is much less than the dimension of the image space χ The parameter space of support set manifold is equivalent to his eigenspace, or his latent space Z. Therefore, encoding map maps Σ to eigenspace and decoding map maps eigenspace to support set manifold Σ.

Suppose there is a fixed probability distribution ζ ∈ P (Z), such as gaussian, in the latent space, we can use a deep neural network θ to approximate the decoding map $g θ : Z \to χ$ (11)

gθ pushes forward ζ to the probability distribution in the image space $μ θ = g θ # ζ$ (12)

We call μθ the generate distribution.

3.1 Generator G

Unlike GAN, G net in AEGAN is like an auto-encoder, that is to say, it encodes before decoding (different from G(z) in GAN, the output in AEGAN is G(x)=g(f(x)), where x is the distribution of the training set). Comparing to mapping a random distribution (such as Gaussian) into high dimensional spaces directly, the additional encoder has two advantages. First, the generative process has more information now, forming a map from a high dimensional space into a high dimensional space, to reduce mode collapse. Second, if AEGAN works well, we can extract feature z from an arbitrary picture, operate on z according to requirements and decode it at last. Due to good properties in vector arithmetic of GAN, the final picture generated will be in the high dimensional space, that is, the distribution of training set.

1) Encoder f: Considering that the encoder is mainly to extracts features from input data, we use CNN due to its good performance on extraction. Our encoder is derived from the Alexnet [12]. Given a picture of size 64*64, we use the convolutional layers and a fully connected layer to extract a 100 dimensional feature. It’s worth noting that we use leakyReLU [19, 20] instead of ReLU and remove max pooling layer to avoid sparse gradients and we add batchnorm layers [17] to make data distribution more concentrated and make convergence faster.

2) Decoder g: Decoder generates pictures from extracted features. The fully connected layer is followed by a series of fractionally-strided convolutions [5], each with a ReLU [18] activation function, converting the high level representation into a 64×64 pixel image. We train the decoder to be a goodgenerator. On one hand, generated picture remains features of the input image, and on the other, decoder maps a vector into the distribution of the training set.

3.2 Discriminator D

The Discriminator is a traditional convolutional neuron network, to distinguish generated data from training data. When training is finished, distribution of generated data is supposed to be closed to the one of training data, while the trained D net is able to differentiate them.

We update the discriminator by ascending its stochastic gradient: $\nabla θ_{d} \frac{1}{m} \sum_{i = 1}^{m} [\log D (x^{(i)}) + \log (1 - D (G (z^{(i)})))] .$ (13)

For G fixed, the optimal discriminator D is $D_{G}^{*} (x) = \frac{p_{data} (x)}{p_{data} (x) + p_{g} (x)}$ (14)

3.3 Loss function

Energy-based GANs(EBGANs) [25] have a discriminator D who tries to minimize

$\begin{array}{l} L_{D} (D, g_{θ}) = E_{x \sim P_{r}} [D (x)] \\ + E_{z \sim p (z)} [{[m - D (g_{θ} (z))]}^{+}] \end{array}$ (15)

for some m > 0 and $[x]^{+} = max (0, x)$ (16)

and a generator trained to minimize $L_{G} (D, g_{θ}) = E_{z \sim p (z)} [D (g_{θ} (z))] - E_{x \sim P_{r}} [D (x)]$ (17)

Similarly, G net is similar to an auto-encoder and the size of input and output data is the same. Therefore we use pixel-wise L2 loss to train the G net. $L_{2} = ∥ x - G (x) ∥ \begin{matrix} 2 \\ 2 \end{matrix} .$ (18)

On the other hand, the objective for discriminator is logistic likelihood whether the input is real sample or predicted one:

$\begin{matrix} L_{ad} = \min_{g} \max_{d} {E_{x \in x} [log (D (x))] \\ + E_{xex} [log (1 - D (G (x)))]} \end{matrix}$ (19) We can reformulate it as:

$\begin{matrix} max_{D} V (G, D) \\ = E_{x \sim p_{data}} [log D_{G}^{*} (x)] \\ + E_{z \sim p_{z}} [log (1 - D_{G}^{*} (G (z)))] \\ = E_{x \sim p_{data}} [log D_{G}^{*} (x)] + E_{x \sim p_{g}} [log (1 - D_{G}^{*} (x))] \\ = E_{x \sim p_{data}} [log \frac{p_{data} (x)}{p_{data} (x) + p_{g} (x)}] \\ + E_{x \sim p_{g}} [log \frac{p_{g} (x)}{p_{data} (x) + p_{g} (x)}] \end{matrix}$ (20)

We adapt this framework because it has shown encouraging results in [5]. In practice, both D and G are optimized jointly using alternating SGD.

We define the overall loss function as $\begin{matrix} L = λ_{L_{2}} L_{2} + λ_{ad} L_{ad} \\ We use λ_{L 2} = 0.999 and λ_{ad} 0.001 \end{matrix}$ (21)

4 Details of training

We trained AEGAN on CelebA [23] and pre-processed the dataset with two operations, face detection and face alignment. We used stochastic gradient descent solver, ADAM [21] for optimization and used the default solver hyper-parameters suggested in [5]. As suggested in [9], we use one-sided label smoothing. That is to say, we replace the real label with 0.9 and remain the fake label 0 to make the gradient signal of the discriminator morereasonable.

5 Evaluation

The pictures generated by GAN can be composited based on feature operations. AEGAN is more powerful because it realizes composition of given two or more pictures, not random ones.

We compare AEGAN to an auto-encoder (L₂ = 1 and not use adversarial loss). Pictures generated by the latter one are blurry and the composited pictures are not sharp (Fig. 2(b)(c)). We also compare situations of different dimension of feature vector (100 and 4000). The former losses some characteristics when reconstructing the picture but the composition is much more coherent (Fig. 2(a)(b)). As a result, we think that using joint loss alleviate the weakness of both auto-encoder and adversarial networks.

Fig. 2

(a) AEGAN with 4000-dimension feature vector. (b) AEGAN with 100-dimension feature vector. (c) auto-encoder. The pictures in the third column are composited of the ones in the first and second columns. The pictures in the sixth column are composited of the ones in the fourth and fifth columns.

Fig. 3

(a) A ”turn” vector was created from five samples of faces. By adding interpolations to their feature vectors we were able to reliably transform their faces. (b) Normal noise sampled with scale +–0.3 was added to z to produce the other 17 samples.

The features extracted by AEGAN can be used to predict appearance of different ages of one person or predict how a child’s face looks according to the parents’ photos. We can even recover a criminal’s face by composite witnesses’ clues and the security video.

6 Future work

The results shown in this paper is extremely preliminary, but they show the potential of combining generative adversarial network and auto-encoder and show promise for useful applications.

A problem of using L2 distance is that it confines the generated picture to be as same as itself. That is why some composited pictures look unnatural. We prefer to acquire pictures with characteristics of the input picture and universality of the training set. As a result, one of the major tasks is to find other ways to guide the G net obtaining features, reducing weight of L2 loss in the joint loss, to improve the robustness when compositing pictures.

7 Conclusion

AEGAN combines GAN and auto-encoder. CNN in G net plays an important role when extracting features and the joint loss composed of L2 loss and Discriminator loss controls the process of decoding, to make generated picture both sharp and coherent. Different from GAN, AEGAN is able to make use of features of given pictures to composite new pictures, to complete some useful applications.

References

Ledig ,

Theis ,

Huszár et al., Photo-realistic single image super-resolution using a generative adversarial network. arXiv preprint arXiv:1609.04802, 2016 pp. 52–59.

Shrivastava ,

Pfister ,

Tuzel et al., Learning from simulated and unsupervised images through adversarial training,arXiv preprint arXiv:1612.078282016.

Reed ,

Akata ,

Yan et al., Generative adversarial text to image synthesis, arXiv preprint arXiv:1605.05396, 2016.

Isola

J Y.

Zhu ,

Zhou , et al., Image-to-image translation with conditional adversarial networks, arXiv preprint 256 arXiv:1611.070042016.

Radford ,

Metz and

Chintala , Unsupervised representation learning with deep convolutional generative adversarial networks, arXiv preprint arXiv:1511.06434, 2015.

Mathieu ,

Couprie and

LeCun , Deep multi-scale video prediction beyond mean square error, arXiv preprint arXiv:1511.05440, 2015.

Luc ,

Couprie ,

Chintala et al., Semantic segmentation using adversarial networks, arXiv preprint arXiv:1611.084082016.

Pathak ,

Krahenbuhl ,

Donahue et al., Context encoders: Feature learning by inpainting, arXiv preprint arXiv:1604.07379, 2016.

Salimans ,

Goodfellow ,

Zaremba et al., Improved techniques for training gans, advances in neural information processing systems, 2016, pp 2226–2234.

10.

Goodfellow ,

Pouget-Abadie ,

Mirza et al., Generative adversarial nets, advances in neural information processing systems2014, pp. 2672–2680.

11.

LeCun , What are some recent and potentially upcoming breakthroughs in deep learning?Quora (2016).

12.

Krizhevsky ,

Sutskever and

G.E.

Hinton , Image net classification with deep convolutional neural networks, In NIPS, (2012).

13.

Bengio

Learning deep architectures for ai. Foundations and trends in Machine Learning, (2009).

14.

G.E.

Hinton and

R.R.

Salakhutdinov , Reducing the dimensionality of data with neural networks, Science (2006).

15.

Fukushima , Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position, Biological Cybernetics1980.

16.

LeCun ,

Boser ,

J.S.

Denker ,

Henderson ,

R.E.

Howard ,

Hubbard and

L.D.

Jackel , Backpropagation applied to handwritten zip code recognition, Neural Computation (1989).

17.

Ioffe and

Szegedy , Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

18.

Nair and

G.E.

Hinton , Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10) pp. 807–8142010.

19.

A.L.

Maas ,

A.Y.

Hannun and

A.Y.

Ng , Rectifier nonlinearities improve neural network acoustic models,In Proc ICML30, 2013.

20.

Xu ,

Wang ,

Chen and

Li , Empirical evaluation of rectified activations in convolutional network, arXiv preprint arXiv:1505.008532015.

21.

Kingma and

Ba , Adam: A method for stochastic optimization., ICLR (2015).

22.

Metz ,

Poole ,

Pfau et al., Unrolled generative adversarial networks, arXiv preprint arXiv:1611.02163, (2016).

23.

Liu ,

Luo ,

Wang et al., Deep learning face attributes in the wild, Proceedings of the IEEE International Conference on Computer Vision2015, pp. 3730–3738.

24.

Arjovsky ,

Chintala and

Bottou , Wasserstein gan,arXiv preprint arXiv:1701.07875, (2017).

25.

Zhao ,

Mathieu and

LeCun , Energy-based generative adversarial network, Corr, abs/1609.03126, (2016).