Abstract
Generative Adversarial Networks have demonstrated potential on a variety of generative tasks, although they are regarded as unstable and sometimes they miss modes. We propose Auto-encoder Generative Adversarial Networks - a convolutional neural network combining auto-encoders with Generative Adversarial Networks. The former brings more information to Generative Adversarial Networks to reduce problems of miss modes and the latter makes the picture generated more coherent because it can better handle multiple modes in the output. We also show that image composition is available for Auto-encoder Generative Adversarial Networks so that it can be used for many feature-based tasks. Besides, we can generate different samples by adding a random noise to a feature vector.
Introduction
Generative adversarial networks have achieved great results in semi-supervised learning as well as unsupervised learning and showed a good application prospect. Christian Ledig et al. (2016) presented SRGAN model, the first framework capable of inferring photo-realistic natural images for 4× upscaling factors [1]. Ashish Shrivastava et al. (2016) developed a method for Simulated+Unsupervised (S+U) learning that uses an adversarial network, where the task is to learn a model to improve the realism of a simulator’s output using unlabeled real data, while preserving the annotation information from the simulator [2]. Scott Reed et al. (2016) developed a novel deep architecture and GAN formulation to effectively bridge advances in text and image modeling, translating visual concepts from characters to pixels [3]. Phillip Isola et al. (2016) investigated conditional adversarial networks as a general-purpose solution to image-to-image translation problems [4]. Pauline Luc et al. (2016) proposed an adversarial training approach to train semantic segmentation models [7]. Deepak Pathak et al. (2016) presented an unsupervised visual feature learning algorithm driven by context-based pixel prediction [8].
While the benefits of GAN are indisputable, there are also obvious disadvantages. First, GAN is not stable so that it collapses easily. Optimization algorithms often approach a saddle point or local minimum rather than a global minimum and sometimes game solving algorithms may not approach an equilibrium at all. Second, missing mode is a common problem when training GAN. When missing mode happens, generated pictures are restricted to some mode and lose richness of the training set.
Now many papers begin to solve the two problems and achieve some good results. Salimans et al. (2016) add minibatch that classify each example by comparing it to other members of the minibatch and they use nearest-neighbor style features detect to prevent a minibatch from containing samples that are too similar to each other [9]. In principle, minibatch can help prevent from mode collapse. Metz et al. (2016) proposed unrolling GAN back propagate through k updates of the discriminator to prevent mode collapse [22]. That is to say, GAN chooses the best D net every k epochs according to the loss, instead of updating D net every time. It works out as well.
The structure of GAN determines that the generative net generates a picture from a random vector. It actually maps a random distribution, such as Gaussian, into a multi-dimensional space. The random vector is, in a way, the feature of the generated picture. However, as a result of the randomness, we cannot make use of the features effectively. When we need some features such as wearing glasses or smiling, we can only search in large amounts of pictures to look for corresponding vectors.
Considering that Convolutional Neural Networks [15, 16] have good properties when extracting features, we add a CNN before the traditional generative network. The structure of the new generative net is like an auto-encoder [13, 14]. Unlike the original G net, mapping a noise vector z into the sample space directly, the new G net firstly encodes the training data into a vector z, and then decodes z (Fig. 1). As a result, the training process is more stable and missing modes problem will be reduced because of extra information.

The Auto-encoder generative adversarial network. After training, we use the forward model f to transfer pictures to vectors and g to transfer composited vector to a picture.
GAN
Supervised learning is based on large amounts of data while sometimes we face problems of lacking data. In this situation, we can use generative model to transform the supervised learning to semi-supervised one. Ian Goodfellow et al. (2014) [10] proposed Generative Adversarial Net in 2014. Yann Lecun called it “the most interesting idea in the last 10 years in ML” in Quora post [11]. GAN trains two neural nets simultaneously. The Discriminator net takes an input (e.g. an image) from a training set and outputs a scalar that indicates the probability that the input is in the training set. The Generator produces images from a vector randomly sampled in a simple distribution (e.g. Gaussian). The G net trains itself to produce images so as to fool D while D trains itself to discriminate real data and the generated one better. That is why GAN is called “adversarial”. In other words, D and G play the following two-player minimax game with value function V (G; D):
Radford et al. proposed DCGAN (Deep Convolutional Generative Adversarial Network) in 2015 [5]. Their G net is actually a reversed convolutional neuron network. Convolutional filters extract meaningful features and decrease the picture sizes. On the contrary, the G net in DCGAN enlarges the features from a random distribution and generates a new picture.
Meanwhile, they proposed many tricks, such as removing pooling layer in traditional CNN to make G fully differentiable. These tricks make training of GAN more stable and manageable. In addition, DCGAN mentioned that we can get some interesting results from vector operations between features [5].
However, z is randomly distributed, so we cannot predict what picture will be generated from a vector z. Furthermore, we think vector space arithmetic meaningless if the picture is totally unpredictable.
We now introduce AEGAN: a convolutional neural network combining auto-encoders with Generative Adversarial Networks. We first give an overview of the general architecture, and then we provide details on the learning procedure.
Different distances
Arjovsky [24] defined elementary distances and divergences between two distributions P r , P g ∈ Pr(χ)
The Total Variation(TV) distance
The Kullback-Leibler(KL) divergence
The Jensen-Shannon(JS) divergence
People are familiar with the generation of random numbers. In essence, GAN can be seen as a large random number generator. We investigate the simplest linear congruent generation algorithm
where
Then we are going to generate a random sampling point for the gaussian distribution on the unit disk. We first generate
Then we define the mapping
That means we can transform the uniform distribution by a mapping into a gaussian. If we view the probability distribution as some mass density, the mapping will bring about the change in area, as well as in density.
In the image generation application, GAN essentially transforms a fixed probability distribution, such as Gaussian, into the probability distribution of training data, such as the distribution of face images. The ideal mathematical model of GAN is as follows: we make a space, denoted as image space χ, of all n × n images and view each image as a point x in χ. We denote the probability that x is a real face picture as υ (x) so υ is the target probability measure for learning. In engineering practice, we only have some samples of face images {y1, y2, . . . , yn}, consisting an approximation of the empirical distribution. The empirical distribution is
Most images are not facial images, so the support set of υ
Suppose there is a fixed probability distribution ζ ∈ P (Z), such as gaussian, in the latent space, we can use a deep neural network θ to approximate the decoding map
gθ pushes forward ζ to the probability distribution in the image space
We call μθ the generate distribution.
Unlike GAN, G net in AEGAN is like an auto-encoder, that is to say, it encodes before decoding (different from G(z) in GAN, the output in AEGAN is G(x)=g(f(x)), where x is the distribution of the training set). Comparing to mapping a random distribution (such as Gaussian) into high dimensional spaces directly, the additional encoder has two advantages. First, the generative process has more information now, forming a map from a high dimensional space into a high dimensional space, to reduce mode collapse. Second, if AEGAN works well, we can extract feature z from an arbitrary picture, operate on z according to requirements and decode it at last. Due to good properties in vector arithmetic of GAN, the final picture generated will be in the high dimensional space, that is, the distribution of training set.
1) Encoder f: Considering that the encoder is mainly to extracts features from input data, we use CNN due to its good performance on extraction. Our encoder is derived from the Alexnet [12]. Given a picture of size 64*64, we use the convolutional layers and a fully connected layer to extract a 100 dimensional feature. It’s worth noting that we use leakyReLU [19, 20] instead of ReLU and remove max pooling layer to avoid sparse gradients and we add batchnorm layers [17] to make data distribution more concentrated and make convergence faster.
2) Decoder g: Decoder generates pictures from extracted features. The fully connected layer is followed by a series of fractionally-strided convolutions [5], each with a ReLU [18] activation function, converting the high level representation into a 64×64 pixel image. We train the decoder to be a goodgenerator. On one hand, generated picture remains features of the input image, and on the other, decoder maps a vector into the distribution of the training set.
Discriminator D
The Discriminator is a traditional convolutional neuron network, to distinguish generated data from training data. When training is finished, distribution of generated data is supposed to be closed to the one of training data, while the trained D net is able to differentiate them.
We update the discriminator by ascending its stochastic gradient:
For G fixed, the optimal discriminator D is
Energy-based GANs(EBGANs) [25] have a discriminator D who tries to minimize
for some m > 0 and
and a generator trained to minimize
Similarly, G net is similar to an auto-encoder and the size of input and output data is the same. Therefore we use pixel-wise L2 loss to train the G net.
On the other hand, the objective for discriminator is logistic likelihood whether the input is real sample or predicted one:
We adapt this framework because it has shown encouraging results in [5]. In practice, both D and G are optimized jointly using alternating SGD.
We define the overall loss function as
We trained AEGAN on CelebA [23] and pre-processed the dataset with two operations, face detection and face alignment. We used stochastic gradient descent solver, ADAM [21] for optimization and used the default solver hyper-parameters suggested in [5]. As suggested in [9], we use one-sided label smoothing. That is to say, we replace the real label with 0.9 and remain the fake label 0 to make the gradient signal of the discriminator morereasonable.
Evaluation
The pictures generated by GAN can be composited based on feature operations. AEGAN is more powerful because it realizes composition of given two or more pictures, not random ones.
We compare AEGAN to an auto-encoder (L2 = 1 and not use adversarial loss). Pictures generated by the latter one are blurry and the composited pictures are not sharp (Fig. 2(b)(c)). We also compare situations of different dimension of feature vector (100 and 4000). The former losses some characteristics when reconstructing the picture but the composition is much more coherent (Fig. 2(a)(b)). As a result, we think that using joint loss alleviate the weakness of both auto-encoder and adversarial networks.

(a) AEGAN with 4000-dimension feature vector. (b) AEGAN with 100-dimension feature vector. (c) auto-encoder. The pictures in the third column are composited of the ones in the first and second columns. The pictures in the sixth column are composited of the ones in the fourth and fifth columns.

(a) A ”turn” vector was created from five samples of faces. By adding interpolations to their feature vectors we were able to reliably transform their faces. (b) Normal noise sampled with scale +–0.3 was added to z to produce the other 17 samples.
The features extracted by AEGAN can be used to predict appearance of different ages of one person or predict how a child’s face looks according to the parents’ photos. We can even recover a criminal’s face by composite witnesses’ clues and the security video.
The results shown in this paper is extremely preliminary, but they show the potential of combining generative adversarial network and auto-encoder and show promise for useful applications.
A problem of using L2 distance is that it confines the generated picture to be as same as itself. That is why some composited pictures look unnatural. We prefer to acquire pictures with characteristics of the input picture and universality of the training set. As a result, one of the major tasks is to find other ways to guide the G net obtaining features, reducing weight of L2 loss in the joint loss, to improve the robustness when compositing pictures.
Conclusion
AEGAN combines GAN and auto-encoder. CNN in G net plays an important role when extracting features and the joint loss composed of L2 loss and Discriminator loss controls the process of decoding, to make generated picture both sharp and coherent. Different from GAN, AEGAN is able to make use of features of given pictures to composite new pictures, to complete some useful applications.
