Abstract
In recent years, deep neural networks have made significant progress in image classification, object detection and face recognition. However, they still have the problem of misclassification when facing adversarial examples. In order to address security issue and improve the robustness of the neural network, we propose a novel defense network based on generative adversarial network (GAN). The distribution of clean - and adversarial examples are matched to solve the mentioned problem. This guides the network to remove invisible noise accurately, and restore the adversarial example to a clean example to achieve the effect of defense. In addition, in order to maintain the classification accuracy of clean examples and improve the fidelity of neural network, we input clean examples into proposed network for denoising. Our method can effectively remove the noise of the adversarial examples, so that the denoised adversarial examples can be correctly classified. In this paper, extensive experiments are conducted on five benchmark datasets, namely MNIST, Fashion-MNIST, CIFAR10, CIFAR100 and ImageNet. Moreover, six mainstream attack methods are adopted to test the robustness of our defense method including FGSM, PGD, MIM, JSMA, CW and Deep-Fool. Results show that our method has strong defensive capabilities against the tested attack methods, which confirms the effectiveness of the proposed method.
Introduction
With the rapid development of deep learning in the field of computer vision [5, 27], the efforts on image recognition have achieved great success. However, recent studies have revealed that deep neural networks are very vulnerable to adversarial examples. The phenomenon was firstly discovered in the work of Szegedy [24]. If a small disturbance is intentionally added to a clean sample as a input into neural network, the classifier will output a wrong prediction. For example, a disturbed picture of a cat can be misclassified as a dog by classifier, even if these disturbances are difficult to detect by human eyes. This feature exposes the vulnerability of deep neural networks. People use this leak [3, 4] to identify and understand the potential weakness of deep neural networks, thus improve the defense ability and robustness of the network.
Adversarial examples pose a potential security threat to real deep learning applications. Studies have shown that adversarial examples associated with traffic signs may cause autopilot vehicles recognition systems to make wrong decisions and behaviors, which may lead to dangerous maneuver [7]. For example, a stopping traffic sign could be classified as a sign of speed-limit 80, which is obviously dangerous for autonomous driving. So it is very important and urgent to develop defense methods against the adversarial examples. Since adversarial examples are constructed by adding noises to original images, a natural idea is to denoise adversarial examples before sending them to the target model.
The non-linear characteristics of neural networks were once considered to be the reason of adversarial examples [1]. In 2014, Goodfellow argued that the generation of adversarial examples was caused by the linear characteristics of deep neural network rather than the nonlinear characteristics, which was confirmed by Fast Gradient Sign Method (FGSM) [8]. Since the high dimensional linear properties of deep networks are difficult to avoid in practical applications, so it is more difficult to defend against adversarial examples. Small perturbations contained in adversarial examples(for a specific classifier) can lead the classifier to make a wrong classification. Currently, how to defend against adversarial examples is still a challenging problem. If a algorithm can eliminate disturbance on adversarial examples, at the same time, adversarial examples satisfy a condition that it is visually consistent with the clean images, the algorithm will be applied to defensive adversarial examples. In the method of Adversarial training [24] and defensive distillation [21], adversarial examples were fed back to the training process to train the model. Adversarial training adds adversarial examples to a training set for joint training, but this method requires large amount of calculation. It has been argued that no matter how many adversarial examples are added, there are always have new adversarial examples that can deceive the network. Defensive distillation uses soft labels to softer the final model output and improve the robustness of new model. However, this approach does not essentially solve the problem of poor robustness of the model against the adversarial examples. Therefore, reducing the impact of adversarial examples on neural networks remains a huge challenge.
In order to achieve the purpose of defense and improve the classification accuracy of adversarial examples, we propose a new method to defend against it. In this paper, due to its strong ability, we adopted Generative adversarial network (GAN) to simulate image distribution and provide some help for denoising. All in all, our propsosed method has three objectives: 1) eliminate the disturbance effectively, 2) make the classifier classify correctly, 3) enhance the robustness of the classifier.
As shown in Fig. 1, 66.17%, 7.88% and 46.27% are the score of coonfidence of the label of images being predicted as an image of digit “5”. 78.1%, 27.5% and 79.8% are the score of confidence of the label of images being predicted as a car. It can be seen that proposed method can be applied to both gray-scale images and RGB images.

Denoise of two different images, from left to right, the clean image, the adversarial image, and the reconstructed image. (Best viewed in color).
Our proposed method mainly focuses on retaining the features of clean samples, while denoising the adversarial samples, reducing training difficulty and faster fitting, so as to achieve good results. Existing defense methods ignore the characteristics of clean samples, and there is a problem that the classification success rate decreases when denoising the clean samples.
Therefore, our contributions are as follows: Firstly, a denoising model consisting of a generator and a discriminator is proposed, which simulates the approximate distribution of clean and adversarial examples to eliminate the difference.
Secondly, an isomorphic loss of images and labels between denoised adversarial examples and clean examples is proposed to generate better denoised images.
Thirdly, the effectiveness of the defense on five datasets, MNIST, fashion-MNIST, cifar10, cifar100, and ImageNet is demonstrated. Meanwhile, our defense model is tested against by six attack methods. Results show the proposed defense method has strong ability to defense against the methods.
The rest of the paper is organized as follows: In Section 2, we introduce the background of the problem and some common used attack methods and defense methods. In Sections 3, we discuss the details of the proposed method, including the description of the principle and the setting of the objective function. Sections 4 introduce the experimental settings and results, which include the experimental details and the algorithm settings of this paper. Finally, we summarize the views of the full text and show the prospects for future work in Sections 5.
Szegedy et al. [24] first discovered the perceived inconsistency between human and machine learning models. If an image is disturbed, the deep neural network may treat it as a completely different image and give totally different predictions, but humans may not even see this difference [12]. The data that can cause wrong predictions with little changes are called adversarial examples. These blind spots make neural network models vulnerable to malicious attacks.
Therefore, we can use an approximate function to model the task of generating adversarial examples, as Equation (1)
Many kinds of methods for generating adversarial examples have been proposed. FGSM (Fast Gradient Sign Method) is a primary method for generating adversarial examples. Goodfellow et al. [8] proposed this method to find a adversarial perturbation for a given input. Equation (2) is the formula for calculating noise.
FGSM calculates the sign of the gradient of the loss function and multiplies it by a constant, and finally adds the obtained result to the clean image as noise to obtain adversarial examples. The advantages of this method is that it is fast, as well as the generated adversarial examples have strong transferability, but the disadvantage of this method is that the added noise can be easily eliminated, such as using median filtering. Moreover, Moosavi-Dezfooli [18] proposed the deep-fool method. This method calculates the minimum norm perturbation for a given image in an iterative manner and finds the projection of an input x on the decision surface, by moving x a small amount along the found projection direction. The advantage of this approach is the high attack rate for white-box attacks. However, the disadvantage is that it’s hard to use them for black-box attacks. In addition, Madry [16] proposed the Project Gradient Descent (PGD) attack.
It is an iterative attack method, which can be regarded as a replica of K-FGSM (K represents the number of iterations). The idea of this method is to iterate multiple times. One step at a time and the disturbance will be clipped to specified range for each iteration. In general, PGD attacks are better than FGSM.
A momentum-based iterative attack method MIM (Momentum Iterative Method) is proposed by Dong [6], which can stably update the direction and avoid local maximum in the iterative process, thus generating a transferable adversarial example. It is effective for not only the "white-box" attack but also the "black-box" attack. Equally important but different in principle, JSMA (Jacobian Saliency Map) method proposed by Papernot [20] uses the Jacobian matrix to find the difference among the saliency map from input to output. It can change the prediction category by modifying only a small number of input features. C&W attack was proposed by Carlini [3]. The attack method has two conditions to achieve successful attack. One is that the adversarial and clean examples should be visually consistent. Another is to use as few disturbances as possible to achieve a higher classification error rate. A patch-based attack [3] was proposed by Liu in 2019. The purpose of this method is to make the area where PATCH exists the only effective RoI. The proposed method uses this area for classification and ignores the characteristics of other areas.
In addition, empirical data show that adversarial examples can be transferred among models trained on the same task [19], and Goodfellow illustrates this phenomenon by introducing the mobility of adversarial examples [8]. Further research indicates that adversarial examples appear in continuous regions or adversarial subspaces.
There are several main defense methods based on adversarial learning. One representative defense method is MagNet [17]. This method does not modify the protected classifier and knows nothing about the process of generating adversarial examples. MagNet contains one or more separate detector networks and a fine-tuning network. The detector part of this method also needs to distinguish between adversarial samples and clean samples, but the method used is different from other detection methods. The idea of MagNet is distinguished by approximating the manifold of normal samples. The article discusses the difficulty of defending against white-box attacks, and proposes a method of defending against gray-box 1 attacks(inspired by randomness in cryptography). Experiments prove that the method proposed in this paper is effective for both gray-box and black-box attacks. This method defends adversarial samples from the perspective of detection, while our method defends from the perspective of modifying the input image. The next thing to introduce is Defense-GAN [17]. This method uses the expression capabilities of the generative model of WGAN [2] to defend against adversarial samples. Defense-GAN is trained to simulate the distribution of undisturbed images. The purpose is to find a distribution close to a clean sample and output it to the classifier to achieve the purpose of defense. It also does not modify the structure or training process of the classifier. Although defense-gan defends from the perspective of modifying the input image, this method is to simulate the distribution of the adversarial sample after removing the disturbance, that is, there is no reference for the distribution of the clean image. However, our method is to input the clean image into the generator to obtain the prior knowledge of the distribution of clean samples, thereby reducing the training difficulty. In addition, it can also play a certain role in denoising the adversarial samples. The last defense method I want to introduce is APE-GAN [9], which is the origin of our motivation. The article defends from the perspective of changing the input image, that is, processing the image before classification. However, this article does not consider the processing of clean sample distributions, and it cannot handle more complex images. Therefore, we improved based on this to enable it to process more complex images. Moreover, our proposed method takes the distribution of clean samples into account, thereby reducing the difficulty and time of training. Our method aims to improve the generation quality of the generator, while reducing the training difficulty, making the neural network fit faster and achieving good denoising effects. At the same time, we reduce the impact of the adversarial examples on the classifier through multiple pairing losses.
Preliminary
In our method, the generative adversarial network is utilized to reduce the value of perturbation. The generator network is used for image denoising, and the discriminator network is used to discriminate adversarial examples and real examples. In the training progress, the discriminator distinguishes the real and fake examples as correctly as possible. The generator minimizes the gap between the clean and adversarial example at the image level, aiming to confuse the discriminator’s judgment so as to generate a better denoised adversarial example. The network architecture is shown in Fig. 2. Adversarial example x adv is input into the generator, while denoised adversarial example x fake is obtained as the output of the generator. Clean image x real is the input to the generator for denoising, finally we got denoised clean image x dec . In this paper, there are three pair of loss functions: the loss between x fake and x real at the pixel level is the repair loss l xf ; the loss between x real and x dec at the pixel level is the recovery loss l xd ; the loss between x fake and x dec at the pixel level is the isomorphic loss l df . The generative adversarial network is trained by the above three loss functions to obtain a good denoising result.

The method used by the defense network is to input clean samples and adversarial samples into the generator, obtain clean samples and adversarial samples after denoising, and use a discriminator to determine true and false to achieve the defense purpose.
All in all, our method is attributed to the powerful generation ability of the generator, which enables the denoising of the adversarial samples while retaining the features of the clean samples in the adversarial samples.
Combined with the network structure definition and main ideas of this paper, there are three loss functions for training different indicators: repair loss, recovery loss and isomorphic loss. The meaning is explained separately below.
Among them, the role of l
xf
-
mse
is to reduce the error of x
real
and x
fake
at the pixel level, prompting the generator to generate denoised images with higher quality. As shown in Equation (4):
The label-level error used in the two-category is shown in Equation (5), which is the cross-entropy loss. The difference between ground truth label and predicted label of the denoised adversarial example is calculated. The purpose is to reduce the loss at the label level between x
real
and x
fake
so that the denoised examples can be correctly classified and guide the generator.
In summary, the total loss function L
all
is composed of the above three loss functions, and the purpose is to obtain high quality denoised images. As shown in Equation (3):
In order to verify the feasibility of the proposed model, qualitative and quantitative experiments are conducted on various defense methods and datasets. Specifically, six attack methods described in Section 2 are used to generate adversarial examples so as to verify the effectiveness of the proposed defense method. Among them, for each dataset, three different Convolutional Neural Networks (CNNs) classification networks is used to classify the adversarial examples as correctly as possible. At the same time, we used universal indicators (classification accuracy) and image quality indicators (peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [25]) to measure the effectiveness of our defense.
Datasets
We used five common datasets for experiments, including grayscale and color datasets. The first one we are going to introduce is the simplest handwritten digital dataset MNIST [13] in the grayscale dataset, which contains ten classes (numbers 0-9). On this basis, we want to introduce a clothing dataset Fashion-MNIST [26] with relatively complex texture features in the grayscale dataset. The clothing dataset also contains ten classes (mainly contains clothing and bags). Grayscale images have a size of 28×28. The advantage of the grayscale image is that the image is relatively small and the operation when processing the image is simpler. Next we introduce the color data set, which is divided into the smaller size CIFAR [10] series dataset (CIFAR10 and CIFAR100) and the larger size ImageNet dataset [11]. The size of the CIFAR series dataset is 32×32. The difference between CIFAR10 and CIFAR100 is the number of categories: CIFAR10 contains 10 classes while CIFAR100 contains 100 classes. Compared with the previous data set, the largest ImageNet is a dataset collected by Li Feifei that can be used for classification tasks and object detection tasks. The size of the imagenet dataset we used is 128×128. The images are sampled in nature. Among these datasets, mnist is compiled by the National Institute of Standards and Technology, so it can be considered as a simulation experiment, and imagenet is a dataset created for all the objects in the world, so our experiment includes simulation experiments and real experiment.
Experimental details
Training methods on the generate adversarial network: the entire network is trained 70,00 epochs. Learning rate is initialized with 0.0002 and the Adam optimizer is used to update parameters and optimize the network. Batchsize is set to 32. The device parameters are CPU: Intel i7-8700, GPU: RTX2080Ti-11G, memory: DDR4-3000-32G. The code runs under the PyTorch deep learning framework, and the entire network training completion time is about 5 days. After calculation, the number of our floating point operands (FLOPS) is about 1.5B, mainly including the convolution and deconvolution operations and the operation of the full connection operation in the network structure.
Algorithm structure
Qualitative experiments
Due to the display complexity caused by multiple attack methods and multiple datasets, our qualitative experiments only show the defensive effect of the adversarial samples generated by the FGSM method [8], in which the displayed images are correctly classified. In this paper, according to the principle of FGSM, we have different settings for the perturbation multiplier of the grayscale image and the RGB image, where for the grayscale image we set the perturbation multiplier to 0.3, and for the RGB image we set it to 0.15, this setting can achieve the best attack effect, that is, it can achieve the effect of misclassification of the neural network and invisible disturbance to the human eye at the same time. As shown in Fig. 3, it can be seen that the denoised result of the proposed method is visually plausible. Fig. 4 is the experiment deployed on the Fashion-MNIST dataset. It is also well reflected in the features of the dataset. For example, in the third line, it can be seen that the effect is obvious, and the features of clothes are also revealed. It also indicates that the denoising effect is easier to reflect on MNIST, Fashion-MNIST. The proposed model is also verified by experiments on RGB image datasets. As shown in Figs. 5 and 6, the experiment is performed on CIFAR10 and CIFAR100 seperately. Compared with CIFAR10, CIFAR100 can better reflect the detailed features of the image, and the semantic features of the image are more clear. In order to make the experiment sufficiently, this paper has done experiments on the ImageNet dataset as well, as shown in Fig. 7. It can be seen from the experimental results that when we process the RGB image, although the image texture is more complex than the grayscale image, it can still achieve a good denoising and defense effect.

The experimental results of the FGSM method on the MNIST, the first line “Normal” is a normal clean sample, the second line “Adv” is the generated adversarial examples, and the third line “Ours” is the denoised results of our method.

The experimental results of the FGSM method on the Fashion-MNIST, the first line “Normal” is a normal clean sample, the second line “Adv” is the generated adversarial examples, and the third line “Ours” is the denoised results of our method.

The experimental results of the FGSM method on CIFAR10, the first line “Normal” is a normal clean sample, the second line “Adv” is the generated adversarial examples, and the third line “Ours” is the denoised results of our method.

The experimental results of the FGSM method on CIFAR100, the first line “Normal” is a normal clean sample, the second line “Adv” is the generated adversarial examples, and the third line “Ours” is the denoised results of our method.

The experimental results of the FGSM method on ImageNet, the first line “Normal” is a normal clean sample, the second line “Adv” is the generated adversarial examples, and the third line “Ours” is the denoised results of our method.
This paper uses the aforementioned six attack methods for quantitative experiments, including classification accuracy, peak signal-to-noise ratio, and structural similarity, as shown in Tables 3. Table 1 is the experiment of classification accuracy. Result shows the proposed method can effectively defend the adversarial examples generated by the above several attack methods, and the "base" column gives the classification accuracy on the adversarial examples of the classification model without defense operations. The "Ours" column is the classification accuracy of the samples denoised by the model proposed in this paper. Results show the adversarial examples generated by the five datasets selected in the six attack methods and the accuracy obtained by the three target classifiers. It can be seen from the Table 1 that the method of this paper has a good effect on the recovery of the adversarial examples, and the classification accuracy after denoising is significantly improved. Among them, for the attack of MIM method on the complex ImageNet dataset, the classification accuracy after denoising increases from 22.6% to 78.7%.
Classification accuracy of five datasets under six attack methods
Classification accuracy of five datasets under six attack methods
Comparative study of three defense methods on five classifiers
The PSNR of the six methods before and after denoising
To satisfy the purpose of classifier defense against attack, we compare our proposed method with existing methods, as shown in Table 2. The results show that our method is effective compared with the other two methods. This can be shown more clearly in the Imagenet dataset, and the classification accuracy of Defense-Gan [17] is lower than our proposed method, which is 4.3%. At the same time, our method is more effective than APE-GAN. In the Imagenet dataset, the classification accuracy of our method is higher than that of APE-GAN [9] 7.4%. In order to evaluate the effect of the proposed model, it is required not only to correctly classify the images, but also to make the denoised image as similar as possible to the original image. Therefore, the quantitative analysis of PSNR and SSIM of the denoised images and the adversarial examples is also carried out, as shown in Tables 3 and 4. PSNR is a common quality evaluation metric, which is used to evaluate the quality of the generated image compared with the clean image. The general value range is 20-40dB. SSIM (structural similarity index) is an index to measure the similarity of two images, the structure similarity ranges from 0 to 1. When two images are identical, the value of SSIM is equal to 1. For PSNR and SSIM, they have the same criteria, that is, the larger the value, the better the image quality.
The SSIM of the six methods before and after denoising
Tables 3 and 4 show the PSNR and SSIM comparison of the proposed method applied to the six attack methods. The “Base” column represents the PSNR and SSIM calculated by the adversarial examples and the clean image; the "Ours" column represents the PSNR and SSIM between the denoised image and the clean image. The increase of values is mainly due to the constraint of isomorphic loss, which enhances the correlation between the adversarial examples and the clean examples.
In this paper, we propose a new method to defend against invisible noise. A discriminator is used to distinguish the output of the generator to maintain the classification accuracy of clean samples, thereby improving the defense ability of the network against adversarial samples and improving the robustness of the network. We then deployed six common attack methods to attack the network and tested the effectiveness of our proposed defense network through qualitative and quantitative experiments. Experiments prove the denoising effect of our method and the robustness of the network. There are still some weaknesses in our method. The most typical one is that our mechanism has certain limitations, i.e. dealing with patch-based attack, which are mainly limited by our generators. In the future, we focus on improving the network for complex adversarial examples denoising problems, combining with new denoising methods to achieve better denoising effects at semantic level.
Acknowledgments
The authors are very indebted to the anonymous referees for their critical comments and suggestions for the improvement of this paper. This work was supported by the grants from the National Natural Science Foundation of China (Nos. 61673396, 61976245) and the Fundamental Research Funds for the Central Universities (18CX02140A).
Footnotes
Except for the parameters, attacker knows everything else about target model.
