Abstract
Deep learning is gaining significant traction in a wide range of areas. Whereas, recent studies have demonstrated that deep learning exhibits the fatal weakness on adversarial examples. Due to the black-box nature and un-transparency problem of deep learning, it is difficult to explain the reason for the existence of adversarial examples and also hard to defend against them. This study focuses on improving the adversarial robustness of convolutional neural networks. We first explore how adversarial examples behave inside the network through visualization. We find that adversarial examples produce perturbations in hidden activations, which forms an amplification effect to fool the network. Motivated by this observation, we propose an approach, termed as sanitizing hidden activations, to help the network correctly recognize adversarial examples by eliminating or reducing the perturbations in hidden activations. To demonstrate the effectiveness of our approach, we conduct experiments on three widely used datasets: MNIST, CIFAR-10 and ImageNet, and also compare with state-of-the-art defense techniques. The experimental results show that our sanitizing approach is more generalized to defend against different kinds of attacks and can effectively improve the adversarial robustness of convolutional neural networks.
Keywords
Introduction
Machine learning is driving progress in many areas such as computer vision, natural language processing and self-driving cars. Like any technology, machine learning is not immune to attacks. There are different categories of adversarial threats on machine learning, such as poisoning, extraction, inference and evasion. As is well known, one fatal weakness of machine learning is its sensibility of adversarial examples, which are mainly exploited to launch evasion attacks. Adversarial examples are carefully crafted examples that have small difference from legitimate examples but can fool or mislead machine learning models. Adversarial examples pose a serious threat to the application of machine learning in safety- and security-critical scenarios such as face recognition, autonomous driving and medical diagnosis. Since first discovered by Szegedy et al. [1], adversarial examples remain a hot topic and have drawn widespread attention from researchers. A large number of attack approaches have been proposed for generating adversarial examples.
As we know, machine learning models usually lack interpretability and they do not explain their predictions. Especially for deep learning models, e.g., deep neural networks, they are faced with black-box nature and un-transparency problem. These issues have contributed to the raging of adversarial examples. As so far, it has not been able to explain the reason for the existence of adversarial examples, which hinders the research on defense techniques. Defenses against adversarial examples are in the extremely passive condition. More seriously, adversarial examples are not only difficult to defend but also difficult to detect. As Francois Chollet [2] argues, “There are no fixes for the fundamental brittleness of deep neural networks.”
In this paper, we focus on improving the adversarial robustness of convolutional neural networks so as to defend against adversarial examples. We first explore how adversarial examples behave inside the network through visualization. We find that adversarial examples produce perturbations that look like “noise” in hidden activations. These perturbations have an amplification effect through forwarding propagation and finally accumulate enough errors to fool the network. Motivated by this observation, we propose an approach, termed as sanitizing hidden activations, to improve the adversarial robustness of convolutional neural networks. The sanitizing approach is designed to eliminate or reduce the perturbations in hidden activations, which can help the network correctly recognize adversarial examples.
In summary, we make the following contributions:
(1) We ascertain the cause of adversarial examples fooling the network through visualization, which promotes the research for improving the adversarial robustness of neural networks.
(2) We propose a novel approach—sanitizing hidden activations—to improve the adversarial robustness of convolutional neural networks.
(3) We present three specific sanitizing methods and verify their effectiveness through comprehensive experiments.
The remainder of the paper is organized as follows. In Section 2, we introduce the related work, including adversarial attacks and defenses. In Section 3, we analyze perturbations in hidden activations through visualization of an MNIST network. In Section 4, we present the sanitizing approach for improving the adversarial robustness of convolutional neural networks. In Section 5, we present the experimental results. In Section 6, we conclude the work.
Related work
Adversarial attacks
Since Szegedy et al. [1] first demonstrated the existence of adversarial examples in neural networks, there are many types of adversarial attacks being proposed one after another. We just introduce three representative adversarial attacks, which are tested in our experiments in this article.
(1) Fast Gradient Sign Method (FGSM)
Goodfellow et al. [3] proposed a fast method named as fast gradient sign method (FGSM) to craft adversarial examples. FGSM is a one-step algorithm and can be formulated as:
FGSM is regarded as a benchmark method for crafting adversarial examples. There are many variants of FGSM, e.g., the basic iterative method (BIM) [4], the iterative least-likely class method (ILCM) [4], and the momentum iterative method (MIM) [5].
(2) Carlini and Wagner Attack (C&W)
Carlini and Wagner [6] introduced an optimization-based method, termed as C&W, to search for high-confidence adversarial examples with small magnitude of perturbations. C&W is formulated as:
(3) Projected Gradient Descent (PGD)
Madry et al. [7] proposed an iterative attack called the projected gradient descent (PGD). The formulation of PGD is expressed as:
In each iteration, PGD first updates the perturbed example by a gradient descent step α. Then it projects the perturbed example into the feasible solution space, i.e., a maximum perturbation ɛ of the original example.
Currently, the defense techniques against adversarial attacks can be classified into five categories.
(1) Adversarial training
Adversarial training tries to train a robust model by injecting adversarial examples into training data. Kurakin et al. [8] proposed the naive adversarial training to scale adversarial training to large-scale datasets. Madry et al. [7] proposed the PGD-based adversarial training via training the model with adversarial examples generated by PGD attack. Tramer et al. [9] introduced the ensemble adversarial training that augments training data with adversarial examples transferred from other models.
(2) Gradient masking/regularization
Since numerous adversarial attacks utilize gradients of the model to calculate adversarial perturbations, some researchers propose reducing the effectiveness of adversarial attacks by hiding the gradients, which is termed as gradient masking or regularization. Lyu et al. [10] proposed a gradient regularization method by penalizing the gradient of the loss function to incorporate robustness into the network. Papernot et al. [11] introduced a defensive mechanism called defensive distillation to use the knowledge of the network to improve its own robustness. Ross and Doshi-Velez [12] studied input gradient regularization as a defense method for improving the robustness against adversarial attacks.
(3) Input transformation
Some research has attempted to remove the adversarial perturbations from inputs before feeding them into machine learning models, which is referred to as input transformation. Xie et al. [13] introduced a random transformation-based defense by making testing examples first go through two additional randomization layers. Song et al. [14] proposed PixelDefend to purify adversarial perturbations in inputs. Buckman et al. [15] proposed the thermometer encoding method to increase the robustness of the network to adversarial examples. Guo et al. [16] investigated strategies that defend against adversarial attacks with different image transformation techniques, such as bit-depth reduction, JPEG compression and total variance minimization. Bafna et al. [17] presented an sparse discrete Fourier transform to thwart L0 attacks.
(4) Detection
Given the difficulty in recognizing adversarial examples correctly, some research aims to detect adversarial examples and then reject them. Meng and Chen [18] introduced a defense framework called MagNet, which includes one or more separate detector networks and a reformer network. Li and Li [19] used statistics of convolution filters in CNN-based neural networks to identify adversarial examples. Ma et al. [20] proposed a local intrinsic dimensionality-based detector to discriminate adversarial examples from legitimate ones. Xu et al. [21] proposed the feature squeezing method to detect adversarial examples in deep neural networks. Roth et al. [22] proposed a statistical test for detecting adversarial examples. Li et al. [23] proposed detection methods using the generative classifier’s logit values. Yin et al. [24] presented a detection method that provides performance guarantee to norm constrained adversarial examples.
(5) Miscellaneous
There are some miscellaneous defense methods. For example, Gao et al. [25] proposed to insert a masking layer immediately before the layer handling the classification. Cao and Gong [26] introduced the region-based classification defense method, which takes the majority prediction on examples that are uniformly sampled from a hypercube around the adversarial example. Pang et al. [27] presented a method that explores the interaction among individual networks to improve robustness for ensemble models. Verma and Swanu [28] proposed an approach which changes the way the output is represented and decoded to improve the adversarial robustness of neural network. Xiao et al. [29] advocated the use of k-Winners-Take-All activation for defending against gradient-based adversarial attacks.
Although many types of defense methods have been proposed, the progress of defense techniques is still not optimistic. As stated in [30], most of these defense methods can be easily bypassed by new carefully crafted adversarial examples.
Perturbations in hidden activations
Before introducing our approach for improving the robustness of convolutional neural networks, we first explain why adversarial examples can successfully fool the network.
We trained a lightweight network for the MNIST 1 dataset classification task, which is termed as MNIST-CNN. The architecture of MNIST-CNN is given in Table 1. This architecture is also used in [6] and [11]. Except for input and output layers, MNIST-CNN has eight hidden layers. For convenience, we term these hidden layers in turn as CONV1, CONV2, POOL3, CONV4, CONV5, POOL6, FC7 and FC8. Our training approaches and hyperparameters are identical to those presented in [6]. We achieve a test accuracy of 98.65% on MNIST-CNN.
Model architecture of MNIST-CNN
Model architecture of MNIST-CNN
We first feed the network with a legitimate example and an adversarial example respectively, and then observe hidden activation changes. The legitimate example is a clean digit “0” image randomly selected from the MNIST test data, and the adversarial example is the corresponding adversarial image generated by the untargeted PGD [7] attack under 40 iterations with a maximum perturbation ɛ = 0.3. For convolutional and max-pooling layers in MNIST-CNN, the hidden activation is a stack of feature maps. We just select one feature map for illustration. For fully connected layers in MNIST-CNN, the hidden activation is a vector. In order to see the changes more clearly, we reshape the vector to a two-dimensional matrix.
The hidden activations of different layers of MNIST-CNN associated with a legitimate “0” image and an adversarial “0” image are shown in Fig. 1. We can see that the activations of the adversarial “0” image are different from those of the legitimate “0” image. Obviously, the adversarial “0” image produces some perturbations to the activations of hidden layers. Intuitively, the perturbations in hidden activations increase gradually with network forward propagating. The perturbations have an amplification effect and finally accumulate enough errors to fool the network. Therefore, we can conclude that the reason for an adversarial example fooling a network is that the adversarial example causes errors to hidden activations and the errors are amplified through forward propagation, which eventually mislead the network to the wrong prediction.

Hidden activations of MNIST-CNN by feeding the model with a legitimate “0” image and an adversarial “0” image, respectively. The adversarial “0” image is generated by PGD attack and is incorrectly recognized as the digit “6”. For convolutional and max-pooling layers, the feature maps for each pair examples are from the same channels of hidden activations. For fully connected layers FC7 and FC8, the activation vector is reshaped to a matrix with size 20×10.
According to the analysis in Section 3, adversarial examples fool the network by producing perturbations to hidden activations. A natural idea is to improve the robustness of convolutional neural networks by removing the perturbations in hidden activations. We propose an approach, termed as sanitizing hidden activations, to eliminate or reduce the perturbations in hidden activations.
For convolutional neural networks, only the convolutional layers and the fully connected layers compute activations, so our sanitizing approach is designed to implement on hidden activations (i.e., the outputs) of the convolutional layers or the fully connected layers. As Fig. 2 shows, if the hidden activation is a stack of feature maps, e.g., the activation of a convolutional layer, the sanitizing operation is conducted on each feature map independently. If the hidden activation is a vector, e.g., the activation of a fully connected layer, the sanitizing operation is directly conducted on the activation vector. For ease of exposition, the object to be sanitized is directly called the hidden activation in the following text.

Sanitizing for the convolutional layer and the fully connected layer.
We first present the non-local-means sanitizer, which is referred from a common approach for image denoising–non-local means algorithm [31]. Given the hidden activation a = {a (i) ; i ∈ N}, the sanitized value
Another approach we introduced for sanitizing hidden activations is the median-filtering sanitizer, which uses a common non-linear technique for image denoising: median filtering. Median filtering is known to be excelled at removing the “salt-and-pepper” noise. We analytically investigate that some adversarial attacks perturbed hidden activations in the “salt-and-pepper” way. So we think this type of sanitizing approach will be effective in some cases.
The median-filtering sanitizer is defined as:
Besides the above two approaches, we also propose another novel approach for sanitizing hidden activations. Its formulation is as follows:
Because this sanitizer works like a “mask” that directly covers the elements with small values, it is called the mask sanitizer. The intuition for proposing this sanitizer is that we find only the elements with large values in hidden activations are crucial to the final prediction of the network, and adversarial attacks usually only have the ability to perturb the elements with small values. Experiments show that the mask sanitizer has amazing effectiveness in improving the robustness of convolutional neural networks.
In this section, we conduct experiments to demonstrate the effectiveness of our sanitizing approach. In Section 5.1, we present the experimental setup. In Section 5.2, we evaluate the effectiveness of our sanitizing approach. In Section 5.3, we present the comparisons with state-of-the-art defenses. In Section 5.4, we present a case study on MNIST-CNN to explain why our sanitizing approach can improve the adversarial robustness of convolutional neural networks.
Experimental setup
Effectiveness of sanitizing approach
On MNIST, CIFAR-10 and ImageNet, we evaluate the effectiveness of our sanitizing approach against three typical adversarial attacks: FGSM, C&W and PGD. The evaluation results are shown in Table 2.
Evaluation results of sanitizing approach on MNIST, CIFAR-10 and ImageNet
Evaluation results of sanitizing approach on MNIST, CIFAR-10 and ImageNet
On the MNIST dataset, MNIST-CNN has a baseline accuracy of 98.65% on legitimate examples. On adversarial examples generated by FGSM, C&W and PGD, respectively, the accuracy of MNIST-CNN is 70.53%, 2.34% and 53.47%, respectively. The non-local-means sanitizer improves the accuracy of MNIST-CNN to 75.84%, 80.27% and 76.28% on adversarial examples generated by FGSM, C&W and PGD, respectively. The median-filtering sanitizer improves the accuracy of MNIST-CNN to 75.27%, 82.94% and 77.32% on adversarial examples generated by FGSM, C&W and PGD, respectively. The mask sanitizer improves the accuracy of MNIST-CNN to 81.95%, 87.28% and 83.57% on adversarial examples generated by FGSM, C&W and PGD, respectively.
We can see that our sanitizing approach has the best performance on defending against C&W attack. For example, the mask sanitizer improves the accuracy of MNIST-CNN by 84.94% from 2.34% to 87.28% on C&W attack. All the three sanitizers can improve the adversarial robustness of MNIST-CNN. The mask sanitizer performs the best among the three ones. Note that the sanitizing approach only has a minor impact on the legitimate accuracy. For example, the mask sanitizer just reduces the accuracy of MNIST-CNN from 98.65% to 96.24% on legitimate examples, which is completely acceptable.
The evaluation results on CIFAR-10 and ImageNet show concordance with those on MNIST. All of the three sanitizers can improve the adversarial robustness of networks, of which the mask sanitizer performs the best. On the CIFAR-10 dataset, the mask sanitizer achieves an accuracy of 76.18%, 80.21% and 78.34% on adversarial examples generated by FGSM, C&W and PGD, respectively. It reduces the legitimate accuracy from 90.67% to 89.26%. On the tiny ImageNet dataset, the mask sanitizer achieves an accuracy of 67.37%, 73.84% and 69.25% on adversarial examples generated by FGSM, C&W and PGD, respectively. It reduces the legitimate accuracy from 83.29% to 82.20%.
As we know, defending measures usually bring extra cost to the normal computation of neural networks. In experimenting we find that the non-local-means sanitizer brings about 5% extra overhead. The computational complexity of the median-filtering sanitizer and the mask sanitizer is very low. They bring about 0.2% extra overhead.
We compare our sanitizing approach with three state-of-the-art defense techniques: PGD-based adversarial training [7], PixelDefend [14] and feature squeezing [21]. The PGD-based adversarial training adopts the PGD attack to generate adversarial examples and constitutes the current state-of-the-art in adversarial training. The PixelDefend purifies the perturbed adversarial examples. The feature squeezing reduces the color bit depth of each pixel to help the network recognize adversarial examples.
The comparison results on the MNIST dataset are shown in Fig. 3. We can see that the PGD-based adversarial training shows the best performance on defending against FGSM attack (with an accuracy of 87.62%) and PGD attack (with an accuracy of 89.60%). However, it is the least effective in defending against C&W attack (with an accuracy of 3.86%). Such results are highly consistent with the essence of the adversarial training. We know the adversarial training can only defend against attacks that are of the same type as the attack it uses to generate adversarial examples. The featuring squeezing performs worst overall, with an accuracy of 28.42%, 75.25% and 25.39% on FGSM, C&W and PGD, respectively; followed by the PixelDefend, with an accuracy of 73.40%, 85.49% and 75.74% on FGSM, C&W and PGD, respectively. Our mask sanitizer performs best overall. It achieves an accuracy of 81.95%, 87.28% and 83.57% on adversarial examples generated by FGSM, C&W and PGD, respectively.

Comparison results on MNIST.
The comparison results on CIFAR-10 and ImageNet are shown in Figs. 4 and 5. We can see that whether on the CIFAR-10 dataset or on the ImageNet dataset, our mask sanitizer generally outperforms PGD-based adversarial training, PixelDefend and feature squeezing. Our sanitizing approach is a more generalized approach that can defend against different kinds of attacks.

Comparison results on CIFAR-10.

Comparison results on ImageNet.
To explain why the sanitizing approach can help the network correctly recognize adversarial examples, we carry out a case study on MNIST-CNN. Similar to Section 3, we feed the network with a legitimate “0” image and an adversarial “0” image, respectively, and then observe the changes to hidden activations. The difference is that this time the model is a sanitized MNIST-CNN, which is sanitized using the mask sanitizer. The hidden activations are shown in Fig. 6. We can see that on the sanitized MNIST-CNN, distinctions that are sharp in hidden activations on un-sanitized MNIST-CNN have now become unobvious. Especially for the last two layers FC7 and FC8, the hidden activations of the adversarial “0” image tend to be consistent with those of the legitimate “0” image.

Hidden activations of the sanitized MNIST-CNN by feeding the model with a legitimate “0” image and an adversarial “0” image, respectively. Both the legitimate “0” image (generated by PGD attack) and the adversarial “0” image can be correctly recognized by the network. The sanitizing operations are implemented on CONV1, CONV2, CONV4, CONV5, FC7 and FC8. For convolutional and max-pooling layers, the feature maps for each pair examples are from the same channels of hidden activations. For fully connected layers FC7 and FC8, the activation vector is reshaped to a matrix with size 20×10.
Our sanitizing approach can eliminate the adversarial perturbations in hidden activations, which helps the network correctly recognize the adversarial examples. Please note that the hidden activation images in Fig. 6 are not obtained by sanitizing each hidden activation image in Fig. 1. That is because the network is a “dynamic” model. The hidden activation of the former layer is first sanitized, and then input to the next layer for recalculation. So the next sanitizing operation is implemented on a new hidden activation.
For quantitative analysis, we define a layer-wise distance between hidden activations of adversarial and legitimate examples. The distance is a normalized Euclidean distance and is computed as:
We randomly select 100 MNIST legitimate examples and the corresponding adversarial examples (generated by PGD attack). Then we compute the average distances on different layers of MNIST-CNN before and after applying sanitizing operations. The results are shown in Fig. 7. We can see that the distances between adversarial examples and legitimate examples increase layer by layer before sanitizing. However, after sanitizing, the distances are effectively reduced. This further demonstrates that our sanitizing approach has the effect of making adversarial examples behave consistent with legitimate examples.

Normalized Euclidean distances between hidden activations of adversarial and legitimate examples on different layers of MNIST-CNN.
We proposed an approach, termed as sanitizing hidden activations, to improve the adversarial robustness of convolutional neural networks. The sanitizing approach is designed to eliminate or reduce the perturbations in hidden activations to help the network correctly recognize the adversarial examples. Specifically, we designed three types of sanitizers: the non-local-means sanitizer, the median-filtering sanitizer and the mask sanitizer. We systematically evaluated the effectiveness of these sanitizers on MNIST, CIFAR-10 and ImageNet datasets. On MNIST dataset, our best sanitizer achieves accuracies of 81.95%, 87.28% and 83.57% on adversarial examples generated by FGSM, C&W and PGD, respectively. On CIFAR-10 dataset, our best sanitizer achieves accuracies of 76.18%, 80.21% and 78.34% on adversarial examples generated by FGSM, C&W and PGD, respectively. On ImageNet dataset, our best sanitizer achieves accuracies of 67.37%, 73.84% and 69.25% on adversarial examples generated by FGSM, C&W and PGD, respectively. The experimental results show that our proposed sanitizing approach can effectively improve the adversarial robustness of convolutional neural networks.
The limitations of our work mainly lie in that the sanitizing approach we proposed is designed to be implemented on convolutional layers or fully connected layers in convolutional neural networks. However, it is a hyperparameter optimization problem to decide which layers to be sanitized to achieve the best effect. In practice, it is time-consuming to find the optimal solution, but it is also an inevitable task in machine learning.
In future work, we will concentrate on designing more effective sanitizing approaches, researching new techniques for defending against adversarial examples and explaining the reason for the existence of adversarial examples.
