Abstract
With the help of network compression algorithms, deep neural networks can be applied on low-power embedded systems and mobile devices such as drones, satellites, and smartphones. Filter pruning is a sub-direction of network compression research, which reduces memory and computational consumption by reducing the number of parameters of model filters. Previous works utilized the “more-simple-less-important” criterion for pruning filters. That is, filters with the smaller norm or more sparse weights in the network are preferentially pruned. In this paper, we found that feature maps are not fully positively correlated with the sparsity of filter weights by observing the visualization of feature maps and the corresponding filters. Hence, we came up with the idea that the priority of filter pruning should be determined by redundancy rather than sparsity. The redundancy of a filter is the measure of whether the output of the filter is repeated with other filters. Based on this, we defined a criterion called redundancy index to rank the filters and introduced it into our filter pruning strategy. Extensive experiments demonstrate the effectiveness of our approach on different model architectures, including VGGNet, GoogleNet, DenseNet, and ResNet. The models compressed with our strategy surpass the state-of-the-art in terms of Floating Point Operations Per Second (FLOPs), parameters reduction, and classification accuracy.
Introduction
The emergence of the convolutional neural network (CNN) has pushed deep learning to the forefront of artificial intelligence. With the rapid development of GPU hardware support, the volume of CNN has become larger and larger. Super large-scale models follow one after another, covering various fields such as speech recognition [1], image understanding [2], and natural language processing [3]. Relying on the unique ability of convolution operation to extract deep features, these deep neural network models have greatly surpassed humans in many tasks and landed in a wide range of application scenarios, such as cancer detection [4], autonomous driving [5], game strategy search [6], and so on. In actual application scenarios, developers often need to trade-off between limited computing resources and excellent model performance. Therefore, the demand for network compression and acceleration for edge computing devices has arisen.
From the literatures, we can see six distinct families of methods for network compression: parameter sharing [7], network pruning [8], weight quantization [9], low-rank decomposition [10], knowledge distillation [11], and network architecture search [12]. Among them, structured pruning of filters [8, 13, 14, 15, 16, 17, 18] is the most intuitive research branch. Its purpose is to reduce the number of filters while retaining the feature extraction capability of convolution to the greatest extent. Specifically, it removes all nodes of the specified filter and leaves the network with stable structures for the efficient use of the Basic Linear Algebra Subprograms. This will benefit the pruned model both in volume (parameter quantity) and efficiency (Floating Point Operations Per Second). Therefore, in this research paradigm, the core of the problem lies in the choice of filters. A primary criterion is to sort the filters in order of importance and then prioritize the pruning of the non-essential ones.
A torrent of research yielded empirical and theoretical insights to describe the importance of filters. For instance, Li et al. [8] proposed a magnitude-based method that employs the
In our opinion, such an assumption is not reasonable. To further analyze this problem, we should go back to the essence of the convolution kernel. As a feature extractor, the convolution kernel is a small matrix, which is slid across the image and multiplied with the input. It is used for blurring, sharpening, embossing, edge detection, and more. The simplicity of the value in the convolution kernel does not accurately define its effectiveness. Perhaps only an elementary filter is enough to extract very accurate features for some specific features. In this case, the filter is effective, even if it has a small norm or a low rank. Pruning this filter will significantly damage the network performance. Therefore, simplicity should not be the criterion for pruning. Hence, we propose a new perspective, namely redundancy. The so-called redundancy principle considers whether the filters overlap with other filters in their feature extraction capabilities. If there is overlap, at least one of these filters is redundant. Pruning redundant filters can ensure that the feature extraction capability has remained and the performance of the model is not weakened. According to this strategy, when we sort filters, the importance should be determined by redundancy rather than simplicity.
Based on the above insight, we re-examined the relationship between the filters and the feature maps. Through visualization (detail in Section 3), we found that the feature maps are invariably similar for some filters, even if the input is changed. In other words, although these filters have different ways to extract features, they have similar feature extraction results. It means these filters are effective but functionally redundant. Therefore, we propose a filter pruning strategy founded on the redundancy evaluation of filters called
The framework of our method is called 
In summary, the contributions of FP-FMC are three-fold:
We visually explored the filters and feature maps in the same convolutional layer, and then empirically described the relationship between the redundancy and effectiveness of filters. We proposed a novel quantitative criterion to define the redundancy of filters, namely, the redundancy index. Based on the above criteria, we designed a new filter pruning strategy and verified its effectiveness on various types of network blocks.
The rest of paper is organized as follows. Firstly, we present a review of the related work in Section 2. Next, in Section 3, we describe observations about the relationship between filters and feature maps, then introduce the definition of filter redundancy which is the background and footstone of our research. The proposed pruning framework is explained in Section 4. A quantitative evaluation on public benchmark datasets as well as hyper-parameter experiments are shown in Section 5. Lastly, the paper concludes in Section 6.
A large body of work exists on the compression and acceleration of convolutional neural networks. Efforts in this field can be divided into parameter sharing [7], network pruning [8], weight quantization [9], low-rank decomposition [10], knowledge distillation [11], and network architecture search [12]. In this section, due to the limited space, we focus on pruning-based methods that are more related to our work. The pruning-based approaches aim to remove the unnecessary connections of the neural network. Depending on whether the removed connections are structural or not, we split them into unstructured and structured pruning.
Non-structural pruning.
Unstructured pruning aims to remove weights at arbitrary position. This work date back to the 1990s [21, 22]. As the most impactful work, [23] removes all weights below a threshold in a generic iterative way. Inspired by this work, the authors of [24] proposed Dynamic Network Surgery, aka DNS, to rebuild the weight connections in the dynamic learning process. Furthermore, another branch of work looked at different definitions of thresholds used to prune weights. Such as, HashedNets [7] used a hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value. Besides, [25] treated the weight pruning of the network as a nonconvex optimization problem and then solved the decomposed subproblems by the alternating direction method of multipliers (ADMM). Although unstructured pruning significantly reduces the model size, these methods suffer from the same problem. That is, the compressed models became sparse, and these irregular weight matrices are difficult to parallel implement on commonly available hardware. This characteristic severely hinders the acceleration of the pruned network compared with the original one. To circumvent this issue, another line of work aims to remove the redundant structural parts, such as channels, filters or layers. These pruning methods produce networks that are easier to accelerate in inference time without specialized hardware or software support.
Structural pruning.
Structured pruning removes structured parts such as filters, channels or layers without changing the original convolutional structure. Various parallel off-the-shelf Basic Linear Algebra Subprograms libraries support this operation without specialized hardware. Among them, filter pruning is the most fine-grained method, which removes the entire filter according to specific metrics. It preserves the feature extraction capability of current convolution operations to the greatest extent while balancing computational speed and model size. The research of this branch mainly focuses on the selection criteria of filters. Some works aim to exploit the intrinsic properties of filters, such as
Our work, FP-FMC, also lies in the filter pruning branch. The difference is that for the principle of filter selection, we are not entangled in the simplification of parameters at the numerical level of filters or feature maps but focus on the functional demands of the convolution operation. In particular, we rethink the function and goal of filters by observing the feature map outputs of convolution operation as described in Section 3. Then, we find that there is redundancy in feature extraction of different filters. So, filter redundancy is come up as an indicator for pruning. This filter selection criterion is feasible to remove a large portion of connections without a significant performance drop compared with the previous works. In a word, this pruning principle designed from the essence of the filter is the reason why the performance of our FP-FMC can be among the best in different network structures.
Background
Before elaborating on our work, we review the convolution operation first. The purpose of convolution operation is to extract features. More specifically, each convolutional layer of a CNN is composed of a batch of filters. A filter is a collection of convolution kernels, and a convolution kernel refers to a 2D array of weights. The weights of the filter determine what specific features are detected. The convolution operation is a series of floating-point operations between the filter weights and the input. Through convolution operation, the filter filters out the information that it does not care about and draws out the features that it cares about, such as edges, highlight areas, backgrounds, and so on. After processing the input by filters, we get the output called feature maps.
Structural pruning compression aims to remove the specified filters. In order to minimize the damage to the feature extraction ability of the convolutional layer, we need to prune relatively redundant filters first. However, there is neither a unified definition of redundancy for filters nor a conventional route to know whether a filter is redundant. Next, we might as well try to define it in a visual way according to the purpose of the convolution operation.
We chose the first convolutional layer of the pre-trained VGGNet [28] as the observation object. It was trained on the CIFAR-10 [29] dataset to classify among ten classes. We chose it for three reasons: Firstly, as an images classification model, the feature extraction of the low convolutional layer is mainly for human-understandable information such as contours, edges, backgrounds, and highlights. These feature maps are more interpretable when viewed. Secondly, the channel number of the input for this convolutional layer is three, making all the kernels of the filters just right to be visualized as an RGB color image, which is more convenient for observation. It is worth noting that the visualization of the RGB color image makes the feature map look very similar to the heat map, but they are fundamentally different. The feature map is the output after feature extraction, which aims to strengthen some explicit or implicit visual features. It does not have the function of heat map for data statistics. Thirdly, the number of filters is 64. This number is suitable for the human eye to observe their relevance without being too dazzling.

In the first row of Fig. 2a, we visualized 64 filters and randomly picked six images from different categories to get the feature maps. The first thing we discovered was that the values of some feature maps are almost zeros, which means that their corresponding filters hardly extract any valuable features. In other words, these filters are ineffective. These zero feature maps will vanish due to the cross-channel addition by convolution operation in the next layer. That is to say, they will not affect the feature extraction of other filters but indeed increase the amount of useless floating-point calculations. This type of filter is obviously redundant. All existing filter pruning algorithms can pick them out easily, although the principles of each method are different. For example, some are based on sparsity [27], some on the norm [14], and some on statistics such as rank [17] or entropy [18].
Upon further observation, we found an interesting phenomenon. It is that some feature maps are always alike, although they come from different filters. Moreover, this similarity does not seem to be affected by the change in input. For instance, the feature maps of filters 52, 57, and 61 are always similar regardless of whether the input image is aircraft or frog. However, this similarity is not always stable. Take the feature maps corresponding to the 60 and 62 filters as an example. The feature maps under input 1 and input 3 are very similar, but the feature maps of input 5 and input 6 are obviously different. Based on the above observations, we make the following conjecture. Each filter is good at extracting a particular type of feature, such as contours, colors, bright parts, dark parts, and so on. If two filters excel at capturing the same feature type, the feature maps are always similar for them. In this case, these two filters are all effective, which means they both obtain the valuable and same feature of the input. Nevertheless, in some cases, there may also be two kinds of features in the same area. For example, the bright part of the image is also its green area. In this case, two filters may also get similar feature maps. At this time, although the feature maps are similar, the feature types the two filters focus on are not the same. From the perspective of the feature extractor, although both of these two cases output similar feature maps, the former case is a redundant filter, while the latter is not.
In brief, we have observed four situations:
the feature maps are all zeros; the feature maps of different filters are always similar; the feature maps of different filters are occasionally similar and occasionally different; the feature maps of filters are unique.
For the first situation, zero feature maps mean the filter is useless for the model. If there is only one such zero feature map, pruning it or not has little effect; if there are multiple such feature maps, then the situation is merged into the second one; that is, these feature maps are always similar. For the second situation, these same feature maps indicate that the filters overlap with each other in feature extraction capabilities. For the purpose of convolution operation, even if the weights of filters are entirely different, their functions are the same. So at least one of them is redundant, and filters under this situation are the focus of filter pruning. For the third situation, the same feature maps are caused by the co-occurrence features rather than the similar feature extractor. Some features are naturally related. Even though they exist independently, they often appear together. Filters corresponding to these features are efficient and different, but their correlation is high. When the demand for network compression ratio is high, we can consider pruning the filter under this situation. Doing so will inevitably affect the performance of the network but is not fatal. For the last situation, these filters are not only practical but also irreplaceable. Once pruning them will significantly affect the performance of the model. Therefore, we should keep them as much as possible when performing filter pruning operations.
Depending on the above analysis, we conclude a standard to define filter redundancy:
Preliminaries
Firstly, we would like to introduce the symbols and notations. Assume that a convolutional neural network model has
Filter pruning via feature map clustering (FP-FMC)
As elaborated in Section 3, we propose a new criterion for defining filter redundancy: the higher the homogeneity of the feature maps is, the more redundant the filter is. Therefore, the key lies in effectively quantifying the homogeneity among feature maps. The most direct way is to cluster the feature maps. The feature maps in the same cluster are more homogeneous than those in different clusters. Nevertheless, there are three obstacles here:
The feature maps are data points in a high-dimensional space, and the traditional clustering method cannot effectively measure their spatial distances in such space; The number of clusters is a hyper-parameter that significantly affects the clustering results. So how to choose an appropriate Refer to the third situation discussed in Section 3. Sometimes the root cause of similar feature maps is not the redundant filters but the highly correlated input features. How to effectively distinguish this situation is also a hard nut to crack.
Next, we solve the above problems one by one.
Distance measurement of feature maps
To start with, we use dimensionality reduction to solve the problem that traditional distance calculation methods are invalid in high-dimensional spaces. Among them, t-SNE [31] is a highly recommended solution. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data into a low-dimensional space, especially into two or three dimensions. Precisely, it models each high-dimensional object by a 2D or 3D point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. This characteristic just fits our purpose of seeking homogeneous feature maps. As shown in Fig. 2b, we utilize t-SNE to reduce the dimensions of six groups of feature maps in Fig. 2a to two-dimensional, respectively. Then we draw them with the corresponding filter ID number. Looking closely at Fig. 2b, you will find that the similarity relationship between the feature maps processed by t-SNE is precisely the same as our analysis of the visualization results in Section 3.
In the implementation, we used a trick. That is, before t-SNE, we normalized each feature map as follows.
On the one hand, this operation scales all the feature maps to a consistent interval
After solving the problem of similarity measure, traditional clustering algorithms can come in handy. We picked the K-means [30]. First of all, for most convolutional layers, the number of feature maps is not much, generally not more than 512, which is the proper interval where K-means excels. Secondly, K-means has a fast convergence speed and a good clustering performance. According to Occam’s razor principle, K-means is a reasonable choice to solve the current problem. Then we face a new challenge, determining the hyper-parameter of K-means, that is, the number of clusters
Here we have adopted a violent optimization manner. We first obtain feature maps of
where
Therefore,
where
Precisely, the purpose of this step is to determine the hyper-parameters, which should be the preparatory step of our algorithm. But it is the most time-consuming step. Fortunately, once the
In the implementation, we used another trick here. Before clustering, we use the 1.5
K-means is a clustering algorithm that is very sensitive to outliers. Removing outliers can increase the robustness of K-means. For feature maps, outlier means the corresponding filters are the unique ones that differ from others a lot. This corresponds to the fourth situation discussed in Section 3. Marking them out facilitates the subsequent pruning step.
Now, we come to the most important step. Take the convolutional layer
where
Thus, for each input, we get
In this equation,
where
Moreover,
That is, if
In total, the redundancy index for filter
With these
Redundancy Quantification
In short, the redundancy quantization algorithm scores filter by summing the comparison of pairwise clustering results. For the outlier filter, we naturally think it is not redundant and set its redundancy index to
The above is the mathematical definition of redundancy index. Intuitively, we draw the redundancy indices for
The redundancy index of GoogleNet in 
In summary, there are three steps for FP-FMC:
For t-SNE feature maps in each convolution layer Calculate the redundancy indices of all filters as described in Algorithm 4.2.3 and sort them in descending order. According to the compression ratio
In short, the core idea of this paper is to propose a new pruning strategy based on redundancy quantification of filters. Although there are some details can be optimized, the overall strategy is implementable, both theoretically and technically. In the next section, the superiority of this strategy far beyond its peers on various network architectures will be fully demonstrated in the experiments, which is exactly the value of this work.
In this section, we present the experimental evaluation of FP-FMC in four network architectures with two challenging datasets. We show the robustness and the flexibility of our proposed filter pruning approach. Also, we conducted analysis experiments on the hyper-parameters used in our method. The source codes are publicly available on GitHub,1 along with the log files which produce all experiment results. To sum up, we can say that the proposed method outperforms the counterparts in terms of Floating Point Operations Per Second (FLOPs), parameters reduction, and classification accuracy.
Experimental settings
Datasets and Baselines
To demonstrate the performance of our method on filter pruning, we used two datasets, CIFAR-10 [29] and ImagNet [34]. The former consists of 60,000
Evaluation Protocols
For classification tasks, we use classification accuracy to measure the performance of models. Specifically, we adopt top-1 accuracy for CIFAR-10, top-1 and top-5 accuracy for ImageNet. For the evaluation of model compression, we selected the number of parameters to assess model size and required Floating Points Operation Per Second to assess calculate consumption, respectively. Correspondingly, we also labeled the compression percentage relative to the original model on these two indicators.
Configurations
Our codes are built on Pytorch [37]. In the fine-tune period, we optimize the objective using Adam [38] with a batch size of 256 and a learning rate of 2E-2. The weight decay and momentun are set to 5E-3 and 0.9. The fine-tune iterations are 300 for all models. All experiments are conducted on NVIDIA GeForce RTX 3090.
Compression and analysis
The experiments of model compression involve six models which contain four different network blocks, including VGGNet [28], DenseNet [35], GoogleNet [33], ResNet-56, ResNet-110 and ResNet-50 [36]. we compare and analyse with other state-of-the-art filter pruning approaches as well as the original models. The filter pruning methods used for comparison are all proposed in the past five years. The experimental results of all other methods in the following tables are all from the original paper, respectively. For our FP-FMC, we use the given compression ratios to calculate the number of filters in each convolutional layer of the pruned model, and use this number as a new configuration to initialize the target model. Then, according to the redundancy index, copy the weights of specified filters from the benchmark model to the target model. Finally, fine-tune the pruned model on the target dataset.
VGGNet on CIFAR-10
As for VGGNet [28], the network architecture we used is VGG16-bn, which contains a linear structure. The VGG16-bn consists of 13 convolutional layers and three fully-connected layers. Moreover, a batch normalization operation follows each convolutional layer. The benchmark model was downloaded from the online public resource.2
Because the number of filters in the last convolutional layer needs to be aligned with the fully-connected layer, it is not allowed to change. Therefore, in addition to the last convolutional layer, filter pruning operations can be performed on the first 12 convolutional layers. All 12 convolutional layers can be compressed by user-specified ratios. We selected seven methods to compare, and the detailed results are in Table 1. For the best result of FP-FMC in Table 1, the compression ratios of each layer are [0.3]*7
Pruning results of VGGNet on CIFAR-10
Pruning results of VGGNet on CIFAR-10
Fine-tuning results of VGG16-bn pruning models under two compression ratios configuration compared with the baseline.
Fine-tuning results of DenseNet-40 pruning models under three compression ratios configuration compared with the baseline.
The competition on DenseNet [35] is fierce, and the gap between participants is not significant. The network architecture we used is DenseNet-40. The benchmark model was downloaded from the online public resource.3
Specifically, DenseNet-40 contains three dense blocks and transition layers between every two dense blocks. There are 12 convolution operations in each dense block and one in the transition layer. Filter pruning can be performed on all 38 convolutional layers (
Pruning results of DenseNet on CIFAR-10
Pruning results of DenseNet on CIFAR-10
Pruning results of GoogleNet on CIFAR-10
For the comparison on GoogleNet [33], FP-FMC is the absolute overlord. Both in terms of parameters and computational consumption, it surpasses others by a large margin. The network architecture we used is the GoogleNet with nine inception modules. The benchmark model was downloaded from the online public resource.4
In detail, for each inception module, there are four
Fine-tuning results of GoogleNet pruning models under two compression ratios configuration compared with the baseline.
For ResNet [36] on CIFAR-10 [29], the network architectures we used are ResNet-56 and ResNet-110. The benchmark models were downloaded from the online public resources.5 ,6 For ResNet-56, there are 55 convolutional layers and one fully-connected layer. In addition to the first convolutional layer, there are three groups of modules at channel numbers of 16, 32, and 64, respectively. Each group contains 9 serial residual blocks. And there are two convolution operations in each residual block. In total, filter pruning can be performed on 54 convolutional layers except for the last one, which must be unchanged to align the fully-connected layer. Similar to ResNet-56, there are 109 convolutional layers and one fully-connected layer in ResNet-110. In addition to the first convolutional layer, there are three groups of modules at channel numbers of 16, 32, and 64, respectively. Each group contains 18 serial residual blocks. And there are two convolution operations in each residual block. In total, filter pruning can be performed on 108 convolutional layers except for the last one, which must be unchanged to align the fully-connected layer.
Fine-tuning results of ResNet-56 pruning models under three compression ratios configuration compared with the baseline.
Fine-tuning results of ResNet-110 pruning models under three compression ratios configuration compared with the baseline.
The compression rates of ResNet are somewhat different from the above models. The given compression rates have two parts. The first half specifies the pruning configuration of the convolution layer before each residual block to make sure the output of the last residual block is well-matched for the input of the next residual block. And the second half specifies the pruning configuration of the convolution layers inside the residual blocks. Take [0.0]
Now, let us move to the comparison results of FP-FMC and other pruning methods. The situation on ResNet [36] is similar to GoogleNet. Whether it is ResNet-56 or ResNet-110, it has an overwhelming advantage in all indicators and surpasses the benchmark model significantly. Whether it is ResNet-56 or ResNet-110, the optimal model after compression exceeds the performance of the baseline. When the floating-point operation of the model is further compressed to less than 30%, FP-FMC yields an impressive classification accuracy (92.26% for ResNet-56 and 93.15% for ResNet-110). Figures 7 and 8 show the fine-tuning process for ResNet-56 and ResNet-110 under three compression ratios. The compression ratios for ResNet-56 are [0.0]
Pruning results of ResNet-56 on CIFAR-10
Pruning results of ResNet-110 on CIFAR-10
Pruning results of ResNet-50 on ImageNet
In control experiments on ImageNet [34], we compare the performance of the ResNet-50 model. The structure of ResNet-50 is more complicated. There are 53 convolutional layers and one fully-connected layer. In addition to the first convolutional layer, there are four groups of modules at channel numbers of 256, 512, 1024, and 2048 respectively. Each group contains 3, 4, 6, and 3 serial residual blocks. And there are three or four convolution operations in each residual block depending on whether there is downsample requirement. In total, filter pruning can be performed on 52 convolutional layers except for the last one, which must be unchanged to align the fully-connected layer. From the perspective of classification accuracy, the advantages of FP-FMC still exist. In addition to the same three metrics as in the CIFAR-10 experiments, we also compared the top-5 classification accuracy. As shown in Table 6, in addition to the benchmark model, FP-FMC surpasses all other models in the top-1 accuracy, and is equal to the current optimal GM [15] in the top-5 accuracy. Similarly, reducing the model compression ratio below the lowest record of existing methods, using only less than 30% of the original Floating Points Operation, our method only reduces the top-1 accuracy rate of 3.8% and the top-5 accuracy rate of 2.13% compared to the benchmark. In fact, due to the huge data volume of ImageNet, the fine-tune of 300 iterations cannot achieve the optimal. In theory, with more fine-tuning, FP-FMC is likely to surpass the benchmark. However, for the fairness of the controlled experiment, we only compare the results under 300 iterations fine-tunes. We also listed the compression ratio configurations as before for ResNet-50. They are [0.0]
Hyper-parameter analysis: The choice of cluster numbers
As explained in Section 4.2, the cluster number is the only hyper-parameter that affects the performance of our model. We use a crude solution to obtain the optimal
where
As shown in Fig. 9, there is the statistical visualization of the optimal
Statistical visualization of the optimal 
In this paper, we observed the visualization results of the feature maps and filters to propose a novel filter pruning criterion which is dubbed as the redundancy index. Under the guidance of the redundancy index, we have constructed the filter pruning algorithm called FP-FMC. We performed extensive experiments that show our algorithm is able to outperform state-of-the-art model compression approaches on almost all network architectures. Such a pruning algorithm can improve the performance of baseline models and can also be utilized as an accelerated technique to benefit downstream resource-scarce tasks. Furthermore, introducing traditional machine learning algorithms into the deep feature interpretation process would open doors for many black-box problems in deep learning.
Footnotes
Code:
VGG16-bn benchmark model:
DenseNet-40 benchmark model:
GoogleNet benchmark model:
ResNet-56 benchmark model:
ResNet-110 benchmark model:
Acknowledgments
This work is funded by the Key R&D Program of Zhejiang (No. 2022C03126) and the National Natural Science Foundation of China (No. 61773336). We would like to extend our gratitude to data providers.
