Filter pruning via feature map clustering

Abstract

With the help of network compression algorithms, deep neural networks can be applied on low-power embedded systems and mobile devices such as drones, satellites, and smartphones. Filter pruning is a sub-direction of network compression research, which reduces memory and computational consumption by reducing the number of parameters of model filters. Previous works utilized the “more-simple-less-important” criterion for pruning filters. That is, filters with the smaller norm or more sparse weights in the network are preferentially pruned. In this paper, we found that feature maps are not fully positively correlated with the sparsity of filter weights by observing the visualization of feature maps and the corresponding filters. Hence, we came up with the idea that the priority of filter pruning should be determined by redundancy rather than sparsity. The redundancy of a filter is the measure of whether the output of the filter is repeated with other filters. Based on this, we defined a criterion called redundancy index to rank the filters and introduced it into our filter pruning strategy. Extensive experiments demonstrate the effectiveness of our approach on different model architectures, including VGGNet, GoogleNet, DenseNet, and ResNet. The models compressed with our strategy surpass the state-of-the-art in terms of Floating Point Operations Per Second (FLOPs), parameters reduction, and classification accuracy.

Keywords

Filter pruning clustering K-means dimensionality reduction t-SNE

1. Introduction

The emergence of the convolutional neural network (CNN) has pushed deep learning to the forefront of artificial intelligence. With the rapid development of GPU hardware support, the volume of CNN has become larger and larger. Super large-scale models follow one after another, covering various fields such as speech recognition [1], image understanding [2], and natural language processing [3]. Relying on the unique ability of convolution operation to extract deep features, these deep neural network models have greatly surpassed humans in many tasks and landed in a wide range of application scenarios, such as cancer detection [4], autonomous driving [5], game strategy search [6], and so on. In actual application scenarios, developers often need to trade-off between limited computing resources and excellent model performance. Therefore, the demand for network compression and acceleration for edge computing devices has arisen.

From the literatures, we can see six distinct families of methods for network compression: parameter sharing [7], network pruning [8], weight quantization [9], low-rank decomposition [10], knowledge distillation [11], and network architecture search [12]. Among them, structured pruning of filters [8, 13, 14, 15, 16, 17, 18] is the most intuitive research branch. Its purpose is to reduce the number of filters while retaining the feature extraction capability of convolution to the greatest extent. Specifically, it removes all nodes of the specified filter and leaves the network with stable structures for the efficient use of the Basic Linear Algebra Subprograms. This will benefit the pruned model both in volume (parameter quantity) and efficiency (Floating Point Operations Per Second). Therefore, in this research paradigm, the core of the problem lies in the choice of filters. A primary criterion is to sort the filters in order of importance and then prioritize the pruning of the non-essential ones.

A torrent of research yielded empirical and theoretical insights to describe the importance of filters. For instance, Li et al. [8] proposed a magnitude-based method that employs the $\ell_{1}$ -norm of filters to prune networks. Furthermore, in [19], the sparsity of the feature map is used to determine its importance. [13, 20, 17] perform filter pruning by calculating the statistics of filters or feature maps. He et al. [15] came up with the idea that the filters farther from the geometric median of all filters can replace the closer ones. However, the above approaches followed the same criterion, “more-simple-less-important”. Simple here can be more minor norms, sparser weights, much closer to the geometric mean, or smaller statistics. This principle of simplicity has a strong assumption here: the simpler the filter weight or feature map is, the safer it is. Due to their simplicity, they are considered useless or can be replaced by more complex filters. So pruning them will have the most negligible impact on the network.

In our opinion, such an assumption is not reasonable. To further analyze this problem, we should go back to the essence of the convolution kernel. As a feature extractor, the convolution kernel is a small matrix, which is slid across the image and multiplied with the input. It is used for blurring, sharpening, embossing, edge detection, and more. The simplicity of the value in the convolution kernel does not accurately define its effectiveness. Perhaps only an elementary filter is enough to extract very accurate features for some specific features. In this case, the filter is effective, even if it has a small norm or a low rank. Pruning this filter will significantly damage the network performance. Therefore, simplicity should not be the criterion for pruning. Hence, we propose a new perspective, namely redundancy. The so-called redundancy principle considers whether the filters overlap with other filters in their feature extraction capabilities. If there is overlap, at least one of these filters is redundant. Pruning redundant filters can ensure that the feature extraction capability has remained and the performance of the model is not weakened. According to this strategy, when we sort filters, the importance should be determined by redundancy rather than simplicity.

Based on the above insight, we re-examined the relationship between the filters and the feature maps. Through visualization (detail in Section 3), we found that the feature maps are invariably similar for some filters, even if the input is changed. In other words, although these filters have different ways to extract features, they have similar feature extraction results. It means these filters are effective but functionally redundant. Therefore, we propose a filter pruning strategy founded on the redundancy evaluation of filters called Filter Pruning via Feature Map Clustering, aka FP-FMC. As shown in Fig. 1, for convolutional layer $L_{i}$ , cluster the output feature maps under $N$ different inputs, respectively. Then according to the homogeneity of the clustering results, we develop a novel indicator, the redundancy index, which scores redundancy of each filter. Unlike the cited prior works that strive to simplify filters, FP-FMC prunes the filters by redundancy index rather than simplicity. Furthermore, the whole process is in a self-organizing way.

Figure 1.

The framework of our method is called FP-FMC. There are three steps to prune networks: 1) Cluster the feature maps under $N$ different inputs, respectively. The clustering results show the co-occurrence relationship of feature extraction. 2) According to the intersection elements of the clustering results, calculate redundancy index (RI) of each filter and sort them. Specifically, if one filter always outputs feature maps similar to other filters, then assign higher RI to it; 3) Prune the filters with high RI and fine-tune the remained filters. This framework makes FP-FMC a model-free method that can prune any convolutional networks without the help of domain experts or domain knowledge.

In summary, the contributions of FP-FMC are three-fold:

We visually explored the filters and feature maps in the same convolutional layer, and then empirically described the relationship between the redundancy and effectiveness of filters.

We proposed a novel quantitative criterion to define the redundancy of filters, namely, the redundancy index.

Based on the above criteria, we designed a new filter pruning strategy and verified its effectiveness on various types of network blocks.

The rest of paper is organized as follows. Firstly, we present a review of the related work in Section 2. Next, in Section 3, we describe observations about the relationship between filters and feature maps, then introduce the definition of filter redundancy which is the background and footstone of our research. The proposed pruning framework is explained in Section 4. A quantitative evaluation on public benchmark datasets as well as hyper-parameter experiments are shown in Section 5. Lastly, the paper concludes in Section 6.

2. Related work

A large body of work exists on the compression and acceleration of convolutional neural networks. Efforts in this field can be divided into parameter sharing [7], network pruning [8], weight quantization [9], low-rank decomposition [10], knowledge distillation [11], and network architecture search [12]. In this section, due to the limited space, we focus on pruning-based methods that are more related to our work. The pruning-based approaches aim to remove the unnecessary connections of the neural network. Depending on whether the removed connections are structural or not, we split them into unstructured and structured pruning.

Non-structural pruning.

Unstructured pruning aims to remove weights at arbitrary position. This work date back to the 1990s [21, 22]. As the most impactful work, [23] removes all weights below a threshold in a generic iterative way. Inspired by this work, the authors of [24] proposed Dynamic Network Surgery, aka DNS, to rebuild the weight connections in the dynamic learning process. Furthermore, another branch of work looked at different definitions of thresholds used to prune weights. Such as, HashedNets [7] used a hash function to randomly group connection weights into hash buckets, and all connections within the same hash bucket share a single parameter value. Besides, [25] treated the weight pruning of the network as a nonconvex optimization problem and then solved the decomposed subproblems by the alternating direction method of multipliers (ADMM). Although unstructured pruning significantly reduces the model size, these methods suffer from the same problem. That is, the compressed models became sparse, and these irregular weight matrices are difficult to parallel implement on commonly available hardware. This characteristic severely hinders the acceleration of the pruned network compared with the original one. To circumvent this issue, another line of work aims to remove the redundant structural parts, such as channels, filters or layers. These pruning methods produce networks that are easier to accelerate in inference time without specialized hardware or software support.

Structural pruning.

Structured pruning removes structured parts such as filters, channels or layers without changing the original convolutional structure. Various parallel off-the-shelf Basic Linear Algebra Subprograms libraries support this operation without specialized hardware. Among them, filter pruning is the most fine-grained method, which removes the entire filter according to specific metrics. It preserves the feature extraction capability of current convolution operations to the greatest extent while balancing computational speed and model size. The research of this branch mainly focuses on the selection criteria of filters. Some works aim to exploit the intrinsic properties of filters, such as $\ell_{1}$ -norm [8], gradient [26], sparsity [27] and so on, and then use them as the criteria for pruning. Meanwhile, there are also works that focus on feature maps. In [18], the authors obtain information entropy by the GAP of the output feature map as the criteria for pruning. [20] presented a LASSO selection strategy to identify the least square reconstruction error to reconstruct the feature map outputs. Moreover, [17] observed the invariance of feature map rank and removed filers with low-rank ones. Whether it is filter-based or feature map-based methods, they all carry out numerical analysis from a statistical point of view as the pruning strategy. The difference only lies in the types of indicators used for statistics. This approach of numerical analysis does not pay attention to the essential purpose of the convolution operation but only simplifies the data distribution of the filter or feature map values. In other words, these works utilized the “more-simple-less-important” criterion for pruning filters.

Our work, FP-FMC, also lies in the filter pruning branch. The difference is that for the principle of filter selection, we are not entangled in the simplification of parameters at the numerical level of filters or feature maps but focus on the functional demands of the convolution operation. In particular, we rethink the function and goal of filters by observing the feature map outputs of convolution operation as described in Section 3. Then, we find that there is redundancy in feature extraction of different filters. So, filter redundancy is come up as an indicator for pruning. This filter selection criterion is feasible to remove a large portion of connections without a significant performance drop compared with the previous works. In a word, this pruning principle designed from the essence of the filter is the reason why the performance of our FP-FMC can be among the best in different network structures.

3. Background

Before elaborating on our work, we review the convolution operation first. The purpose of convolution operation is to extract features. More specifically, each convolutional layer of a CNN is composed of a batch of filters. A filter is a collection of convolution kernels, and a convolution kernel refers to a 2D array of weights. The weights of the filter determine what specific features are detected. The convolution operation is a series of floating-point operations between the filter weights and the input. Through convolution operation, the filter filters out the information that it does not care about and draws out the features that it cares about, such as edges, highlight areas, backgrounds, and so on. After processing the input by filters, we get the output called feature maps.

Structural pruning compression aims to remove the specified filters. In order to minimize the damage to the feature extraction ability of the convolutional layer, we need to prune relatively redundant filters first. However, there is neither a unified definition of redundancy for filters nor a conventional route to know whether a filter is redundant. Next, we might as well try to define it in a visual way according to the purpose of the convolution operation.

We chose the first convolutional layer of the pre-trained VGGNet [28] as the observation object. It was trained on the CIFAR-10 [29] dataset to classify among ten classes. We chose it for three reasons: Firstly, as an images classification model, the feature extraction of the low convolutional layer is mainly for human-understandable information such as contours, edges, backgrounds, and highlights. These feature maps are more interpretable when viewed. Secondly, the channel number of the input for this convolutional layer is three, making all the kernels of the filters just right to be visualized as an RGB color image, which is more convenient for observation. It is worth noting that the visualization of the RGB color image makes the feature map look very similar to the heat map, but they are fundamentally different. The feature map is the output after feature extraction, which aims to strengthen some explicit or implicit visual features. It does not have the function of heat map for data statistics. Thirdly, the number of filters is 64. This number is suitable for the human eye to observe their relevance without being too dazzling.

Figure 2.

The observation of $\textit{conv}\_1$ layer in VGGNet [28] trained on CIFAR-10 [29]. a) 64 filters with $3\times 3$ weights and $32\times 32$ feature maps of six images from different classes. b) Feature maps clustering results. c) The redundancy indices of the filters.

In the first row of Fig. 2a, we visualized 64 filters and randomly picked six images from different categories to get the feature maps. The first thing we discovered was that the values of some feature maps are almost zeros, which means that their corresponding filters hardly extract any valuable features. In other words, these filters are ineffective. These zero feature maps will vanish due to the cross-channel addition by convolution operation in the next layer. That is to say, they will not affect the feature extraction of other filters but indeed increase the amount of useless floating-point calculations. This type of filter is obviously redundant. All existing filter pruning algorithms can pick them out easily, although the principles of each method are different. For example, some are based on sparsity [27], some on the norm [14], and some on statistics such as rank [17] or entropy [18].

Upon further observation, we found an interesting phenomenon. It is that some feature maps are always alike, although they come from different filters. Moreover, this similarity does not seem to be affected by the change in input. For instance, the feature maps of filters 52, 57, and 61 are always similar regardless of whether the input image is aircraft or frog. However, this similarity is not always stable. Take the feature maps corresponding to the 60 and 62 filters as an example. The feature maps under input 1 and input 3 are very similar, but the feature maps of input 5 and input 6 are obviously different. Based on the above observations, we make the following conjecture. Each filter is good at extracting a particular type of feature, such as contours, colors, bright parts, dark parts, and so on. If two filters excel at capturing the same feature type, the feature maps are always similar for them. In this case, these two filters are all effective, which means they both obtain the valuable and same feature of the input. Nevertheless, in some cases, there may also be two kinds of features in the same area. For example, the bright part of the image is also its green area. In this case, two filters may also get similar feature maps. At this time, although the feature maps are similar, the feature types the two filters focus on are not the same. From the perspective of the feature extractor, although both of these two cases output similar feature maps, the former case is a redundant filter, while the latter is not.

In brief, we have observed four situations:

the feature maps are all zeros;

the feature maps of different filters are always similar;

the feature maps of different filters are occasionally similar and occasionally different;

the feature maps of filters are unique.

For the first situation, zero feature maps mean the filter is useless for the model. If there is only one such zero feature map, pruning it or not has little effect; if there are multiple such feature maps, then the situation is merged into the second one; that is, these feature maps are always similar. For the second situation, these same feature maps indicate that the filters overlap with each other in feature extraction capabilities. For the purpose of convolution operation, even if the weights of filters are entirely different, their functions are the same. So at least one of them is redundant, and filters under this situation are the focus of filter pruning. For the third situation, the same feature maps are caused by the co-occurrence features rather than the similar feature extractor. Some features are naturally related. Even though they exist independently, they often appear together. Filters corresponding to these features are efficient and different, but their correlation is high. When the demand for network compression ratio is high, we can consider pruning the filter under this situation. Doing so will inevitably affect the performance of the network but is not fatal. For the last situation, these filters are not only practical but also irreplaceable. Once pruning them will significantly affect the performance of the model. Therefore, we should keep them as much as possible when performing filter pruning operations.

Depending on the above analysis, we conclude a standard to define filter redundancy: the filter that always outputs feature maps similar to other filters is redundant. Furthermore, the redundancy of the filter is positively correlated with the homogeneity of the corresponding feature map. Although we only perform visual analysis on the shallow layer of the network, the experimental results in Section 5 show that this standard applies to all convolutional layers. In the next section, we will describe a quantitative criterion according to this definition to measure the redundancy of filters and incorporate it into a novel filter pruning approach called FP-FMC.

4. Method

4.1 Preliminaries

Firstly, we would like to introduce the symbols and notations. Assume that a convolutional neural network model has $\mathcal{L}$ convolutional layers. The $i^{\text{th}}$ layer is denoted as $\mathcal{L}_{i}$ . For the convolutional layer $\mathcal{L}_{i}$ , the input tensors are $\mathcal{F}_{in}$ , and the output feature maps are $\mathcal{F}_{\textit{out}}$ . In $\mathcal{L}_{i}$ , there are $\mathcal{C}^{\textit{out}}_{L_{i}}$ filters. Each filter contains $\mathcal{C}^{in}_{L_{i}}$ convolutional kernels. $\mathcal{C}^{in}_{L_{i}}$ is also the channel number of $\mathcal{F}_{in}$ and $\mathcal{C}^{\textit{out}}_{L_{i}}$ is the channel number of $\mathcal{F}_{\textit{out}}$ . As mentioned in Section 3, we need to evaluate the similarity between feature maps, so the clustering algorithm is introduced. Here we use the K-means [30], which involves a hyper-parameter used to define the number of clusters. Considering that the clustering is operated on the output of each convolutional layer, so we denote these hyper-parameters according to $\mathcal{L}_{i}$ as $K_{L_{i}}$ .

4.2 Filter pruning via feature map clustering (FP-FMC)

As elaborated in Section 3, we propose a new criterion for defining filter redundancy: the higher the homogeneity of the feature maps is, the more redundant the filter is. Therefore, the key lies in effectively quantifying the homogeneity among feature maps. The most direct way is to cluster the feature maps. The feature maps in the same cluster are more homogeneous than those in different clusters. Nevertheless, there are three obstacles here:

The feature maps are data points in a high-dimensional space, and the traditional clustering method cannot effectively measure their spatial distances in such space;

The number of clusters is a hyper-parameter that significantly affects the clustering results. So how to choose an appropriate $K_{L_{i}}$ for $\mathcal{L}_{i}$ to gather feature maps into distinct subcategories is a pending issue.

Refer to the third situation discussed in Section 3. Sometimes the root cause of similar feature maps is not the redundant filters but the highly correlated input features. How to effectively distinguish this situation is also a hard nut to crack.

Next, we solve the above problems one by one.

4.2.1 Distance measurement of feature maps

To start with, we use dimensionality reduction to solve the problem that traditional distance calculation methods are invalid in high-dimensional spaces. Among them, t-SNE [31] is a highly recommended solution. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data into a low-dimensional space, especially into two or three dimensions. Precisely, it models each high-dimensional object by a 2D or 3D point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability. This characteristic just fits our purpose of seeking homogeneous feature maps. As shown in Fig. 2b, we utilize t-SNE to reduce the dimensions of six groups of feature maps in Fig. 2a to two-dimensional, respectively. Then we draw them with the corresponding filter ID number. Looking closely at Fig. 2b, you will find that the similarity relationship between the feature maps processed by t-SNE is precisely the same as our analysis of the visualization results in Section 3.

In the implementation, we used a trick. That is, before t-SNE, we normalized each feature map as follows.

$\displaystyle\overline{F}_{\textit{out}_{i,j}}=\frac{F_{\textit{out}_{i,j}}-% \min(F_{\textit{out}})}{\max(F_{\textit{out}})-\min(F_{\textit{out}})}$ (1)

On the one hand, this operation scales all the feature maps to a consistent interval $[0,1]$ unprejudiced. On the other hand, it reduces the feature maps with linear correlation to a much closer area in the 2D space. In our opinion, two feature maps with linear correlation are also essentially redundant. After normalization, they were grouped within the same cluster more closely. So far, we have figured out problem one. After reducing the high-dimensional feature maps into the order-preserving 2-dimensional space, the Euclidean distance can be used to measure the similarity.

4.2.2 Hyper-parameter of clustering

After solving the problem of similarity measure, traditional clustering algorithms can come in handy. We picked the K-means [30]. First of all, for most convolutional layers, the number of feature maps is not much, generally not more than 512, which is the proper interval where K-means excels. Secondly, K-means has a fast convergence speed and a good clustering performance. According to Occam’s razor principle, K-means is a reasonable choice to solve the current problem. Then we face a new challenge, determining the hyper-parameter of K-means, that is, the number of clusters $K_{L_{i}}$ .

Here we have adopted a violent optimization manner. We first obtain feature maps of $B$ inputs. Then cluster them with the number of clusters from $2$ to $\mathcal{C}^{\textit{out}}_{L_{i}}-1$ one by one. Next, We record the number of clusters corresponding to the best clustering result. Finally, we take the mode of the $B$ records as the final $K_{L_{i}}$ . According to the law of large numbers, the result is the optimal solution as long as $B$ is large enough. Here, in order to measure what is a good cluster result, we use the Silhouette Coefficient [32] as an indicator as follows:

$\displaystyle a_{i}=\frac{1}{|\mathcal{C}_{i}|-1}\sum_{j\in\mathcal{C}_{i},j% \neq i}d(i,j)$ (2) $\displaystyle b_{i}=\min_{k\neq i}\frac{1}{|\mathcal{C}_{k}|}\sum_{j\in% \mathcal{C}_{k}}d(i,j)$ (3)

where $i, j, k$ are data points, $\mathcal{C}_{i}$ is the cluster where data point $i$ belongs to, $|\mathcal{C}_{i}|$ means the number of data points in the cluster $\mathcal{C}_{i}$ and $d(i,j)$ is the distance between data point $i$ and $j$ .

Therefore, $a_{i}$ is the mean distance between data point $i$ and all other data points in the same cluster $\mathcal{C}_{i}$ and $b_{i}$ is the smallest mean distance of point $i$ to all points in any other clusters. We can interpret $a_{i}$ as a measure of how well $i$ is assigned to its cluster and $b_{i}$ as mean dissimilarity to the neighboring cluster. Then the silhouette coefficient $S C$ is defined as:

$\displaystyle s_{i}=\left\{\begin{array}[]{ll}1-\frac{a_{i}}{b_{i}},&\text{if % }a_{i}<b_{i}\\ 0,&\text{if }a_{i}=b_{i}\\ \frac{b_{i}}{a_{i}}-1,&\text{if }a_{i}>b_{i}\end{array}\right.$ (4) $\displaystyle SC=\max_{k}\tilde{s}(k)$ (5)

where $\tilde{s}(k)$ represents the mean $s_{i}$ over all data of the entire dataset for a specific number of clusters $k$ . The Silhouette Coefficient falls in $[-1,1]$ . The closer the Silhouette Coefficient is to $1$ , the better the clustering is. We set the hyper-parameter $K_{L_{i}}$ as the cluster number corresponding to the maximum Silhouette Coefficient.

Precisely, the purpose of this step is to determine the hyper-parameters, which should be the preparatory step of our algorithm. But it is the most time-consuming step. Fortunately, once the $K_{L_{i}}$ is obtained, it can be used forever, and there is no need to repeat the calculation again. For detailed hyper-parameter analysis, see Section 5.3.

In the implementation, we used another trick here. Before clustering, we use the 1.5 $\cdot$ IQR method by dimension to find outliers in 2D t-SNE feature maps. These outliers will not participate in clustering. After clustering, we mark the clustered feature maps by $\mathcal{C}_{i}\in[0,K_{L_{i}}-1]$ and the outliers by $-1$ . We did this for two reasons:

K-means is a clustering algorithm that is very sensitive to outliers. Removing outliers can increase the robustness of K-means.

For feature maps, outlier means the corresponding filters are the unique ones that differ from others a lot. This corresponds to the fourth situation discussed in Section 3. Marking them out facilitates the subsequent pruning step.

4.2.3 Redundancy quantification

Now, we come to the most important step. Take the convolutional layer $L_{i}$ as an example. There are $\mathcal{C}^{\textit{out}}_{L_{i}}$ filters in this layer denoted as $f_{L_{i}}=\{f_{L_{i}}^{1},f_{L_{i}}^{2},\dots,f_{L_{i}}^{\mathcal{C}^{\textit{% out}}_{L_{i}}}\}$ . Using the $K_{L_{i}}$ prepared before, we cluster the $\mathcal{C}^{\textit{out}}_{L_{i}}$ feature maps for $N$ inputs, respectively. As in the previous step, we treat outliers specially and mark them as $-1$ while marking the rest with cluster label $[0,K_{L_{i}}-1]$ by K-means. In this way, we got a $N\times\mathcal{C}^{\textit{out}}_{L_{i}}$ matrix $M$ whose value $M[i][j]$ represents the cluster label of the $j_{th}$ filter for the $i_{th}$ input. Then, we re-represent $M$ in the form of sets as follows:

$\displaystyle\textit{Set}_{n}^{k}=\{j:M[n][j]=k\}$ (6)

where $n\in[1,N],j\in[1,\mathcal{C}^{\textit{out}}_{L_{i}}]$ and $k\in[-1,K_{L_{i}}-1]$ .

Thus, for each input, we get $K_{L_{i}}+1$ sets, and each set contains filters whose feature maps belong to the same cluster $k$ . Then, we define the redundancy index of filter $f$ as two parts:

$\displaystyle RI^{f}=RI^{f}_{1}-RI^{f}_{2}$ (7)

In this equation, $RI^{f}_{1}$ measures how similar the feature map of $f$ is to that of other filters under different inputs:

$\displaystyle RI^{f}_{1}=\sum\limits_{\begin{subarray}{c}j>i\\ M[i][f]\neq-1\\ M[j][f]\neq-1\end{subarray}}||\textit{Set}_{i}^{M[i][f]}\cap\textit{Set}_{j}^{% M[j][f]}||$ (8)

where $||*||$ denotes the cardinality of set. That is, if $f$ always outputs similar feature map as other filters, the redundancy index of $f$ will be larger.

Moreover, $RI^{f}_{2}$ measures the situation where the feature map of $f$ is the outlier:

$\displaystyle RI^{f}_{2}=||\{i:M[i][f]=-1\}||$ (9)

That is, if $f$ always outputs feature map which is an outlier, $RI^{f}$ will be smaller.

In total, the redundancy index for filter $f$ is defined as Eq. (7). For convenience, we normalize all redundancy index of $\mathcal{C}^{\textit{out}}_{L_{i}}$ filters in convolution layer $L_{i}$ to $[0,1]$ :

$\displaystyle\overline{RI^{f}}=\frac{RI^{f}-\min\limits_{F\in[1,\mathcal{C}^{% \textit{out}}_{L_{i}}]}RI^{F}}{\max\limits_{F\in[1,\mathcal{C}^{\textit{out}}_% {L_{i}}]}RI^{F}-\min\limits_{F\in[1,\mathcal{C}^{\textit{out}}_{L_{i}}]}RI^{F}}$ (10)

With these $N$ results, we calculate the redundancy index of each filter described in Algorithm 4.2.3.

Redundancy Quantification $M$ ( $N\times\mathcal{C}^{\textit{out}}_{L_{*}}$ matrix): the clustering results of $N$ inputs, the value represents the label of cluster each filter belongs to. $\overline{RI}$ ( $\mathcal{C}^{\textit{out}}_{L_{*}}$ -dimensional vector): the value represents the normalized redundancy index of each filter. initialization all $RI\leftarrow 0$ re-represent $M$ in the sets form $i\leftarrow 1$ $N$ $k\leftarrow-1$ $K_{L_{*}}$ $f\leftarrow 1$ $\mathcal{C}^{\textit{out}}_{L_{*}}$ $M[i][f]==k$ add $f$ to $\textit{Set}_{i}^{k}$ calculate the redundancy index $i\leftarrow 1$ $N$ $j>i$ $f\leftarrow 1$ $\mathcal{C}^{\textit{out}}_{L_{*}}$ $M[i][f]=-1$ or $M[j][f]=-1$ for outlier $RI^{f}+=-1$ $RI^{f}+=||\textit{Set}_{i}^{M[i][f]}\cap\textit{Set}_{j}^{M[j][f]}||$ normalize all redundancy index to $[0,1]$ $\overline{RI}=\frac{RI-\min(RI)}{\max(RI)-\min(RI)}$

In short, the redundancy quantization algorithm scores filter by summing the comparison of pairwise clustering results. For the outlier filter, we naturally think it is not redundant and set its redundancy index to $-1$ . For filters whose feature maps are always similar under different clustering results, we use the cardinality of the intersection of the sets where the filter is located as the redundant index. Hence, for the cluster that contains more feature maps, its filters are more redundant. Due to the vast difference in inputs, as $N$ increases, the dissimilation between feature maps will show out gradually. The more clustering result pairs are used, the more accurate the redundancy indices are. As shown in Fig. 3, we plot the distribution of redundancy indices for GoogleNet [33] in $\textit{conv}_{2}$ whose filter number is 32. Obviously, the distribution of redundancy indices tends to stabilize as $N$ increases. When $N$ is small, the instability comes from the highly correlated input features themselves as discussed in Section 3. As the number of inputs increases, the proportion of such co-occurring features decreases, and the difference in the filters is revealed.

The above is the mathematical definition of redundancy index. Intuitively, we draw the redundancy indices for $\textit{conv}_{1}$ layer of VGGNet in Fig. 2c. As expected, the redundancy indices quantified the four situations discussed in Section 3 well and provided guidance for the pruning step.

Figure 3.

The redundancy index of GoogleNet in $\textit{conv}_{2}$ whose filter number is 32. The six graphs correspond to the values calculated with the sample size of 32, 64, 128, 256, 512, and 1024 respectively. For each subfigure, the x-axis represents the indices of filters and the y-axis is the redundant index which falls in $[0,1]$ . We draw the six graphs separately instead of pairwise comparison in one graph because the focus is on the shape of the distribution of redundancy rather than the specific value changes. It was evident that when the sample size is small, the distribution of redundancy indices is significantly different from that of the greater samples one. As the sample size increases, the distribution of redundancy indices becomes stable. When the sample size is large, although the value of the individual redundancy index fluctuates, the overall order relationship remains unchanged. And this fluctuation does not affect the order of the redundancy indices of filters which is the criterion used for pruning.

4.3 Summary for FP-FMC

In summary, there are three steps for FP-FMC:

Step 1:
For t-SNE feature maps in each convolution layer $L_{i}$ , pick out the outliers and cluster the rest by K-means with prepared $K_{L_{i}}$ .
Step 2:
Calculate the redundancy indices of all filters as described in Algorithm 4.2.3 and sort them in descending order.
Step 3:
According to the compression ratio $r_{L_{i}}$ given by the user, calculate the number of filters to be pruned $N_{\textit{prune}}=r_{L_{i}}\times\mathcal{C}^{\textit{out}}_{L_{i}}$ . Remove the first $N_{\textit{prune}}$ filters in the sorted filter list. Repeat the above remove operation layer by layer. Then, fine-tune the model.

In short, the core idea of this paper is to propose a new pruning strategy based on redundancy quantification of filters. Although there are some details can be optimized, the overall strategy is implementable, both theoretically and technically. In the next section, the superiority of this strategy far beyond its peers on various network architectures will be fully demonstrated in the experiments, which is exactly the value of this work.
5. Experiments

In this section, we present the experimental evaluation of FP-FMC in four network architectures with two challenging datasets. We show the robustness and the flexibility of our proposed filter pruning approach. Also, we conducted analysis experiments on the hyper-parameters used in our method. The source codes are publicly available on GitHub,1 along with the log files which produce all experiment results. To sum up, we can say that the proposed method outperforms the counterparts in terms of Floating Point Operations Per Second (FLOPs), parameters reduction, and classification accuracy.

5.1 Experimental settings

Datasets and Baselines

To demonstrate the performance of our method on filter pruning, we used two datasets, CIFAR-10 [29] and ImagNet [34]. The former consists of 60,000 $32\times 32$ color images in 10 classes. The latter spans 1000 object classes and contains 1,281,167 for training, 50,000 for validation and 100,000 for test. Both are well-known benchmark datasets for the classification task. Compared with other classification datasets, the proportions and characteristics of objects in these two datasets are pretty diversified. Besides, because the shooting is in a natural scene and the background is complex, object recognition and classification are a considerable challenge. As for models, we selected diverse architectures with different modules, including VGGNet [28], DenseNet [35], GoogleNet [33], and ResNet [36]. All pre-trained models come from the open-source contributions on GitHub. In all experiments, we randomly sampled 128 images to calculate the clustering hyper-parameters and redundancy index.

Evaluation Protocols

For classification tasks, we use classification accuracy to measure the performance of models. Specifically, we adopt top-1 accuracy for CIFAR-10, top-1 and top-5 accuracy for ImageNet. For the evaluation of model compression, we selected the number of parameters to assess model size and required Floating Points Operation Per Second to assess calculate consumption, respectively. Correspondingly, we also labeled the compression percentage relative to the original model on these two indicators.

Configurations

Our codes are built on Pytorch [37]. In the fine-tune period, we optimize the objective using Adam [38] with a batch size of 256 and a learning rate of 2E-2. The weight decay and momentun are set to 5E-3 and 0.9. The fine-tune iterations are 300 for all models. All experiments are conducted on NVIDIA GeForce RTX 3090.

5.2 Compression and analysis

The experiments of model compression involve six models which contain four different network blocks, including VGGNet [28], DenseNet [35], GoogleNet [33], ResNet-56, ResNet-110 and ResNet-50 [36]. we compare and analyse with other state-of-the-art filter pruning approaches as well as the original models. The filter pruning methods used for comparison are all proposed in the past five years. The experimental results of all other methods in the following tables are all from the original paper, respectively. For our FP-FMC, we use the given compression ratios to calculate the number of filters in each convolutional layer of the pruned model, and use this number as a new configuration to initialize the target model. Then, according to the redundancy index, copy the weights of specified filters from the benchmark model to the target model. Finally, fine-tune the pruned model on the target dataset.

5.2.1 VGGNet on CIFAR-10

As for VGGNet [28], the network architecture we used is VGG16-bn, which contains a linear structure. The VGG16-bn consists of 13 convolutional layers and three fully-connected layers. Moreover, a batch normalization operation follows each convolutional layer. The benchmark model was downloaded from the online public resource.2 Because the number of filters in the last convolutional layer needs to be aligned with the fully-connected layer, it is not allowed to change. Therefore, in addition to the last convolutional layer, filter pruning operations can be performed on the first 12 convolutional layers. All 12 convolutional layers can be compressed by user-specified ratios. We selected seven methods to compare, and the detailed results are in Table 1. For the best result of FP-FMC in Table 1, the compression ratios of each layer are [0.3]*7 $+$ [0.75]*5, which means the first seven convolutional layers are compressed at a compression ratio of 30%, and the last five convolutional layers are compressed at a compression ratio of 75%. It can be seen that our method is superior to the others in all indicators. Take CHIP [39] as an example. FP-FMC reduces the FLOPs of 26.49M and parameters of 0.26M while the classification accuracy is increased by 0.03%. For all models with FLOPs above 100M, the model compressed by FP-FMC not only requires the least amount of floating-point operations but also own the least amount of parameters and achieves the best classification accuracy. Further compressing the model to less than 2.00M, FP-FMC still achieves 92.03% classification accuracy when reducing floating-point operations by 78.7%. A model of this size is friendly to edge devices such as mobile phones, and the performance is relatively impressive. At this time, the compression ratio of the 12 convolutional layers are [0.45]*7 $+$ [0.78]*5. We draw the classification accuracy change of the fine-tuning process for the above two compression ratios in Fig. 4. The x-axis represents the fine-tuning iteration, and the y-axis is the classification accuracy. It can be seen that after only more than 50 iterations of fine-tuning process, the classification accuracy of the compressed model tends to be stable.

Table 1
Pruning results of VGGNet on CIFAR-10

Model	Top-1% $\uparrow$	FLOPs $\downarrow$	Parameters $\downarrow$
VGGNet (Baseline)	93.96	313.73M	14.98M
FP-FMC (Ours: pruned 1) ${}^{*}$	93.89	104.78M ( $-$ 66.6%)	2.50M ( $-$ 83.3%)
CHIP [39] (2021)	93.86	131.17M ( $-$ 58.1%)	2.76M ( $-$ 81.6%)
GM [15] (2019)	93.54	206.43M ( $-$ 34.2%)	–
HRank [17] (2020)	93.43	145.61M ( $-$ 53.5%)	2.51M ( $-$ 82.9%)
L1 [8] (2017)	93.40	206.00M ( $-$ 34.3%)	5.40M ( $-$ 64.0%)
Zhao et al. [40] (2019)	93.18	190.00M ( $-$ 39.1%)	3.92M ( $-$ 73.3%)
SSS [41] (2018)	93.02	183.13M ( $-$ 41.6%)	3.93M ( $-$ 73.8%)
FP-FMC (Ours: pruned 2) ${}^{**}$	92.93	66.95M ( $-$ 78.7%)	1.90M ( $-$ 87.3%)
GAL-0.05 [42] (2019)	92.03	189.49M ( $-$ 39.6%)	3.36M ( $-$ 77.6%)

${}^{*}$ Compression ratio configurations of pruned 1: [0.3]*7 $+$ [0.75]*5. ${}^{**}$ Compression ratio configurations of pruned 2: [0.45]*7 $+$ [0.78]*5.

Figure 4.

Fine-tuning results of VGG16-bn pruning models under two compression ratios configuration compared with the baseline.

Figure 5.

Fine-tuning results of DenseNet-40 pruning models under three compression ratios configuration compared with the baseline.

5.2.2 DenseNet on CIFAR-10

The competition on DenseNet [35] is fierce, and the gap between participants is not significant. The network architecture we used is DenseNet-40. The benchmark model was downloaded from the online public resource.3 Specifically, DenseNet-40 contains three dense blocks and transition layers between every two dense blocks. There are 12 convolution operations in each dense block and one in the transition layer. Filter pruning can be performed on all 38 convolutional layers ( $12\times 3+1\times 2=38$ ). In addition, DenseNet-40 also includes a convolution operation on the input layer and a fully-connected operation on the output layer, a total of 40 layers. We do not make changes to these two layers. In Table 2, among all methods with classification accuracy above 94%, FP-FMC has the least amount of parameters. When the compression ratio reaches more than 66%, it still maintains 93.28% classification accuracy but requires only 37.7% of floating-point operations compared to the baseline. We draw the classification accuracy change of the fine-tuning process for three compression ratios in Fig. 5. The x-axis represents the fine-tuning iteration, and the y-axis is the classification accuracy. And the compression ratio configurations are [0.2]*12 $+$ [0.0] $+$ [0.2]*12 $+$ [0.0] $+$ [0.2]*12, [0.3]*12 $+$ [0.1] $+$ [0.3]*12 $+$ [0.1] $+$ [0.3]*12, [0.4]*12 $+$ [0.1] $+$ [0.4]*12 $+$ [0.1] $+$ [0.4]*12, respectively. It can be seen that after only more than 150 iterations of fine-tuning process, the classification accuracy of the compressed model tends to be stable.

Table 2
Pruning results of DenseNet on CIFAR-10

Model	Top-1% $\uparrow$	FLOPs $\downarrow$	Parameters $\downarrow$
DenseNet-40 (Baseline)	94.81	282.00M	1.04M
FP-FMC (Ours: pruned 1) ${}^{*}$	94.45	173.38M ( $-$ 38.5%)	0.62M ( $-$ 40.4%)
GAL-0.01 [42] (2019)	94.29	182.92M ( $-$ 35.3%)	0.67M ( $-$ 35.6%)
HRank [17] (2020)	94.24	167.41M ( $-$ 40.8%)	0.66M ( $-$ 53.8%)
FP-FMC (Ours: pruned 2) ${}^{**}$	93.72	133.17M ( $-$ 52.78%)	0.45M ( $-$ 56.7%)
GAL-0.05 [42] (2019)	93.53	128.11M ( $-$ 54.7%)	0.45M ( $-$ 56.7%)
FP-FMC (Ours: pruned 3) ${}^{***}$	93.28	106.23M ( $-$ 62.3%)	0.35M ( $-$ 66.3%)
Zhao et al. [40] (2019)	93.16	156.00M ( $-$ 44.8%)	0.42M ( $-$ 59.7%)

${}^{*}$ Compression ratio of pruned 1: [0.2]*12 $+$ [0.0] $+$ [0.2]*12 $+$ [0.0] $+$ [0.2]*12. ${}^{**}$ Compression ratio of pruned 2: [0.3]*12 $+$ [0.1] $+$ [0.3]*12 $+$ [0.1] $+$ [0.3]*12. ${}^{***}$ Compression ratio of pruned 3: [0.4]*12 $+$ [0.1] $+$ [0.4]*12 $+$ [0.1] $+$ [0.4]*12.

Table 3

Pruning results of GoogleNet on CIFAR-10

Model	Top-1% $\uparrow$	FLOPs $\downarrow$	Parameters $\downarrow$
FP-FMC (Ours: pruned 1) ${}^{*}$	95.11	0.65B ( $-$ 57.2%)	2.86M ( $-$ 53.50%)
GoogleNet (Baseline)	95.05	1.52B ( $-$ 0.0%)	6.15M ( $-$ 0.0%)
FP-FMC (Ours: pruned 2) ${}^{**}$	94.83	0.39B ( $-$ 74.3%)	2.09M ( $-$ 65.80%)
SFM [43] (2022)	94.82	0.64B ( $-$ 57.5%)	2.85M ( $-$ 53.7%)
Random [17] (2020)	94.54	0.96B ( $-$ 36.8%)	3.58M ( $-$ 41.8%)
L1 [8] (2017)	94.54	1.02B ( $-$ 32.9%)	3.51M ( $-$ 42.9%)
HRank [17] (2020)	94.53	0.69B ( $-$ 54.6%)	2.74M ( $-$ 69.8%)
GAL-0.05 [42] (2019)	93.93	0.94B ( $-$ 38.2%)	3.12M ( $-$ 49.3%)
GAL-ApoZ ${}^{\text{a}}$ [19] (2016)	92.11	0.76B ( $-$ 50.0%)	2.85M ( $-$ 53.7%)

${}^{*}$ Compression ratio configurations of pruned 1: [0.6]*2 $+$ [0.7]*5 $+$ [0.8]*2. ${}^{**}$ Compression ratio configurations of pruned 2: [0.85]*2 $+$ [0.9]*5 $+$ [0.9]*2. ${}^{\text{a}}$ GAL-ApoZ are results standardized by HRANK [17] and GAL [42].

5.2.3 GoogleNet on CIFAR-10

For the comparison on GoogleNet [33], FP-FMC is the absolute overlord. Both in terms of parameters and computational consumption, it surpasses others by a large margin. The network architecture we used is the GoogleNet with nine inception modules. The benchmark model was downloaded from the online public resource.4 In detail, for each inception module, there are four $1\times 1$ convolutional layers, one $3\times 3$ convolutional layer, and two $5\times 5$ convolutional layers. We prune all $3\times 3$ and $5\times 5$ convolutional layers for the first eight inception modules. For the last inception module, we only compress the first $5\times 5$ convolutional layer to ensure that the number of output channels is aligned with the fully-connected layer. In total, filter pruning can be performed on 25 convolutional layers ( $8\times 3+1=25$ ). In the experiment, we set the compression ratios by inception module, and the $3\times 3$ convolutional layers share the same compress ratio with the $5\times 5$ convolutional layers in the same inception module. It is worth noting that compared with baseline, FP-FMC can achieve better results with less than half of the resources. Further compressing the model to 2.09M parameters, FP-FMC still achieves 94.83% classification accuracy when reducing floating-point operations by 74.3%. This far outperforms other state-of-the-art methods on all three metrics. We also draw the classification accuracy change of the fine-tuning process for these two compression ratios in Fig. 6. Similar to Figs 4 and 5, the x-axis represents the fine-tuning iteration, and the y-axis is the classification accuracy. And the compression ratio configurations are [0.6]*2 $+$ [0.7]*5 $+$ [0.8]*2 and [0.85]*2 $+$ [0.9]*5 $+$ [0.9]*2. It can be seen that after only more than 150 iterations of fine-tuning, the classification accuracy of the compressed model tends to be stable or even exceeds the baseline.

Figure 6.

Fine-tuning results of GoogleNet pruning models under two compression ratios configuration compared with the baseline.

5.2.4 ResNet on CIFAR-10

For ResNet [36] on CIFAR-10 [29], the network architectures we used are ResNet-56 and ResNet-110. The benchmark models were downloaded from the online public resources.5 ,6 For ResNet-56, there are 55 convolutional layers and one fully-connected layer. In addition to the first convolutional layer, there are three groups of modules at channel numbers of 16, 32, and 64, respectively. Each group contains 9 serial residual blocks. And there are two convolution operations in each residual block. In total, filter pruning can be performed on 54 convolutional layers except for the last one, which must be unchanged to align the fully-connected layer. Similar to ResNet-56, there are 109 convolutional layers and one fully-connected layer in ResNet-110. In addition to the first convolutional layer, there are three groups of modules at channel numbers of 16, 32, and 64, respectively. Each group contains 18 serial residual blocks. And there are two convolution operations in each residual block. In total, filter pruning can be performed on 108 convolutional layers except for the last one, which must be unchanged to align the fully-connected layer.

Figure 7.

Fine-tuning results of ResNet-56 pruning models under three compression ratios configuration compared with the baseline.

Figure 8.

Fine-tuning results of ResNet-110 pruning models under three compression ratios configuration compared with the baseline.

The compression rates of ResNet are somewhat different from the above models. The given compression rates have two parts. The first half specifies the pruning configuration of the convolution layer before each residual block to make sure the output of the last residual block is well-matched for the input of the next residual block. And the second half specifies the pruning configuration of the convolution layers inside the residual blocks. Take [0.0] $+$ [0.4]*2 $+$ [0.5]*9 $+$ [0.6]*9 $+$ [0.7]*9 of ResNet-56 for example, [0.0] $+$ [0.4]*2 specifies compression rates of convolution layers before the three groups of 9 serial residual blocks, and [0.5]*9 $+$ [0.6]*9 $+$ [0.7]*9 specifies the compression rates of the middle filters inside the three groups.

Now, let us move to the comparison results of FP-FMC and other pruning methods. The situation on ResNet [36] is similar to GoogleNet. Whether it is ResNet-56 or ResNet-110, it has an overwhelming advantage in all indicators and surpasses the benchmark model significantly. Whether it is ResNet-56 or ResNet-110, the optimal model after compression exceeds the performance of the baseline. When the floating-point operation of the model is further compressed to less than 30%, FP-FMC yields an impressive classification accuracy (92.26% for ResNet-56 and 93.15% for ResNet-110). Figures 7 and 8 show the fine-tuning process for ResNet-56 and ResNet-110 under three compression ratios. The compression ratios for ResNet-56 are [0.0] $+$ [0.18]*29, [0.0] $+$ [0.15]*2 $+$ [0.4]*27, and [0.0] $+$ [0.4]*2 $+$ [0.5]*9 $+$ [0.6]*9 $+$ [0.7]*9, respectively. And the compression ratios for ResNet-110 are [0.0] $+$ [0.2]*2 $+$ [0.3]*18 $+$ [0.35]*36, [0.0] $+$ [0.25]*2 $+$ [0.4]*18 $+$ [0.55]*36, and [0.0] $+$ [0.4]*2 $+$ [0.5]*18 $+$ [0.65]*36, respectively. Similar to results in Fig. 6, only more than 150 iterations of fine-tuning, the classification accuracy of the compressed model tends to be stable and exceeds the baseline.

Table 4

Pruning results of ResNet-56 on CIFAR-10

Model	Top-1% $\uparrow$	FLOPs $\downarrow$	Parameters $\downarrow$
FP-FMC (Ours: pruned 1) ${}^{*}$	95.28	90.35M ( $-$ 28.0%)	0.66M ( $-$ 22.4%)
FP-FMC (Ours: pruned 2) ${}^{**}$	94.22	65.94M ( $-$ 47.4%)	0.48M ( $-$ 42.8%)
CHIP [39] (2021)	94.16	65.94M ( $-$ 47.4%)	0.48M ( $-$ 42.8%)
SCOP [44] (2020)	93.64	55.22M ( $-$ 56%)	0.37M ( $-$ 56.3%)
RFPruning [45] (2021)	93.41	62.00M ( $-$ 51.2%)	0.49M ( $-$ 42.4%)
NPPM [46] (2021)	93.40	62.75M ( $-$ 50%)	/
DECORE [47](2021)	93.34	92.48M ( $-$ 26.3%)	0.64M ( $-$ 24.2%)
ResNet-56 (Baseline)	93.26	125.49M	0.85M
HRank [17] (2020)	93.17	62.72M ( $-$ 50.0%)	0.49M ( $-$ 42.4%)
L1 [8] (2017)	93.06	90.90M ( $-$ 27.6%)	0.73M ( $-$ 14.1%)
NISP [48] (2018)	93.01	81.00M ( $-$ 35.5%)	0.49M ( $-$ 42.4%)
GAL-0.6 [42] (2019)	92.98	78.30M ( $-$ 37.6%)	0.75M ( $-$ 11.8%)
FP-FMC (Ours: pruned 3) ${}^{***}$	92.26	34.78M ( $-$ 74.1%)	0.24M ( $-$ 71.8%)
CHIP [39] (2021)	92.05	34.79M ( $-$ 72.3%)	0.24M ( $-$ 71.8%)
CP [20] (2017)	91.8	62.75M ( $-$ 50.0%)	–
GAL-0.8 [42] (2019)	90.36	49.99M ( $-$ 60.2%)	0.29M ( $-$ 65.9%)

${}^{*}$ Compression ratio of pruned 1: [0.0] $+$ [0.18]*29. ${}^{**}$ Compression ratio of pruned 2: [0.0] $+$ [0.15]*2 $+$ [0.4]*27. ${}^{***}$ Compression ratio of pruned 3: [0.0] $+$ [0.4]*2 $+$ [0.5]*9 $+$ [0.6]*9 $+$ [0.7]*9.

Table 5

Pruning results of ResNet-110 on CIFAR-10

Model	Top-1% $\uparrow$	FLOPs $\downarrow$	Parameters $\downarrow$
FP-FMC (Ours: pruned 1) ${}^{*}$	95.32	140.54M ( $-$ 44.4%)	1.04M ( $-$ 39.1%)
FP-FMC (Ours: pruned 2) ${}^{**}$	95.17	101.97M ( $-$ 59.6%)	0.72M ( $-$ 58.1%)
CHIP [39] (2021)	94.44	121.09 ( $-$ 52.1%)	0.89M ( $-$ 48.3%)
DECORE [47] (2021)	93.88	163.30M ( $-$ 35.4%)	1.11M ( $-$ 35.7%)
ResNet-110 [36] (Baseline)	93.50	252.89M	1.72M
L1 [8] (2017)	93.30	155.00M ( $-$ 38.7%)	1.16M ( $-$ 32.6%)
FP-FMC (Ours: pruned 3) ${}^{***}$	93.15	71.69M ( $-$ 71.6%)	0.54M ( $-$ 68.3%)
HRank [17] (2020)	92.65	79.30M ( $-$ 68.6%)	0.53M ( $-$ 68.7%)
GAL-0.5 [42] (2019)	92.55	130.20M ( $-$ 48.5%)	0.95M ( $-$ 44.8%)

${}^{*}$ Compression ratio of pruned 1: [0.0] $+$ [0.2]*2 $+$ [0.3]*18 $+$ [0.35]*36. ${}^{**}$ Compression ratio of pruned 2: [0.0] $+$ [0.25]*2 $+$ [0.4]*18 $+$ [0.55]*36. ${}^{***}$ Compression ratio of pruned 3: [0.0] $+$ [0.4]*2 $+$ [0.5]*18 $+$ [0.65]*36.

Table 6

Pruning results of ResNet-50 on ImageNet

Model	Top-1% $\uparrow$	Top-5% $\uparrow$	FLOPs $\downarrow$	Parameters $\downarrow$
ResNet-50 [36] (Baseline)	76.15	92.87	4.09B	25.50M
FP-FMC (Ours: pruned 1) ${}^{*}$	75.64	92.63	2.26B ( $-$ 44.74%)	15.09M ( $-$ 40.82%)
GM [15] (2019)	75.59	92.63	1.72B ( $-$ 57.95%)	–
HRank [17] (2020)	74.98	92.33	2.30B ( $-$ 43.77%)	16.15M ( $-$ 36.67%)
SSS-32 [41] (2018)	74.18	91.91	2.82B	18.60M ( $-$ 27.06%)
RFPruning [45] (2021)	73.29	–	2.14B ( $-$ 47.9%)	13.87M ( $-$ 45.8%)
FP-FMC(Ours: pruned 2) ${}^{**}$	72.35	90.74	0.95B ( $-$ 76.77%)	8.02M ( $-$ 68.55%)
CP [20] (2017)	72.30	90.80	2.73B ( $-$ 33.25%)	–
GAL-0.5 [42] (2019)	71.95	90.94	2.33B ( $-$ 43.03%)	21.20M ( $-$ 16.86%)
SSS-26 [41] (2018)	71.82	90.79	2.33B ( $-$ 43.03%)	15.60M ( $-$ 38.82%)
GAL-0.5-joint [42] (2019)	71.80	90.82	1.84B ( $-$ 55.01%)	19.31M ( $-$ 24.27%)
GDP-0.6 [16] (2018)	71.19	90.71	1.88B ( $-$ 54.03%)	–
GDP-0.5 [16] (2018)	69.58	90.14	1.57B ( $-$ 61.61%)	–
GAL-1 [42] (2019)	69.88	89.75	1.58B ( $-$ 61.37%)	14.67M ( $-$ 42.47%)
GAL-1-joint [42] (2019)	69.31	89.12	1.11B ( $-$ 72.86%)	10.21M ( $-$ 59.96%)
ThiNet-50 [13] (2017)	68.42	88.30	1.10B ( $-$ 73.11%)	8.66M ( $-$ 66.04%)
SFP [49] (2018)	62.14	84.6	1.71B ( $-$ 58.19%)	–

${}^{*}$ Compression ratio of pruned 1: [0.0] $+$ [0.1]*3 $+$ [0.35]*16. ${}^{**}$ Compression ratio of pruned 2: [0.0] $+$ [0.5]*3 $+$ [0.6]*16.

5.2.5 ResNet on ImageNet

In control experiments on ImageNet [34], we compare the performance of the ResNet-50 model. The structure of ResNet-50 is more complicated. There are 53 convolutional layers and one fully-connected layer. In addition to the first convolutional layer, there are four groups of modules at channel numbers of 256, 512, 1024, and 2048 respectively. Each group contains 3, 4, 6, and 3 serial residual blocks. And there are three or four convolution operations in each residual block depending on whether there is downsample requirement. In total, filter pruning can be performed on 52 convolutional layers except for the last one, which must be unchanged to align the fully-connected layer. From the perspective of classification accuracy, the advantages of FP-FMC still exist. In addition to the same three metrics as in the CIFAR-10 experiments, we also compared the top-5 classification accuracy. As shown in Table 6, in addition to the benchmark model, FP-FMC surpasses all other models in the top-1 accuracy, and is equal to the current optimal GM [15] in the top-5 accuracy. Similarly, reducing the model compression ratio below the lowest record of existing methods, using only less than 30% of the original Floating Points Operation, our method only reduces the top-1 accuracy rate of 3.8% and the top-5 accuracy rate of 2.13% compared to the benchmark. In fact, due to the huge data volume of ImageNet, the fine-tune of 300 iterations cannot achieve the optimal. In theory, with more fine-tuning, FP-FMC is likely to surpass the benchmark. However, for the fairness of the controlled experiment, we only compare the results under 300 iterations fine-tunes. We also listed the compression ratio configurations as before for ResNet-50. They are [0.0] $+$ [0.1]*3 $+$ [0.35]*16 for Top-1 of 75.64%, and [0.0] $+$ [0.5]*3 $+$ [0.6]*16 for Top-1 of 72.35%.

5.3 Hyper-parameter analysis: The choice of cluster numbers

As explained in Section 4.2, the cluster number is the only hyper-parameter that affects the performance of our model. We use a crude solution to obtain the optimal $K$ :

$\displaystyle K_{L_{i}}=\mathop{\arg\min}_{k\in[2,\mathcal{C}^{\textit{out}}_{% L_{i}}-1]}\tilde{s}(k)$ (11)

where $\tilde{s}(k)$ is defined by Eq. (4). According to the law of large numbers, we repeat the above calculations on a large amount of inputs, and take the mode as the most ideal result. Given that this process is time-consuming, so how large the sample size is becomes a problem. Fortunately, the experimental results show that only a small samples size is enough.

As shown in Fig. 9, there is the statistical visualization of the optimal $K_{L_{i}}$ under different network architectures. We choose the VGGNet, which only contains linear structures, and DenseNet and ResNet, which contain skip structures. The x-axis represents the number of samples used to calculate $K_{L_{i}}$ . Furthermore, we utilize the boxplot to show the five-number summary information. We can find that as the sample size changes, the mode of the optimal $K_{L_{i}}$ remains almost unchanged for all models. It shows that only a few samples are required to determine the hyper-parameters. This makes the most time-consuming part of the FP-FMC method acceptable even with limited computing power.

Figure 9.

Statistical visualization of the optimal $K_{L_{i}}$ under different network architectures and layers. For each subfigure, x-axis represent the number of samples used to calculate $K_{L_{i}}$ . Each boxplot displays five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles. Outliers are plotted as rhombus. The orange dotted line represents the mode of all results. We found that using small amount of samples can estimate the optimal $K_{L_{i}}$ that are almost consistent with the large sample size. And as the sample size increases, there will be no significant fluctuations.

6. Conclusion

In this paper, we observed the visualization results of the feature maps and filters to propose a novel filter pruning criterion which is dubbed as the redundancy index. Under the guidance of the redundancy index, we have constructed the filter pruning algorithm called FP-FMC. We performed extensive experiments that show our algorithm is able to outperform state-of-the-art model compression approaches on almost all network architectures. Such a pruning algorithm can improve the performance of baseline models and can also be utilized as an accelerated technique to benefit downstream resource-scarce tasks. Furthermore, introducing traditional machine learning algorithms into the deep feature interpretation process would open doors for many black-box problems in deep learning.

Footnotes

Code: https://github.com/liweileev/FP-FMC.

VGG16-bn benchmark model: https://drive.google.com/file/d/1i3ifLh70y1nb8d4mazNzyC4I27jQcHrE.

DenseNet-40 benchmark model: https://drive.google.com/file/d/12rInJ0YpGwZd_k76jctQwrfzPubsfrZH.

GoogleNet benchmark model: https://drive.google.com/file/d/1rYMazSyMbWwkCGCLvofNKwl58W6mmg5c.

ResNet-56 benchmark model: https://drive.google.com/file/d/1f1iSGvYFjSKIvzTko4fXFCbS-8dw556T.

ResNet-110 benchmark model: https://drive.google.com/file/d/1uENM3S5D_IKvXB26b1BFwMzUpkOoA26m.

Acknowledgments

This work is funded by the Key R&D Program of Zhejiang (No. 2022C03126) and the National Natural Science Foundation of China (No. 61773336). We would like to extend our gratitude to data providers.

References

Chung

Y.-A.

Zhang

Han

Chiu

C.-C.

Qin

Pang

and Wu

, W2v-BERT: Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training, arXiv preprint arXiv:2108.06209, 2021.

Zeng

Ren

Wang

Liao

Wang

Jiang

Yang

Wang

Zhang

et al., PanGu-

\alpha

: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation, arXiv preprint arXiv:2104.12369, 2021.

Brown

T.B.

Mann

Ryder

Subbiah

Kaplan

Dhariwal

Neelakantan

Shyam

Sastry

Askell

et al., Language models are few-shot learners, arXiv preprint arXiv:2005.14165, 2020.

Han

and Jiang

, Computer-aided diagnosis system of lung carcinoma using Convolutional Neural Networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 690–691.

Chen

Seff

Kornhauser

and Xiao

, Deepdriving: Learning affordance for direct perception in autonomous driving, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2722–2730.

Liu

Sun

Shi

Zhao

Yang

Guo

et al., Mastering complex control in moba games with deep reinforcement learning, in: Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI), Vol. 34, 2020, pp. 6672–6679.

Chen

Wilson

Tyree

Weinberger

and Chen

, Compressing neural networks with the hashing trick, in: Proceedings of the International Conference on Machine Learning (ICML), PMLR, 2015, pp. 2285–2294.

Kadav

Durdanovic

Samet

and Graf

H.P.

, Pruning filters for efficient convnets, in: Proceedings of the International Conference on Learning Representations (ICLR), 2017.

Yan

Lin

Zheng

Zhang

Yang

and Ji

, Pams: Quantized super-resolution via parameterized max scale, in: Proceedings of the European Conference on Computer Vision (ECCV), Springer, 2020, pp. 564–580.

10.

Lin

Chen

Tao

and Luo

, Holistic cnn compression via low-rank decomposition with knowledge transfer, TPAMI 41(12) (2018), 2889–2905.

11.

Liu

and Zhang

, Few sample knowledge distillation for efficient network compression, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 14639–14647.

12.

Zheng

Tang

Zhang

Liu

and Tian

, Multinomial distribution learning for effective neural architecture search, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019, pp. 1304–1313.

13.

Luo

J.-H.

and Lin

, Thinet: A filter level pruning method for deep neural network compression, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5058–5066.

14.

Lin

and Wang

J.Z.

, Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers, arXiv preprint arXiv:1802.00124, 2018.

15.

Liu

Wang

and Yang

, Filter pruning via geometric median for deep convolutional neural networks acceleration, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4340–4349.

16.

Lin

Huang

and Zhang

, Accelerating Convolutional Networks via Global & Dynamic Filter Pruning, in: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Vol. 2, 2018, p. 8.

17.

Lin

Wang

Zhang

Tian

and Shao

, Hrank: Filter pruning using high-rank feature map, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 1529–1538.

18.

Luo

J.-H.

and Wu

, An entropy-based pruning method for cnn compression, arXiv preprint arXiv:1706.05791, 2017.

19.

Peng

Tai

Y.-W.

and Tang

C.-K.

, Network trimming: A data-driven neuron pruning approach towards efficient deep architectures, arXiv preprint arXiv:1607.03250, 2016.

20.

Zhang

and Sun

, Channel pruning for accelerating very deep neural networks, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1389–1397.

21.

Hanson

and Pratt

, Comparing biases for minimal network construction with back-propagation, in: Proceedings of the Neural Information Processing Systems (NeurIPS), 1988.

22.

LeCun

Denker

and Solla

, Optimal brain damage, in: Proceedings of the Neural Information Processing Systems (NeurIPS), 1989.

23.

Han

Pool

Tran

and Dally

, Learning both weights and connections for efficient neural network, in: Proceedings of the Neural Information Processing Systems (NeurIPS), 2015.

24.

Guo

Yao

and Chen

, Dynamic network surgery for efficient dnns, in: Proceedings of the Neural Information Processing Systems (NeurIPS), 2016.

25.

Zhang

Tang

Wen

Fardad

and Wang

, A systematic dnn weight pruning framework using alternating direction method of multipliers, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 184–199.

26.

Molchanov

Tyree

Karras

Aila

and Kautz

, Pruning convolutional neural networks for resource efficient inference, arXiv preprint arXiv:1611.06440, 2016.

27.

Liu

Shen

Huang

Yan

and Zhang

, Learning efficient convolutional networks through network slimming, in: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2736–2744.

28.

Simonyan

and Zisserman

, Very deep convolutional networks for large-scale image recognition, arXiv preprint arXiv:1409.1556, 2014.

29.

Krizhevsky

and Hinton

, Learning multiple layers of features from tiny images, Technical Report, Citeseer, 2009.

30.

Lloyd

, Least squares quantization in PCM, IEEE Transactions on Information Theory 28(2) (1982), 129–137.

31.

Van der Maaten

and Hinton

, Visualizing data using t-SNE, Journal of Machine Learning Research 9(11) (2008).

32.

Rousseeuw

P.J.

, Silhouettes: A graphical aid to the interpretation and validation of cluster analysis, Journal of Computational and Applied Mathematics 20 (1987), 53–65.

33.

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

Erhan

Vanhoucke

and Rabinovich

, Going deeper with convolutions, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.

34.

Russakovsky

Deng

Krause

Satheesh

Huang

Karpathy

Khosla

Bernstein

Berg

A.C.

and Fei-Fei

, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision (IJCV) 115(3) (2015), 211–252. doi: 10.1007/s11263-015-0816-y.

35.

Huang

Liu

Van Der Maaten

and Weinberger

K.Q.

, Densely connected convolutional networks, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.

36.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.

37.

Paszke

Gross

Chintala

Chanan

Yang

DeVito

Lin

Desmaison

Antiga

and Lerer

, Automatic Differentiation in PyTorch, in: NIPS Autodiff Workshop, 2017.

38.

Kingma

and Ba

, Adam: A method for stochastic optimization, in: International Conference on Learning Representations (ICLR), 2014.

39.

Sui

Yin

Xie

Phan

Aliari Zonouz

and Yuan

, CHIP: CHannel Independence-based Pruning for Compact Neural Networks, in: Proceedings of the Neural Information Processing Systems (NeurIPS), 2021.

40.

Zhao

Zhang

Zhao

Zhang

and Tian

, Variational convolutional neural network pruning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2780–2789.

41.

Huang

and Wang

, Data-driven sparse structure selection for deep neural networks, in: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 304–320.

42.

Lin

Yan

Zhang

Cao

Huang

and Doermann

, Towards optimal structured cnn pruning via generative adversarial learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2790–2799.

43.

Liu

Ding

and Liao

, Learning compact ConvNets through filter pruning based on the saliency of a feature map, IET Image Processing 16(1) (2022), 123–133.

44.

Tang

Wang

Tao

and Xu

, SCOP: Scientific Control for Reliable Neural Network Pruning, in: Proceedings of the Neural Information Processing Systems (NeurIPS), 2020, pp. 10936–10947.

45.

Wang

Xie

and Shi

, RFPruning: A retraining-free pruning method for accelerating convolutional neural networks, Applied Soft Computing 113 (2021), 107860.

46.

Gao

Huang

Cai

and Huang

, Network Pruning via Performance Maximization, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 9270–9280.

47.

Alwani

Madhavan

and Wang

, DECORE: Deep Compression with Reinforcement Learning, arXiv preprint arXiv:2106.06091, 2021.

48.

Chen

C.-F.

Lai

J.-H.

Morariu

V.I.

Han

Gao

Lin

C.-Y.

and Davis

L.S.

, Nisp: Pruning networks using neuron importance score propagation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 9194–9203.

49.

Kang

Dong

and Yang

, Soft filter pruning for accelerating deep convolutional neural networks, Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI), 2018, 2234–2240.

Filter pruning via feature map clustering

Abstract

Keywords

1. Introduction

Non-structural pruning.

Structural pruning.

3. Background

4.1 Preliminaries

4.2 Filter pruning via feature map clustering (FP-FMC)

4.2.1 Distance measurement of feature maps

5.1 Experimental settings

Datasets and Baselines

Evaluation Protocols

Configurations

5.2 Compression and analysis

5.2.1 VGGNet on CIFAR-10

Table 1 Pruning results of VGGNet on CIFAR-10

Table 2 Pruning results of DenseNet on CIFAR-10

5.3 Hyper-parameter analysis: The choice of cluster numbers

Footnotes

Acknowledgments

References

Table 1
Pruning results of VGGNet on CIFAR-10

Table 2
Pruning results of DenseNet on CIFAR-10