A CNN channel pruning low-bit framework using weight quantization with sparse group lasso regularization

Abstract

The deployment of large-scale Convolutional Neural Networks (CNNs) in limited-power devices is hindered by their high computation cost and storage. In this paper, we propose a novel framework for CNNs to simultaneously achieve channel pruning and low-bit quantization by combining weight quantization with Sparse Group Lasso (SGL) regularization. We model this framework as a discretely constrained problem and solve it by Alternating Direction Method of Multipliers (ADMM). Different from previous approaches, the proposed method reduces not only model size but also computational operations. In experimental section, we evaluate the proposed framework on CIFAR datasets with several popular models such as VGG-7/16/19 and ResNet-18/34/50, which demonstrate that the proposed method can obtain low-bit networks and dramatically reduce redundant channels of the network with slight inference accuracy loss. Furthermore, we also visualize and analyze weight tensors, which showing the compact group-sparsity structure of them.

Keywords

Convolutional neural network (CNN)weight quantization sparse group lasso (SGL)alternating direction method of multipliers (ADMM)channel pruning

1 Introduction

In recent years, the world has witnessed the success of a wide range of computer vision tasks, e.g., image classification [1], object detection [2], and video segmentation [3] owe to the development of deep neural networks (DNNs), especially convolutional neural networks (CNNs). However, with the refinement of vision tasks, large CNN model follows, which impose heavy storage burden on training devices. For example, the GoogleNet, ResNet101, AlexNet and VGG-Net involve 50, 200, 250 and 500 Mbytes of model parameters respectively [4]. It is unlikely to embed these large and high performance models into resource constrained platforms, which encourage CNNs models to have smaller memory and computation cost for fast inference without affecting the task performance. The deployment of CNNs in relative applications is mostly constrained by (a) model size, (b) run-time memory and (c) number of computing operations [6]. Since the computation cost of CNNs is mainly dominated by convolutional operation, which is exactly the dot-product between weights and activations [5]. Thus the number of parameters in the model is critical to above mentioned three factors. Considerable efforts have been proposed to compress large CNNs and speed up the inference. These works include weight pruning and sharing [7 –10], low rank approximation [11 –13], network quantization [14 –20], special model architecture [21 –27], and sparsity regularization constraints [28 –30] etc. However, most of these techniques can only solve one constraint in real applications mentioned above while channel pruning could achieve all the aforementioned challenges to some extent.

In this paper, we propose a CNN channel pruning and low-bit framework, combining weight quantization with SGL regularization, which addresses both low-bit weight quantization and channel pruning when deploying CNNs under limited resources. For weight quantization, our approach introduces scaling factors in each output channel, and weights of the network are restricted to be either zero or powers of two so that the expensive floating-point multiplication operations can be replaced by cheaper shift operations. The opportunities for channel pruning occur when the scale factors equal to zero. Moreover, the SGL does not only yield sparsity within a group but also promotes group sparsity (identify insignificant channels) which facilitates channel pruning. In fact, pruning unimportant channels may sometimes lead to higher generation accuracy. Different from previous channel pruning work [6, 40], we also achieve low-bit networks using proposed weight quantization function. After using our framework in training process, the resulting network is much more compact than the initial model. Furthermore, iterative process of channel pruning can be conducted to obtain even more compact networks. The main contributions of this paper are summarized as follows:

We propose a low-bit weight quantization method by introducing scaling factors which is the base for unified framework. Specially, we achieve channel pruning when the scaling factors equal to zero.

We introduce the basic form of SGL, and model the low-bit network as discretely constrained optimization problem while adding SGL term. Considering the particularity of this problem, we use the alternating direction method of multipliers(ADMM) to decompose it into several subproblems and propose an iterative algorithm to make those problems can be efficiently solved.

Experiments on several well-known neural network models and benchmark datasets show that the proposed framework can obtain compact and low-bit models with little accuracy loss.

2 Related work

In this section, we have a brief review of CNNs model compression and ADMM algorithm.

2.1 Model compression

On the whole, recent works about neural network compression and acceleration can be roughly divided into the following five categories.

Weight pruning and sharing: [7 –10] propose to find and prune the unimportant connections with small weights in the neural network model. Then the network is fine-tuned to obtain better accuracy. The resulting network size is reduced by special coding pattern like Huffman coding. However, compression ratio by these methods is very limited, and such simple parameter pruning often produce non-structured connectivity which may hurt inference accuracy drastically.

Low rank approximation: In general training process, the weights are presented by high-order tensors. [11 –13] approximate high-order tensors with the product of several low-order matrixes by using techniques such as Singular Value Decomposition (SVD), and maintain significant information of each layer. However, low rank approximation can indeed obtain compact structure but it needs more resources to decompose high-order tensors and without notable compression effects.

Low-bit weight quantization: low-bit quantization methods mean that the weights and activations are represented by discrete values, which could replace the original floating-point operations by only accumulations or even binary logic operations [14]. [15] and [16] groundbreakingly constrain the weights to the binary and ternary space. It follows that both weights and activations are mapped into binary space or ternary space, i.e. binary neural networks (BNN) [17], XNOR-Net [18], and ternary neural networks (TNN) [19], which directly replace multiply-accumulate operations by logic operations. DoReFa-Net [20] not only quantizes weights and activations, but also quantizes gradients to low-bit width floating-point numbers with discrete states in the backward propagation. Our quantization method also fall into this category.

Special model architecture: Simplifying the architecture of network model by using convolutional filters of small size, adopting depth-wise convolution operations, constraining convolution operations to specific input channels, related researches including Squeeze-Net [21], Mobile-Net [22, 23] and Shuffle-Net [24, 25]. Some recent works [26, 27] propose to learn suitable model architecture automatically using reinforcement learning.

Sparsity regularization constraints: As we all know, Group Lasso is an efficient regularization to learn sparse structures in numerous studies. Kim [28] and [29] used group lasso to regularize the structure of filters in DNNs. Wen [30] applied group lasso to regularize multiple DNN structures (filters, channels, filter shapes and layer depth). Since these methods only sparsify part of model structures to achieve network compression. SGL is also the same effect with this kind of approach.

2.2 ADMM

ADMM is a useful algorithm [31, 32] for solving nonconvex optimization problems, potentially with combinatorial constraints, since it can converge to a solution that may not globally optimal but is good enough in many aspects [33]. ADMM solves problems in following form: $\begin{matrix} \min & f_{1} (u) + f_{2} (v) \\ s . t . & v = Gu \end{matrix}$ Where $u \in R^{n}$ , $v \in R^{d}$ and $G \in R^{d \times n}$ . By using the augmented Lagrangian method, the above problem could be transformed into two subproblems which can be solved separately. It turns out to be a powerful method for our framework, more details regarding the use of ADMM will be demonstrated in next section.

3 The proposed method

In this work, a regularization constraint of SGL is proposed to regulazire the convolutional layers. We also concentrate on training low-bit quantized neural networks. The diagram is shown in Figure 1. Assuming a general DNN with L layers can be summarized as the following forms [5]

Fig.1

Proposed framework. We introduce a scaling factor with each channel in convolutional layers for weight quantization (Weight quantization). The SGL regularization is imposed on weights constraint. The channels with zero scaling factors will be pruned (Sparse channel).

$\begin{matrix} x_{L} = f_{L} (x_{L - 1} W_{L} + b_{L}) \\ . . . \\ x_{1} = f_{1} (I W_{1} + b_{1}) \end{matrix}$ (1) where f_i is nonlinear activation function, b_i the set of bias, and x_i output activation of each layer, I the input image data. In convolutional layers, the weights (denoted by W_i are organized in a four-dimensional tensor ( $W \in R^{c_{out} \times c_{in} \times s \times s}$ ) where (c_out, c_in, s, s) represents output channels, input channels and filters size respectively) and they are denoted by a two-dimensional matrix in fully-connected layers. In this section, we aim to establish a framework for low-bit weights quantization combined with SGL regularization.

Given f_W be the general loss function between predictions and data labels, thus the final total loss objective function with regularization loss f_r (W) is

$min_{W} f (W) + f_{r} (W)$ (2) where the first term is cross-entropy loss in general, and f_r (W) is the regularization constraint of SGL for each layer. Besides, we also train networks in low-bit regions which could be mathematically modeled as discretely constrained optimization problem.

3.1 Sparse group lasso regularization

In this section, we mainly concentrate on the relative model of Lasso, Group Lasso and Sparse Group Lasso. For general linear regression model:

$y = X θ + b$ (3) where $y \in R^{n}$ denotes n-dimensional observation vector, $θ \in R^{p}$ is an unknown regression vector, $X \in R^{n \times p}$ design matrix and $b \in R^{n}$ a noise vector. Assuming index set {1, 2, ⋯ , p} can be divided into T groups, recorded as G ={ G₁, ⋯ , G_T }, and corresponding SGL norm is

$\begin{matrix} θ = \underset{θ \in R^{p}}{argmin} \\ {\frac{1}{2 n} ∥ y - X θ ∥_{2}^{2} + λ_{1} ∥ θ ∥_{G, 2} + λ_{2} ∥ θ ∥_{1}} \end{matrix}$ (4) where $∥ θ ∥_{G, 2} = \sum_{i = 1}^{T} {∥ θ_{G_{i}} ∥}_{2}$ , and α ∈ [0, 1] a convex combination of the Lasso and Group Lasso (λ₂ = 0 denotes the Group Lasso and λ₁ = 0 gives the Lasso fit) [35]. Proper regularization parameter in SGL can make the entire parameter vector be zero, which could be removed from the whole overfitting model. Furthermore, this method also retains important variables within the group. Besides, the three terms in Eq. (4) are all convex functions on θ, so solving the model is a convex optimization problem which is easy to resolve in the process of training neural networks. We refer to the expression in [35 –37], and divide groups by output channels.

3.2 Low-bit weight quantization

We mainly concentrate on low-bit weight quantization for networks in this section. Assuming an L-layer CNN with [I, W, *], where I is the input tensor, W is the weight tensor, and * represents convolutional operation between I and W(In this section, we ignore bias terms of convolutional filters). In specific, the weights of the network are restricted to be either zero or powers of two so that the expensive floating-point multiplication operations can be replaced by cheaper shift operations. For $W \in R^{c_{out} \times c_{in} \times s \times s}$ , where (c_out, c_in, s, s) represents output channels, input channels and filters size respectively. For convolutional layers in CNNs, we first need to reshape 4-dimensional weight tensor W into 2-dimensional matrix, i.e., W^{c_out×c_in×s×s} ⟶ W^{c_out×c_inss}, and apply our quantization constraint on it. In order to constrain weights to low-bit region and achieve channel pruning in DNNs, we introduce a scaling factor $α \in R^{+}$ every output channels and a low-bit vector b such that w ≈ αb(where w is real-value weight vector corresponding rows of weight matrix W^{c_out×c_inss}). Obviously, every layer has different c_out values of α. By bringing in scaling factors in weight tensors, memory usage and convolutional operation computation would be reduced drastically. Furthermore, weight channels could be clipped when α = 0 in training process. According to the least squares method, our goal is to solve the following optimization function

$\underset{α, b}{argmin} F (α, b) = \underset{α, b}{argmin} ∥ w - α b ∥^{2}$ (5) By expanding above equation, we have

$\begin{matrix} α^{*}, b^{*} = \underset{α, b}{argmin} F \\ = \underset{α, b}{argmin} (α^{2} b^{T} b - 2 α w^{T} b + w^{T} w) \end{matrix}$ (6) We assume b ∈ {±1, 0} ⁿ(n equals to (c_inss)) which is similar to ternary quantization function. Since b^Tb and w^Tw are constant at the same time, and α is a positive value as shown above. Thus, the Eq. (6) could be simplified into

$b^{*} = \underset{b}{argmax} {w^{T} b}$ (7) Therefore the optimal solution of Eq. (7) is b^* = sign (w). Here we can extend the quantization level to k-bits (take the place of sign(w) function) by using following expression

$\begin{matrix} q (x, k) = Clip \\ {2^{k - 1} \cdot round [2^{k - 1} (x + 1)] - 1, - 1, 1} \end{matrix}$ (8) where round [*] refers to round operation in math, Clip [*] is the is the saturation function that clips weight x to [-1, 1] (Clip [x] is equal to 1 when x is greater than or equal to 1,(Clip [x] is equal to -1 when x is less than or equal to -1,and with others unchanged). We approximate the sign function by using q (x, k) in Eq. (8), which is shown below (k=1, 2). It is worth noting that q (x, k) is not continuous and non-differentiable in some points, which makes it difficult for back propagation. In this paper, we adopt Eq. (9) to approximate the partial derivatives of q (x, k)

$\frac{\partial q (x, k)}{\partial x} = {\begin{matrix} \frac{1}{2 ɛ}, if r - ɛ \leq | x | \leq r + ɛ \\ x, others \end{matrix}$ (9) where r is the discontinuity point of function q (x, k), and ɛ is an arbitrary small positive parameter which needs to be determined in later training process. The significance of this formula lies in the special treatment of non-derivable points. Next, we consider finding optimal scaling factor α. It is necessary to compute the partial derivative of first and second order of Eq. (6) with respect to α and b

${\begin{matrix} \frac{\partial F}{\partial α} = 2 α b^{T} b - 2 w^{T} b = 0 \\ \frac{\partial F^{2}}{\partial^{2} α} = 2 b^{T} b \geq 0 \end{matrix}$ (10)

${\begin{matrix} \frac{\partial F}{\partial b} = 2 α b^{T} - 2 α w^{T} = 0 \\ \frac{\partial F^{2}}{\partial^{2} b} = 2 α^{2} \geq 0 \end{matrix}$ (11) Solving Eq. (10) and Eq. (11) respectively, noting both second order derivatives are always greater than or equal to zero, which satisfy the necessary conditions for the extremum points. Considering both Eq. (10) and Eq. (11), the final optimal results for α is

${\begin{matrix} α^{*} = 0, if b = 0 \\ α^{*} = \frac{w^{T} b}{b^{T} b}, if b = q_{k} (*) \end{matrix}$ (12) Therefore, all weights of one output channel may become zero when α = 0, which means that this channel could be clipped during subsequent training process. Specially, the weights of the network are restricted to [0, ± 1] and [0, ± 0.5, ± 1] when k = 1, 2, so that floating-point multiplications would be replaced by cheaper bit shift operations. In DNNs, the computational complexity is dominated by the convolution operation, and could decrease dramatically after our quantization. In this paper, we conduct experiments mainly aiming at k = 1 in order to get more sparse parameters, which facilitates subsequent channel pruning.

3.3 Proposed framework

In this paper, training neural networks with SGL can be modeled as discretely constrained optimization problems. According to above weight quantization method, the weight in networks are restricted to be {-2⁰α_i, 0, 2⁰α_i} when k = 1. It is worthy noting that there are different c_out values of α in each layer. Generally, the objective function of our proposed framework can be formulated as

${\begin{matrix} min_{W} f (W) + f_{1} (W) + f_{2} (W) \\ s . t . W_{i} \in C_{i} (i = 1, 2 \dots L) \end{matrix}$ (13) where C_i ={ 0, ± 2⁰α_i,1, ± 2⁰α_i,2, ⋯ , ±2⁰α_i,cout } which means weights in the network are all mapped into discret domain, L is the number of layers, and α is positive value. Furthermore, f (W) is the general loss of the net output, f₁ (W) and f₂ (W) (which contain penalty coefficients with λ₁ and λ₂ respectively) denote the Lasso and Group Lasso constraints loss respectively. The criteria for grouping in Group Lasso constraints is the same as above quantization method. By bringing in the scaling factor in each output channel incurs channel sparsity to the convolutional layers which could be pruned in later practical training process, and incurs fewer computation on the convolutional operator duo to the efficient convolution with quantized weight tensors. Obviously, the optimization problem of Eq. (13) is difficult to solve because the weights in each layer are constrained in a discrete space. In order to deal with this situation, we adopt a systematic framework called alternating direction method of multipliers (ADMM).

Referring to similar studies, we define an indicator function I_C(constraining weights into prescribed quantized space) for whether W ∈ C, the loss objective function in Eq. (13) can be rewritten into

$min_{W} f (W) + f_{1} (W) + f_{2} (W) + I_{C} (W)$ (14) where I_C (W) =0 if W ∈ C, otherwise I_C (W) =+ ∞. Similar to general ADMM method, by introducing an auxiliary variable B and denoting F (W) = f (W) + f₁ (W) + f₂ (W), Eq. (14) could be expressed as

${\begin{matrix} min_{W, B} & F (W) + I_{C} (B) \\ s . t . & W = B \end{matrix}$ (15) It is clear that C is nonconvex set, F (W) is convex function for general neural networks and it is difficult to solve such optimization problems with nonconvex constraints. Just like previous researches [31, 32], problems of such form could be solved by ADMM. The augmented Lagrangian equation of above problem can be formed as

$\begin{matrix} L_{ρ} (W, B, λ) \\ = F (W) + I_{C} (B) + \frac{ρ}{2} ∥ W - B + λ ∥^{2} - \frac{ρ}{2} ∥ λ ∥^{2} \end{matrix}$ (16) According to the standard process of ADMM method, there are three step iterations need to be computed for solving above problems (for k = 0, 1⋯)

$\begin{matrix} {W^{k + 1}} : \\ = \underset{W}{argmin} L_{ρ} ({W}, {B^{k}}, {λ^{k}}) \end{matrix}$ (17)

$\begin{matrix} {B^{k + 1}} : \\ = \underset{B}{argmin} L_{ρ} ({W^{k + 1}}, {B}, {λ^{k}}) \end{matrix}$ (18)

$λ^{k + 1} : = λ^{k} + W^{k + 1} - B^{k + 1}$ (19) where Eq. (17), Eq. (18) and Eq. (19) denote the proximal step, projection step and dual update step respectively. For the proximal step, Eq. (17) can be formulated as

$\begin{matrix} min_{W} L_{ρ} (W, B^{k}, λ^{k}) \\ = F (W) + \frac{ρ}{2} {∥ W - B^{k} + λ^{k} ∥}^{2} \end{matrix}$ (20) The above equation can be considered as the general loss function (F (W)) with special regularization term ( $\frac{ρ}{2} {∥ W - B^{k} + λ^{k} ∥}^{2}$ ) which can be solved via stochastic gradient descent (SGD) method in Pytorch. Note that we cannot prove optimality of problem of Eq. (17), just as we cannot prove optimality of the general DNN training problem [33]. On the other hand, for the projection step, Eq. (18) can be formed as

$\begin{matrix} min_{B} L_{ρ} (W^{k}, B, λ^{k}) \\ = I_{C} (B) + \frac{ρ}{2} {∥ W^{k + 1} - B + λ^{k} ∥}^{2} \end{matrix}$ (21) Since I_C (B) is an indicator function of the set C, the optimal solution of Eq. (21) could be formulated as [32]

$B^{k + 1} = Π_{C} (W^{k + 1} + λ^{k})$ (22) Where Π_C (*) denotes the Euclidean projection onto the set C. Naturally, the projection of {W^k+1+ λ^k }(denoted by U_i) onto C can be formed as ${\begin{matrix} min_{B_{i}, α_{i, c_{out}}} & ∥ U_{i} - B_{i} ∥^{2} \\ s . t . & B_{i} \in {0, \pm 2^{0} α_{i, 1}, \dots, \pm 2^{0} α_{i, c_{out}}} \end{matrix}$ (23) Moreover, we have proposed quantization method in previous section which can be adopted onto Eq. (23). Finally, we update the dual variable λ^K according to Eq. (19) which concludes one iteration of the whole ADMM algorithm.

4 Experiments

In this section, we mainly demonstrate the effectiveness of proposed channel pruning framework method on several benchmark datasets. We implement our method based on Pytorch for image classification. Moreover, most previous works do not quantize first and last layers of the network, and we follow the same strategy and report the averaged results over three runs for each experiment by SGD optimizer in this paper.

4.1 Datasets

In this paper, we empirically conduct experiments on CIFAR-10 and CIFAR-100 which contains 50,000 and 10,000 images in train and test sets respectively. These two datasets [38] consist of natural images with resolution 32 × 32. CIFAR-10 is selected from 10 classes while CIFAR-100 from 100 classes. We follow the data augmentation strategy in [39]: 4 pixels are padded on each side, and patch is randomly cropped. The input data is also normalized using channel means and standard deviations. The learning rate starts at 0.1 with a batch size of 128 and we use the learning rate decay equal to 0.2 at epochs number 60, 120 and 160 for the whole 200 epochs. Moreover, we use a weight decay of 5 × 10^-4 and momentum of 0.9 for SGD optimizer.

4.2 Models

On CIFAR-10 datasets, we evaluate our framework method on two popular and simple network architectures: VGG7 and ResNet18. The network structure for “VGG7” is “2(128-C3)+MP2+2(256-C3)+MP2+2(512-C3)+MP2+2(1024-FC)+Softmax”, and “ResNet-18” is standard ResNet-18 which is shown in the official website of Pytorch. In order to show the effectiveness of proposed method, we also conduct more experiments on CIFAR-100. For VGGNet, we adopt VGG16 and VGG19. For ResNet, we use ResNet34 and ResNet50. All of those network architectures are original from the official website.

Because the limitation of our experimental conditions, we cannot evaluate our method on ImageNet and complex networks with more layers.

4.3 Relevant evaluation indicator

We define several evaluation terms that will be used in the following sections. Channel sparsity ratio = (number of zero channels) / (number of total channels). Parameter pruned = (parameter pruned) / (total parameter before pruned). FLOPs pruned = (FLOPs reduction) / (total FLOPs before pruned).

As we all know, FLOP is a commonly used indicator to compare the computation complexities of CNNs. To compute the number of FLOPs, we assume convolution is implemented as a sliding window and nonlinearity is computed for free. For convolutional layers we have [34] $\begin{matrix} FLOPs = 2 HW (c_{in} s^{2} + 1) c_{out} \end{matrix}$ Where H and W are weight and width of the input feature map, s is the filter size, and c_in represents input (output) channel numbers. However, bias is ignored in our experiments, which simplify above computation.

4.4 Results

Performance on CIFAR-10 and CIFAR-100:

Noting in the whole experiments, we compare performance under the same conditions which only quantize weight not activations. In fact, our quantization method is a special XNOR network combined with SGL regularization constraint. Different from XNOR network, parameter sparsity and channel sparsity are ubiquitous in our network model. The comparison results with some classical method on CIFAR-10 are shown in Table 1, where ‘FWN’ represents full precision weight network without any quantization, ‘BWN’ represents binary weight networks which constrain the weights to the binary space {-1, + 1} (1 bit) [15], ‘TWN’ represents ternary weight networks which constrain the weights to the ternary space {-1, 0, + 1} (2 bits) [16] and ‘XNOR’ refers to XNOR-Net which constrain both the weights and the activations to the binary space (1 bit) which can directly replace the multiply-accumulate operations by binary logic operations [18], ‘OURS’ is by our framework with λ₁ = λ₂ = 0.001.

Table 1
Test Error Comparison on CIFAR-10

% FWN BWN XNOR TWN OURS

VGG7 6.39 8.27 6.63 7.44 6.59

ResNet18 4.87 - 5.39 - 5.52

Bits 32 1 1 2 2

%	FWN	BWN	XNOR	TWN	OURS
VGG7	6.39	8.27	6.63	7.44	6.59
ResNet18	4.87	-	5.39	-	5.52
Bits	32	1	1	2	2

It could be seen that the proposed method achieves almost equal accuracy compared with full precision, and has outperform a little more than other several state-of-art algorithms. Moreover, our method achieve channel pruning while others not. On the other hand, considering the best performance on CIFAR-10 among several classical methods, we mainly compared ‘XNOR’ with ‘OURS’ on CIFAR-100 by different networks. The results on CIFAR-100 are shown in Table 2, which presents that our method is much better than ‘XNOR’ and closes to ‘FWN’. Obviously, our method is robust to some extent for complex data sets.

Effects of SGL regularization:

Table 2

Test Error Comparison on CIFAR-100

top1/top5(%)	FWN	XNOR	OURS
VGG16	26.14/8.39	31.37/10.85	30.21/10.18
VGG19	27.15/8.82	32.16/12.15	29.64/9.72
ResNet34	21.02/5.69	26.13/7.89	22.78/6.16
ResNet50	20.94/5.16	25.70/7.73	22.71/6.15

We explore the effects of adding regularization in this section. It is clear that proposed quantization method could lead channel sparsity for channel pruning. Experimental results show that the ratio of channel sparsity becomes greater by adding SGL regularization while without accuracy loss as shown in Figure 3. Table 3 and Table 4 shows the network channel sparsity of each convolutional layer on ResNet18 and VGG7 for CIFAR-10, where Q is by our quantization only, Q+SGL is by proposed framework, "Q+SGL1" refers to λ₁ = λ₂ = 0.001 and “Q+SGL2” refers to λ₁ = λ₂ = 0.01.

Fig.2

Quantization function q(x, k) with k = 1 and k = 2.

Fig.3

Sparse channel numbers(vertical axis) of each layer on ResNet-18.

Table 3

Channel Sparsity on ResNet18

Layers(channels)	Q	Q+SGL1	Q+SGL2
conv2(64)	4	11	15
conv3(64)	6	10	14
conv4(64)	7	16	21
conv5(64)	2	10	15
conv6(128)	0	7	10
conv7(128)	5	4	4
conv8(128)	10	93	108
conv9(128)	19	27	46
conv10(128)	1	5	4
conv11(256)	9	6	8
conv12(256)	3	3	3
conv13(256)	7	182	215
conv14(256)	34	50	66
conv15(256)	2	3	4
conv16(512)	3	10	13
conv17(512)	5	80	124
conv18(512)	60	473	487
conv19(512)	7	252	322
sparsity ratio	4.36%	29.40%	35.01%
accuracy	94.93%	94.48%	94.51%

Table 4

Channel Sparsity on VGG7

Layers(channels)	Q	Q+SGL1	Q+SGL2
conv2(128)	6	9	11
conv3(256)	12	26	33
conv4(256)	0	0	0
conv5(512)	13	37	49
sparsity ratio	2.60%	6.25%	8.07%
accuracy	93.56%	93.41%	93.47%

From Table 3 and 4 we can observe that, on ResNet18 (VGG7), typically 35% ( 8%) channels can be pruned, and achieve similar performance compared to original model. On the other hand, we obtain compact model by pruning many redundant channels obtained by proposed method. With larger values of λ₁ and λ₂, we obtain larger sparsity of each layer without accuracy loss. We also show the visualization result of convolutional kernel on ResNet18 in Figure 4 (the darker the color, the smaller the weight value. And black pixels mean that parameter value equals to zero). Moreover, the results of channel sparsity for CIFAR-100 are summarized in Table 5.

Fig.4

Visualization result of conv2 in Table 3.

Table 5

Channel Sparsity Ratio on CIFAR-100

(%)	Q+SGL1	Q+SGL2
VGG16	13.54	25.44
VGG19	14.35	26.79
ResNet34	22.68	30.66
ResNet50	34.87	39.98

Obviously, experimental results in Figure 4 and Table 5 agree with the above discussion in more complex datasets. From Table 4 and Table 5, it can be concluded that the channel sparsity is more sensitive to the value of λ in simple network with less layers, and there are apparently more redundant channels in complex networks.

Analysis of channel pruning:

By using proposed framework, we can obtain low-bit network in order to make floating-point multiplications be replaced by cheaper bit shift operations. Besides, plenty of redundant channels exist in each layer of the network. If we prune those sparse channels, the resource saving and computation saving can be very considerable. The results are shown on Table 6 and Table 7. It can be obviously concluded that much redundancy exists in ResNet than VGG-Net. Moreover, we could prune more parameters and FLOPs with larger λ (λ_SGL2> λ_SGL1) while almost no accuracy loss.

Table 6

Channel Pruning Ratio on CIFAR-10

%	Test error		Parameters pruned		FLOPs pruned
	SGL1	SGL2	SGL1	SGL2	SGL1	SGL2
VGG7	6.63	6.60	16.9	22.0	12.1	15.3
ResNet18	6.28	6.07	32.1	41.7	26.0	34.8

Table 7

Channel Pruning Ratio on CIFAR-100

%	Test error(top1)		Parameters pruned		FLOPs pruned
	SGL1	SGL2	SGL1	SGL2	SGL1	SGL2
VGG16	30.13	31.14	26.8	50.7	25.7	40.1
VGG19	30.53	30.83	28.5	50.0	26.2	45.3
ResNet34	23.48	23.97	39.0	46.0	34.0	40.9
ResNet50	23.59	24.03	45.5	53.5	41.2	50.3

We also employ iterative channel pruning by using proposed method on CIFAR datasets. The test errors of models in each iteration are shown in Table 8. We count the number of pruned parameters and FLOPs in each iteration. As the pruning process goes, we obtain more compact quantized models. It is notable that we can prune near 80% parameters and 70% FLOPs without obvious accuracy loss on ResNet18.

Run-time memory, wall-clock time and model size savings:

Table 8

Iterative channel pruning on CIFAR-10(ResNet18)(%)

Iteration	Test error		Parameters pruned		FLOPs pruned
	SGL1	SGL2	SGL1	SGL2	SGL1	SGL2
1	6.18	6.07	32.1	41.7	26.0	34.8
2	6.09	5.88	53.7	60.6	40.3	44.4
3	6.01	6.42	60.7	67.3	46.5	52.4
4	6.18	6.49	64.1	70.4	51.5	58.1
5	6.33	6.87	70.3	76.7	58.8	66.4
6	6.55	7.38	73.6	81.1	61.7	73.5
7	6.49	7.34	76.0	83.5	64.3	76.8
8	6.53	7.52	78.1	85.7	66.6	80.1
9	7.00	7.96	80.7	87.6	70.8	83.3
10	6.84	8.31	82.8	88.8	74.1	85.6
11	7.50	-	84.3	-	76.8	-
12	7.32	-	85.8	-	79.2	-
13	7.48	-	86.7	-	81.4	-
14	8.03	-	88.0	-	83.7	-

The experiments is conducted by using Pytorch with a batch size 128. We record the run-time memory and wall-clock time of VGG16 and ResNet34 on CIFAR-100 during inference time. We also compare model size of diffirent situation which are stored by Pytorch. The results are shown in Table 9, which roughly matches parameters pruned in Table 6 and Table 7. Note that run-time memory, wall-clock time and model size will be further reduced in special hardware due to our low-bit quantized network.

Table 9

Run-time memory and wall-clock time savings on CIFAR-100

Net	Model	Memory	Time/Iteration	Model size
VGG16	Baseline	404MB	0.020s	61.0MB
	SGL1	282MB	0.015s	47.7MB
	SGL2	266MB	0.013s	35.9MB
ResNet34	Baseline	1118MB	0.038s	61.0MB
	SGL1	762MB	0.031s	55.9MB
	SGL2	698MB	0.029s	50.6MB

5 Conclusion and future work

This paper focused on neural network compression with low-bit weights and channel pruning. We proposed a unified framework using weight quantization combined with SGL regularization, and transformed it into optimization problem with discrete constraints which could solved by ADMM. Unimportant channels can be identified by proposed framework during training process and then pruned. Experiments on different CNNs have shown that the proposed method is able to obtain low-bit networks and achieve channel pruning to learn more compact CNNs with little accuracy loss. Moreover, model size, computation cost and run-time memory could be reduced drastically by our method due to channel pruning. Because of our poor conditions, however, we cannot conduct more experiments on more complex datasets such as ImageNet, and networks such as ResNet101. In the future work, we wish to have more experiments on other datasets and networks to demonstrate to effectiveness of proposed framework. Furthermore, layer pruning is also possible for more complex networks in later research.

Footnotes

Acknowledgments

This work was partially supported by the National Natural Science Foundation(NSFC) of China (Grants No. 61602494, No. 61906206) and Natural Science Foundation of Hunan Province, China (Grant No.2019JJ50746).

References

Krizhevsky

, Sutskever

and Hinton

G.E.

, ImageNet Classification with Deep Convolutional Neural Networks, neural information processing systems 141(5) (2012), 1097–1105.

Girshick

R.B.

, Donahue

, Darrell

and Malik

, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation, Computer vision and pattern recognition (2014), 580–587.

Yang

, Wang

, Xiong

, Yang

and Katsaggelos

A.K.

, Efficient Video Object Segmentation via Network Modulation, Computer vision and pattern recognition (2018), 6499–6507.

Bahdanau

, Cho

and Bengio

, Neural Machine Translation by Jointly Learning to Align and Translate, International conference on learning representations (2015).

Cai

, He

, Sun

and Vasconcelos

, Deep Learning with Low Precision by Half-Wave Gaussian Quantization, Computer vision and pattern recognition (2017), 5406–5414.

Liu

, Li

, Shen

, Huang

, Yan

and Zhang

, Learning Efficient Convolutional Networks through Network Slimming, International conference on computer vision (2017), 2755–2763.

Han

, Mao

and Dally

W.J.

, Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding, International conference on learning representations, (2016).

Han

, Liu

, Mao

, Pu

, Pedram

, Horowitz

and Dally

W.J.

, EIE: efficient inference engine on compressed deep neural network, International symposium on computer architecture 44(3) (2016), 243–254.

Han

, Liu

, Mao

, Pu

, Pedram

, Horowitz

and Dally

, Deep compression and EIE: Efficient inference engine on compressed deep neural network, IEEE hot chips symposium, (2016).

10.

Liu

, Wang

, Foroosh

, Tappen

M.F.

and Penksy

, Sparse Convolutional Neural Networks, Computer vision and pattern recognition, (2015).

11.

, Liu

, Wang

and Tao

, On Compressing Deep Models by Low Rank and Sparse Decomposition, Computer vision and pattern recognition, (2017).

12.

Denton

E.L.

, Zaremba

, Bruna

, Lecun

and Fergus

, Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation, Neural information processing systems (2014), 1269–1277.

13.

Jaderberg

, Vedaldi

and Zisserman

, Speeding up Convolutional Neural Networks with Low Rank Expansions, British machine vision conference, 2014.

14.

Deng

, Jiao

, Pei

, Wu

and Li

, GXNOR-Net: Training deep neural networks with ternary weights and activations without full-precision memory under a unified discretization framework, Neural Networks (2018), 49–58.

15.

Courbariaux

, Bengio

and David

, BinaryConnect: training deep neural networks with binary weights during propagations, Neural information processing systems (2015), 3123–3131.

16.

and Liu

, Ternary Weight Networks, arXiv: Computer Vision and Pattern Recognition, 2016.

17.

Courbariaux

, Hubara

, Soudry

, Elyaniv

and Bengio

, Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1, arXiv: Learning, 2016.

18.

Rastegari

, Ordonez

, Redmon

and Farhadi

, XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks, European conference on computer vision (2016), 525–542.

19.

Mellempudi

, Kundu

, Mudigere

, Das

, Kaul

and Dubey

, Ternary Neural Networks with Fine-Grained Quantization, arXiv: Learning, 2017.

20.

Zhou

, Ni

, Zhou

, Wen

, Wu

and Zou

, DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients, arXiv: Neural and Evolutionary Computing, 2016.

21.

Iandola

F.N.

, Han

, Moskewicz

M.W.

, Ashraf

, Dally

W.J.

and Keutzer

, SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size, arXiv: Computer Vision and Pattern Recognition, 2017.

22.

Howard

A.G.

, Zhu

, Chen

, Kalenichenko

, Wang

, Weyand

and Adam

, MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications, arXiv: Computer Vision and Pattern Recognition, 2017.

23.

Sandler

M.B.

, Howard

A.G.

, Zhu

, Zhmoginov

and Chen

, MobileNetV2: Inverted Residuals and Linear Bottlenecks, Computer vision and pattern recognition (2018), 4510–4520.

24.

Zhang

, Zhou

, Lin

and Sun

, ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices, Computer vision and pattern recognition (2018), 6848–6856.

25.

, Zhang

, Zheng

and Sun

, ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design, European conference on computer vision (2018), 122–138.

26.

Baker

, Gupta

, Naik

and Raskar

, Designing Neural Network Architectures using Reinforcement Learning, International conference on learning representations, 2017.

27.

Zoph

and Le

Q.V.

, Neural Architecture Search with Reinforcement Learning, International conference on learning representations, 2017.

28.

Kim

and Xing

E.P.

, Tree-Guided Group Lasso for Multi-Task Regression with Structured Sparsity, International conference on machine learning, 2010.

29.

Feng

and Darrell

, Learning the Structure of Deep Convolutional Networks, International conference on computer vision, 2015.

30.

Wen

, Wu

, Wang

, Chen

and Li

, Learning Structured Sparsity in Deep Neural Networks, Neural information processing systems (2016), 2074–2082.

31.

Takapoui

, Moehle

, Boyd

S.P.

and Bemporad

, A simple effective heuristic for embedded mixed-integer quadratic programming, Advances in computing and communications (2016), 5619–5625.

32.

Leng

, Dou

, Li

, Zhu

and Jin

, Extremely Low Bit Neural Network: Squeeze the Last Bit Out with ADMM, National conference on artificial intelligence (2018), 3466–3473.

33.

Zhang

, Ye

, Zhang

, Tang

, Wen

, Fardad

and Wang

, A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers, European conference on computer vision (2018), 191–207.

34.

Molchanov

, Tyree

, Karras

, Aila

and Kautz

, Pruning Convolutional Neural Networks for Resource Efficient Inference, International conference on learning representations, 2017.

35.

Friedman

J.H.

, Hastie

and Tibshirani

, A note on the group lasso and a sparse group lasso, arXiv: Statistics Theory, 2010.

36.

Simon

, Friedman

J.H.

, Hastie

and Tibshirani

, A Sparse-Group Lasso, Journal of Computational and Graphical Statistics 22(2) (2013), 231–245.

37.

Scardapane

, Comminiello

, Hussain

and Uncini

, Group sparse regularization for deep neural networks, Neurocomputing 22(2) (2017), 81–89.

38.

Krizhevsky

and Geoffrey

, Learning multiple layers of features from tiny images, In Tech Report, 2009.

39.

, Zhang

, Ren

and Sun

, Deep Residual Learning for Image Recognition, Computer vision and pattern recognition (2016), 770–778.

40.

, Zhang

and Sun

, Channel Pruning for Accelerating Very Deep Neural Networks, International conference on computer vision (2017), 1398–1406.

A CNN channel pruning low-bit framework using weight quantization with sparse group lasso regularization

Abstract

Keywords

1 Introduction

2 Related work

2.1 Model compression

2.2 ADMM

3 The proposed method

4.1 Datasets

4.2 Models

4.3 Relevant evaluation indicator

4.4 Results

Table 1 Test Error Comparison on CIFAR-10 % FWN BWN XNOR TWN OURS VGG7 6.39 8.27 6.63 7.44 6.59 ResNet18 4.87 - 5.39 - 5.52 Bits 32 1 1 2 2

Footnotes

Acknowledgments

References

Table 1
Test Error Comparison on CIFAR-10

% FWN BWN XNOR TWN OURS

VGG7 6.39 8.27 6.63 7.44 6.59

ResNet18 4.87 - 5.39 - 5.52

Bits 32 1 1 2 2