Modifying the softening process for knowledge distillation

Abstract

The prime focus of knowledge distillation (KD) seeks a light proxy termed student to mimic the outputs of its heavy neural networks termed teacher, and makes the student run real-time on the resource-limited devices. This paradigm requires aligning the soft logits of both teacher and student. However, few doubts whether the process of softening the logits truly give full play to the teacher-student paradigm. In this paper, we launch several analyses to delve into this issue from scratch. Subsequently, several simple yet effective functions are devised to replace the vanilla KD. The ultimate function can be an effective alternative to its original counterparts and work well with other skills like FitNets. To claim this point, we conduct several visual tasks on individual benchmarks, and experimental results verify the potential of our proposed function in terms of performance gains. For example, when the teacher and student networks are ShuffleNetV2-1.0 and ShuffleNetV2-0.5, our proposed method achieves 40.88%top-1 error rate on Tiny ImageNet.

Keywords

Neural network compression knowledge distillation knowledge transfer

1 Introduction

With the development of mobile devices at hand, the user requirement for automatic visual tasks like object recognition and verification using machine learning methods becomes essential. Many advanced deep neural networks, compared with previous shallow-layer models, however, are too large to be equipped thereon due to expensive inference time overhead. This drives the proceeding of network compression technologies. Among them, a fundamental and versatile approach – and now commonly incorporated in many visual tasks including object tracking [2], object detection [20, 32], etc.– is knowledge distillation termed KD. The early idea of KD is originally introduced by Ba and Caruana [1], but the concrete concept of KD is proposed by Hinton et al. [14], which is presented in Figure 1. Different from its siblings such as network pruning [9, 12] and layer decomposition [15], which removes the so-called redundant components in light of a predefined rule, KD searches for a light proxy as the student network to mimic the outputs of its heavy neural network termed teacher.

Fig. 1

A demonstration of knowledge distillation.

As yet, softening the logits of both teacher and student has become the standard for KD, and is now widely adopted by many preceding arts. The soft logits generate final soft probabilities via the softmax function. Then, a form of sample features contained in soft probabilities could embrace more information about data structure learned by the teacher to some degree. FitNets (FN) [34], Attention Transfer (AT) [43], Similarity Preserving (SP) [38] are combined with KD to further enhance the performance of the student.

A student, in general, owns a lighter structure from its teacher, thus the learned student might not always behave like its teacher. Thus, it is natural to pose the following questions: 1) Does the process of softening logits/probabilities truly give full play to the current teacher-student paradigm? 2) Why cannot we do that, and how can this issue be addressed efficiently without extra hassle?

For this first question, we have launched several empirical studies to present the existence possibility of this issue. Before that, we illustrate whether the outputs of the student have similar data distribution or structure to that of the softened teacher. According to such intuition analyses, the incompatible case is possible to emerge. Such an issue is still fully unexplored in existing studies. In our work, we delve into the reason for this issue.

Then, we response the second question via several simple parameter decoupling strategies. In detail, we make the first change on the conventional softening process by removing the temperature parameter of student. In terms of this simple strategy, the learned student outputs softer probabilities, which are more similar to that of the softened teacher than before. Moreover, as mentioned in [42], the main hidden knowledge encapsulated in the similarity information between categories from teacher is sometimes regarded as "dark knowledge". To reveal more dark knowledge, we introduce a segmented softening function (SSF) into the softmax function to replace the traditional softening method for teacher. In this way, the more discriminative similarity information between classes can be revealed. Built off the two strategies above, we further integrate them into a unified formulation. More experiments conducted on various visual tasks show the universal applicability of this study. This implies that our work can be seamlessly coupled with current existing arts.

The rest of the paper is organized as follows. After illustrating related work (Sec. 2), we provide the background of knowledge distillation and the motivations of our proposed method (Sec. 3). Then, extensive experiments and analyses are explored (Sec. 4). At last, a conclusion is made (Sec. 5).

2 Related work

To the best of our knowledge, Bucilua; et al. [3] firstly explored how to use an ensemble of networks to train a single network. Furthermore, Ba and Caruana [1] aided shallow net to imitate deep net through penalizing the difference of logits between them. Inspired by [1], Hinton et al. [14] introduced the concept of knowledge distillation, which leveraging a cucumber teacher network to educate a lightweight student network by soft targets and not subject to software/hardware platforms. In this paper, this knowledge distillation framework is explored by us, which is briefly described in Section 3.1. More specifically, Phuong and Lampert [30] did a theoretical explanation and introduced three key points—data geometry, optimization bias and strong monotonicity for KD.

Recently, some approaches have been introduced to eliminate defects of KD itself. Cho and Hariharan [6] showed that small student is unable to mimic large teacher and they used an "Early-stopped" teacher to mitigate it. Mirzadeh et al. [25] bridged the gap between oversize teacher and undersize student via an intermediate size teacher assistant. Tan et al [36] defined an expressive teacher to educate the student. Wen et al. [39] amended the incorrect supervision and uncertain supervision of a teacher to improve KD.

However, the teacher cannot fulfil its function by only transferring soft targets. Therefore, numerous additive methods for KD have been proposed. Romero et al. [34] used intermediate-level hints from the teacher hidden layers to guide the student. Based on the boundary supporting sample (BBS), Heo et al. [13] aimed to transfer more accurate information about the decision boundary. In [21 , 38], they consistently concentrated on transferring the instance relationship between classes from teacher to student. Afterwards, in [5 , 40], contrastive learning was applied to transfer knowledge.

Besides, KD can be easily united with other model compression method for its flexibility. Polino et al. [31] and Mishra and Marr [26] combined quantization with KD to reduce the network weights and activations. Lee et al. [18] used singular value decomposition to enhance the accuracy of knowledge distillation. Lin et al. [19] pruned a transformable architecture via KD. However, only a little research discussed the combination of knowledge distillation and other model compression approaches.

3 Approach

3.1 Background

The key idea behind KD proposed by Hinton et al. [14] trained student not only via true labels but also the information lurked in the teacher network. We let F_t and F_s be the functions of the teacher and student networks, respectively. For an image x with true label y, the logits (the input of softmax) of teacher and student are α_t = [α_t0, α_t1, …, α_ti] and α_s = [α_s0, α_s1, …, α_si]. Then, the final probabilities of them are P_t = softmax (α_t) and P_s = softmax (α_s). Hinton et al. [14] used a temperature τ to produce a softer probability distillation over classes. So, knowledge distillation tries to match the soft probabilities (soft targets) $P_{t}^{τ} = softmax (α_{t} / τ)$ and $P_{s}^{τ} = softmax (α_{s} / τ)$ via KL-divergence. $Loss - kd = τ^{2} KL (P_{s}^{τ}, P_{t}^{τ})$ (1)

Besides, a cross-entropy loss H (P_s, y) is added to the modified soft targets. Then, the final loss function is a weighted average of two objective functions. $Loss - distill = (1 - λ) H (P_{s}, y) + λ τ^{2} KL (P_{s}^{τ}, P_{t}^{τ})$ (2)

τ and λ are hyperparameters. τ ∈ {3, 4, 5} and λ = 0.9 are most recommended.

3.2 Removing the temperature τ of

P_{s}^{τ}

As mentioned above, forcing student to mimic the soft targets of teacher is aimed to help student reveal the dark knowledge that teacher has discovered above. However, we observe the probability distribution of student trained by KD is not similar to its softened teacher. To show this more visually, we first use t-SNE visualization [24] based on the following steps like [27]: (1) Choose three classes; (2) Find an orthonormal basis of the plane crossing the templates of these three classes; (3) Project the final probabilities of examples from these three classes onto this plane. The visualization shows in a 2-D to help us observe whether the student truly imitates the soft targets of the teacher or not.

Figure 2 presents the visualization results of final probabilities $P_{s}^{'}$ trained on CIFAR-10 [16]. WideResNet16-2 and WideResNet16-1 [44] are selected as teacher and student networks. τ is set as 3 and 4. The three classes are "Airplane", "Automobile" and "Bird" obtained from the test set of CIFAR-10. The first column shows the soft targets of the teacher network. The second column is the final probabilities $P_{s}^{'}$ of the student network which has been trained by KD. As the dark knowledge is mainly encapsulated in the correlations between the teacher’s primary and secondary soft probabilities, it can be observed that the second column projections are tighter than its soft teacher, which means that the student does not truly grasp dark knowledge from its teacher.

Fig. 2

t-SNE visualization of the probabilities on CIFAR-10 test set. The first column shows the soft targets of the teacher network. The second column and third column are the final probabilities $P_{s}^{'}$ of the student network trained with and without τ, respectively.

We find that adding the temperature τ for $P_{s}^{τ}$ seems to hinder student to mimic the soft targets of the teacher. To this end, we make the first change on the conventional softening process by removing the temperature parameter of the student, which means the Loss-kd is changed to $τ^{2} KL (P_{s}, P_{t}^{τ})$ . The t-SNE visualization of this method is plotted in the last column of Figure 2. It can be clearly observed that the last column projections are more like the "soft" teacher than the second. Besides, in the last column, the clusters are broader than the second. Both of these phenomena mean that the $P_{s}^{'}$ of the student is softer and grasp more dark knowledge from its teacher.

More straightly, if the logits α_t and α_s are zero-meaned for each transfer case, so that

∑_i α_ti = ∑_i α_si = 0. So, the gradient of Loss-kd $τ^{2} KL (P_{s}, P_{t}^{τ})$ and Loss-kd $τ^{2} KL (P_{s}^{τ}, P_{t}^{τ})$ can be approximated as [44]: $\frac{\partial KL}{\partial α_{si}} \approx \frac{1 + α_{si}}{N + \sum_{j} α_{sj}} - \frac{1 + α_{ti} / T}{N + \sum_{j} α_{tj} / T}$ (3)

$\frac{\partial KL}{\partial α_{si}} \approx \frac{1}{T} (\frac{1 + α_{si} / T}{N + \sum_{j} α_{sj} / T} - \frac{1 + α_{ti} / T}{N + \sum_{j} α_{tj} / T})$ (4)

Then, Equation 3 and Equation 4 can be simplified as: $\frac{\partial KL}{\partial α_{si}} \approx \frac{1}{NT} (T α_{si} - α_{ti})$ (5) $\frac{\partial KL}{\partial α_{si}} \approx \frac{1}{N T^{2}} (α_{si} - α_{ti})$ (6)

As $\frac{1}{NT} (T α_{si} - α_{ti}) \geq \frac{1}{NT} (α_{si} - α_{ti}) \geq \frac{1}{N T^{2}} (α_{si} - α_{ti})$ , Loss-kd $τ^{2} KL (P_{s}, P_{t}^{τ})$ converges faster than Loss-kd $τ^{2} KL (P_{s}^{τ}, P_{t}^{τ})$ . In particular, for τ = 3, we also show the test error and test loss of different epochs in Figure 3 and Figure 4. It can be observed that the student trained by $KL (P_{s}, P_{t}^{τ})$ has smaller test error but larger test loss during the last 100 epochs, which demonstrates the $P_{s}^{'}$ of the student is softer and obtains more dark knowledge. Moreover, we show the error rates of the above experiments in Table 1. When τ = 3, the student’s error rate trained by Loss-kd $τ^{2} KL (P_{s}, P_{t}^{τ})$ reduces by 0.26%from the original KD. When τ = 4, the error rate reduces by 0.32%. From these perspectives, using P_s to imitate $P_{t}^{τ}$ can help the student learn better from its teacher. An easily overlooked but important detail is τ², which is used to balance the magnitudes of the gradients produced by the soft targets scaled as 1/τ². However, when we use $τ^{2} KL (P_{s}, P_{t}^{τ})$ to educate the student, τ² may be too large for Loss-kd, especially when τ ≥ 5. It is necessary to set a new trade-off weight to balance the gradients. 2 (τ + 1) is recommended by us when τ ∈ {3, 4, 5} for $P_{t}^{τ}$ . So, our first modified function called Student Compatibility Function (SCF) is

Fig. 3

Test loss comparison between student trained by $τ^{2} KL (P_{s}^{τ}, P_{t}^{τ})$ and $τ^{2} KL (P_{s}, P_{t}^{τ})$ when τ = 3.

Fig. 4

Top-1 error rate (%) comparison between student trained by $τ^{2} KL (P_{s}^{τ}, P_{t}^{τ})$ and $τ^{2} KL (P_{s}, P_{t}^{τ})$ when τ = 3.

Table 1

Top-1 error rate (%) comparison between student trained by $τ^{2} KL (P_{s}^{τ}, P_{t}^{τ})$ and $τ^{2} KL (P_{s}, P_{t}^{τ})$

τ	Teacher network	Student network	Student baseline	Student w/τ	Student w/o τ
3	WideResNet 16-2	WideResNet 16-1	8.48	7.12	6.86
4				7.15	6.83

$Loss - SCF = (1 - λ) H (P_{s}, y) + 2 λ (1 + τ) KL (P_{s}, P_{t}^{τ})$ (7)

3.3 Segmenting the temperature τ of

P_{t}^{τ}

A good teacher is indispensable to a studious student. When we distill a student network, the τ for $P_{t}^{τ}$ is closely related to how much knowledge the teacher can impart. As mentioned in [42], the main knowledge is the similarity information between categories. Besides, Lopez-Paz et al. [22] reported that soft class-probability predictions reveal label dependencies. However, When the temperature is low, distillation pays much less attention to the similarity information between categories. On the contrary, when the temperature is high, the teacher could make uncertain predictions about some samples and misguide the student. As shown in Figure 5, we choose an image from CIFAR-10 training set, whose ground truth label is "Ship". When τ = 3, the soft class probabilities

Fig. 5

The soft probabilities of the teacher softened by τ ∈ {3, 6, 10, 20} and SSF on CIFAR-10 training set.

can just depict the relationship between the primary probability category "Ship" and the secondary probability category "Airplane". The other relationship is almost negligible. When τ = 20, the distinction between "Ship" probability and "Airplane" probability is not obvious, which may induce the student to acquire wrong information. So the value of τ is like two sides of a coin, which either cannot dig the dark knowledge all out or may undermine the distillation process. For this account, τ ∈ {3, 4, 5} seems to be a compromise choice.

Why do these happen? We observe that just leveraging single τ to soften α_t may be too unadorned to extract the knowledge. To this end, we propose a new softening strategy for teacher’s logits α_t. As shown in Figure 6, we divide α_t = [α_t0, α_t1, …, α_ti] into three groups according to their numerical values and called these groups as the first logits, the middle logits and the last logits, respectively. We assign a larger temperature τ for the first logits and the last logits, a smaller temperature τ′ = τ - 1 for the middle logits. However, if we directly soften the logits with τ and τ′, the primary probability may be less than the secondary probability. To this end, a segmented softening function (SSF) is proposed as

Fig. 6

The illustration of the segmented softening function of teacher.

φ (α_{ti}) = {\begin{matrix} \frac{α_{ti}}{τ} α_{ti} \leq k_{0} \\ \frac{α_{ti}}{τ^{'}} + \frac{τ^{'} - τ}{τ τ^{'}} k_{0} k_{0} < α_{ti} \leq k_{1} \\ \frac{α_{ti}}{τ} + \frac{τ - τ^{'}}{τ τ^{'}} (k_{1} - k_{0}) k_{1} < α_{ti} \end{matrix}

(8)

where k₀ and k₁ are the anchor points to divide the groups, which are manually selected, $P_{t}^{φ} = softmax (φ (α_{ti})$ . The main intent of this softening function is to assign the values of the first logits and the last logits to the middle logits. So, the important similarity information between categories can be amplified and the unimportant information is restrained. So, the second modified function called Teacher Compatibility Function (TCF) is $Loss - TCF = (1 - λ) H (P_{s}, y) + λ τ^{2} KL (P_{s}^{τ}, P_{t}^{φ})$ (9)

In the remainder of this section, we demonstrate the advantages of SSF and TCF in more detail. Above all, as shown in Figure 5, it can be observed that compared with τ = 3, the soft probabilities of "Airplane" and "Automobile" by SSF are enlarged and the probabilities "Ship" and "Dog" by SSF are reduced. While compared with τ ∈ {6, 10, 20}, the distinction between "Ship" probability and "Airplane" probability by SSF is more clear. So, more discriminative similarity information is revealed by SSF. On the other hand, we use WideResNet16-2 to educate WideResNet16-1 on CIFAR-10. The error rates of vanilla KD and TCF are presented in Table 2. Compared with the vanilla KD with τ = 3 and τ = 4, the error rates of TCF reduce by 0.18%and 0.10%, respectively.

Table 2

Top-1 error rate (%) comparison between student trained by KD and TCF

Teacher network	Student network	Student baseline	KD	TCF
WideResNet 16-2	WideResNet 16-1	8.48	7.12 (τ = 3)	6.94 (τ = 3, τ′ = 2)
			7.15 (τ = 4)	7.05 (τ = 4, τ′ = 3)

According to the SCF and TCF, we further design the third modified function to form a more efficient teacher-student paradigm, which is called Teacher-Student Compatibility Function (TSCF):

$Loss - TSCF = (1 - λ) H (P_{s}, y) + 2 λ (1 + τ) KL (P_{s}, P_{t}^{φ})$ (10)

4 Experiments

4.1 Image classification on CIFAR-100

CIFAR-100 [16] contains 50K training images and 10K test images with 100 classes, both of which are at resolution 32 × 32. To better illustrate the superiority of proposed functions, we do different experimental setups. Table 3 presents the setup of each experiment such as the teacher and student architectures, model size and compression rate. Four compression styles are chosen. Because the width or depth can be flexibly modified, we choose the variants of WideResNet and ResNet [11] as teachers or students trained on CIFAR-100 (SGD with Nesterov momentum; weight decay 5e^-4; batch size of 128; 200 epochs; initial learning rate of 0.1 which is divided by 10 at epoch 100s and epochs 150; pad, random flip and random crop used for data augmentation). For KD, the hyperparameters λ = 0.9 and τ = 4. For the proposed three modified functions, λ = 0.9, τ = 4 and τ′ = 3. The top-1 error rate is adopted as the performance metric.

Table 3
Experimental settings with various compression styles on CIFAR-100 dataset

Setup Compression style Teacher network Student network # of params teacher # of params student Compression ratio

(a) Depth WideResNet28-2 WideResNet16-2 1.48M 0.73M 49.32%

(b) WideResNet34-2 WideResNet16-2 1.87M 0.73M 39.03%

(c) WideResNet40-2 WideResNet16-2 2.26M 0.73M 32.30%

(d) Width WideResNet28-2 WideResNet28-1 1.48M 0.38M 25.68%

(e) WideResNet34-2 WideResNet34-1 1.87M 0.47M 25.13%

(f) WideResNet40-2 WideResNet40-1 2.26M 0.56M 24.78%

(g) Depth&Width WideResNet28-2 WideResNet16-1 1.48M 0.18M 12.16%

(h) WideResNet34-2 WideResNet16-1 1.87M 0.18M 9.63%

(i) WideResNet40-2 WideResNet16-1 2.26M 0.18M 7.96%

(j) Architecture WideResNet22-4 ResNet20 4.32M 0.28M 6.48%

(k) ResNet56 WideResNet16-1 0.86M 0.18M 20.93%

(l) ResNet110 WideResNet16-1 1.74M 0.18M 10.34%

Setup	Compression style	Teacher network	Student network	# of params teacher	# of params student	Compression ratio
(a)	Depth	WideResNet28-2	WideResNet16-2	1.48M	0.73M	49.32%
(b)		WideResNet34-2	WideResNet16-2	1.87M	0.73M	39.03%
(c)		WideResNet40-2	WideResNet16-2	2.26M	0.73M	32.30%
(d)	Width	WideResNet28-2	WideResNet28-1	1.48M	0.38M	25.68%
(e)		WideResNet34-2	WideResNet34-1	1.87M	0.47M	25.13%
(f)		WideResNet40-2	WideResNet40-1	2.26M	0.56M	24.78%
(g)	Depth&Width	WideResNet28-2	WideResNet16-1	1.48M	0.18M	12.16%
(h)		WideResNet34-2	WideResNet16-1	1.87M	0.18M	9.63%
(i)		WideResNet40-2	WideResNet16-1	2.26M	0.18M	7.96%
(j)	Architecture	WideResNet22-4	ResNet20	4.32M	0.28M	6.48%
(k)		ResNet56	WideResNet16-1	0.86M	0.18M	20.93%
(l)		ResNet110	WideResNet16-1	1.74M	0.18M	10.34%

The results of different experiments are shown in Table 4. From the average value of each method, the SCF, TCF and TSCF reduce the original KD error rate by 0.36%, 0.53%and 0.93%, respectively. More exactly, from the results of different experimental settings, the error of the three proposed functions are all less than those of the original KD. Especially, the accuracies of SCF, TCF and TSCF are 0.50%, 0.51%

Table 4

Classification performance of various knowledge distillation methods on CIFAR-100. Top-1 error rate (%) is applied as the metric

Setup	Teacher baseline	Student baseline	KD	SCF	TCF	TSCF
(a)	25.39	26.74	25.32	24.89	24.88	24.54
(b)	24.68	26.74	25.23	24.91	24.82	24.81
(c)	23.53	26.74	25.22	24.86	25.06	24.82
(d)	25.39	29.56	28.03	27.56	27.33	27.31
(e)	24.68	29.28	27.41	27.28	27.18	26.96
(f)	23.53	28.32	27.33	26.90	27.23	26.63
(g)	25.39	32.27	32.55	32.32	31.57	30.99
(h)	24.68	32.27	32.90	32.71	31.99	31.46
(i)	23.53	32.27	33.06	32.76	32.13	31.50
(j)	22.08	31.24	30.49	30.28	29.65	29.17
(k)	27.05	32.27	31.92	31.42	31.44	31.06
(l)	26.44	32.27	32.17	31.38	31.94	31.21
Average	24.70	30.00	29.30	28.94	28.77	28.37

and 0.85%higher than the teacher network in setup (a). Besides, when the teacher is too gigantic for a student like setup (g), (h) and (i), the KD cannot help the student imitate its teacher and the results are worse 0.28%, 0.63%and 0.79%than the baseline of the student. This is a typical shortcoming of KD, which has been studied in [6, 25]. Yet, the proposed TCF and TSCF are still useful to train student.

4.2 Image classification on tiny imageNet

Tiny ImageNet [17] contains 200 image classes with 500 training examples, 50 validation examples and 50 test examples per class. The images of Tiny ImageNet are down-sampled to 64 × 64 for the original ImageNet [7]. However, Tiny ImageNet is large enough to be regarded as a challenging and realistic problem. Table 5 presents the different experimental settings. For the teacher and student architectures, variants of the state-of-the-art mobile architecture ShuffleNetV2 [23] are selected by us. All students are trained by SGD Nesterov momentum, weight decay 5e^-4, 200 epochs with an initial learning rate of 0.01 which is divided by 10 at epochs 100, epochs 150. The data augmentation is random rotation and random flip. Three methods are added to KD and TSCF with hyperparameters β referenced by [37]. Then, the total hyperparameters are KD: {λ = 0.9, τ = 4}; the proposed three modified functions: {λ = 0.9, τ = 4, τ′ = 3}; FN [34]: {β = 100}; AT [43]: {β = 1000}; PKT [28]: {β = 30000}; SP [28]: {β = 3000}; CC [38]: {β = 0.02}. The top-1 error rate is adopted as the performance metric.

Table 5
Experimental settings with the variants of ShuffleNetV2 architectures on Tiny ImageNet dataset

Setup Teacher network Student network # of params teacher # of params student Compression ratio

(a) ShuffleNetV2-2.0 ShuffleNetV2-1.0 5.73M 1.46M 25.48%

(b) ShuffleNetV2-2.0 ShuffleNetV2-0.5 5.73M 0.55M 9.60%

(c) ShuffleNetV2-1.0 ShuffleNetV2-0.5 1.46M 0.55M 37.67%

Setup	Teacher network	Student network	# of params teacher	# of params student	Compression ratio
(a)	ShuffleNetV2-2.0	ShuffleNetV2-1.0	5.73M	1.46M	25.48%
(b)	ShuffleNetV2-2.0	ShuffleNetV2-0.5	5.73M	0.55M	9.60%
(c)	ShuffleNetV2-1.0	ShuffleNetV2-0.5	1.46M	0.55M	37.67%

Table 6 presents the results of the experiments. Surprisingly, the KD and three proposed functions are all better than the teacher in setup (a). Compared with the vanilla KD, the average error rates of the three proposed functions reduce by 1.27%, 1.02%and 2.10%, respectively. The TSCF shows the best results among the three functions. Compared with FN, AT, PKT, SP and CC, the average error rates of TSCF are reduced by 2.03%, 1.95%, 2.12%, 1.96%and 1.98%, respectively. Moreover, it can be observed in setup (b) when the gap between teacher and student is huge, the KD loses its efficacy to train the student. Yet, our proposed functions still work and alleviate the shortcoming. Particularly, compared with KD, setup (b) improves much better via TSCF than other settings. Besides, the supplementary of other methods further enhances the results of the experiments. For setup (a) and (c), TSCF combined with AT has the best results. For setup (b), TSCF combined with SP has the best result.

Table 6

Classification performance of various knowledge distillation methods on Tiny ImageNet. Top-1 error rate (%) is applied as the metric

Setup	(a)	(b)	(c)	Average
Teacher baseline	37.84	37.84	40.22	38.63
Student baseline	40.22	43.39	43.39	42.40
KD	37.41	43.81	42.62	41.28
FN	37.25	43.30	42.99	41.21
AT	37.31	43.21	42.87	41.13
PKT	37.42	43.55	42.94	41.30
SP	37.22	43.16	43.05	41.14
CC	37.20	43.18	43.09	41.16
SCF	36.82	42.00	31.20	40.01
TCF	36.93	42.39	41.47	40.26
TSCF	36.49	40.08	40.88	39.18
KD+FN	37.28	43.70	42.56	41.18
KD+AT	37.26	43.61	42.50	41.12
KD+PKT	37.37	43.69	42.60	41.22
KD+SP	37.32	43.57	42.57	41.15
KD+CC	37.30	43.59	42.58	41.16
TSCF+FN	36.17	40.02	40.79	38.99
TSCF+AT	36.12	39.95	40.71	38.93
TSCF+PKT	36.37	40.10	41.00	39.16
TSCF+SP	36.34	39.89	40.97	39.07
TSCF+CC	36.32	39.91	40.97	39.07

Besides, a deep network is able to automatically learn semantically similar classes for each image individually [41]. So, we compare the probabilities assigned to the top-5 highest ranked classes outputted by the student on CIFAR-100 test set in Figure 7. The methods were added one by one to measure their effects. The setup (a) and setup (g) are chosen. It can be observed that compared with KD, the primary probabilities and secondary probabilities of the student trained by SCF and TSCF are more correlated than those of the student trained by KD, which means the student learns more semantically similar classes for each image from its teacher. Moreover, compared with SCF, TSCF (SCF+TCF) is further strengthen the correlation between the primary probabilities and secondary probabilities of the student.

Fig. 7

(a) The top-5 highest ranked probabilities of the student for setup (a); (b) The top-5 highest ranked probabilities of the student for setup (g).

4.3 Person re-Identification

Person re-identification (ReID) is an important technique for the automatic search of a person’s in a surveillance video. Market-1501 [45] and DukeMTMC-reID [33] are chosen as the datasets to evaluate the performance of our proposed functions. The Market-1501 dataset collected from six cameras contains 197,32 images for testing and 12,936 images for training. The training set and testing set have 751 identities and 750 identities. The DukeMTMC-reID dataset includes eight 85-minute high-resolution videos from eight different cameras, which is divided into 16,522 training images for 702 identities and 19,889 test images for 702 identities. ResNet50 (26.99M) pre-trained by ImageNet and ResNet18 (12.34M) are the teacher and student. The student with a batch size of 64 is trained by SGD Nesterov momentum. Weight decay is 5e^-4. The initial learning rate is 0.01 and divided by 10 at epochs 40, a totally of 60 epochs. the total hyperparameters are KD: {λ = 0.9, τ = 4}; TSCF: {λ = 0.9, τ = 4, τ′ = 3}; FN: {β = 100}; AT: {β = 1000}; PKT: {β = 30000}; SP: {β = 3000}; CC: {β = 0.02}. Rank-1, Rank-5 and mean accuracy precision (mAP) are applied as The performance metric.

Table 7 and Table 8 are results of the student pre-trained without and with ImageNet, respectively. It can be observed that the TSCF in the two tables are greater than the original KD, especially for the student pre-trained without ImageNet. For example, for DukeMTMC-reID in Table 7, the Rank-1, Rank-5 and mAP for the student trained by TSCF are 4.89%, 6.29%, 5.07 higher than the student trained by KD. For Market-1501 in Table 8, the Rank-1, Rank-5 and mAP for the student trained by TSCF are 5.52%, 4.10%, 5.13 higher than the student trained by KD. Besides, we also combine other methods with KD and TSCF. The performance is further enhanced. TSCF integrated with FN are significantly superior to other methods. For exampl, in Table 7, the Rank-1, Rank-5 and mAP are 81.41%, 91.59%and 60.65 on Market-1501. Besides, in Table 8, the Rank-1, Rank-5 and mAP are 77.65%, 88.15%and 61.26 on DukeMTMC-reID.

Table 7
Person re-identification performance of various knowledge distillation methods on Market-1501 and DukeMTMC-reID when the student is not pre-trained by ImageNet. Rank-1 (%), Rank-5 (%) and mAP are applied as performance metric

Method Market-1501 DukeMTMC-reID

Rank-1 Rank-5 mAP Rank-1 Rank-5 mAP

Teacher baseline 85.24 93.88 66.97 77.65 87.88 58.89

Student baseline 56.35 77.23 32.75 43.18 62.21 27.24

KD 61.40 79.75 36.85 48.02 64.63 30.79

TSCF 66.92 83.85 41.98 52.91 70.92 35.86

KD+FN 80.23 91.36 59.05 71.41 82.99 52.28

KD+AT 63.18 82.45 39.68 54.35 72.40 36.99

KD+PKT 63.60 82.39 39.53 54.46 72.58 37.54

KD+SP 63.78 82.42 39.67 56.60 71.72 37.68

KD+CC 63.76 82.40 39.68 56.61 71.74 37.69

TSCF+FN 81.41 91.59 60.65 72.83 84.17 53.72

TSCF+AT 69.83 86.31 45.26 59.29 75.54 40.96

TSCF+PKT 69.39 85.84 45.65 61.01 77.13 41.94

TSCF+SP 69.57 85.99 45.67 61.13 77.24 42.03

TSCF+CC 69.59 86.01 45.70 61.17 77.29 42.11

Method	Market-1501	DukeMTMC-reID
Teacher baseline	85.24	93.88	66.97	77.65	87.88	58.89
Student baseline	56.35	77.23	32.75	43.18	62.21	27.24
KD	61.40	79.75	36.85	48.02	64.63	30.79
TSCF	66.92	83.85	41.98	52.91	70.92	35.86
KD+FN	80.23	91.36	59.05	71.41	82.99	52.28
KD+AT	63.18	82.45	39.68	54.35	72.40	36.99
KD+PKT	63.60	82.39	39.53	54.46	72.58	37.54
KD+SP	63.78	82.42	39.67	56.60	71.72	37.68
KD+CC	63.76	82.40	39.68	56.61	71.74	37.69
TSCF+FN	81.41	91.59	60.65	72.83	84.17	53.72
TSCF+AT	69.83	86.31	45.26	59.29	75.54	40.96
TSCF+PKT	69.39	85.84	45.65	61.01	77.13	41.94
TSCF+SP	69.57	85.99	45.67	61.13	77.24	42.03
TSCF+CC	69.59	86.01	45.70	61.17	77.29	42.11

Table 8

Person re-identification performance of various knowledge distillation methods on Market-1501 and DukeMTMC-reID when the student is pre-trained by ImageNet. Rank-1 (%), Rank-5 (%) and mAP are applied as performance metric

Method	Market-1501			DukeMTMC-reID
	Rank-1	Rank-5	mAP	Rank-1	Rank-5	mAP
Teacher baseline	85.24	93.88	66.97	77.65	87.88	58.89
Student baseline	72.95	88.30	47.87	67.46	81.06	45.22
KD	81.44	92.84	60.48	73.46	85.77	55.12
TSCF	83.64	93.74	63.53	74.60	86.31	57.01
KD+FN	86.93	95.01	69.45	77.42	87.79	60.83
KD+AT	81.91	92.79	60.65	74.60	85.10	54.88
KD+PKT	81.58	92.49	60.56	74.07	85.58	55.11
KD+SP	81.77	92.64	60.72	74.27	85.77	55.29
KD+CC	81.80	92.69	60.76	74.31	85.82	55.34
TSCF+FN	87.23	95.31	69.73	77.65	88.15	61.26
TSCF+AT	84.09	93.62	63.79	75.63	86.44	56.96
TSCF+PKT	84.22	93.64	63.62	75.20	86.43	56.96
TSCF+SP	84.35	93.74	63.71	75.31	86.54	57.07
TSCF+CC	84.39	93.78	63.76	75.33	86.59	57.12

4.4 Semantic segmentation

Semantic segmentation aims to assign a categorical label to every pixel in an image, which plays an important role in image understanding and self-driving systems. In this section, KD and proposed functions are applied for the Semantic segmentation. DeeplabV3+ [4] is selected as the basic model to verify the performance. ResNet101 pre-trained by ImageNet is the backbone of the teacher model. ResNet50 and MobileNetV2 [35] pre-trained by ImageNet are the backbone of the student model. PASCAL VOC 2012 Aug [4] is adopted as the dataset, which is augmented from the PASCAL VOC 2012 [8] by the extra annotations [10], resulting in 10,582 training images. The student is trained by SGD Nesterov momentum with a batch size of 16 and 30K iterations. The initial learning rate is 0.01 and updates by the "poly" policy. All images are cropped to 513 × 513, which is larger than other tasks and more difficult to conduct. The loss function is the sum of Loss-distill multiply by ε = 0.1 and Loss-semantic segmentation. The total hyperparameters are KD: {λ = 1, τ = 4}; TSCF: {λ = 1, τ = 4, τ′ = 3}. The performance is measured in terms of pixel intersection-over-union averaged with 21 classes (mIOU).

The results are presented in Table 9. Compared with the baseline, the proposed TSCF can further increase the mIOU by 1.13%and 0.48%for ResNet50 and MobileNetV2, respectively. The results of TSCF are greater than those of the KD. Because of the huge gap between teacher and student, KD is invalid to transfer knowledge from teacher to student in MobileNetV2. Yet, TSCF is effective. Moreover, we also visualize the effect of different methods for MobileNetV2 in Figure 8. It can be

Table 9
Semantic segmentation performance of various knowledge distillation methods on PASCLA VOC 2012 Aug. mIOU is applied as the metric

BackBone # of params Method mIOU

ResNet101 58.75M Teacher baseline 78.33

ResNet50 39.76M Student baseline 76.03

KD 76.77

TSCF 77.16

MobileNetV2 5.32M Student baseline 70.98

KD 70.23

TSCF 71.36

BackBone	# of params	Method	mIOU
ResNet101	58.75M	Teacher baseline	78.33
ResNet50	39.76M	Student baseline	76.03
		KD	76.77
		TSCF	77.16
MobileNetV2	5.32M	Student baseline	70.98
		KD	70.23
		TSCF	71.36

Fig. 8

Qualitative effect of employing different methods for student MobileNetV2.

observe that the results of TSCF are more similar with the "Label" than those of other methods.

4.5 Analysis on λ and τ

λ and τ are two critical hyperparameters for KD and TSCF. We have mainly set λ = 0.9 and τ = 4 in the above experiments. In this section, we vary λ and τ to analyze whether TSCF is sensitive to these two parameters. The experiments are conducted on CIFAR-100 dataset. WideResNet28-2 and WideResNet40-2 are selected as the teacher networks. WideResNet16-1 is selected as the student network.

Figure 9 and Figure 10 shows the error rates of various λ when τ = 4. It can be observed that the error rates of TSCF are consistently less than that of KD. Whether the value of λ is large (λ = 1.0) or small (λ = 0.1), the results of KD and TSCF get worse

Fig. 9

Top-1 error rate (%) of WideResNet16-1 supervised by WideResNet28-2 on CIFAR-100 when λ ∈ {0.1, 0.4, 0.7, 1.0} and τ = 4.

Fig. 10

Top-1 error rate (%) of WideResNet16-1 supervised by WideResNet40-2 on CIFAR-100 when λ ∈ {0.1, 0.4, 0.7, 1.0} and τ = 4.

significantly. When λ = 1.0, $P_{t}^{τ}$ and $P_{t}^{φ}$ cannot be revised by the true labels. When λ = 0.1, $P_{t}^{τ}$ and $P_{t}^{φ}$ cannot efficiently educate the student. So, the value of λ should be properly selected.

Figure 11 and Figure 12 shows the error rates of various τ when λ = 0.9. We observe that when τ = 3, the improvement of TSCP is better than the other settings. Because τ = 3 can make TSCP generate more discriminative similarity information. When τ is large, $P_{t}^{φ}$ becomes ambiguous. So, the efficiency of TSCP is weakened. Yet, the performance of TSCP is still better than that of KD.

Fig. 11

Top-1 error rate (%) of WideResNet16-1 supervised by WideResNet28-2 on CIFAR-100 when λ = 0.9 and τ ∈ {3, 4, 5}.

Fig. 12

Top-1 error rate (%) of WideResNet16-1 supervised by WideResNet40-2 on CIFAR-100 when λ = 0.9 and τ ∈ {3, 4, 5}.

5 Conclusion

In this paper, we rethink knowledge distillation from its essence. In general, the teacher and the student are in different size or architecture from each other. So, softening their logits, in the same way, cannot make full use of the knowledge distillation. To find a more effective teacher-student paradigm, we propose three simple but effective strategies for knowledge distillation. In detail, we make the first change on the conventional softening process by removing the temperature parameter of student. To reveal more dark knowledge, we introduce a segmented softening function (SSF) into the softmax function to replace the traditional softening method for teacher. Built off the two strategies above, we further integrate them into a unified formulation. Extensive experiments have been conducted by us. In particular for image classification, the proposed method shows its performance superior to the vanilla knowledge distillation and even better than the teacher network. Moreover, compared with the state-of-art approaches, our method is still advanced. When the gap between teacher and student is huge, the vanilla knowledge distillation loses its efficacy to train the student. Yet, our proposed functions can efficiently alleviate this shortcoming. As for person re-identification, the proposed method can well cooperate with other methods, especially with FitNets. We also carry out experiments on semantic segmentation. The results of the vanilla knowledge distillation are further enhanced.

Acknowledgments

This work was supported by the National Key Research and Development Program: 2017YFB0202104 and the National Key Research and Development Program of China: 2018YFB0204301.

Conflict of interest

We declare that we have no financial and personal relationships with other people or organizations that can inappropriately influence our work, there is no professional or other personal interest of any nature or kind in any product, service and/or company that could be construed as influencing the position presented in, or the review of, the manuscript entitled.

References

and Caruana

Do deep nets really need to be deep? In Advances in Neural Information Processing Systems (2014), 2654–2662.

Bertinetto

, Valmadre

, Henriques

J.F.

, Vedaldi

and Torr

P.H.S.

, Fully-convolutional siamese networks for object tracking, In European conference on computer vision, 850–865. Springer, (2016).

Bucilua

, Caruana

and Niculescu-Mizil

, Model compression, In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages (2006), 535–541.

Chen

L.-C.

, Zhu

, Papandreou

, Schroff

and Adam

, Encoder-decoder with atrous separable convolution for semantic image segmentation, In Proceedings of the European conference on computer vision (ECCV) (2018), 801–818.

Chen

, Wang

, Gan

, Liu

, Henao

and Carin

, Wasserstein contrastive representation distillation, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2021), 16296–16305.

Cho

J.H.

and Hariharan

, On the efficacy of knowledge distillation, In Proceedings of the IEEE International Conference on Computer Vision (2019), 4794–4802.

Deng

, Dong

, Socher

, Li

L.-J.

, Li

and Fei-Fei

, Imagenet: A large-scale hierarchical image database, In 2009 IEEE conference on computer vision and pattern recognition 248–255. Ieee, (2009).

Everingham

, Ali Eslami

S.M.

, Van Gool

, Williams

C.K.I.

, Winn

and Zisserman,

, The pascal visual object classes challenge: A retrospective, International Journal of Computer Vision 111(1) (2015), 98–136.

Han

, Mao

and Dally

W.J.

, Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, arXiv preprint arXiv:1510.00149, (2015).

10.

Hariharan

, Arbeláez

, Bourdev

, Maji

and Malik

, Semantic contours from inverse detectors, In 2011 International Conference on Computer Vision 991–998. IEEE, (2011).

11.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 770–778.

12.

, Zhang

and Sun

, Channel pruning for accelerating very deep neural networks, In Proceedings of the IEEE International Conference on Computer Vision (2017), 1389–1397.

13.

Heo

, Lee

, Yun

and Choi

J.Y.

, Knowledge distillation with adversarial samples supporting decision boundary, In, Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 3771–3778.

14.

Hinton

, Vinyals

and Dean

, Distilling the knowledge in a neural network, arXiv preprint arXiv:1503.02567 (2015).

15.

Jaderberg

, Vedaldi

and Zisserman

, Speeding up convolutional neural networks with low rank expansions, arXiv preprint arXiv:1405.3866, (2014).

16.

Krizhevsky

, Hinton

, et al., Learning multiple layers of features from tiny images. (2009).

17.

and Yang

, Tiny imagenet visual recognition challenge, CS 231N 7 (2015).

18.

Lee

S.H.

, Kim

D.H.

and Song

B.C.

, Selfsupervised knowledge distillation using singular value decomposition, In European Conference on Computer Vision, pages 339–354. Springer, (2018).

19.

Lin

, Ji

, Yan

, Zhang

, Cao

, Ye

, Huang

and Doermann

, Towards optimal structured cnn pruning via generative adversarial learning, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), 2790–2799.

20.

Lin

T.-Y.

, Dollár

, Girshick

, He

, Hariharan

and Belongie

, Feature pyramid networks for object detection, In Proceedings of the IEEE conference on computer vision and pattern recognition (2017), 2117–2125.

21.

Liu

, Cao

, Li

, Yuan

, Hu

, Li

and Duan

,Knowledge distillation via instance relationship graph, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019), 7096–7104.

22.

Lopez-Paz

, Bottou

, Schölkopf

and Vapnik

, Unifying distillation and privileged information, arXiv preprint arXiv:1511.03643, (2015).

23.

, Zhang

, Zheng

H.-T.

and Sun

, Shufflenet v2: Practical guidelines for efficient cnn architecture design, In Proceedings of the European conference on computer vision (ECCV) (2018), 116–131.

24.

van der Maaten

and Hinton

, Visualizing data using t-sne, Journal of Machine Learning Research 9(Nov) (2008), 2579–2605.

25.

Mirzadeh

S.-I.

, Farajtabar

, Li

, Levine

, Matsukawa

and Ghasemzadeh

, Improved knowledge distillation via teacher assistant, arXiv preprint arXiv:1902.03393, (2019).

26.

Mishra

and Marr

, Apprentice: Using knowledge distillation techniques to improve low-precision network accuracy, arXiv preprint arXiv:1711.05852, (2017).

27.

Müller

, Kornblith

and Hinton

G.E.

, When does label smoothing help? In Advances in Neural Information Processing Systems (2019), 4694–4703.

28.

Passalis

and Tefas

, Learning deep representations with probabilistic knowledge transfer, In Proceedings of the European Conference on Computer Vision (ECCV) (2018), 268–284.

29.

Peng

, Jin

, Liu

, Li

, Wu

, Liu

, Zhou

and Zhang

, Correlation congruence for knowledge distillation, In Proceedings of the IEEE International Conference on Computer Vision, 5007–5016.

30.

Phuong

and Lampert

, Towards understanding knowledge distillation, In International Conference on Machine Learning (2019), 5142–5151.

31.

Polino

, Pascanu

and Alistarh

, Model compression via distillation and quantization, arXiv preprint arXiv:1802.05668 (2018).

32.

Redmon

, Divvala

, Girshick

and Farhadi

, You only look once: Unified, real-time object detection, In Proceedings of the IEEE conference on computer vision and pattern recognition (2016), 779–788.

33.

Ristani

, Solera

, Zou

, Cucchiara

and Tomasi

, Performance measures and a data set for multi-target, multicamera tracking, In European Conference on Computer Vision pages 17–35. Springer, (2016).

34.

Romero

, Ballas

, Kahou

S.E.

, Chassang

, Gatta

and Bengio

, Fitnets: Hints for thin deep nets, arXiv preprint arXiv:1412.6550, (2014).

35.

Sandler

, Howard

, Zhu

, Zhmoginov

and Chen

L.-C.

, Mobilenetv2: Inverted residuals and linear bottlenecks, In Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), 4510–4520.

36.

Tan

, Liu

and Zhang

, Improving knowledge distillation via an expressive teacher, Knowledge-Based Systems 218 (2021), 106837.

37.

Tian

, Krishnan

and Isola

, Contrastive representation distillation, arXiv preprint arXiv:1910.10699, (2019).

38.

Tung

and Mori

, Similarity-preserving knowledge distillation, In Proceedings of the IEEE International Conference on Computer Vision (2019), 1365–1374.

39.

Wen

, Lai

and Qian

, Preparing lessons: Improve knowledge distillation with better supervision, Neurocomputing 454 (2021), 25–33.

40.

, Liu

, Li

and Loy

C.C.

, Knowledge distillation meets self-supervision, In European Conference on Computer Vision, pages 588–604. Springer, (2020).

41.

Yang

, Xie

, Qiao

and Yuille

A.L.

, Training deep neural networks in generations: A more tolerant teacher educates better students, In, Proceedings of the AAAI Conference on Artificial Intelligence 33 (2019), 5628–5635.

42.

Yuan

, Tay

F.E.H.

, Li

, Wang

and Feng

, Revisiting knowledge distillation via label smoothing regularization, In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 3903–3911.

43.

Zagoruyko

and Komodakis

, Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer, arXiv preprint arXiv:1612.03928, (2016).

44.

Zagoruyko

and Komodakis

, Wide residual networks, arXiv preprint arXiv:1605.07146, (2016).

45.

Zheng

, Shen

, Tian

, Wang

and Tian

, Scalable person re-identification: A benchmark, In Proceedings of the IEEE international conference on computer vision (2015), 1116–1124.