FEFR: Feature early fusion and reconstruction for online knowledge distillation

Abstract

Online knowledge distillation breaks the pre-determined strong and weak teacher-student models, it provides a new way of thinking about knowledge distillation. However, the current online methods often use the Logits-based prediction distribution, and the features containing rich semantic information are rarely used. Even if the feature-based methods are used, they only operate on the last layer of the network, without further exploring the representation knowledge of the middle layer feature map. To address the above issues, we propose an innovative feature early fusion and reconstruction (FEFR) method for online knowledge distillation which entails four essential components: multi-scale feature extraction and intermediate layer feature early fusion, reconstruction of features, dual-attention and overall fusion module in this paper. We propose early fusion by “sum” operation for feature matrices between different layers and advance fusion to improve the feature map representation. In order to enhance the communication ability between groups to obtain features, the features were reconstructed. We create a dual-attention to enhance the critical channel and spatial regions adaptively in order to collect more accurate information. The previously processed feature maps are combined and fused using feature fusion, which also aids in student models training. A study of the network architectures of CIFAR-10, CIFAR-100, CINIC-10 and ImageNet 2012 shows that FEFR provides more useful characterization knowledge for refinement and improves accuracy by about 0.5% compared to other methods.

Keywords

Online knowledge distillation teacher-student models multi-scale feature early fusion

1 Introduction

Several applications, including image classification [1, 2], object detection [3, 4] and segmentation [5, 6], have shown promise for deep convolution neural networks. Modern convolutional neural networks need a lot of compute and storage to attain high performance always, which drastically restricts their use on devices with low resources.

Over the last few years, this problem has been the subject of extensive research and numerous models compression and acceleration strategies have been proposed as solutions. Representative methods include pruning [7, 8], quantization [9, 10], low-rank factorization [11, 12], and knowledge distillation [13]. One of the most effective methods is the knowledge distillation approach for model compression, which involves training a small student network to replicate the output of the teacher network after training an over-parameterized neural network as the teacher. Because it inherits teacher knowledge, the student model can take the place of the over-parameterized teacher model and achieve model compression and rapid inference.

Traditional KD [13] approaches use offline distillation, often known as a two-stage training strategy, to transfer knowledge from high-capacity large models that have already been taught to compact student models [14]. High-capacity instructors are not always accessible, and even if they are, it is unavoidable to incur greater computational costs and longer training periods for complex teachers. KD also encounters gaps in model capacity when there is a large difference in size between student and teacher models [15]. To address the problem, online knowledge distillation (OKD) has been created [16, 17]. This method is more appealing since it breaks the pre-defined specific strength and weakness relationships and reduces the training procedure to an end-to-end one-stage fashion rather than using pre-trained high-performance teachers. By learning from one another as they go through the training process, all models are trained at once. In other words, all networks distill and share knowledge. Online knowledge distillation achieves superior performance compared to offline KD while maintaining a simpler structure. However, traditional approaches focus on moving logit data as soft targets. Even though one-hot labels and soft targets both offer more detailed information, employing the logit alone is comparatively simple. Even though straightforward alignment or merging does not fully utilize representations of pertinent entities, feature maps can still reveal important information about spatial sight, channels, and connections.

To solve the aforementioned problems in this study, we present a novel feature early fusion and reconstruction approach (FEFR) for online knowledge distillation. FEFR consists of four essential elements: multi-scale feature extraction and intermediate layer feature early fusion, reconstruction of features, dual-attention, overall fusion module. In order to gain more useful representational knowledge for distillation, we first extract multi-scale characteristics from the middle layers and last layer that can concentrate on both local particular and global areas. Then we fuse these features by “sum” and early fusion of features from various middle layers. In order to enhance the communication ability between groups to obtain features, the features were reconstructed. Then, in order to aid in the learning of student models, student networks are encouraged to use feature fusion to combine the produced feature maps and put them into a fused classifier. In summary, the major contributions in this paper are as follows:

We first provide many intermediate layers and last layer feature extraction to enhance the multi-scale representation of features and provide deeper information beyond straightforward alignment. Then we propose intermediate layers feature early fusion to obtain more comprehensive features.

In order to enhance the communication ability between groups to obtain features, the features were reconstructed. We create a dual-attention to enhance the critical channel and spatial regions adaptively in order to collect more accurate information. The previously processed feature maps are combined and fused using feature fusion, which also aids in the training of student models.

Extensive experiments of CIFAR-10/100 [18] CINIC-10 [19] and ImageNet2012 [20] demonstrates the effectiveness of the proposed FEFR. Our method FEFR can improve the representation of characteristics and producing additional data for knowledge distillation.

2 Related work

2.1 Vanilla knowledge distillation

The title page should provide the following information: The concept of knowledge transfer from a large model to a small one without the major loss in accuracy is inspired from [21]. The conventional KD has two stages and requires a teacher who has received special training. Because of the mismatch between the feature maps of some student models and the teacher models, some articles propose to solve it from the hidden layer in the middle of the network [14]. To further exploit more accurate information, [22] combines attention to detail with distillation to make better use of more specific information. By simulating the teacher’s flow matrix with the inner output, [23] investigates the interaction between layers. The learning of realistic data distributions by student and instructor networks is made possible in [24] by the use of an adversarial training approach. In order to close the competence gap between the teacher and student models, a teacher assistant is added in [15]. [25] proposes distillation utilizing the activation boundaries created by hidden neurons.

2.2 Online knowledge distillation

The introduction of online knowledge distillation enhances the performance of student models by removing the dependency on time- and money-consuming high-volume teacher models. Through the sharing of predictions throughout the training process, this paradigm allows student models to learn from one another. [16] is an example of a representative strategy in which various networks cooperate. To replicate the category probabilities of peer networks, each network employs KL divergence. [25] pushes DML further by constructing an ensemble logit as a teacher to enhance generalization by averaging the predictions of a set of students. [26] suggests a switchable learning mode strategy to raise student achievement. [17] introduces a fusion module to create a fusion grader that will control the sub-network formation process. A gate module is incorporated in [27] to produce importance scores for each branch and produce a more effective teacher. [28] proposes a two-level refinement between a group leader and several auxiliary peers to increase the variety of student models. [29] suggests using a weight assessment system to build a virtual teacher. In terms of architectural design, [30] proposes to embed integration branch and adaptive fusion branch between two parallel peer-to-peer networks for learning. Knowledge is distilled inside the network itself in [31, 32], moving from the deeper to the shallower portions of the network. The network is shared by the teacher and student models. OKDPH [33] constructs a learning method with mixed parameters.

2.3 Feature fusion methods

Dual learning is subjected to feature fusion in [17 , 34–36]. Networks from two distinct outputs are combined and projected to a bilinear vector in a bilinear CNN [34]. DualNet [35] builds a fused classifier by training two parallel networks with the same topology and combining their features using the “sum” function. Iterative training is also used to update the weights of the sub-networks in an alternate manner in order to discover complementary traits. FFL [17], which fuses the feature maps of the end layer of the two student networks, helps the two student networks and the fused classifier function well together. The feature maps of last layer of the two student networks are subjected to multiple segmentation and concatenate operations. These features are given double attention to obtain the feature map. In contrast, By combining resulting feature maps and sending them into a fused classifier, MFEF [36] obtains multi-scale features to aid learning.

3 Proposed method

We go into great detail on the framework and loss function in this section. Figure 1 depicts the FEFR in detail. In contrast to other KD approaches, FEFR goes further into the data the middle layer feature map provides, including using it to connect the features between the first and last layers. It takes into account both the fused classifier’s performance as well as the performance of the student networks. The fusion module fuses the characteristics of parallel sub-networks during the process, and the fused classifier subsequently obtains the final classification outcomes.

Fig. 1

Overview of feature early fusion and reconstruction for online knowledge distillation. Features extracted from subnetwork are reconstructed, dual attention and fusion modules form the final classifier to guide subnetwork learning.

3.1 Definition of the problem

A labeled dataset ofN samples is given as $D {x_{i}, y_{i}}_{i = 1}^{N}$ , where x_i is the ith input sample and y_i∈ { 1, 2, …, M } is the matching ground-truth label. M is number of classes of dataset. The logit produced by the final fully connected layer of the student S_k is represented as $z_{k} = {z_{k}^{1}, z_{k}^{2}, . . ., z_{k}^{M}}$ . Taking into consideration n student models ${S_{k}}_{k = 1}^{n}$ , then using softmax [37], it is possible to calculate the probability of the kth student for the sample x_i over the mth class $p_{k}^{m} (x_{i})$ function. $p_{k}^{m} (x_{i}) = \frac{exp (z_{k}^{m} / T)}{\sum_{m = 1}^{M} exp (z_{k}^{m} / T)}$ (1) Where $z_{k}^{m}$ represents the output of the kth student network on m classes T is the temperature factor, which causes the probability distribution to become softer as it rises. It is specifically written as $p_{k}^{m} (x_{i})$ when T = 1 and rewritten as ${\tilde{p}}_{k}^{m} (x_{i})$ when T > 1. The goal of multi-class classification is to reduce the cross-entropy loss [38] between the ground-truth labels and the softmax outputs. The cross-entropy of kth student network is defined as:

$L_{k}^{CE} = - \sum_{i = 1}^{m} l_{i} log (p_{k}^{m} (x_{i}))$ (2) where if y_i = m, l_i = 1, and if not, l_i = 0. By matching the softened probabilities of the student model ${\tilde{p}}_{k}^{m} (x_{i})$ and teacher model ${\tilde{p}}_{t}^{m} (x_{i})$ , knowledge transmission is made easier. We introduce the Kullback-Leibler Divergence, which represents the distillation loss of the kth student model. $L_{k}^{D} = \sum_{k = 1}^{M} {\tilde{p}}_{t}^{m} (x_{i}) log \frac{{\tilde{p}}_{t}^{m} (x_{i})}{{\tilde{p}}_{k}^{m} (x_{i})}$ (3)

3.2 FEFR framework

Globally, as show in Fig. 1 the primary goal of FEFR is to sequentially extract several intermediate layer features. Then we use a specific intermediate layer as a bridge to connect the information communicated by the intermediate layer features at its two ends. It can produce additional information knowledge for knowledge distillation. The ability to communicate between groups is increased through the reconstruction of features. We can focus more on the details of features through the dual attention mechanism. The fusion module is responsible for fusing the newly acquired features.The following section will discuss each essential component.

3.2.1 Multi-scale feature extraction and intermediate layer feature early fusion

In addition to soft targets, inspired by [39], we introduced the extraction of multiscale features to generate multiscale features that are important for the visual task. As shown in the Fig. 2, the feature outputs of the two intermediate layers and the last layer are extracted. The features of the intermediate layer provide geometric information and the feature map of the last layer is used because they have high-level semantic information that are richer and more specific. In this study, the analysis is conducted using two student networks. For notational convenience, we name the intermediate feature maps of the ith student models as F_i1,F_i2 and the feature map of the last layer is labeled as F_i3. Similarly, we refer to the feature mapsjth student models as F_j1, F_j2, and F_j3. Feature maps with several scale perceptual fields can be created using multi-scale feature extraction. The receptive field expands as additional characteristics are concatenated. Smaller receptive fields concentrate on the specifics whereas larger receptive fields gather broad information. Combining these two techniques can result in feature maps with greater meaning and higher distillation efficiency.

Fig. 2

Multi-scale feature extraction in the middle and last layers. Extract the feature of middle and last layers.

As shown in Fig. 3, after obtaining the two layers of intermediate feature maps and the last layer of feature maps. We first perform a “sum” operation (matrix addition) to fuse two layers of intermediate feature maps, F_i1 + F_i2, which is labeled as F_i12for convenience. Then, we perform “sum” of the middle layer with the feature map of the last layer, F_i2 + F_i3, named F_i23. We perform the concatenate operation on the feature maps F_i12and F_i23, in this way we can connect the last layer of features with the shallow ones and consider the feature map after the operation as F_i. By performing the exact same operation on another network, we can obtain F_j. The two “sum” operations in the network reduce the load on the network. This method is equivalent to providing the network with a priori: the semantic features of the corresponding channels in the feature maps of the two inputs are similar, and the mixing of semantic information between different channels can be avoided. And the subsequent concatenate can fully fuse the features of the middle layer of this network to form effective and rich semantic information, which improves the expression of the features.

Fig. 3

Intermediate layer feature “sum” and fusion. ⊕ marks for “sum” fusion for matrix and ø marks for fuse features, “conv” marks for convolution in order to match dimension. Multi-scale sum and fuse features will then be obtained by concatenate the new features. Connect low-latitude features and high-latitude features through the features in between and “sum” fusion improves semantic information.

3.2.2 Reconstruction of features

After obtaining the feature matrix, which the matrix after the “sum” and “concatenate” fusion operations. In order to generate features that are more helpful for visual tasks, we decided to enhance communication during different groups. As shown in Fig. 4, we use output of sec 3.2.1 as the input. Specifically, after we get F_i, we divide this matrix into p groups firstly, p = 4 in this paper. then jump around and add up the small matrices after each division. We perform the averaging operation to reduce the data variability caused by matrix summation. Last, we concatenate the summed matrices in order.

Fig. 4

Reconstruction of features. ⊕ marks for “sum” fusion and ø marks for integrated features, ’divide’ marks for cutting feature and “concatenate” marks for link feature. we group the connected new features from F_i and sum and fuse them according to certain rules to activate the new features. Then concatenate the resulting features. That is feature reconstruction. Split features into groups and regroup them to form a new rich-information feature.

$(\oplus ({o_{1}, o_{3}})) concatenate (\oplus ({o_{2}, o_{4}}))$ (4)

3.2.3 Dual-attention

The output of Section 3.2.2 is clearly noted as F_rei. After reconstruct the features, we use dual attention to dig deeper into the feature map (Figs. 5 6). We use them sequentially and emphasize “what” and “where” as the center of our channel and spatial attention. The output of part Section 3.2.2 F_rei ∈ G^C×H×W, where C, H, and W denote the channel numbers, height, and width, respectively, is denoted as the input. To get better attention, average and maximum pools are used together.

Fig. 5

Channel attention module. ⊕ marks for “sum” fusion operation. ⊖ marks for sigmoid function. The features are sent to the multi-layer perceptron after maximum pooling and average pooling respectively. The new features are then generated by the sum and sigmoid activation functions.

Fig. 6

Spatial attention module. ’conv’ marks for convolution in order to match dimension. ⊖ marks for sigmoid function. After maximum pooling and average pooling, new features are generated by sum and sigmoid activation functions.

We designate a_c, m_c ∈G^C×1×1 as the vectors following average-pooling and maximum-pooling for channel attention. The weight w_c ∈ G^C×1×1 of channel is $w_{c} = σ (L (a_{c}) + L (m_{c}))$ (5) where L is the weight of a multi-layer perceptron and σ stands for the Sigmoid function. The output of the channel called ${AT}_{i}^{c}$ which is $A T_{i}^{c} = w_{c} ⊙ F_{M}$ (6) element-wise multiplication is denoted by ⊙. F_M refers to vectors of class M The average-pooling and maximum-pooling vectors are denoted by the letters a_s,m_s ∈ G^1×H×W, w_s ∈ G^1×H×W. $w_{s} = σ (conv (a_{s}; m_{s}))$ (7) where conv represents a convolution operation. The output ${AT}_{i}^{s}$ is $A T_{i}^{s} = w_{s} ⊙ A T_{i}^{c}$ (8)

3.2.4 Fusion module

The output of sec. 3.2.3 is clearly written as F_reci and F_recj. As shown in Fig. 7, the distinction with DualNet [35] and FFL [17] is that our technique does not combine the features from the previous layer of the simple sub-network or apply simple “sum” or average operations while doing so. In order to gather more information, we also input features from the intermediate layers, fuse those features beforehand, and then perform the convolution operation using the fusion module. We employ a pointwise convolution used in MobileNet [40], to decrease the number of parameters. We let the features in the middle layer of the network accumulate “sum” one step first and perform this operation twice to obtain two new feature maps. This feature map has sufficient network representation. The number of channels from the fused feature map H will be C₁ + C₂, which can be adjusted as necessary, if C₁ and C₂ are the channels of the two fused feature maps mentioned above. In order to merge the feature map segments, as shown in the image, we first conduct 3×3 deep convolution, applying a filter to each input channel, and then use point convolution.

Fig. 7

Overall fusion module. ⊗ marks for convolution. After obtaining the reconstructed features of the sub-network, they are sent to the regressor and undergo multiple convolution operations to form the final fused classifier.

3.3 Loss function

As described in Equation 2, the cross-entropy loss of the kth student network is $L_{k}^{CE}$ . The cross-entropy of the fusion classifier is $L_{f}^{CE}$ . To make full use of the given knowledge, a powerful ensemble classifier is made. The sub-network incorporates logit to train the fusion module as it constructs the ensemble classifier. The logit set is calculated as follows, assuming that there is a sub-network for a given n: $z_{e}^{m} = \frac{1}{n} \sum_{k = 1}^{n} z_{k}^{m}$ (9) The integration probability is $p_{e}^{m}$ ,and the prediction probability after fusion is $p_{f}^{m}$ . KL divergence is used in the training of the fusion classifier. $L_{e}^{D} = L_{e}^{KL} ({\tilde{p}}_{e}^{m}, {\tilde{p}}_{f}^{m})$ (10) The fusion classifier then minimizes the distillation loss to make it easier for knowledge. The information feature representation to be transferred back to the student model, so that the fusion classifier can transmit the knowledge to the student model. $L_{f}^{D} = \sum_{i = 1}^{M} L_{f}^{KL} ({\tilde{p}}_{f}^{m}, {\tilde{p}}_{k}^{m})$ (11) Finally, we reach the subsequent general training objective: $L_{total} = L_{CE} + T^{2} L_{D}$ (12) where $L_{CE} = \sum_{k = 1}^{n} L_{k}^{CE} + L_{f}^{CE}$ (13) $L_{D} = L_{e}^{D} + \sum_{k = 1}^{n} L_{f}^{D}$ (14)

Since the gradient generated by the soft target is scaled by 1/T², L_D is multiplied by T² to keep the contributions of L_CE and L_D approximately balanced.

4 Experiment

In this area, we execute in-depth tests to check the performance of FEFR on four datasets and a wide range of well-known neural networks. To demonstrate the strong universality of FEFR across various numbers and types of models, we choose a variety of similar methods under various comparison conditions and present the resultsP_T.

4.1 Experimental setup

Datasets, network architecture and parameters setting. The following analyses used four image classification datasets. (1) CIFAR-10, which has 60000 naturally colored images from 10 categories (5000 training examples and 10,000 test samples). (2) There are 60000 images total in CIFAR100, distributed between 5000 training examples and a further 10,000 test samples. (3) The photos in CINIC-10 comes from ImageNet and CIFAR. It presents a greater challenge than CIFAR-10. It has 90000 test samples and 90000 training samples. (4) ImageNet2012 contains about 1.3 million images with 64 × 64 pixels in 1000 classes for training and 50 thousand for validation. In addition to using the classic ResNet, WRN, DenseNet, there are VGG, MobileNet. We use MobileNetV2 and ResNet-18 to configure CINIC-10 and ImageNet2012 in accordance with the guidelines in [20]. We apply horizontal flips and random crop from an image padded by 4 pixels for data augmentation in training. We use SGD as the optimizer with Nesterov momentum 0.9, weight decay of 1e-4 for student models and 1e-5 for fusion and minibatch size of 128. The models are trained for 300 epochs for all datasets. We set the initial learning rate to 0.1 and is multiplied by 0.1 at 150, 225 epochs. We set the temperature T to 3 empirically and α = 80 for ramp-up weighting.In order to the comparison would be fair, we only used two student models. After three runs, the top student’s top-1 mistake rate (%) is displayed.

4.2 Experimental results

Results on CIFAR-10/100 are presented in Tables 2. Based on these results, we evaluate the performance of FEFR on CIFAR-10 and CIFAR-100. We compare FEFR with the offline KD, the logit-only online method DML, the fusion-only technique FFL. And MFEF extracts the last layer of multiscale features because our objective is to develop a more robust feature representation for online distillation. The pre-trained ResNet-110 serves as the teacher model for the offline KD. We publish the top-1 mistake rates for DML for the top students. The results for the top students are represented by FFL-S, MFEF-S, and FEFR-S, whereas the results for the fusion classifier are represented by FFL, MFEF, and FEFR.

Table 1
Comparison with closely related methods from seven different networks on CIFAR-10. The top-1 error rate (%) is given. The two student models utilized in each procedure were the same. While FFL, MFEF, and FEFR refer to the results of the fusion classifier, FFL-S, MFEF-S, and FEFR-S correspond to the results of the student models

Network Baseline KD DML FFL-S FFL MFEF-S MFEF FEFR-S FEFR

ResNet-20 7.32 7.18 6.63 6.49 6.22 6.38 6.08 6.20 5.99

ResNet-32 6.77 6.69 6.52 6.06 5.78 5.59 5.41 5.55 5.38

ResNet-56 6.30 6.14 5.82 5.46 5.26 5.28 4.82 5.12 4.72

ResNet-110 5.64 5.47 5.21 5.18 4.83 4.81 4.52 5.25 4.86

WRN-16-2 6.78 6.40 5.49 6.09 5.97 5.33 4.99 4.77 4.44

WRN-40-2 5.34 5.24 4.72 4.75 4.60 4.51 4.02 4.45 3.93

DenseNet40-12 6.87 6.81 6.50 6.72 6.24 5.79 5.30 5.71 5.22

VGG16 6.04 5.88 5.74 5.43 5.21 5.36 5.02 5.12 4.90

Network	Baseline	KD	DML	FFL-S	FFL	MFEF-S	MFEF	FEFR-S	FEFR
ResNet-20	7.32	7.18	6.63	6.49	6.22	6.38	6.08	6.20	5.99
ResNet-32	6.77	6.69	6.52	6.06	5.78	5.59	5.41	5.55	5.38
ResNet-56	6.30	6.14	5.82	5.46	5.26	5.28	4.82	5.12	4.72
ResNet-110	5.64	5.47	5.21	5.18	4.83	4.81	4.52	5.25	4.86
WRN-16-2	6.78	6.40	5.49	6.09	5.97	5.33	4.99	4.77	4.44
WRN-40-2	5.34	5.24	4.72	4.75	4.60	4.51	4.02	4.45	3.93
DenseNet40-12	6.87	6.81	6.50	6.72	6.24	5.79	5.30	5.71	5.22
VGG16	6.04	5.88	5.74	5.43	5.21	5.36	5.02	5.12	4.90

Table 2

Comparison with closely related methods from seven different networks on CIFAR-100. The error rate (%) for the top-1 is reported. The two student models utilized in each procedure were the same. In contrast to FFL, MFEF, and FEFR, which refer to the results of the fusion classifier, FFL-S, MFEF-S, and FEFR-S refer to the outcomes of the student model

Network	Baseline	KD	DML	FFL-S	FFL	MFEF-S	MFEF	FEFR-S	FEFR
ResNet-20	31.08	29.94	29.61	28.56	26.87	28.46	26.30	27.83	26.18
ResNet-32	30.34	29.82	26.89	27.06	25.56	26.36	24.84	26.06	24.44
ResNet-56	29.31	28.61	25.51	24.84	23.53	24.22	23.15	24.17	23.12
ResNet-110	26.30	25.67	24.49	23.95	22.79	23.37	22.16	23.16	22.14
WRN-16-2	27.74	26.78	26.16	25.72	24.74	24.66	22.93	24.46	22.85
WRN-40-2	25.13	24.43	22.77	22.06	21.05	21.76	20.60	21.48	20.47
DenseNet40-12	28.74	28.74	26.94	27.21	24.76	26.81	24.27	26.65	24.12
VGG16	25.68	25.43	24.48	24.31	24.02	24.23	23.81	24.19	23.75

Results on CIFAR-10. The outcomes in Table 1 above amply demonstrates the performance advantages of our FEFR. Particularly, FEFR raises the accuracy of backbone network. Additionally, FEFR outperforms similarly comparable online distillation algorithms in terms of top-1 error rate. For example, On CIFAR-10, compared to MFEF-S, FEFR-S achieves 0.18% and 0.04% improvement on ResNet-20 and ResNet-32. And ResNet-56 improves 0.16%. Compared with MFEF, FEFR is about 0.03% higher on ResNet-32. The method we proposed is also suitable for architecture without skip-connection such as VGG16.

Results on CIFAR-100. As show in Table 2 on CIFAR-100, compared to MFEF-S, FEFR-S achieves 0.3% and 0.28% improvement on ResNet-32 and WRN-40-2. And DenseNet40-12 improves 0.16%. Compared with MFEF, FEFR is about 0.4% higher on ResNet-32. These improvements are attributed to the “sum” and advance fusion of the obtained features after multi-scale feature extraction from the middle layer of the network and reconstruction of features. In addition, overall feature fusion after all the student models have completed the above steps.The method we proposed is also suitable for architecture without skip-connection such as VGG16.

Results on CINIC-10 and ImageNet2012. In this section, we contrast the top-1 error rates of MobileNetV2 and ResNet-18 based FEFR, FFL, and MFEF. Table 3 demonstrates that both MFEF and FEFR lower the baseline error rate, with FEFR showing higher performance gains in both models and the fusion classifier. On the ImageNet2012, in the case of MobileNetV2, FEFR improves the performance of the student model by about 0.43% and 0.22% compared to FFL and MFEF. For the model after fusion, FEFR improves the performance by 0.62% over the baseline, and about 0.22% and 0.15% compared to FFL and MFEF. On the CINIC-10, the FEFR improved by 0.13% compared to the MFEF based on these tests. We can establish that all student models give greater quality knowledge due to the early fusing of intermediate layer features. It can enhance representation and resulting in a lower error rate than previous approaches.

Table 3

Comparison of the top-1 error rate (%) on the CINIC-10 and ImageNet2012 with FFL and MFEF

Dataset	Network	Baseline	FFL-S	FFL	MFEF-S	MFEF	FEFR-S	FEFR
CINIC-10	MobileNetV2	18.07	17.85	16.10	17.56	15.66	17.48	15.53
	ResNet-18	13.94	13.33	12.67	13.22	12.39	13.15	12.26
ImageNet2012	MobileNetV2	28.00	27.85	27.60	27.66	27.53	27.42	27.38
	ResNet-18	12.95	12.63	12.54	12.61	12.39	12.53	12.21

We run the ResNet and WRN tests in Table 4 to test the universality of FEFR on different model topologies. ResNet was specified to be Net1 and WRN to be Net2. In both Net1 and Net2, FEFR performs better than DML and MFEF. It’s interesting to see that when FEFR is used, the smaller network (Net1) performs noticeably better than the larger network. For instance, using ResNet-32 and WRN-16-2, FEFR is roughly 2.18% and 1.45% greater than DML. This is due to FEFR’s ability to better transfer information expertise from larger networks to smaller networks by combining and aggregating the feature maps of all networks.

Table 4

Comparing the top-1 error rate (%) of different architecture student models on the CIFAR-100 to various online distillation methods

Method	Net1:ResNet-32	Net2:WRN-16-2	Net1:ResNet-56	Net2:WRN-40-2
DML	28.31	26.45	26.75	23.33
FFL	27.06	25.93	26.23	23.06
MFEF	26.38	25.16	25.70	22.39
FEFR	26.13	25.00	25.37	22.28

Expansion of student models. Figure 8 demonstrates how having more student models has an influence. On ResNet-56, we run the experiments. It should come as no surprise that as the quantity of student models rises, so does the performance of both the student and fusion classifiers. As shown in Table 5, FEFR still performed well versus MFEF and FFL when the student models were increased to three. As we can see, MFEF-S and FFL-S both perform worse than FEFR-S on ResNet-32 by roughly 0.14% and 0.4%, respectively. On ResNet-56, the fusion classifier outperformed MFEF and FFL by around 0.15% and 0.84%, respectively.

Fig. 8

Evaluating the impact of student model expansion on CIFAR-100 using ResNet56.

Table 5

Comparison of top-1 error rates (%) for three student models trained on CIFAR-100 with alternative online distillation techniques (ONE-S and ONE-E denote student model and gated ensemble teacher outcomes)

Method	ResNet-32	ResNet-56
ONE-S	26.64	24.63
FFL-S	26.30	24.51
MFEF-S	26.04	24.12
FEFR-S	25.90	24.07
ONE-E	24.75	23.27
FFL	24.31	23.20
MFEF	24.03	22.51
FEFR	23.73	22.36

Parameter quantitative analysis. We supplemented the experiment on the number of parameters. The experimental results are shown in the Table 6 below. We compare the parameters of the OKDPH [?] on four network architectures respectively. Pair 1 and Pair 2 refer to two student network combinations for online distillation, ResNet-32&WRN-16-2, ResNet-56&WRN-40-2. From the table, we can see that FEFR has fewer parameters than OKDPH in most cases. Some of the reasons for the excessive number of parameters may be due to the large difference in the capacity of the two student networks, which requires more parameters to balance and adjust in the training process.

Table 6

Comparing the top-1 error rate (%) of different architecture student models on the CIFAR-100 to various online distillation methods

Method	ResNet-32	ResNet-56	WRN-16-2	WRN-40-2	Pair 1	Pair 2
OKDPH	0.98M	1.66M	1.43M	4.43M	1.35M	3.36M
FEFR	0.94M	1.63M	1.39M	4.47M	0.94M	3.47M

4.3 Ablation studies

We conducted a variety of ablation tests on CIFAR-100 on ResNet-32 and ResNet-56 to further validate the advantages of each component. Specifically, experiments were conducted in our three ablation cases. As shown in Table 7. Case A refers to the case in which only the characteristics of the intermediate layers and the last layer are accumulated and fused in advance (FE). Case B refers to the “sum” and advance fusion of the features of one intermediate layer and the last layer, in addition to, the Case B also adds the reconstruction of features (FC). Case C refers to our proposed method, which The double attention module (FD) was added to B. When only FE modules are added, the accuracy of the model is reduced 0.31%. Similarly, when only modules FE and FC are available, model accuracy decreases 0.04%. The same phenomenon also appears on ResNet-56. The above ablation experiments show that feature reconstruction has the greatest effect on online knowledge distillation. By fusing the middle and last layers and reconstructing the features, we can significantly enhance the representation ability of features. Furthermore, the dual attention mechanism can combine both channel and spatial features, leading to a further improvement in the significance of the features.

Table 7
Evaluating the effectiveness of each component on CIFAR-100

Network Case FE FC FD Student fused

ResNet-32 A ✓ 25.15

B ✓ ✓ 24.88

C ✓ ✓ ✓ 24.84

ResNet-56 A ✓ 25.18

B ✓ ✓ 23.35

C ✓ ✓ ✓ 23.30

Network	Case	FE	FC	FD	Student fused
ResNet-32	A	✓			25.15
	B	✓	✓		24.88
	C	✓	✓	✓	24.84
ResNet-56	A	✓			25.18
	B	✓	✓		23.35
	C	✓	✓	✓	23.30

4.4 Visualization analysis

By examining the relevant feature maps, we offer more information in this part about how the suggested FEFR technique can enhance network capabilities. We use t-SNE [50] to show the feature maps that were taken from the fully linked layer in order to achieve this. A comparison of the ResNet-32 network’s t-SNE visualization results using the CIFAR-10 dataset, trained using FFL and FEFR, is shown in Fig. 4. While FEFR makes use of the rich data in the embedded feature map, FFL and MFEF do not use feature information in the KD process. The inclusion of feature information promotes the network to produce more significant features, as seen by our visualization findings.

Figure 9 illustrates that the improvement in sub-network validity and fusion classifier performance is attributed to the rich feature maps obtained from the embedded teachers, as well as the improved sub-network classifier performance resulting from the auxiliary teachers. Specifically, the different colors in the four diagrams in Fig. 4 represent the different classes in CIFAR-10. The more the same color is concentrated together and the different colors are spread out, representing a better classification. The final classification performance can be improved by using the FEFR approach to find more precise classification margins for large-scale data. The t-SNE visualization results presented here offer additional evidence of the FEFR method’s effectiveness.

Fig. 9

FFL-S1t-SNE images of FFL, FEFR and their sub-networks.

5 Conclusion

We suggest FEFR, a splitting method for online knowledge distillation that enhances the multi-scale representation of feature maps and then fuses them from student models to support training by spreading more information expertise. It unites multi-scale feature extraction and early fusing of middle-layer and last-layer features into a fusion framework. Then the feature representation is enhanced by reconstructing the feature and dual attention mechanism. Finally, the obtained features are fused to form a classifier to guide student networks learning. Numerous experiments have demonstrated that our proposed method has led to further improvements in online knowledge distillation accuracy. Our proposed approach achieves gains of between 0.1% and 0.6% across various datasets. In the future, it is also hoped that the method FEFR can be applied in more scenarios.

Although the accuracy of online knowledge distillation methods has been improved, some combinations of student networks are somewhat redundant in terms of the number of parameters. Our goal for the next stage of research is to improve the accuracy while using as few parameters as possible.

Funding This research was supported by the Research Foundation of the Institute of Environment-friendly Materials and Occupational Health (Wuhu) Anhui University of Science and Technology (No.ALW

2021YF04). The Science and Technology Research Project of Wuhu City (No.2020yf48). The National Natural Science Foundation of China (NO.62102003)and the University Synergy Innovation Program of Anhui Province (No. GXXT-2021-006).

Availability of data and material The datasets used during the cur - rent study are available from the corresponding author on reasonable request.

Declarations The authors declare that there is no conflict of interest in this paper.

References

Kaiming He , Xiangyu Zhang , Shaoqing Ren and Jian Sun , Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

Saining Xie , Ross Girshick , Piotr Doll’ar , Zhuowen Tu and Kaiming He , Aggregated residual transformations for deep neural networks, Proceedings of the IEEE conference on computer vision and pattern recognition 2017, pp. 565 1492–1500.

Zhong-Qiu Zhao , Peng Zheng , Shou-tao Xu and Xindong Wu , Object detection with deep learning: A review, in: IEEE transactions on neural networks and learning systems, 2019, pp. 3212–3232.

Tsung-Yi Lin , Priya Goyal , Ross Girshick , Kaiming He and Piotr Doll’ar , Focal loss for dense object detection, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.

Kaiming He , Georgia Gkioxari , Piotr Doll’ar and Ross Girshick , Mask r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.

Olaf Ronneberger , Philipp and Thomas Brox , U-net: Convolutional networks for biomedical image segmentation, in: International Conference on Medical image computing and computer-assisted interventiont, 2015, pp. 234–241.

Gongfan Fang , Xinyin Ma , Mingli Song , Michael Bi Mi and Xinchao Wang , Depgraph: Towards any structural pruning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 16091–16101.

Seul-Ki Yeom , Philipp Seegerer , Sebastian Lapuschkin , Alexander Binder , Simon Wiedemann , Klaus-Robert Müller and Wojciech Samek , Pruning by explaining: A novel criterion for deep neural network pruning, in: Pattern Recognition, 2021.

Lushuai Niu , Zhi Xu , Longyang Zhao , Daojing He , Jianqiu Ji , Xiaoli Yuan and Mian Xue , Residual vector product quantization for approximate nearest neighbor search, in: Expert Systems with Applications, 2023.

10.

Shuai Feng , Ahmet Cetinkaya , Hideaki Ishii , Pietro Tesi and Claudio De Persis , Resilient quantized control under Denial-of-Service: Variable bit rate quantization, in: Automatica, 2022.

11.

Lin Chen , Xue Jiang , Xingzhao Liu and Martin Haardt , Reweighted low-rank factorization with deep prior for image restoration, in: IEEE Transactions on Signal Processing, 2022, pp. 3514–3529.

12.

Sijie Wang , Kewen Xia , Li Wang , Zhixian Yin , Ziping He , Jiangnan Zhang and Naila Aslam , Low-rank matrix factorization with nonconvex regularization and bilinear decomposition, in: Signal Processing, 2022.

13.

Geoffrey Hinton , Oriol Vinyals and Jeff Dean , and others, Distilling the knowledge in a neural network, in: arXiv preprint arXiv:1503.02531, 2015.

14.

Adriana Romero , Nicolas Ballas , Samira Ebrahimi Kahou , Antoine Chassang , Carlo Gatta and Yoshua Bengio , Fitnets: Hints for thin deep nets, in: arXiv preprint arXiv:1412.6550, 2014.

15.

Seyed Iman Mirzadeh , Mehrdad Farajtabar , Ang Li , Nir Levine , Akihiro Matsukawa and Hassan Ghasemzadeh , Improved knowledge distillation via teacher assistant, in: Proceedings of the AAAI conference on artificial intelligence, 2020, pp. 5191–5198.

16.

Ying Zhang , Tao Xiang , Timothy Hospedales

and Huchuan Lu , Deep mutual learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4320–4328.

17.

Jangho Kim , Minsung Hyun , Inseop Chung and Nojun Kwak , Deep mutual learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4320–4328.

18.

Maneesh Ayi and Mohamed El-Sharkawy , Rmnv2: Reduced mobilenet v2 for cifar10, in: 2020 10th Annual Computing and Communication Workshop and Conference (CCWC), 2020, pp. 0287–0292.

19.

Luke Darlow

, Elliot Crowley

, Antreas Antoniou and Amos Storkey

, Cinic-10 is not imagenet or cifar-10, in: arXiv preprint arXiv:1810.03505, 2018, pp. 0287–0292.

20.

Mohsin Sharif , Asia Kausar , Jin Hyuck Park and Dong Ryeol Shin , Tiny image classification using Four-Block convolutional neural network, in: arXiv preprint arXiv:1810.03505, 2019, pp. 1–6.

21.

Cristian Bucilua , Rich Caruana and Alexandru Niculescu-Mizil , Model compression, in: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006, pp. 535–541.

22.

Yuenan Hou , Zheng Ma , Chunxiao Liu and Chen Change Loy , Learning lightweight lane detection cnns by self attention distillation, in: Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1013–1021.

23.

Junho Yim , Donggyu Joo , Jihoon Bae and Junmo Kim , A gift from knowledge distillation: Fast optimization, network minimization and transfer learning, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4133–4141.

24.

Xiaojie Wang , Rui Zhang , Yu Sun and Jianzhong Qi , KDGAN:knowledge distillation with generative adversarial networks, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 783–794.

25.

Qiushan Guo , Xinjiang Wang , Yichao Wu , Zhipeng Yu , Ding Liang , Xiaolin Hu and Ping Luo , Online knowledge distillation via collaborative learning, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11020–11029.

26.

Biao Qian , Yang Wang , Hongzhi Yin , Richang Hong and Meng Wang , Switchable Online Knowledge Distillation, in: European Conference on Computer Vision, 2022, pp. 449–466.

27.

Xu Lan , Xiatian Zhu and Shaogang Gong , Knowledge distillation by on-the-fly native ensemble, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 7528–7538.

28.

Defang Chen , Jian-Ping Mei , Can Wang , Yan Feng and Chun Chen , Online knowledge distillation with diverse peers, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2020, pp. 3430–3437.

29.

Yu-e Lin , Xingzhu Lian , Gan Hu and Xianjin Fang , Smarter peer learning for online knowledge distillation, in: Multimedia Systems, 2022, pp. 1059–1067.

30.

Chuanxiu Li , Guangli Li , Hongbin Zhang and Donghong Ji , Embedded mutual learning: A novel online distillation method integrating diverse knowledge sources, in: Applied Intelligence, 2022, pp. 1–14.

31.

Linfeng Zhang , Jiebo Song , Anni Gao , Jingwei Chen , Chenglong Bao and Kaisheng Ma , Be your own teacher: Improve the performance of convolutional neural networks via self distillation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 3713–3722.

32.

Zuxiang Long , Fuyan Ma , Bin Sun , Mingkui Tan and Shutao Li , Diversified branch fusion for self-knowledge distillation, in: Information Fusion, 2023, pp. 12–22.

33.

Tianli Zhang , Mengqi Xue , Jiangtao Zhang , Haofei Zhang , Yu Wang , Lechao Cheng , Jie Song and Mingli Song , Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20176–20185.

34.

Tsung-Yu Lin , Aruni Roy Chowdhury and Subhransu Maji , Bilinear CNN models for fine-grained visual recognition, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1449–1457.

35.

Saihui Hou , Xu Liu and Zilei Wang , Dualnet: Learn complementary features for image recognitionl recognition, in: Proceedings of the IEEE international conference on computer vision, 2017, pp. 502–510.

36.

Panpan Zou , Yinglei Teng and Tao Niu , Multi-scale Feature Extraction and Fusion for Online Knowledge Distillation, in: International Conference on Artificial Neural Networks, 2022, pp. 126–138.

37.

Mingyang Jiang , Yanchun Liang , Xiaoyue Feng , Xiaojing Fan , Zhili Pei , Yu Xue and Renchu Guan , Text classification based on deep belief network and softmax regression, in: Neural Computing and Applications, 2018, pp. 61–70.

38.

Pieter-Tjerk De Boer , Dirk Kroese

, Shie Mannor and Reuven Rubinstein

, A tutorial on the cross-entropy method, in: Annals of Operations Research, 2005, pp. 19–67.

39.

Xiang Li , Wei Zhang and Qian Ding , Deep learning-based remaining useful life estimation of bearings using multiscale feature extraction, in: Reliability engineering & system safety, 2019, pp. 208–218.

40.

Laurens Van der Maaten and Geoffrey Hinton , Visualizing non-metric similarities in multiple maps, in: Machine Learning, 2012, pp. 33–55.