Abstract
Online knowledge distillation breaks the pre-determined strong and weak teacher-student models, it provides a new way of thinking about knowledge distillation. However, the current online methods often use the Logits-based prediction distribution, and the features containing rich semantic information are rarely used. Even if the feature-based methods are used, they only operate on the last layer of the network, without further exploring the representation knowledge of the middle layer feature map. To address the above issues, we propose an innovative feature early fusion and reconstruction (FEFR) method for online knowledge distillation which entails four essential components: multi-scale feature extraction and intermediate layer feature early fusion, reconstruction of features, dual-attention and overall fusion module in this paper. We propose early fusion by “sum” operation for feature matrices between different layers and advance fusion to improve the feature map representation. In order to enhance the communication ability between groups to obtain features, the features were reconstructed. We create a dual-attention to enhance the critical channel and spatial regions adaptively in order to collect more accurate information. The previously processed feature maps are combined and fused using feature fusion, which also aids in student models training. A study of the network architectures of CIFAR-10, CIFAR-100, CINIC-10 and ImageNet 2012 shows that FEFR provides more useful characterization knowledge for refinement and improves accuracy by about 0.5% compared to other methods.
Introduction
Several applications, including image classification [1, 2], object detection [3, 4] and segmentation [5, 6], have shown promise for deep convolution neural networks. Modern convolutional neural networks need a lot of compute and storage to attain high performance always, which drastically restricts their use on devices with low resources.
Over the last few years, this problem has been the subject of extensive research and numerous models compression and acceleration strategies have been proposed as solutions. Representative methods include pruning [7, 8], quantization [9, 10], low-rank factorization [11, 12], and knowledge distillation [13]. One of the most effective methods is the knowledge distillation approach for model compression, which involves training a small student network to replicate the output of the teacher network after training an over-parameterized neural network as the teacher. Because it inherits teacher knowledge, the student model can take the place of the over-parameterized teacher model and achieve model compression and rapid inference.
Traditional KD [13] approaches use offline distillation, often known as a two-stage training strategy, to transfer knowledge from high-capacity large models that have already been taught to compact student models [14]. High-capacity instructors are not always accessible, and even if they are, it is unavoidable to incur greater computational costs and longer training periods for complex teachers. KD also encounters gaps in model capacity when there is a large difference in size between student and teacher models [15]. To address the problem, online knowledge distillation (OKD) has been created [16, 17]. This method is more appealing since it breaks the pre-defined specific strength and weakness relationships and reduces the training procedure to an end-to-end one-stage fashion rather than using pre-trained high-performance teachers. By learning from one another as they go through the training process, all models are trained at once. In other words, all networks distill and share knowledge. Online knowledge distillation achieves superior performance compared to offline KD while maintaining a simpler structure. However, traditional approaches focus on moving logit data as soft targets. Even though one-hot labels and soft targets both offer more detailed information, employing the logit alone is comparatively simple. Even though straightforward alignment or merging does not fully utilize representations of pertinent entities, feature maps can still reveal important information about spatial sight, channels, and connections.
To solve the aforementioned problems in this study, we present a novel feature early fusion and reconstruction approach (FEFR) for online knowledge distillation. FEFR consists of four essential elements: multi-scale feature extraction and intermediate layer feature early fusion, reconstruction of features, dual-attention, overall fusion module. In order to gain more useful representational knowledge for distillation, we first extract multi-scale characteristics from the middle layers and last layer that can concentrate on both local particular and global areas. Then we fuse these features by “sum” and early fusion of features from various middle layers. In order to enhance the communication ability between groups to obtain features, the features were reconstructed. Then, in order to aid in the learning of student models, student networks are encouraged to use feature fusion to combine the produced feature maps and put them into a fused classifier. In summary, the major contributions in this paper are as follows: We first provide many intermediate layers and last layer feature extraction to enhance the multi-scale representation of features and provide deeper information beyond straightforward alignment. Then we propose intermediate layers feature early fusion to obtain more comprehensive features. In order to enhance the communication ability between groups to obtain features, the features were reconstructed. We create a dual-attention to enhance the critical channel and spatial regions adaptively in order to collect more accurate information. The previously processed feature maps are combined and fused using feature fusion, which also aids in the training of student models. Extensive experiments of CIFAR-10/100 [18] CINIC-10 [19] and ImageNet2012 [20] demonstrates the effectiveness of the proposed FEFR. Our method FEFR can improve the representation of characteristics and producing additional data for knowledge distillation.
Related work
Vanilla knowledge distillation
The title page should provide the following information: The concept of knowledge transfer from a large model to a small one without the major loss in accuracy is inspired from [21]. The conventional KD has two stages and requires a teacher who has received special training. Because of the mismatch between the feature maps of some student models and the teacher models, some articles propose to solve it from the hidden layer in the middle of the network [14]. To further exploit more accurate information, [22] combines attention to detail with distillation to make better use of more specific information. By simulating the teacher’s flow matrix with the inner output, [23] investigates the interaction between layers. The learning of realistic data distributions by student and instructor networks is made possible in [24] by the use of an adversarial training approach. In order to close the competence gap between the teacher and student models, a teacher assistant is added in [15]. [25] proposes distillation utilizing the activation boundaries created by hidden neurons.
Online knowledge distillation
The introduction of online knowledge distillation enhances the performance of student models by removing the dependency on time- and money-consuming high-volume teacher models. Through the sharing of predictions throughout the training process, this paradigm allows student models to learn from one another. [16] is an example of a representative strategy in which various networks cooperate. To replicate the category probabilities of peer networks, each network employs KL divergence. [25] pushes DML further by constructing an ensemble logit as a teacher to enhance generalization by averaging the predictions of a set of students. [26] suggests a switchable learning mode strategy to raise student achievement. [17] introduces a fusion module to create a fusion grader that will control the sub-network formation process. A gate module is incorporated in [27] to produce importance scores for each branch and produce a more effective teacher. [28] proposes a two-level refinement between a group leader and several auxiliary peers to increase the variety of student models. [29] suggests using a weight assessment system to build a virtual teacher. In terms of architectural design, [30] proposes to embed integration branch and adaptive fusion branch between two parallel peer-to-peer networks for learning. Knowledge is distilled inside the network itself in [31, 32], moving from the deeper to the shallower portions of the network. The network is shared by the teacher and student models. OKDPH [33] constructs a learning method with mixed parameters.
Feature fusion methods
Dual learning is subjected to feature fusion in [17, 34–36]. Networks from two distinct outputs are combined and projected to a bilinear vector in a bilinear CNN [34]. DualNet [35] builds a fused classifier by training two parallel networks with the same topology and combining their features using the “sum” function. Iterative training is also used to update the weights of the sub-networks in an alternate manner in order to discover complementary traits. FFL [17], which fuses the feature maps of the end layer of the two student networks, helps the two student networks and the fused classifier function well together. The feature maps of last layer of the two student networks are subjected to multiple segmentation and concatenate operations. These features are given double attention to obtain the feature map. In contrast, By combining resulting feature maps and sending them into a fused classifier, MFEF [36] obtains multi-scale features to aid learning.
Proposed method
We go into great detail on the framework and loss function in this section. Figure 1 depicts the FEFR in detail. In contrast to other KD approaches, FEFR goes further into the data the middle layer feature map provides, including using it to connect the features between the first and last layers. It takes into account both the fused classifier’s performance as well as the performance of the student networks. The fusion module fuses the characteristics of parallel sub-networks during the process, and the fused classifier subsequently obtains the final classification outcomes.

Overview of feature early fusion and reconstruction for online knowledge distillation. Features extracted from subnetwork are reconstructed, dual attention and fusion modules form the final classifier to guide subnetwork learning.
A labeled dataset ofN samples is given as
Globally, as show in Fig. 1 the primary goal of FEFR is to sequentially extract several intermediate layer features. Then we use a specific intermediate layer as a bridge to connect the information communicated by the intermediate layer features at its two ends. It can produce additional information knowledge for knowledge distillation. The ability to communicate between groups is increased through the reconstruction of features. We can focus more on the details of features through the dual attention mechanism. The fusion module is responsible for fusing the newly acquired features.The following section will discuss each essential component.
Multi-scale feature extraction and intermediate layer feature early fusion
In addition to soft targets, inspired by [39], we introduced the extraction of multiscale features to generate multiscale features that are important for the visual task. As shown in the Fig. 2, the feature outputs of the two intermediate layers and the last layer are extracted. The features of the intermediate layer provide geometric information and the feature map of the last layer is used because they have high-level semantic information that are richer and more specific. In this study, the analysis is conducted using two student networks. For notational convenience, we name the intermediate feature maps of the ith student models as Fi1,Fi2 and the feature map of the last layer is labeled as Fi3. Similarly, we refer to the feature mapsjth student models as Fj1, Fj2, and Fj3. Feature maps with several scale perceptual fields can be created using multi-scale feature extraction. The receptive field expands as additional characteristics are concatenated. Smaller receptive fields concentrate on the specifics whereas larger receptive fields gather broad information. Combining these two techniques can result in feature maps with greater meaning and higher distillation efficiency.

Multi-scale feature extraction in the middle and last layers. Extract the feature of middle and last layers.
As shown in Fig. 3, after obtaining the two layers of intermediate feature maps and the last layer of feature maps. We first perform a “sum” operation (matrix addition) to fuse two layers of intermediate feature maps, Fi1 + Fi2, which is labeled as Fi12 for convenience. Then, we perform “sum” of the middle layer with the feature map of the last layer, Fi2 + Fi3, named Fi23 . We perform the concatenate operation on the feature maps Fi12 and Fi23, in this way we can connect the last layer of features with the shallow ones and consider the feature map after the operation as F i . By performing the exact same operation on another network, we can obtain F j . The two “sum” operations in the network reduce the load on the network. This method is equivalent to providing the network with a priori: the semantic features of the corresponding channels in the feature maps of the two inputs are similar, and the mixing of semantic information between different channels can be avoided. And the subsequent concatenate can fully fuse the features of the middle layer of this network to form effective and rich semantic information, which improves the expression of the features.

Intermediate layer feature “sum” and fusion. ⊕ marks for “sum” fusion for matrix and ø marks for fuse features, “conv” marks for convolution in order to match dimension. Multi-scale sum and fuse features will then be obtained by concatenate the new features. Connect low-latitude features and high-latitude features through the features in between and “sum” fusion improves semantic information.
After obtaining the feature matrix, which the matrix after the “sum” and “concatenate” fusion operations. In order to generate features that are more helpful for visual tasks, we decided to enhance communication during different groups. As shown in Fig. 4, we use output of sec 3.2.1 as the input. Specifically, after we get F i , we divide this matrix into p groups firstly, p = 4 in this paper. then jump around and add up the small matrices after each division. We perform the averaging operation to reduce the data variability caused by matrix summation. Last, we concatenate the summed matrices in order.

Reconstruction of features. ⊕ marks for “sum” fusion and ø marks for integrated features, ’divide’ marks for cutting feature and “concatenate” marks for link feature. we group the connected new features from F i and sum and fuse them according to certain rules to activate the new features. Then concatenate the resulting features. That is feature reconstruction. Split features into groups and regroup them to form a new rich-information feature.
The output of Section 3.2.2 is clearly noted as F rei . After reconstruct the features, we use dual attention to dig deeper into the feature map (Figs. 5 6). We use them sequentially and emphasize “what” and “where” as the center of our channel and spatial attention. The output of part Section 3.2.2 F rei ∈ GC×H×W, where C, H, and W denote the channel numbers, height, and width, respectively, is denoted as the input. To get better attention, average and maximum pools are used together.

Channel attention module. ⊕ marks for “sum” fusion operation. ⊖ marks for sigmoid function. The features are sent to the multi-layer perceptron after maximum pooling and average pooling respectively. The new features are then generated by the sum and sigmoid activation functions.

Spatial attention module. ’conv’ marks for convolution in order to match dimension. ⊖ marks for sigmoid function. After maximum pooling and average pooling, new features are generated by sum and sigmoid activation functions.
We designate a
c
, m
c
∈GC×1×1 as the vectors following average-pooling and maximum-pooling for channel attention. The weight w
c
∈ GC×1×1 of channel is
The output of sec. 3.2.3 is clearly written as F reci and F recj . As shown in Fig. 7, the distinction with DualNet [35] and FFL [17] is that our technique does not combine the features from the previous layer of the simple sub-network or apply simple “sum” or average operations while doing so. In order to gather more information, we also input features from the intermediate layers, fuse those features beforehand, and then perform the convolution operation using the fusion module. We employ a pointwise convolution used in MobileNet [40], to decrease the number of parameters. We let the features in the middle layer of the network accumulate “sum” one step first and perform this operation twice to obtain two new feature maps. This feature map has sufficient network representation. The number of channels from the fused feature map H will be C1 + C2, which can be adjusted as necessary, if C1 and C2 are the channels of the two fused feature maps mentioned above. In order to merge the feature map segments, as shown in the image, we first conduct 3×3 deep convolution, applying a filter to each input channel, and then use point convolution.

Overall fusion module. ⊗ marks for convolution. After obtaining the reconstructed features of the sub-network, they are sent to the regressor and undergo multiple convolution operations to form the final fused classifier.
As described in Equation 2, the cross-entropy loss of the kth student network is
Since the gradient generated by the soft target is scaled by 1/T2, L D is multiplied by T2 to keep the contributions of L CE and L D approximately balanced.
In this area, we execute in-depth tests to check the performance of FEFR on four datasets and a wide range of well-known neural networks. To demonstrate the strong universality of FEFR across various numbers and types of models, we choose a variety of similar methods under various comparison conditions and present the resultsP T .
Experimental setup
Experimental results
Results on CIFAR-10/100 are presented in Tables 2. Based on these results, we evaluate the performance of FEFR on CIFAR-10 and CIFAR-100. We compare FEFR with the offline KD, the logit-only online method DML, the fusion-only technique FFL. And MFEF extracts the last layer of multiscale features because our objective is to develop a more robust feature representation for online distillation. The pre-trained ResNet-110 serves as the teacher model for the offline KD. We publish the top-1 mistake rates for DML for the top students. The results for the top students are represented by FFL-S, MFEF-S, and FEFR-S, whereas the results for the fusion classifier are represented by FFL, MFEF, and FEFR.
Comparison with closely related methods from seven different networks on CIFAR-10. The top-1 error rate (%) is given. The two student models utilized in each procedure were the same. While FFL, MFEF, and FEFR refer to the results of the fusion classifier, FFL-S, MFEF-S, and FEFR-S correspond to the results of the student models
Comparison with closely related methods from seven different networks on CIFAR-10. The top-1 error rate (%) is given. The two student models utilized in each procedure were the same. While FFL, MFEF, and FEFR refer to the results of the fusion classifier, FFL-S, MFEF-S, and FEFR-S correspond to the results of the student models
Comparison with closely related methods from seven different networks on CIFAR-100. The error rate (%) for the top-1 is reported. The two student models utilized in each procedure were the same. In contrast to FFL, MFEF, and FEFR, which refer to the results of the fusion classifier, FFL-S, MFEF-S, and FEFR-S refer to the outcomes of the student model
Comparison of the top-1 error rate (%) on the CINIC-10 and ImageNet2012 with FFL and MFEF
We run the ResNet and WRN tests in Table 4 to test the universality of FEFR on different model topologies. ResNet was specified to be Net1 and WRN to be Net2. In both Net1 and Net2, FEFR performs better than DML and MFEF. It’s interesting to see that when FEFR is used, the smaller network (Net1) performs noticeably better than the larger network. For instance, using ResNet-32 and WRN-16-2, FEFR is roughly 2.18% and 1.45% greater than DML. This is due to FEFR’s ability to better transfer information expertise from larger networks to smaller networks by combining and aggregating the feature maps of all networks.
Comparing the top-1 error rate (%) of different architecture student models on the CIFAR-100 to various online distillation methods

Evaluating the impact of student model expansion on CIFAR-100 using ResNet56.
Comparison of top-1 error rates (%) for three student models trained on CIFAR-100 with alternative online distillation techniques (ONE-S and ONE-E denote student model and gated ensemble teacher outcomes)
Comparing the top-1 error rate (%) of different architecture student models on the CIFAR-100 to various online distillation methods
We conducted a variety of ablation tests on CIFAR-100 on ResNet-32 and ResNet-56 to further validate the advantages of each component. Specifically, experiments were conducted in our three ablation cases. As shown in Table 7. Case A refers to the case in which only the characteristics of the intermediate layers and the last layer are accumulated and fused in advance (FE). Case B refers to the “sum” and advance fusion of the features of one intermediate layer and the last layer, in addition to, the Case B also adds the reconstruction of features (FC). Case C refers to our proposed method, which The double attention module (FD) was added to B. When only FE modules are added, the accuracy of the model is reduced 0.31%. Similarly, when only modules FE and FC are available, model accuracy decreases 0.04%. The same phenomenon also appears on ResNet-56. The above ablation experiments show that feature reconstruction has the greatest effect on online knowledge distillation. By fusing the middle and last layers and reconstructing the features, we can significantly enhance the representation ability of features. Furthermore, the dual attention mechanism can combine both channel and spatial features, leading to a further improvement in the significance of the features.
Evaluating the effectiveness of each component on CIFAR-100
Evaluating the effectiveness of each component on CIFAR-100
By examining the relevant feature maps, we offer more information in this part about how the suggested FEFR technique can enhance network capabilities. We use t-SNE [50] to show the feature maps that were taken from the fully linked layer in order to achieve this. A comparison of the ResNet-32 network’s t-SNE visualization results using the CIFAR-10 dataset, trained using FFL and FEFR, is shown in Fig. 4. While FEFR makes use of the rich data in the embedded feature map, FFL and MFEF do not use feature information in the KD process. The inclusion of feature information promotes the network to produce more significant features, as seen by our visualization findings.
Figure 9 illustrates that the improvement in sub-network validity and fusion classifier performance is attributed to the rich feature maps obtained from the embedded teachers, as well as the improved sub-network classifier performance resulting from the auxiliary teachers. Specifically, the different colors in the four diagrams in Fig. 4 represent the different classes in CIFAR-10. The more the same color is concentrated together and the different colors are spread out, representing a better classification. The final classification performance can be improved by using the FEFR approach to find more precise classification margins for large-scale data. The t-SNE visualization results presented here offer additional evidence of the FEFR method’s effectiveness.

FFL-S1t-SNE images of FFL, FEFR and their sub-networks.
We suggest FEFR, a splitting method for online knowledge distillation that enhances the multi-scale representation of feature maps and then fuses them from student models to support training by spreading more information expertise. It unites multi-scale feature extraction and early fusing of middle-layer and last-layer features into a fusion framework. Then the feature representation is enhanced by reconstructing the feature and dual attention mechanism. Finally, the obtained features are fused to form a classifier to guide student networks learning. Numerous experiments have demonstrated that our proposed method has led to further improvements in online knowledge distillation accuracy. Our proposed approach achieves gains of between 0.1% and 0.6% across various datasets. In the future, it is also hoped that the method FEFR can be applied in more scenarios.
Although the accuracy of online knowledge distillation methods has been improved, some combinations of student networks are somewhat redundant in terms of the number of parameters. Our goal for the next stage of research is to improve the accuracy while using as few parameters as possible.
2021YF04). The Science and Technology Research Project of Wuhu City (No.2020yf48). The National Natural Science Foundation of China (NO.62102003)and the University Synergy Innovation Program of Anhui Province (No. GXXT-2021-006).
