Abstract
Vehicle Re-Identification (Re-ID) aims to discover and match target vehicles in different cameras of road surveillance. The high similarity between vehicle appearances and the dramatic variations in viewpoints and illumination cause great challenges for vehicle Re-ID. Meanwhile, in safety supervision and intelligent traffic systems, one needs a quick efficient method of identifying target vehicles. In this paper, we propose a Multi-Attention Guided Feature Enhancement Network (MAFEN) to extract robust vehicle appearance features. Specifically, the Fusing Spatial-Channel information multi-receptive fields Feature Enhancement module (FSCFE) is first proposed to aggregate richer and more representative multi-receptive fields features at different receptive fields sizes. It also learned the spatial structure information and channel dependencies of the multi-receptive fields features and embedded them to enhance the feature. Then, we construct the Spatial Attention-Guided Adaptive Feature Erasure (SAAFE) module, which uses spatial attention to erase the most distinguishing features. The network’s attention is shifted to potentially salient features to strengthen the ability of the network to extract salient features. In addition, a multi-loss knowledge distillation (MLKD) method using MAFEN as a teacher network is designed to improve computational efficiency. It uses multiple loss functions to jointly supervise the student network. Experimental results on three public datasets demonstrate the merits of the proposed method over the state-of-the-art methods.
Keywords
Introduction
Vehicle re-identification (Re-ID) is an image retrieval task that uses a given vehicle image to identify the same vehicle captured by other cameras, which has been popular in recent years due to its wide application in intelligent transportation [1] and video surveillance [2]. The vehicle Re-ID selects the top five vehicles with the largest similarity as the Re-ID results by calculating the distance between the target and the picture of the gallery set, as shown in Fig. 1. As a result, the challenge of vehicle Re-ID is not only the intra-class variations between various viewpoints of the same vehicle, but also the minor inter-class changes across different vehicles of the same model and color, which is unique among re-identification tasks [3, 4].

Example of vehicle re-ID. The area selected by the yellow circle is the location of the detail with the difference.
To address the above issues, traditional approaches focused on handcrafted features [5, 6], including geometric attributes, color histograms, and textures, to handle the sophisticated relationships between intra-class and inter-class discrepancies. These methods, with the ability to depict the vehicle’s appearance, are insensitive to large variations in illumination conditions and viewpoints, thus leading to limited application in practical situations. Recently, deep learning has become prominent in many computer vision tasks, such as action recognition [7], object detection [8], and object segmentation [9]. As a result, numerous deep learning algorithms are used to improve the feature representation in vehicle Re-ID.
Some of the deep learning methods aimed to discover and utilize multi-modal information (e.g., types, color, spatial-temporal information, etc.) to assist in extracting robust vehicle features. Liu et al. [10] suggested a coarse-to-fine framework, which combined attributes and spatial-temporal information for vehicle Re-ID. Besides, Zakria et al. [11] achieved vehicle Re-ID by filtering an initial list of vehicles from a gallery based on vehicle appearance information and then using the license plate information to verify whether the target vehicle matches the initial list of vehicles. Zheng et al. [12] introduced a deep network to fuse multiple attributes (viewpoints, types, and color) as the vehicle’s feature representation. As well, Hou et al. [13] also used a multi-label learning approach to optimize the algorithm’s performance. These methods focused on learning global representations which are effective for vehicle Re-ID in simple scenarios, but unsolvable for high visual similarity vehicles.
Therefore, some researchers had constructed robust feature representations by combining global features and local features. Chen et al. [14] designed a novel part alignment network to build a robust vehicle feature representation. Refs. [15, 16] utilized object detectors and part parsers to extract a vehicle’s fine-grained features. The above methods greatly improve vehicle Re-ID performance via locating information regions. However, locating remarkable areas (e.g., car-head, windscreen, etc.) accurately required a lot of manual work to annotate localized areas, resulting in additional costs.
Based on the above discussion, a dual-stream Multi-Attention Guided Feature Enhancement Network (MAFEN) is proposed to extract robust enhancement features for vehicle Re-ID without introducing manually annotated attribute labels. In this context, the fusing spatial-channel information multi-receptive fields feature enhancement (FSCFE) module aggregates features with different sizes of parallel filters while learning the spatial structure information and channel dependencies and embedding them to implement feature enhancement. The spatial attention-guided adaptive feature erasure (SAAFE) module initially obtains spatial attention to the input features and thresholds it to return a binary erasure mask. The input features are then element-wise multiplied with the erasure mask to erase the salient regions of interest to spatial attention, shifting the network attention to potentially significant regions to extract fine-grained features. Meanwhile, the FSCFE module is alternatively iteratively trained with the SAAFE module to learn better features.
In addition, to improve the implementation of the MAFEN in practical applications, a multi-loss knowledge distillation (MLKD) method using MAFEN as the teacher network is designed to compress the parameter number of MAFEN while ensuring high performance of the algorithm, improving the network processing efficiency. The student network is ResNet-50 with a similar structure to the teacher network. The MLKD method builds on traditional knowledge distillation methods (including only cross-entropy loss and knowledge distillation loss) to supervise the student network by using the L2 paradigm. It allows the student network to better simulate the teacher network in both the feature distribution and the predicted probabilities. More specifically, the triplet loss is less strongly constrained for high similarity vehicle samples, and the similarity loss can further help to better constrain it. Finally, the student network achieved a greater improvement in processing efficiency at less performance degradation.
The main contributions of the proposed method are as follows:
The MAFEN constructed appearance representations through the dual-stream structure. In the MAFEN, the FSCFE module aggregated features under different receptive fields while learning and fusing the spatial structure information and channel dependencies of them. The SAAFE module erases the most significant local features to focus on potential fine-grained features. The MLKD is designed to simplify the MAFEN by applying reasonable loss functions. Ultimately, the knowledge from the pre-trained MAFEN is distilled and transferred to a shallow student network (ResNet-50), allowing the student network to effectively improve processing efficiency while maintaining retrieval performance. Experiments on three vehicle Re-ID datasets verify the effectiveness and robustness of the MAFEN and the MLKD method.
The rest of this paper is organized as follows: section 2 refers to the related work of vehicle Re-ID. Section 3 introduces the proposed MAFEN and knowledge distillation methods. Section 4 presents the experimental results and analyses. Section 5 presents the conclusion.
Feature representation-based vehicle Re-ID
Existing vehicle re-id methods focus on how to construct robust visual representations. Liu et al. [17] proposed to fuse manual features with deep learning features to obtain a robust vehicle appearance representation. Wang et al. [18] aggregated multiple part key point features in different directions to model the vehicle’s appearance. Meanwhile, Khorramshahi et al. [19] addressed the problem of misfocus by estimating the orientation of the vehicle, to adaptively select key points to focus on as a way of providing complementary information to the global appearance features. Furthermore, several works [20, 21] divided high-level semantic features into multiple local features and constructed robust vehicle features by fusing global features with local features. Some researchers complemented the visual appearance representation via using multi-modal cues, especially spatial-temporal information in [22, 23]. The experimental results demonstrate that these algorithms perform effectively by constructing robust feature representations of vehicles. But these methods were mostly a single receptive field, which had limited feature extraction and failed to inherently improve the network’s ability to extract features. Exceptionally, the FSCFE module is designed to aggregate features in different receptive fields, enabling to obtain richer features than a single receptive field. Meanwhile, the SAAFE module is designed to improve the network’s sensitivity to potentially fine-grained features by erasing the most significant ones. It essentially improves the network’s ability to extract salient features. Both complement each other and effectively improve the accuracy of the vehicle Re-ID.
Attention-based vehicle Re-ID
The attention mechanism enables convolutional neural networks (CNN) to focus on local critical regions, ignoring irrelevant information in the image. Refs. [24] strengthened the network’s discriminative ability for local cues through using three feature learning branches with spatial attention. This approach is unable to capture large scope information to globally determine spatial attention. Zhang et al. [25] accurately pinpointed salient part regions via part-guided attention and effectively combined coarse-grained and fine-grained cues to learn discriminative features. Guo et al. [26] learned an effective feature by using hard part-level attention focused on the windshield and car head. Zhou et al. [27] enforced the network to focus on key regions of different viewpoints by using the viewpoint-aware attention. These works are effective in using external cues of human semantics (windshield or viewpoint, etc.) to guide the learning of attention. However, the external annotation or additional network for estimating local regions is usually required. In this paper, the attention in the FSCFE module is enabled to learn the respective global scope relationships of each feature node and no additional semantic information is required for guidance. Moreover, experiments demonstrate the effectiveness of our attention.
Knowledge distillation
To improve the implementation of algorithms in practical applications, network compression is the way to cross over from academia to industry. Knowledge Distillation has been investigated in [28, 29] for network compression: the idea is to train a lightweight neural network with a complex neural network so that the output of the lightweight neural network maximally replicates the output of the complex network. Hinton et al. [30] employed knowledge distillation loss in classification tasks to solve the additional information about similarity inter-classes and intra-classes missing in one-hot coding and effectively lighten the network. Benefiting from the techniques proposed in [30], the proposed MLKD method jointly constrains the student network by combining knowledge distillation loss and cross-entropy loss with the addition of triplet loss and similarity loss. It can further constrain the consistency of the distribution between features in high-dimensional space and transfer dark knowledge of the teacher network more effectively.
Methodology
As shown in Fig. 2, the ResNet-50 [31] is utilized as the backbone. The FSCFE module and SAAFE module are embedded in the backbone so that the network can extract more significant features. However, erasing the most significant features may lead to degradation of the performance.

The framework of MAFEN. It mainly includes the FSCFE and SAAFE.
Hence, an FSCFE module was parallel introduced in the SAAFE module. An equal-probability random selection pattern was then utilized to achieve that features originally input to the SAAFE module were selectively input to either the FSCFE module or the SAAFE module, while GeM pooling [32] is used to converge features. In the MAFEN, the FC consists of two fully-connected layers. A dropout is used after the first fully connected layer to randomly ignore some neurons to prevent overfitting. Detailed descriptions of each part proposed are shown in the following subsections.
Before the discussion on the proposed method, we mathematically redefine the vehicle Re-ID task. Let X ={ x1, x2, ⋯ , x
m
} denotes an image dataset taken from different cameras. The X is sampled into a training set T ={ x1, x2, ⋯ , x
t
; t < m }, a query set Q = {xt+1, ⋯ , x
q
; t < q < m}, and a gallery set G = {xq+1, ⋯ , x
m
; q < g ⩽ m}. The label of the training set T is defineds Y
T
={ y1, y2, ⋯ , y
N
}. In general, in the training phase, vehicle Re-ID transforms a training image x
i
∈ T by a nonlinear mapping function f(· ; θ) into a high-dimensional feature, where θ denotes the parameter of a CNN. Then, a fully connected (FC) layer is added to DCNN to transform the feature representing x
i
into class vectors. In the testing phase, the distance between different images is calculated to measure the similarity between different vehicles, as shown in Equation (1).
where j∈ [n + 1, q] , k ∈ [q + 1, m] , ∥ · ∥ defined as Euclidean distance.
These distances construct the distance matrix L D . As shown in Equation (2), vehicle Re-ID is achieved via searching for a smaller distance.
The receptive field can represent the scope of feature mapping in the input image, and different levels of receptive fields contain semantic targets in various sizes. The proposed FSCFE module contains two operations: 1) the multi-receptive fields feature extraction and 2) the multi-receptive fields feature enhancement.
The multi-receptive field feature extraction operation, as shown in Fig. 3, captures vehicle features through four different receptive field branches (1 × 1, 3 ×3, 5 ×5, 7 ×7) arranged in parallel and then aggregates the features via a shared parameter multilayer perceptron (MLP) structure to obtain multi-receptive fields features. Specifically, the multi-receptive fields feature extraction operation first used four parallel-arranged filters of different sizes to convolve and dimensionality reduce the original input feature

The detailed structure of the multi-receptive field features extraction.
Each sub-feature F
a
is then aggregated in spatial dimension by a max pooling layer and an average pooling layer to obtain pooling features. Meanwhile, pooling features are input to a shared parameter MLP structure and a Sigmoid activation function to learn the weight dependencies of the features under different receptive fields. The sub-features F
a
and the weight dependence of the features in different receptive fields are element-wise multiplied and element-wise added to obtain aggregated features
The aggregated features
Discussion: In [33], Liu et al. designed the receptive field block (RFB) module to obtain different receptive fields by employing different kernels and dilated convolutional layers, which imitates the structure of the receptive fields in the human visual and generates more discriminative and robust features. But only features from different branches are collocated in feature aggregation, which cannot be effectively guided during the back propagation of the network. In contrast, the FSCFE module utilizes the MLP with shared parameters for feature aggregation. The advantage is that all branch parameter information can be used to jointly guide and supervise the output features during back propagation.
The multi-receptive fields feature enhancement operation learns the spatial structure information and channel dependencies of the multi-receptive fields feature

The detailed structure of the multi-receptive field features enhancement.
To learn the spatial structure information of
To learn the channel dependencies of
Discussion: In [34], Woo et al. propose a lightweight Convolutional Block Attention Module (CBAM). It uses a large filter size of 7 × 7 to generate spatial attention and an MLP structure to generate channel attention. It can be seamlessly integrated into any CNN architecture and can refine features using spatial and channel information. But limited by the practical receptive fields, it is ineffective in capturing the large scope information to globally assign weights to the input features. In this paper, the FSCFE module first takes multi-receptive field features as input for the next step in attention learning, allowing the practical receptive field improvement. Furthermore, we divide the input feature into several feature nodes and globally learn the relationships (spatial location relationships and channel dependencies) between all the different feature nodes. It can determine the global attention, which is important for extracting the abundant vehicle appearance features.
In some challenging cases, the network is required to focus on salient localities to distinguish vehicles due to the influence of similar vehicles, or the occurrence of image blurring and illumination variations. Thus, the SAAFE is proposed to improve the network’s ability to extract significant features by shifting the network’s attention to potentially salient features. The detailed structure is illustrated in Fig. 5.

The detailed structure of the Spatial Attention-Guided Adaptive Feature Erasure (SAAFE) Module.
The SAAFE takes the high-level semantic feature
Then, the erasure teshold τ is calculated in Equation (9).
The binary erasure mask
After obtaining the binary erasure mask, the mask is element-wise multiplied with the input feature to extract the erasure map. In other words, the erasure map is comparable to the disappearance of salient features due to obstacle occlusion in the practical environment, and the network needs to be guided to focus on potentially important features. However, if the binary erasure mask is applied at every iteration, the network failed to explore the most discriminative part features during the initial training phase. Hence, as shown in Fig. 6, a random selection pattern with equal probability is applied to the next classification of the network, enabling the high-level semantic features F s to selectively input the FSCFE module or the SAAFE module.

The detailed structure of equal probability random selection.
The random selection pattern not only simulates the random noise in practical scenarios, but also weakens the negative impact caused by the disappearance of salient features while preserving the network’s ability to mine potential fine-grained features, allowing the network to work in practical situations where the cues are unclear.
Considering the importance of vehicle Re-ID in building intelligent transportation [1] and video surveillance [2], for quickly locating and tracking suspicious vehicles, we also need to improve its efficiency, apart from its accuracy. Therefore, the MLKD method with four loss functions is introduced, as shown in Fig. 7.

The architecture of the knowledge distillation method.

The parts of the images selected in the VeRi-776.

The parts of the images selected in the VERI-Wild.

The parts of the images selected in the VehicleID.
The MAFEN is utilized as a teacher network. Then, we design a student network that uses ResNet-50 as the backbone and is similar to the teacher network structure. The student network is supervised jointly via the application of L ce , L tri , L kd , and L s , where L ce , L tri , L kd and L s denote cross-entropy loss, triplet loss, knowledge distillation loss, and similarity loss, respectively. Among them, the similarity loss utilizes the L2 paradigm to force the high-level semantic features of the student network to be distributed similarly to the high-level semantic features of the teacher network in the feature space. The triplet loss utilizes the high-level semantic features of the student network to enable closer distances between similar samples and farther distances between non-similar samples, indirectly solving the problem of triplet loss without strong constraints for triplet samples. The distillation loss utilizes kullback-leibler divergence to approximate the predicted class scores of the student network closer to the soft target of the teacher network. The cross-entropy loss enables more efficient vehicle classification using the student network’s prediction scores. In this way, the student network greatly improves processing efficiency while ensuring vehicle Re-ID performance. The detailed loss functions are described in section 3.5.
To train the MAFEN, the cross-entropy loss and triplet loss are constructed to jointly optimize the network. The total loss of the teacher network L
ter
is the weighted summation of two losses in Equation (11).
To mitigate overfitting of the network, a label smoothing strategy [35] is employed in the cross-entropy loss. Formally, L
ce
is defined as Equation (12):
pi,n is a smoothed label, illustrated as Equation (13).
The hyperparameter ɛ ∈ [0, 1] is a weight factor. y n ∈ Y T is ground truth of the i th sample. Furthermore, the triplet loss L tri describes in Equation (14).
We next optimize the student network jointly by a cross-entropy loss, triplet loss, knowledge distillation loss, and similarity loss. As illustrated in Equation (15), the student network’s whole loss L
stu
is the weighted summation of four losses.
In particular, the knowledge distillation loss, as shown in Equation (16), constrains the probability distribution between the student and teacher networks more similar.
In addition to fitting the output of the teacher network, we also suggest a constraint on the feature distribution of the student network learning. Therefore, we seek to minimize Equation (17).
Datasets
We evaluate the proposed algorithm on three mainstream vehicle Re-ID datasets, including VeRi-776 [10], VERI-Wild [36], and VehicleID [17], as shown in Figs. 8–10.
Evaluation metrics
We adopted the widely used evaluation protocols to evaluate the proposed network, including mean Average Precision (mAP), rank-1, and rank-5. In addition, we also use the testing time to measure the computational efficiency of the network.
The mAP is used to evaluate the comprehensive performance of the network and represents the average precision of all retrieval results. The higher value of mAP indicates better network generalization. The average precision (AP) is computed for each query as:
The mAP is the mean of AP overall queries. The mAP can be calculated as:
In this paper, ResNet-50, pre-trained on ImageNet, is used as the backbone network. All experiments are conducted in the PyTorch. For each training image, it is first resized to 256 × 256 and padded 10 pixels on the image borders, then randomly cropped to 224 × 224. For each testing image, it is only resized to 256 × 256. During the training phase, we apply the Adam optimizer. The initial learning rate is 3.5e-4. It dropped to 3.5e-5 and 3.5e-6 in the 30 th and 60 th epoch, respectively. In the training loss of the teacher network, λ is set to 0.5 in Equation (11). β in Equation (14) is set to 0.3. In the training loss of the student network, λ1 and λ3 in Equation (15) is set to 0.1. The temperature t p is set to 10 in Equation (16). In the testing phase, we combine coarse-grained and fine-grained features as the final feature representation of the vehicle image.
Performance comparison on VeRi-776
This subsection compares the MAFEN with several traditional and current state-of-the-art methods on the VeRi-776 dataset. Among them, LOMO [5] treats the viewpoint and lighting variations via hand-crafted local features. GoogLeNet [37] directly learns vehicle global features from the pre-trained GoogleNet on CompCars. FACT [38] jointly discriminated vehicles with color and texture features, and PROVID [10] further optimizes the algorithm by utilizing license plate and spatial-temporal information. AAVER [19] focuses on vehicle orientation and local key-points to capture vehicle local discriminative features. To solve the effect of viewpoint variation, [39, 40] used additional viewpoint classifiers to extract salient features. SAN [21] obtained the final vehicle descriptor by using multi-modal information. MADRL [41] accurately focuses on multi-attention regions to extract discriminative features. Table 1 shows the comparison results with other methods in detail. The bold numbers denote the optimal results, while the suboptimal results are shown with underlined numbers.
Comparison with the state-of-the-art on the VeRi-776
Comparison with the state-of-the-art on the VeRi-776
As shown in Table 1, it can be observed that deep learning-based methods, i.e., GoogLeNet [37], FACT [37], PROVID [10], AAVER [19], PAMTRI [39], VANet [40], SAN [21] and MADRL [41] obtain remarkable improvements than the hand-crafted features-based method, i.e., LOMO [5]. Compared with using additional attribute labels, i.e., FACT [37], SAN [21], or spatial-temporal information, i.e., PROVID [10], the MAFEN shows a powerful competition. This may indicate that our network can extract more significant vehicle features without using additional information. Compared with the methods to solve the viewpoint translation problem, i.e., PAMTRI [39], VANet [40], the MAFEN also achieves the best performance in Rank-1. This proves that the MAFEN can solve the problem of overlooking key information about vehicles caused by a view-point transformation. Compared with the method of extracting robust features, i.e., AAVER [19], the MAFEN improves 6.53%/2.62% rank-1/rank-5, respectively. Compared with the method using the attention mechanism, i.e., MADRL [41], MAFEN improves 4.23%/2.13% rank-1/rank-5, respectively. We show the loss function graph of the MAFEN on the VeRi-776 in Fig. 11. It exhibits good convergence and a downward trend. Figure 12 visualizes the ranking list of MAFEN on the VeRi-776. The green bounding box images are true positive samples, while those with red bounding boxes are false positive samples. In summary, the above results illustrate the effectiveness of the MAFEN in solving the problems of disappearing detailed features due to tree occlusion, image blurring, and viewpoint transformation, as well as dramatic changes in vehicle color due to illumination variations.

The Graphs of training loss.

Visualization of the ranking lists of MAFEN on VeRi-776. The green (red) boxes denote the correct (wrong) results.
The MAFEN is compared with state-of-the-art methods on the VERI-Wild [36]. Tables 2 and 3 show the comparison results in the three test subsets. FDA-Net [36] utilizes similarity constraints and attentional regularization to generate high similarity negative samples to improve the discriminator’s discriminative ability. VARID [42] proposes a viewpoint-aware triplet loss and updates viewpoint centers through clustering to address intra and inter-class differences caused by viewpoint variations. The experimental results of the other state-of-the-art methods are copied directly in [36].
Comparison with the state-of-the-art on the VERI-Wild. This shows the results of mAP
Comparison with the state-of-the-art on the VERI-Wild. This shows the results of mAP
Comparison with the state-of-the-art on the VERI-Wild. This shows the results for rank-1 and rank-5
Table 2 shows the experimental results of mAP on three test subsets. Compared with the FDA-Net [36], the MAFEN achieves 39.62%, 39.52%, and 41.25% improvement in mAP, respectively. This impressive improvement in mAP benefits from the SAAFE and MRFRA, which enables the network to extract robust features and also enhances the network’s ability to capture discriminatory cues. Compared with the VARID [42], the MAFEN is not achieving optimal results on the small test subset, but the MAFEN shows the best robustness on the other test subsets. The experimental results demonstrate that using viewpoint information suitably is effective in dealing with the high similarity between samples.
The rank-1 and rank-5 of the different methods on the three test datasets are shown in Table 3. A clear observation is that the MAFEN outperforms all other methods in two evaluation metrics. This indicates that the SAAFE and FSCFE modules enhance the network’s ability to capture discriminatory features and can address well the problem of local features being overlooked in complex scenes.
We compare the MAFEN with previous works on the VehicleID. Table 4 shows the experimental results on three test subsets.
Comparison with the state-of-the-art on the VehicleID
Comparison with the state-of-the-art on the VehicleID
We observe that the VANet [40] and the SAN [21] obtained significant performance improvements compared with other methods. A possible reason is that these methods introduce additional attribute labels or viewpoint classifiers to complement the appearance features. Besides, the vehicle in the VehicleID dataset only contains two viewpoints, i.e., the front and back sides. Therefore, the combination of multiple features is more easily to improve the vehicle Re-ID performance in this scene. On the contrary, the proposed SAAFE improves the sensitivity of the network to potentially salient features by erasing the most significant features. It has a negative effect when solving the VehicleID with only two viewpoints. Therefore, the algorithm performance does not achieve the best performance on all three test subsets.
Considering the application of vehicle Re-ID in real scenarios, we compare the efficiency of the student network and the teacher network on two datasets (i.e., Veri-776, VehicleID) on the same machine. The results are shown in Table 5. Note that all networks in our experiments are trained and tested in a computer with a Nvidia GeForce GTX 1080 Ti GPU. The test time is the time taken from the input of all target vehicle images to the network to correctly match each gallery set image during the test phase.
Comparison of the teacher network and student network in terms of size and testing time on the three datasets
Comparison of the teacher network and student network in terms of size and testing time on the three datasets
On the Veri-776, the student network took 9.52 s to complete the overall Re-ID process. Compared to the time spent by the teacher network of 13.01 s, the student network achieves a 26.83% improvement in efficiency. The time spent to finish the testing process for the teacher network is 8.67 s, 9.18 s, and 11.21 s on three test datasets of VehicleID, respectively. In contrast, the student network spent only 5.71 s, 6.86 s, and 8.56 s on the three test datasets, which is at least a 23.64% improvement in efficiency. Simultaneously, the performance of the teacher network and the student network is also compared in Fig. 13. The observation of Fig. 13 shows that the performance of the student network is higher than the teacher network’s performance in different test datasets. Hence, we can apply vehicle Re-ID in the future through the knowledge distillation method, which can simplify the complex network and improve the efficiency while ensuring the performance of Re-ID.

The CMC of the teacher-student network performance comparison.
We designed a set of experiments and demonstrated their effectiveness by adding the FSCFE module at different stages in the backbone. √ indicates that the FSCFE module is added after a particular stage in the backbone.
The first observation in Table 6 is that adding the FSCFE module to different stages of the baseline resulted in different degrees of incrementality compared to the baseline, respectively. Among them, the FSCFE module is added after the Res3 and Res4 of the backbone, and the mAP/Rank-1 are increased 7.61% /2.98% in performance compared with the backbone, respectively. The FSCFE module is added to after the Res3, Res4, and Res5 of the backbone to achieve optimal performance. Notice that the FSCFE module is only added to the ResNet-50 after the Res3 and Res4. The reason is that adding the FSCFE module to other stages of the backbone results in a small improvement in the algorithm’s performance and consumes a lot of computing costs. Figure 14 shows the heatmap of adding MRFRA in the Res3 and Res4. Most of the activated areas are the headlights, windows, and edges of the car. It once again demonstrates that the FSCFE module can extract rich distinguishing features.
Ablation experimental results of adding FSCFE modules at different stages of the backbone network on the VeRi-776
Ablation experimental results of adding FSCFE modules at different stages of the backbone network on the VeRi-776

Visualization of the heatmap of the FSCFE module on VeRi-776.
To validate the effectiveness of the SAAFE module, we conduct the experiment by adding SAAFE to the backbone. In particular, SAAFE w/o FSCFE means that no FSCFE module is introduced in the SAAFE module while no equal-probability random selection is performed, and only the SAAFE module is used. SAAFE w/ FSCFE means that the FSCFE module is introduced in the SAAFE module while equal-probability random selection is performed. SAAFE w/ CBAM means that the CBAM module [34] is introduced in the SAAFE module while equal-probability random selection is performed. The results are shown in Table 7.
Erasure probability selection of the SAAFE on VeRi776 datasets
Compared to the backbone, SAAFE w/o FSCFE improved by 1.78% and 5.52% in Rank-1 and mAP, respectively. This improvement shows that the SAAFE module can improve the network’s ability to extract potentially salient features. Considering that the disappearance of the most salient features leads to a decrease in the network’s learning ability, this section further compares the effectiveness of the CBAM module with the SAAFE module for random selection. Compared to SAAFE w/o FSCFE, the mAP and Rank-1 of SAAFE w/ FSCFE improved by 2.39% and 0.36% respectively, while the mAP and Rank-1 of SAAFE w/ CBAM decreased by 1.02% and 0.83% respectively. This demonstrates that the network performance degradation caused by the disappearance of the most significant features can be effectively suppressed by introducing the FSCFE module into the SAAFE module, and the FSCFE module is more effective than the CBAM module.
To verify the effect of the erasure probability α in the SAAFE module on the performance of vehicle Re-ID, experiments were conducted by using the SAAFE w/ FSCFE module on the VeRi-776 and varying the erasure pbability α. Detailed results are shown in Fig. 15.

Erasure probability selection of the SAAFE on the VeRi776.
Obviously, we can find that the erasure probability of 0.8 achieves the best vehicle Re-ID results on all metrics. Therefore, suitable erasing salient features can improve the ability of the network to extract potential features. In this paper, the erasure probability α is set to 0.8 for training on the VeRi-776, 0.9 for training on the VehicleID, and 0.8 for training on the VERI-Wild.
In this paper, we propose a Multi-Attention Guided Feature Enhancement Network (MAFEN) for vehicle Re-ID. The network mainly consists of the Fusing Spatial-Channel information multi-receptive fields Feature Enhancement (FSCFE) module and the Spatial Attention-Guided Adaptive Feature Erasure (SAAFE) module. The FSCFE module aggregates features from different receptive fields while learning the spatial structure and channel dependencies of features, which enables the network to extract rich appearance features. The SAAFE module shifts the network’s attention to potentially significant parts, effectively improving the network’s ability to extract significant local features. Considering the application of the vehicle Re-ID task in intelligent transportation, video surveillance, and smart city applications, a knowledge distillation method was applied to effectively improve the processing efficiency of the vehicle Re-ID task. The experimental results on three datasets demonstrate the competitiveness of the proposed MAFEN and knowledge distillation methods.
Footnotes
Acknowledgments
This research was supported by the National Natural Science Foundation of China under Grant 61806071, Natural Science Foundation of Hebei Province under the Grants F2019202381 and F2019202464, National Key R&D Program of China under the Grants 2018YFC08, Open Projects Program of National Laboratory of Pattern Recognition under Grant 201900043. Innovative Research and Experiment Project of Young Researchers of Tianjin Academy of Agricultural Sciences under grant 2021005.
