Abstract
Person re-identification (ReID) is widely used in intelligent security, monitoring, criminal investigation and other fields. Aiming at the problems of local occlusion, scale misalignment and attitude change of pedestrian images in actual scenes, we propose a Multi-local Feature and Attention fused network (MFA) used for person re-identification task. Firstly, Channel Point Affinity Attention module (CPAA) is embedded in the backbone network to enhance the ability of the network for extracting local details. The feature map output from the backbone network is horizontally segmented into four local feature maps, and further four branch networks are concatenated to the feature map of the backbone network. The four local feature maps are used to guide the four branch networks to pay more attention on different areas of pedestrians through Global Local Aligned loss (GLA) function. Finally, the pedestrian feature vector containing multi-local features is obtained. The mAP of the network on Market-1501, DukeMTMC-reID,CUHK03 and MSMT17 datasets were 88.6%, 81.4%, 79.5% and 64.7%, and the Rank-1 was 95.8%, 90.1%, 81.2% and 84.1% respectively. In addition, the model also obtained 73.2% and 68.1% of Rank-1 on partial dataset Patial-REID and Patial-iLIDS, respectively. Recently, The MFA model parameter is 28.3M and the inference efficiency is approximately 32 fps to an image with a resulation of 256
Introduction
Person Re-identification (ReID) [1] is a technology that can be widely used in video monitoring, intelligent security, and other fields by combining with pedestrian detection, target tracking, and other technologies. It plays an important role in criminal investigation or specific scenarios of person search. The ReID task essentially belongs to the image retrieval problem. In recent years, with the breakthrough of deep learning in the image field, the deep learning-based ReID method has also received widespread attention.
Unlike pedestrian detection tasks, the ReID task needs to pay attention to more granular features of pedestrians. In an open scene, the combined features of pedestrians such as clothing, accessories, and personal belongings are ever-changing. In addition, due to factors such as perspective, posture, occlusion, and lighting, the same pedestrian has multiple differences in different time periods, which makes the ReID task very difficult.
At present, although many pedestrian re-recognition methods have been proposed, it is difficult to solve the problems of background interference and scale variation caused by pedestrian movement. When it is desired to obtain more fine-grained local features, it is easy to introduce background interference, and only learning some features leads to difficulty in improving model performance. In response to these problems, our proposed MFA method can simultaneously address multiple local features while addressing the issues of background interference and pedestrian scale variation. Testing on commonly used datasets Market-1501,Dukemtmc-reid,CUHK03 and MSMT17, as well as occlusion datasets Partial-REID and Partial-iLIDS shows that the network achieves competitive performance. Through visualization analysis using Grad-CAM [27],The proposed method effectively avoids interference from backgrounds and occlusions, allowing for extraction of purer pedestrian features. Furthermore, by incorporating multiple local features of pedestrians, the model’s recognition capability is further improved. When combined with pedestrian detection models, it can efficiently identify target individuals in surveillance videos, providing great convenience for criminal investigation work.
Our work contributions are summarized as follows:
We propose a novel person re-identification method that improves recognition by focusing on multiple local features of pedestrians. Testing results on datasets and occlusion datasets show that our method achieves competitive performance compared to similar methods. We use multiple modules and the proposed Global Local Aligned (GLA) loss for training. Through GradAM visualization, we find that our method effectively handles complex backgrounds occlusions in pedestrian activity scenes, addressing the problem of background interference commonly encountered in existing methods and providing new insights for future research. We propose a flexible and effective Channel Point Affinity Attention module (CPAA), which further enhances the model’s performance.
The remaining sections of this paper are organized as follows:
In Section 2, we introduce the related work. In Section 3, we provide detailed explanations of the basic principles and specific implementation of our method. In Section 4, we conduct experimental verification of the method, presenting specific evaluation metrics and comparisons with other methods, as well as visualization results to further validate its effectiveness. Finally, in Section 5, we conclude the paper.
Related work
Currently, the ReID task combines research contents of representation learning [2] and metric learning [3], where representation learning enables the network to extract more discriminative features, and metric learning maps features to specific subspaces, increasing the sample class distance and achieving the purpose of distinguishing different pedestrians.
In the ReID task, methods of representation learning mainly include methods based on global features, generative adversarial networks (GANs), pose estimation, masks, image blocks, and attention mechanisms. Among them, networks based on global features such as OSNet [4], BagTricks [5], CDNet [48] and SVDNet [6] nly allow the network to learn partial discriminative features through the dataset and do not consider local features. They perform generally in scenes where pedestrians are occluded. GAN-based methods such as Camstyle [7] and PN-GAN [8] improve the model’s generalization ability by further expanding the training data. Pose estimation-based methods such as CPM [9], GLAD [10], PIE [11], and SpindleNet [12] utilize the key point information of the human body to alleviate the problem of local misalignment. However, it increases the model’s computational complexity and requires human keypoint annotated data during training. Mask segmentation-based methods such as SPReID [13], ISP [51] and MaskReID [14] use image segmentation methods to separate human and background information, generate binary masks, and thus suppress background interference. Image block-based methods such as AlignedReID [15], SCPNet [16], PCB [17], Pyramid [18], BFE [19] and DAAF-BOT [45] force the network to focus on features in different regions during the learning process, enhancing the model’s feature robustness. However, if the pedestrian alignment problem is not solved, it will introduce background interference. Attention mechanism-based methods can selectively improve the feature extraction ability of the network, such as AGW [20], Mancs [21], DuATM [22], HA-CNN [23], Transreid [46], AAFormer [49], DenseFormer [50], etc. The attention mechanism improves the model’s feature extraction ability, but the methods based on attention mechanisms only utilize the global features and discriminative local features of the human body and ignore other minor features, resulting in mediocre performance in occlusion and viewpoint changes. Currently, in supervised ReID tasks, more attention is given to the fusion of global and local information to extract more discriminative pedestrian features.
After extracting discriminative pedestrian features, the recognition performance can be further improved by using metric learning methods that design loss functions to increase inter-class distance and decrease intra-class distance. Commonly used metric learning methods include classification loss [24], triplet loss [25], RA-Loss [47] and quadruplet loss [26]. Classification loss sets the number of IDs in the dataset as the number of classes in the network, feeds the feature vector into a fully connected layer, and calculates cross-entropy loss after Softmax. Triplet loss controls the distance between positive sample pairs and negative sample pairs to be less than a set threshold during training, thus achieving sample clustering within the same class. RA-Loss address the intra-pair variance via exploring the informative relation across pairs. Quadruplet loss adds another set of samples from different classes to triplet loss, and controls the distance between positive sample pairs and other samples from different classes to be less than a set threshold, thus reducing intra-class distance and increasing inter-class distance.
Existing works have difficulty balancing network complexity and fine-grained recognition. Mask-based methods require pixel-level annotations and have high computational costs. Pose estimation-based methods suffer from low efficiency due to complex network structures. Block-based methods fail to address background interference and scale variations. Attention-based methods guide the network to focus on more discriminative features but often neglect other local features. To address these issues, our work balances computational costs with multi-local feature recognition. We use global and local association features to guide the network’s attention to specific regions of the human body, effectively filtering out background interference.
Method
Overall network architecture of MFA. We use ResNet50 as the backbone network and embed CPAA modules in the last three layers. The output feature map is processed through multiple convolutional and GeM modules to obtain multiple global embeddings. Simultaneously, the output feature map is divided into spatially corresponding local embeddings. During training, global and local association learning is achieved through the GLA loss. Each global embedding is learned using Soft-triplet loss, Center loss, and cross-entropy-based ID loss.
The ReID network structure proposed in this article, which integrates local features and attention mechanisms, as shown in Fig. 1. For the ReID task, deep features with rich semantic information are required, but increasing the depth of the network can lead to the problem of vanishing gradients, and the training error also increases accordingly. ResNet proposed by He [28] et al. solved the degradation problem of deep networks, so ResNet50 was used as the backbone network in this article. In order to better utilize the associated information in neighboring regions and extract discriminative pedestrian local semantic features, a Channel Point Affinity Attention (CPAA) module is embedded in the last 3 layers of the network. At the same time, in order to better deal with occlusion, scale changes, and other situations, the output feature map of the backbone network is horizontally divided into multiple regions, and each horizontal region is output as a feature vector through global average pooling (GAP), and corresponding local branches are added. The feature map generated by each local branch is subjected to a GeM [29] layer to produce a feature vector, which is paired with the region feature vector obtained by horizontal segmentation and combined through the proposed Global Local Aligned (GLA) loss function to guide each local branch to learn the features of different regions of the pedestrian. To reduce overfitting, better focus on local detailed features, and increase feature robustness, this article uses cross-entropy loss with LabelSmoothing as the identity loss (ID loss), and adds Soft-triplet loss and Center loss for joint training. It should be noted that batch normalization (BN) is used for the features in ID loss in this article.
CPAA structure diagram, The part before feature map B employs channel-wise self-attention mechanism, while the part after it utilizes spatial pixel-wise self-attention.
In ReID networks, attention mechanisms can generally promote the network to learn more pedestrian features and suppress irrelevant background information. The proposed Channel Point Affinity Attention (CPAA) in this article can better enable the network to learn associated information in neighboring regions and thus extract more discriminative local semantic features. In Fig. 2, the input feature map of the network
Let
Subsequently, the feature map B is passed through three
In this paper, the output feature map of the backbone network is horizontally divided into
where
In the formula,
Different from global average pooling that only performs pooling operation on the feature map, GeM continuously updates the final convergence to the best value during the backpropagation process when generating the feature vector, and when
After generating the feature vector in the local branch, batch normalization (BN) [5] module is applied to normalize the features. Batch normalization makes the features of different input samples follow a normal distribution, which helps the ID loss to converge faster and also facilitates the convergence of other auxiliary losses.
The overall loss function of the MFA network can be represented as follows:
In the formula,
Firstly, the model uses cross-entropy loss with added label smoothing as the basic classification loss function, which can be represented by the formula:
where
In addition, to improve the network’s learning of finer-grained features, the Soft-Triplet loss [30] is added, where
The problem with triplet-based losses, including the Soft-Triplet loss, is that the function only focuses on the difference between
In the equation,
In the formula, presents the number of horizontally divided regions,
To provide a clearer representation of the implementation details, the pseudocode is as follows:
Experimental dataset introduction
Parameters of the datasetsThe datasets used in our experiments include market1501, DukeMTMC-reID CUHK03, and MSMT17, which consist of both training and testing sets. Partial-REID and Partial-iLIDS datasets are used for testing occluded scenarios only
Parameters of the datasetsThe datasets used in our experiments include market1501, DukeMTMC-reID CUHK03, and MSMT17, which consist of both training and testing sets. Partial-REID and Partial-iLIDS datasets are used for testing occluded scenarios only
Examples of cropped pedestrian images on 6 different datasets where (e) and (f) are occlusion datasets.
To demonstrate the effectiveness of our proposed method, we conducted experiments on four widely-used datasets, Market-1501, DukeMTMC-reID, CUHK03 and MSMT17, as well as two occlusion datasets, Partial-REID and Partial-iLIDS, to evaluate the model’s generalization ability under occlusion scenarios. Figure 3 shows some examples of each dataset. Market-1501, DukeMTMC-reID, CUHK03 and MSMT17 were all collected in university campus environment and contain a certain amount of occlusion; Partial-REID was captured outdoors, where pedestrians are occluded by trees, buildings, and other objects; Partial-iLIDS was captured by airport cameras, where pedestrians are occluded by luggage, signs, and other pedestrians. Table 1 shows the basic parameters of each dataset, where #identities represents the number of identities, #imgs represents the number of images, and #camera represents the number of cameras. Partial-REID and Partial-iLIDS are relatively small in scale and do not include a training set, only being used for testing purposes.
The experiment was conducted on Ubuntu20.04 operating system with hardware specifications including core-i9 11900 CPU and NVIDIA GeForce 3090 graphics card with 24 GB of memory. The model was built using Pytorch framework and the ResNet50 network pre-trained on the ImageNet dataset was used as the backbone network. Since the pedestrian images in the dataset are generally small in size and the feature map resolution is low after downsampling by 32 times, the stride of the last layer of the backbone network was set to 1. During training, the Adam optimizer was used with a batch size of 64 (16
Considering a model’s prediction, True Positive (TP) refers to predicted positive samples, False Positive (FP) indicates incorrectly predicted positive samples, True Negative (TN) represents correctly predicted negative samples, and False Negative (FN) denotes incorrectly predicted negative samples. Precision is denoted as:
Consider an ID labeled as
mAP represents the mean average precision across all IDs, with
Effects of CPAA module position on model performance the CPAA module in the 2nd and 3rd layers of the backbone network leads to significant improvement in metrics.adding the CPAA module in the last layer still has a positive effect
Effects of CPAA module position on model performance the CPAA module in the 2nd and 3rd layers of the backbone network leads to significant improvement in metrics.adding the CPAA module in the last layer still has a positive effect
To investigate the optimal location of the CPAA module in the network, four groups of control experiments shown in Table 2 were designed. Baseline only uses the ResNet50 backbone network to extract features without multiple local branches and the CPAA attention structure, and the loss function does not contain GLA. Layer2, Layer3, and Layer4 represent adding the CPAA module at the corresponding positions. To ensure the fairness of the comparison, the basic experimental parameters were set according to Section 4.2, and each method was trained and tested on the Market-1501 and DukeMTMC-reID datasets separately.
Table 2 shows the results of four control experiments designed to study the optimal location of the CPAA module in the network. Baseline only uses the ResNet50 backbone network to extract features, without including multiple local branches or the CPAA attention structure, and the loss function does not include GLA. Layer2, Layer3, and Layer4 represent adding the CPAA module at the corresponding positions. To ensure fair comparison, the basic experimental parameters are set according to Section 3.2, and each method is trained and tested on the Market-1501 and DukeMTMC-reID datasets. The data in Table 2 shows that using only the Baseline method 1, the Rank-1 indicators in Market-1501 and DukeMTMC-reID are 93.6% and 85.1%, respectively. Methods 2, 3, and 4 all show some degree of improvement in Rank-1 indicators, because the CPAA module aggregates channel features and related information from neighboring regions, resulting in feature maps with better representation capabilities. Method 3 shows a higher degree of improvement compared to Method 2, as the CPAA module in Method 2 is located in a shallower position in the network, with more interference from other low-level features. In Method 3, the feature map outputted after Layer3 already contains rich semantic features, and at this point, adding the module can better aggregate deep semantic features and improve network performance. Method 4 shows some improvement compared to Method 3, but the magnitude of improvement is limited, as important semantic information has already been attended to through multiple feature aggregations in the early stages, and the outputted features tend to be stable.
To further verify the effectiveness of each module, six control experiments are designed, as shown in Table 3, where Localbranches represents multiple local branches.
Effects of different module combinations on model performanceon the Market-1501 and DukeMTMC-reID datasets, CPAA directly improves detection metrics, and further improvements are observed with the influence of GLA and local branches. Despite scale variations and background interference, each local branch focuses on the corresponding pedestrian region and ignores the background area
A comparative analysis of the data in Table 3 reveals the following: (1) CPAA, when added to Baseline, increases the Rank-1 performance by 1.6% and 1.7% on the Market-1501 and DukeMTMC-reID datasets, respectively. Similar improvements are observed in other combinations that include CPAA, indicating that CPAA can effectively aggregate semantic features from neighboring regions in the feature map and enhance the representation ability of the backbone network. (2) Compared to Baseline, adding only Localbranches (without CPAA or GLA) to Baseline has no significant impact on the Rank-1 performance on both datasets, and this observation holds for other combinations that only include Localbranches, indicating that Localbranches alone cannot achieve the goal of diversifying local features. (3) Adding GLA to Localbranches (Baseline
Grad-CAM results of MFA, with the influence of the GLA and local branches, even if the pedestrian scale changes and there is background interference, each local branch focuses on the corresponding pedestrian area and ignores the background area.
To visualize the regions of interest in the image that the network attends to, Grad-CAM [27] was used for visual analysis. Four representative scene images were randomly selected from the Market-1501 test dataset, as shown in Fig. 4. The leftmost image in each group is the original image, and the middle four images show the visualization of the network’s attention regions overlaid with four different local branches. The image on the far right shows the visualization of the fused attention region after combining the four local branches.
In Fig. 4, (a) shows a normal cropped image, as described in Section 4.2. The different branches of the network sequentially attend to four regions of the pedestrian’s body, from above the chest to the chest and waist, from the waist to the knees, and from below the knees. The fused attention regions basically cover the entire outline of the pedestrian. At the same time, the fused features tend to focus on more discriminative regions, such as special texture patterns on the upper body, and the transition area between the lower body pants and legs. (b) shows a situation where the scale is changed, and the pedestrian only occupies a local area of the image. In this case, the local branch responsible for the upper part of the image does not attend to the background area above the image because under the influence of global semantic features, there is no semantic information that meets the requirements in the upper region of the image. This branch of the network defaults to outputting the most discriminative global feature, i.e., the upper body region of the pedestrian. The other branches continue to focus on different local regions. (c) shows a situation where the pedestrian is partially occluded. Because the lower part of the pedestrian’s body is covered, the branch network responsible for the area below the knees defaults to outputting the most prominent global feature, and the other branches only focus on the upper body region. (d) shows a difficult situation where there is both background interference and scale change. Many other pedestrians and objects appear in the background, but the attention regions of each local branch still meet the requirements.
In all four situations, the network’s final attention regions cover the entire pedestrian and ignore interference such as occlusion and background, thus achieving rich and pure pedestrian features in the network output.
Grad-CAM results of MFA on Patial-REID dataset images, The model trained on the market1501 dataset still pays attention to the pedestrian region and ignores the occlusion area when tested on the Patial-REID dataset.
To observe the generalization ability of the network on other occlusion datasets, we selected four representative images from the Patial-REID dataset, as shown in Fig. 5, including half-body, side view, heavily occluded, and partially occluded situations. After visualizing with Grad-CAM, it was found that the model trained on the Market-1501 dataset still has good generalization ability on Patial-REID. The network’s attention areas covered the entire person and focused on multiple discriminative local features while ignoring the interference of the background and occlusions.
The proposed method will be compared with current state-of-the-art (SOTA) methods, including block-based methods such as DAAF-BOT [45], MGN [32], GCP [33], MSCN [34]; global-based methods such as OSNet [4] and BagTricks [5],CDNet [48], AEFLN [35]; mask-based segmentation method such as ISP [51], SPReID [13], GASM [36]; attention-based methods such as AAFormer [49], DenseFormer [50], AGW [20], ABD-Net [37], and SONA [38]. To ensure fairness in comparison, re-ranking [39], multi-query [40], and other methods will not be used in this study.
Performance comparison of different methods on 4 datasetsThe evaluation metrics are shown in bold, while underline indicates the second-best result. ‘–’ denotes that the result is unavailable. Our model achieves competitive performance across the evaluated datasets
Performance comparison of different methods on 4 datasetsThe evaluation metrics are shown in bold, while underline indicates the second-best result. ‘–’ denotes that the result is unavailable. Our model achieves competitive performance across the evaluated datasets
Table 4 presents the performance of these networks on the Market-1501,DukeMTMC-reID,CUHK03 and MSMT17 datasets. OSNet, AGW, and MGN, which have better performance among the methods described above with open source,were selected for analysis. OSNet incorporated a fusion mechanism of convolution layers with different receptive fields in the module, which can extract the most discriminative features, but ignores other secondary features and has limited network generalization ability. AGW further enhances the ability of the network backbone to extract features by adding an attention mechanism module on the basis of BagTricks, but it does not use local feature information. MGN uses global and block-based local features, and the network has the ability to extract fine-grained features, but it does not address the problem of scale transformation and occlusion, which may introduce more interference information when the network faces such situations. In this paper, by using CPAA and multi-branch structures, the network can extract diverse fine-grained features, and by using GLA, the network considers global semantic information while focusing on local features to deal with scale transformation and occlusion scenarios. The indicators in Table 4 show that the proposed MFA network has the best performance.
The top-10 detection for each model under the same query are shown, with red indicating incorrect ID matches and green indicating correct ID matches.
To visually reflect the effectiveness of the proposed method, the retrieval results of AGW, OSNet, MGN, and the proposed method were compared. Figure 6 shows the Rank-10 retrieval results of each method and the proposed method, with dotted boxes indicating incorrect matches. Although AGW and OSNet had correct Rank-1 matches, they could not pay attention to more detail information due to the absence of local features, leading to more incorrect matches. MGN failed to solve the problem of scale variation and could result in misidentification in such cases. MFA, due to its ability to extract more diverse local fine-grained features and address the issue of scale variation, achieved better Rank-10 matching.
For visualization purposes, we compare our results by Grad-CAM. As shown in (a) and (b), from left to right: the original image, RA-Loss [47] results, ISP [51] results, AGW [20] results, and our results. Our model accurately focuses on more discriminative pedestrian features while filtering out background and occlusion interference.
To visually demonstrate the effectiveness of our proposed method in capturing detailed information in images, we compared it with other methods RA-Loss, ISP, and AGW using Grad-CAM visualization. Figure 7 presents the specific visualization results. Our method demonstrates more effective attention to multiple local features in the pedestrian region while filtering out background and occlusion interference.
To compare model complexity and execution efficiency, we selected models that performed well on multiple datasets for comparative analysis. The models include AGW, GASM, ISP, DAAF-BOT, and MGN. Table 5 the parameters and execution efficiency (fps) for each model, with a test image resolution of 256
Finally, to verify the model’s generalization ability under various occlusion scenarios, tests were conducted on the Partial-REID and Partial-iLIDS datasets using the single-shot method. The occluded image was used as the query and the complete pedestrian image was used as the gallery. Table 6 shows the indicators for each method. In Partial-REID, MFA achieved a Rank-1 indicator of 73.2%, which is 3.5% higher than AGW, ranked higher. In the Partial-iLIDS dataset, the Rank-1 indicator of MFA increased by 0.8% compared to the better-performing PGFA, reaching 68.1%.
Comparison of different models in terms of parameter size and inference efficiency
Performance comparison of different methods on 2 partial datasetswhile underline represents the second-best. Our model still performs the best on the occlusion dataset containing only test images
We propose a pedestrian re-identification network that integrates local features with attention mechanisms. The CPAA attention module is embedded in the ResNet50 backbone network, effectively aggregating semantic features from different channels and neighboring regions. On the other hand, through the joint action of local branches and GLA loss, the model can focus on multiple local features of pedestrians. Visual analysis shows that the fused pedestrian features are rich and pure. Compared with similar networks, the proposed method achieves competitive performance, demonstrating its effectiveness.However, our work still has room for improvement. Currently, pedestrian re-identification models have high accuracy but large parameter sizes, making it challenging to deploy them on edge devices.In future work, we will investigate how to streamline the model while maintaining current performance, enabling lightweight and edge deployment of the network.
Footnotes
Acknowledgments
This research was funded by the National Natural Science Foundation of China (NSFC) (grant no. U21A6003), the Program of Promoting the Development of University – Diligence Talents (grant no. 5112111145).
