Abstract
Person re-identification (ReID) is a crucial task in identifying pedestrians of interest across multiple surveillance camera views. ReID methods in recent years have shown that using global features or part features of the pedestrian is extremely effective, but many models do not have further design models to make more reasonable use of global and part features. A new model is proposed to use global features more rationally and extract more fine-grained part features. Specifically, our model captures global features by using a multi-scale attention global feature extraction module, and we design a new context-based adaptive part feature extraction module to consider continuity between different body parts of pedestrians. In addition, we have added additional enhancement modules to the model to enhance its performance. Experiments show that our model achieves competitive results on the Market1501, Dukemtmc-ReID, and MSMT17 datasets. The ablation experiments demonstrate the effectiveness of each module of our model. The code of our model is available at:
Introduction
The purpose of the Person re-identification(ReID) task is to determine whether the pedestrians captured by different cameras belong to the same pedestrian and the ReID is generally used to address cross-camera tracking and surveillance security issues. Compared with face recognition tasks, the ReID task has many practical application problems, i.e., the appearance of pedestrians is easily influenced by lighting conditions, posture, viewpoint, and resolution. Some examples are shown in Fig. 1.

Some of the challenges faced by person re-identification in the Market1501 dataset. (a). Same pedestrian images under different lighting conditions. (b). Same pedestrian images under different posture. (c). Same pedestrian images under different viewpoints. (d). Same pedestrian images under different resolution cameras.
Traditional methods focused on low-level features such as color [7], texture, and some combination of low-level features [8]. In recent years, convolutional neural networks (CNN) have achieved great success in many computer vision tasks. Compared with traditional methods, deep learning technology performs better in the ReID task [49]. Most of the current ReID methods based on deep learning are using global features because the global features are effective when the external factors are less variable. However, global features cannot effectively describe some fine-grained information about pedestrians. Therefore, to further improve the performance of the ReID model, some additional descriptions of pedestrians are required. After the emergence of part-based methods, some improved part-based methods [24,41,45] use enhanced part features as pedestrian descriptors, which leads to a significant improvement in the efficiency of person re-identification.
In recent years, part-based methods can be classified into three categories: (1). The first category is to divide different body parts according to the pedestrians’ posture [17,20,29]. (2). The second category is the direct division method, also known as ‘hard division’, in which horizontal stripes are cut on the image [4,22,44]. (3). The third category is the flexible division of pedestrians, also called the ‘soft division’ [25,37], and this division method is more accurate and reasonable for ReID tasks.
In this paper, we propose a new global feature extraction method and a new part division method to obtain fine-grained features of pedestrians. Compared with existing methods of global and part features, our method has several contributions as follows. (1). For global features, our method uses multi-scale convolution kernels and a new lightweight spatial attention module to extract more refined global features. (2). For part features, our method employs a part feature fusion strategy that drives the model to focus on continuous part features of pedestrians. (3). To further enhance the feature mining capability of the model, we add the Non-local module after stage1 and stage2 of the backbone network respectively. From the results, the Non-local module can effectively improve the model performance.
In addition, the method of dividing the person image into several part features is effective. For example, the authors in [5,47] divided the image into the upper and lower body, the authors in [52] divided the pedestrian into several horizontal stripes, and in the literature [17], the authors used sliding windows to construct local descriptors. In the literature [17], the authors proposed a GOG descriptor for ReID, which uses a hierarchical Gaussian operator to describe the color and texture information of different segmented parts. Although these methods perform relatively well, the hand-crafted methods cannot adapt to changes in image viewpoint and are not comparable to deep learning methods in performance.
To make more rational use of global features and part features, we propose a multi-scale global feature module and a part feature fusion module, and the multi-scale global feature module extracts fine-grained global features by using several different scales convolution kernels and a new lightweight spatial attention module, and the weight-driven part feature fusion module uses an adaptive part feature fusion strategy to drive the model to consider the correlation between each part and its neighboring parts.
Method
Our model uses ResNet50 [10] as the backbone network, but we add some extra modules in the model to improve the feature extraction ability of the model. Specifically, we introduce three feature mining modules. (1). Non-local modules [34]. The traditional Resnet50 network contains five basic modules as feature extraction units (stage0-stage4), and we add the non-local module after stage1 and stage2 respectively to enhance the model’s salient feature extraction ability. (2). Multi-scale global feature module. We introduce a multi-scale global feature module to extract more comprehensive and refine global features by using multiple convolution kernels of different scales and the new lightweight spatial attention module. (3). Weight-driven part feature fusion module. We fuse each part feature of a pedestrian with neighboring part information by using a weight-driven adaptive method. Compared with mainstream part-based methods, our method not only considers some information contained in a single part but also designs an adaptive weight-driven method to learn the correlation between pedestrian part features and adjacent part features. The architecture of our model is shown in Fig. 2.

Overview of our training/testing framework.
Convolutional neural networks have some shortcomings when applied to the ReID task. For example, (1). Obtaining global features with large receptive fields generally requires large-scale convolutional kernels, but large convolutional kernels introduce a large number of parameters, which leads to inefficient model training. (2). As the model deepens, the overall model gradient needs to be considered when designing other enhancement modules. (3). The traditional convolution-based backbone network is slightly inadequate in mining salient features of pedestrian images, and additional feature extraction modules are needed to help the whole network to improve the mining ability of salient features. Non-local module, which can capture a larger range of image receptive fields, was first proposed in the literature [34].
The computational flow of the Non-local module is shown in Fig. 3, from Fig. 3 we can observe that the staged features are input into three identical convolutional layers to obtain three feature maps of the same size,
Where x is the input feature map, the

Non-local module.
In this section, we introduce the multi-scale global feature module. We use different combinations of convolution and pooling layers to extract global features at different scales. After several sets of experiments, we finally use branch 1 (

Multi-scale global feature module.
To further enhance the feature extraction capability of the multi-scale global feature module, we designed a tiny spatial attention (TSA) module and added it to the multi-scale global feature module, as shown in Fig. 5. We perform spatial construction of the feature map with four suboperations (10 parameters), which include a global cross-channel averaging pooling operation (0 parameters), a

Tiny spatial-attention module.
Inspired by the ResNet residual module [10], we know that the use of residual structure can effectively solve the model gradient problem, so we sum up the multi-scale global feature map and stage4 feature map to get the final features
In this section, we introduce the weight-driven part feature fusion module, which is independent of the multi-scale global feature module. The stage4 feature

Weight-driven part feature fusion module. We divide the pedestrian image into six parts. Correspondingly, we calculate six ‘remainder parts’, which are the sum of one or two parts adjacent to the part. Then we use two FC layers and the sigmoid function to obtain two sets of weights.
Firstly, we define the sum of the parts adjacent to the current part (p1–p6 in Fig. 6) as the ‘remainder part’ (t1–t6 in Fig. 6). In general, it is difficult to determine the importance between part features and ‘remainder part’ features rapidly by setting the weights manually, in order to drive the model to explore the importance of each part and remainder part adaptively, we use two FC layers and a sigmoid function to obtain adaptive weights for the part and the ‘remainder part’. Specifically, we use
Where FC and sigmoid represent two full connection layers and sigmoid functions respectively.
Where we use adaptive weights
After the images are input to the backbone network, we get the features of stage 3 and stage 4. The feature of stage 3 are shallow features and is not enough to be used as the final classification feature. The stage 4 features are the final features obtained from the Resnet50 network, which contains precise image information. We perform average pooling and maximum pooling operations on the features of the two stages and then perform summation operations on the average pooled features and the maximum pooled features, finally input them to the FC layer to obtain the prediction results for calculating the cross-entropy loss. In addition, we calculate the triplet loss for the average pooled features and the maximum pooled features respectively, as shown in Fig. 2.
Loss computation
Label smoothing cross-entropy loss can weaken the supervision and prevent the overfitting of the model to some extent, and this strategy is also widely used in classification tasks. The label-smoothed cross-entropy loss is shown in Eq. (7).
In addition, to measure the most similar pedestrians from the dataset, we introduce the triplet loss to drive the model to find useful features, and the triplet loss is used to improve the final ranking performance. We calculate the triplet loss according to Eq. (8).
Where
In this paper, our loss computation is divided into three main parts. (1). For stage3 and stage4 features, we compute the triplet loss for the average pooled feature and the maximum pooled feature and compute the cross-entropy loss for the sum of the average pooled feature and the maximum pooled feature. (2). For the multi-scale global feature module, we only compute its cross-entropy loss. (3). For the weight-driven part feature fusion module, we calculate the cross-entropy loss for each
Finally, we jointly train an end-to-end multi-stage network loss with intermediate supervision and we train and optimize the model according to the loss of Eq. (9).
Model test
In the model testing phase, we input the query image to the backbone network to obtain the stage 3 features and stage 4 features. Then we sum the global maximum-pooled features and global average-pooled features of stage3 to obtain
Experiments
Datasets
In this paper, we use three datasets to evaluate our model, which are Market-1501 [48], DukeMTMC-ReID [53], and MSMT17 [36]. We summarize the detailed data of the three datasets in Table 1.
Statistics of the datasets used for training and testing
Statistics of the datasets used for training and testing
Where
Performance evaluation
Evaluation metrics
We use the cumulative matching characteristic (CMC) and mean average precision (mAP) metrics on all datasets. The CMC curve records the actual match within the topk-Ranks, while the mAP evaluates the overall performance of the method considering the precision.
Ablation studies
Effect of each module on the model performance. NL represents the non-local module, MSG represents the multi-scale global characterization module, and WDP represents the weight-driven part feature fusion module
Effect of each module on the model performance. NL represents the non-local module, MSG represents the multi-scale global characterization module, and WDP represents the weight-driven part feature fusion module
Effect of each branch of C-P2 on the model performance. TSA represents tiny spatial attention module
Effect of TSA on the model performance. TSA represents tiny spatial attention module
Effect of weight-driven part feature fusion operation on the model performance
The results under different image sizes and different multi-scale global feature modules on Market-1501

Effect of the different number of parts on model performance. It is observed that the model performance is optimal when the number of
Results are obtained by using L2 distance and RK on the Market1501 dataset
Results are obtained by using single-query and multi-query on the Market1501 dataset
In this section, we compare our model with existing models on the Market1501, DukeMTMC-ReID, and MSMT17 datasets.
A comparison of our method with those of recent years on the Market-1501 dataset
A comparison of our method with those of recent years on the Market-1501 dataset
With the help of multi-query, Rank-1, Rank-5, and Rank-10 are further, improved to 96.1%, 99.1%, and 99.3%, and the mAP has also reached 90.3%. By using RK and the multi-query, our Rank-k and mAP are greatly improved.
A comparison of our method with those of recent years on the DukeMTMC-ReID dataset
A comparison of our method with those of recent years on the MTMC17 dataset
PCB [27] is a simple and strong part-based method that divides the body into horizontal stripes of the same size. Many existing methods use PCB models as the comparison method. To intuitively compare the PCB model and our model, we visualize activation maps and search results respectively, as shown in Figs 8, 9.
Visualization of the activation maps on Market-1501

Activation maps of pedestrians for different stages. The region covered by the color represents the scope of the model’s attention, and the shade of the color represents the intensity of the model’s attention.
In order to visualize the working mechanism of our model, we instantiated two sets of pedestrian images and compared the activation maps of stage1-stage3 of the PCB model and our model(The shade of color represents the degree of focus of the model), as shown in Fig. 8. As can be observed in Fig. 8, the first row is the visualization result of the PCB model and the second row is the visualization result of our model. For the Image1, the PCB model focuses on the book held in the pedestrian’s hand and covers a smaller region of attention, while our model tends to focus on diverse and comprehensive pedestrian features. For the Image2, factors such as background have more influence on the PCB model, while our model is rarely disturbed by the background, and is more robust to changes in factors such as the color and texture of the pedestrian’s upper and lower body clothes, and have greater capability for salient feature extraction(e.g., the clothing logo of the pedestrian in Image 2).
HPM [6] is a model based on part features proposed in recent years, which describes pedestrians by dividing parts more finely. We reproduce this model for comparison.

Retrieval results for our model. Green numbers indicate correct identifications, and red numbers indicate incorrect identifications.
In order to show the retrieval results of our models more intuitively. We reproduction three models to query the same pedestrian image and visualized the retrieval results, which are shown in Fig. 9. From Fig. 9, we can observe that the query image and the gallery image of the same pedestrian contain many factors of variation, such as lighting conditions, posture, viewpoint, and resolution, etc. The error rate of the Resnet50 model is high because it only uses global features as the description of pedestrians, and the global features are not robust to changes in pedestrian details. These details include the texture of the pedestrian clothes, logos and color shades of clothes, etc.
The accuracy of the part features based models (PCB and HPM) is higher than the global features based Resnet50. This also shows that local features are effective for ReID tasks, and our model uses enhanced global and local features, which greatly improves the accuracy of the search. This also shows that our improvements are effective.
Our method aims to enhance the robustness of the ReID model to various external factor changes, and we propose a ReID model based on a multi-scale global feature module and a weight-driven partial feature fusion module. For global features, we use multiple convolution kernels and a novel spatial attention module to extract refined global features. For part features, we divide different body parts and propose a novel adaptive weight method to weight adjacent part features to construct enhanced pedestrian part features. In addition, we introduce the Non-local module into the backbone network to enhance its feature enhancement capability, and then use two secondary modules as feature complements. Comparative experiments show that our method has high accuracy compared to other similar methods. Ablation experiments also clearly demonstrate the effectiveness of each component of our model.
Footnotes
Acknowledgements
This work was supported by the Provincial Natural Science Research Projects in Anhui Universities-Postgraduate Projects(YJS20210508), the University Synergy Innovation Program of Anhui Province under Grant GXXT-2019-007, by the National Natural Science Foundation of China under Grant 61901006, Grant 62105002, and Grant62001004, by the Anhui Provincial Natural Science Foundation under Grant 1908085QF281 and Grant2008085MF182, by the Anhui Provincial DOHURD Science Foundation under Grant 2020-YF22, and by the Educational Commission of Anhui Province of China under Grant KJ2020A0471.
