Multi-scale and multi-patch feature fusion network for person re-identification

Abstract

Person re-identification relies on discriminative features. However, most researches focus on extracting features from the high-layer of network while ignoring the middle-layer features, some important details are overlooked frequently. To address this issue, we propose a Multi-Scale and Multi-Patch Feature Fusion Network(MSPF). We employ modified OSFA to extract, align, and fuse the feature maps in the middle-layer of network, which can compensate for the lack of detailed information in the high-level network features. To obtain richer detailed global features of pedestrian, we construct a multi-patch feature fusion module(MPF). We concatenate the global features extracted from modified OSFA and MPF to obtain global features with richer detailed representations. Cross-entropy loss, triplet loss and center loss are combined to constrain our model. We evaluate the performance of our model on Market-1501, CUHK03_labeled and DukeMTMC. The results prove that our method is superior to the state-of-the-art approaches.

Keywords

Person re-identification multi-scale multi-patch feature fusion

1 Introduction

Person re-identification aims to distinguish identities from person images captured by diverse cameras. Given a full-body image of one person we care about(Query), then we need find the closest image or images from database(Gallery) to that person. Person re-identification is a extremely challenging task. When a single person is captured by diverse cameras, observable human body parts, the illumination conditions, background clutter, and occlusion can be extremely different. Even within the same camera, the above-mentioned conditions may vary due to human movement and other factors. Therefore, it becomes difficult to obtain relevant features.

Early researches focus on discriminative feature representation [1 –3] or distance metrics [4, 5] or their combination in a deep learning framework [6, 7]. However, these methods tend to lose some detailed features with high time complexity. With the increasing computational power of computers, the outstanding performance of convolutional neural networks(CNN) in person Re-ID has been extensively recorded [8 –13]. Some studies regarded pedestrian re-identification as a multi-classification problem and only utilized global-scale features [11 –13]. However, these methods can not solve pedestrian occlusion problem well, and background clutter in pedestrian images can make it difficult to extract the global-scale features. To solve the above problem, in recent years, researchers proposed to fuse global-scale and local-scale features to extract more discerning feature representations [14 –16]. Extracting local-scale features from multiple parts of pedestrian image and then fusing the local-scale features with the global-scale features are robust to factors like occlusion and background clutter, and can produce better results. But simply fusing local and global features still does not bring terrific results.

For the past few years, multi-scale feature combination(fusing local features extracted from diverse convolutional layers) for person re-identification has received much attention. For example, Wang [19] fused features cross different scales to emphasize ID related factors and ignore irrelevant factors. Rao et al. [18] proposed a multi-scale graph relation network to learn structural relations so as to capture more discriminative skeleton graph features. Huang et al. [19] proposed a multi-scale feature combination network. It combines structural feature information with global comprehensive feature information of person image through a convolution neural network, which can effectively retain distinguishing detail information. Combining the multi-scale features with the global-scale features can extract enough discriminative features of person. Though it performs well in person Re-ID and has higher identification accuracy than that of other single-scale methods, it increases the complexity of network. Therefore, it is necessary to find a lightweight network for the multi-scale feature person re-identification task.

A lightweight network with small number of parameters is less prone to overfitting and can reduce complexity of model. Fortunately, a number of lightweight networks have been proposed for various recognition tasks [20 –22]. F et al. [21] evaluated and compared some lightweight networks like ResNet-50 and GoogleNet which proved themselves in object recognition tasks. Jha et al. [22] proposed a novel Lightweight Attribute Localizing Model (LWALM) for pedestrian attribute recognition. It demonstrated a significant reduction in parameters and computational complexity with less than 2% accuracy drop. Based on the research of lightweight network, combining the multi-scale feature learning, [12] designed a novel deep re-ID CNN termed omni-scale network (OSNet). This is achieved by designing a residual block composed of multiple convolutional streams and each detecting features at a certain scale. OSNet is lightweight since its building blocks involve factorized convolutions. In particular, OSNet exploits the multi-layer information of network, so it can learn discriminative pedestrian representations under the optimization of multiple losses, which is thus that many works employ it as the basic backbone network [23 –25]. Especially, Li et al. [30] proposed an Omni-Scale Feature Aggregation method (OSFA) with OSNet as backbone. OSFA can exploit the multi-layer discriminative pedestrian features, learn discriminative pedestrian representations, and reduce model complexity of network.

Based on the above situations, in this paper, we consider employing OSNet as the basic CNN backbone network. Furthermore, we construct a multi-patch feature fusion module(MPF). Feature aggregation over multi-patch can extract rich detailed global features of pedestrian and has been widely used in image classification task [26 , 31]. The contribution of this work is summarized as follows:

We employ Omni-Scale Feature Aggregation Network(OSFA) [30] to exploit the multi-layer discriminative pedestrian features and learn discriminative pedestrian features based on global and local representation learning. Different from OSFA, we put Spatial Attention Module behind Channel Attention Module since Woo et al. [36] confirmed that there can be better performance when Spatial Attention Module is behind Channel Attention Module.

We construct a multi-patch feature fusion module(MPF) by using OSNet as baseline to extract discriminative pedestrian global features. Then we concatenate the global features extracted from modified OSFA and MPF to obtain richer detailed global features.

We evaluate our method on three benchmark person Re-ID datasets: Market-1501, CUHK03_labeled and DukeMTMC, compare the performance with some advanced Re-ID models, and confirm that our method outperforms the other methods in person Re-identification task.

2 Related work

Multi-scale feature learning. Features extracted from a single convolution layer are inadequate for tasks that require multi-scale pedestrian representations for inference. The significance of multi-scale feature learning has been recognized recently. For instance, Qian et al. [28] proposed a multi-scale deep learning model which can learn deep discriminative feature representations at different scales and automatically determine the most suitable scales for matching. Chen et al. [29] demonstrated the benefits of multi-scale person representations learning by jointly learning discriminative scale-specific features and maximizing multi-scale feature fusion selections. Of course, the methods [17 –19] mentioned earlier used multi-scale feature learning as well. However, the above methods can not learn the abstract and concrete information in the underlying features well. To solve this problem, Li et al. [30] proposed Omni-Scale Feature Aggregation method(OSFA) which fuses the feature maps in the middle-layer of network to make up for the lack of detailed information in the high-level network features. For extracting more meaningful information, it constructed a Smooth Aggregation Module(SAM) which can align feature maps of different scales through unified division. It introduces an attention mechanism to the middle of network. The attention module is composed of Channel Attention Module(CAM) and Position Attention module(PAM). OSFA can learn the abstract and concrete information in the underlying features well.

Multi-patch feature learning. Multi-scale feature learning can extract more discriminative features of person. However, people can look incredibly similar when they wear close costumes, it is not enough to extract discriminative features from the entire image. To obtain richer detailed pedestrian features, some researchers utilized the idea called multi-patch. For instance, Zhang et al. [31] presented a deep hierarchical multi-patch network for image deblurring and achieved the state-of-the-art performance on the GoPro dataset compared to multi-scale methods at that time. For person re-identification task, Wang et al. [32] designed a Multiple Granularity Network(MGN) that takes multi-patch images as input. It is useful for obtaining local feature representations with multiple granularities. Xue et al. [33] proposed a multi-division attention network which can learn the robust and discriminative person feature representation from the global image and different local images simultaneously. Multi-patch feature learning is meaningful for extracting discriminative person feature representations.

Lightweight network framework. As mentioned above, finding a lightweight network for person re-identification task is that researchers are committed to. Zhou et al. [12] proposed omni-scale feature learning network(OSNet) which is achieved by designing a residual block composed of multiple convolutional streams. OSNet is a lightweight network and has achieved good performance. Based on OSNet, Ploco et al. [34] combined it and the spatial-temporal constraint to create a hybrid model titled Spatial-Temporal OmniScale Network (st-OSNet), and achieved better performance than Zhou et al. [12]. For better result, Herzog et al. [35] combined global, part-based, and channel features in a unified multi-branch architecture that builds on the resource-efficient OSNet baseline. It achieved a good result on CUHK03 labeled, CUHK03 detected, and Market-1501. Coincidentally, OSFA [30] mentioned above used OSNet as backbone and employed multi-scale feature learning. OSFA can exploit multi-layer discriminative pedestrian features, learn discriminative pedestrian representations, and reduce model complexity of network.

Based on the above researches, combining multi-scale, multi-patch, and lightweight network, we propose a Multi-Scale and Multi-Patch Feature Fusion Network(MSPF) for person re-identification task. MSPF includes two parts: modified OSFA and multi-patch feature fusion module(MPF). Modified OSFA is used to extract rich local and global features. MPF is used to extract richer discriminative global features of pedestrian. Detailed descriptions are given in section 3.

3 Our proposed network architecture

In this section, we describe the proposed Multi-Scale and Multi-Patch Feature Fusion Network(MSPF) in detail.

3.1 Network architecture overview

Our network architecture consists of modified OSFA and multi-patch feature fusion module(MPF), which can be described as follows:

Modified OSFA is illustrated in Fig. 1. OSFA [30] can extract global features and local features simultaneously by using OSNet as the baseline. Compared with traditional baselines, OSNet is lightweight and has lower computational costs. OSFA consists of two parts: baseline and Smooth Aggregation Module(SAM). The baseline can learn the abstract and concrete information in the underlying features by employing an attention mechanism to the middle of network. The attention module is composed of Position Attention Module and Channel Attention Module, which can decrease the influence of background clutter. Channel Attention Module is behind Position Attention Module in OSFA. Different from OSFA, we employ Spatial Attention Module and Channel Attention Module in the second block from the baseline, and put Spatial Attention Module behind Channel Attention Module. CBAM [36] has confirmed that there can be better performance when Spatial Attention Module is behind Channel Attention Module. Inputting an image into the baseline, after Global Max Pooling, we can obtain a 512-dimensional global feature vector f_G. Smooth Aggregation Module(SAM) can extract the multi-scale feature maps from different layers and fuse them to obtain the local features f_L. Cross-entropy loss, triplet loss and center loss are employed to constrain the model, which is helpful for our network learning robust pedestrian image information.

Fig. 1

The framework of modified OSFA. It consists of two parts: Omni-Scale Network(OSNet) and Smooth Aggregation Module(SAM). OSNet is used as the CNN backbone to extract the high-level global feature f_G. Smooth Aggregation Module (SAM) is used to fuse multi-scale features to obtain local feature f_L. Cross-entropy loss, triplet loss and center loss are employed to constrain the model.

Multi-patch feature fusion module(MPF) is illustrated in Fig. 2. MPF contains two levels: the initial pedestrian image is divided into 4 patches as input at the 2-level(P2), and divided into 2 patches as input at the 1-level(P1). The 1-th level is of half patch size of the 2-th level. Each level of MPF consists of OSNet and Global Max Pooling. The output from lower level (corresponding to finer grid) is added to the output from upper level (one level above) so that the output from the top level contains all the information extracted from the finer level, which helps us to obtain the global feature $f_{G}^{'}$ . Then we add $f_{G}^{'}$ to f_G to obtain the overall global feature $f_{G}^{''}$ . Therefore, our loss function is based on the overall global feature $f_{G}^{''}$ and the local feature f_L. Our loss function combines cross-entropy loss, triplet loss and center loss in the training process of network.

Fig. 2

The framework of multi-patch feature fusion module(MPF). It contains two levels. Images with multi-patches are sent to the two baseline to obtain global feature $f_{G}^{'}$ . Then we add $f_{G}^{'}$ to f_G to obtain $f_{G}^{''}$ with richer detail representations.

3.2 Feature aggregation learning

Multi-scale feature aggregation learning. Smooth Aggregation Module (SAM) is used to fuse multi-scale features from different levels. The framework is shown in Fig. 1. Feature maps F₁ ∈ R^{c₁×H₁×W₁}, F₂ ∈ R^{c₂×H₂×W₂}, F₃ ∈ R^{c₃×H₃×W₃} are extracted from the second, the third, and the fourth block from the baseline, respectively. These feature maps first pass through a 1×1 kernel-sized convolutional layer to unify dimensions. Next, a global average pooling(GAP) will get the three feature maps with a uniform scale, which accurately locates pedestrian images and reduces the computational complexity of feature aggregation. Since the underlying features are redundant and scattered, each feature map is divided into four parts to learn discriminative local information. It can be regarded as four local features which can be expressed as $f_{n}^{i} \in R^{C_{i} \times \frac{H_{i}}{4} \times W_{i}}$ (n=1, 2, 3, i = 1, 2, 3, 4). Then $f_{n}^{i}$ is horizontally linked to form a local connection vector fⁱ so that we can extract the local information of pedestrian at different scales. Finally, fⁱ is connected to obtain local feature tensor f_L. It has the following advantages: connecting fⁱ can reduce the impact of irrelevant components on the network’s prediction of pedestrian identity.

Multi-patch feature aggregation learning. In order to obtain richer detailed global features of pedestrian, we use OSNet with an attention mechanism as our baseline and construct a multi-patch feature fusion module(MPF). MPF is used to extract richer detailed global features. As shown in Fig. 2, the feature extraction process of MPF starts at the bottom level-2. The pedestrian image is sliced into 4 non-overlapping patches P_2,j, j = 1, 2, 3, 4, which are sent into the baseline $F_{2}$ to generate the following convolutional feature representation: $C_{2, j} = F_{2} (P_{2, j}), j \in {1 \dots 4}$ (1) Then, we concatenate adjacent features (in terms of spatial) to obtain a new feature representation $C_{2, j}^{*}$ , which has the same size as the conv. feature representation at level-1: $C_{2, j}^{*} = C_{2, 2 j - 1} \oplus C_{2, 2 j}, j \in {1, 2}$ (2) where ⊕ denotes the concatenation operator. Next, we move one level up to level-1. The pedestrian image is sliced into 2 non-overlapping patches P_1,j, j = 1, 2, which are sent into the baseline $F_{1}$ . Once the output of $F_{1}$ is generated, we add it to $C_{2, j}^{*}$ : $C_{1, j} = F_{1} (P_{1, j}) + C_{2, j}^{*} j \in {1 \dots 2}$ (3) Then, we concatenate adjacent features (in terms of spatial) to obtain a new feature representation $C_{1, j}^{*}$ (It is depicted as $f_{G}^{'}$ in Fig. 2): $C_{1, j}^{*} = C_{1, 2 j - 1} \oplus C_{1, 2 j}, j = 1, C_{1, j}^{*} = f_{G}^{'}$ (4) Finally, we add $C_{1, j}^{*}$ to the global feature f_G extracted from OSFA. The final output $f_{G}^{''}$ is given by: $f_{G}^{''} = f_{G} + f_{G}^{'}$ (5) Our loss function is based on the overall global feature $f_{G}^{''}$ and the local featuref_L. MPF and modified OSFA share the network weights.

3.3 Loss function

Like most person re-identification tasks, we also use metric loss and classification loss as constraints in this paper. In order to represent the difference between inter-class and intra-class better, we employ multiple loss functions to strengthen the feedback to the network. The total loss can be expressed as the weighted sum of the following three losses: $L = L_{cross - entropy} + α_{1} L_{triplet} + α_{2} L_{C}$ (6) Triplet loss is used to maximize the distance between the anchor and the negative one, and minimize the distance between the positive and the anchor. In this paper, we employ Batch Hard Triplet Loss to randomly sample P pedestrian identities, and randomly sample S images from each class, so there are P × S images in a batch. Triplet Loss can be computed by:

$\begin{matrix} L_{triplet} = \overset{all anchors}{\overset{︷}{\sum_{i = 1}^{P} \sum_{a = 1}^{S}}} [\overset{hardest positive}{\overset{︷}{\max_{p = 1 \dots S} D (f_{a}^{i}, f_{p}^{i})}} \\ - \overset{hardest positive}{\overset{︷}{\min_{\begin{matrix} j = 1.. P \\ n = 1.. S \\ j \neq i \end{matrix}} D (f_{a}^{i}, f_{n}^{j})]}} \end{matrix}$ (7) $f_{a}^{i}$ , $f_{p}^{i}$ , $f_{n}^{i}$ represent features of the i-th identity extracted from the anchor, the positive, and the negative samples, respectively. D (·) is the non-squared euclidean distance.

In the meanwhile, to enhance this inter-class discrimination, we employ cross-entropy loss with label-smoothing regularization: $\begin{matrix} L_{cross - entropy} \\ = - \frac{1}{P \times S} \sum_{i = 1}^{P} \sum_{a = 1}^{S} p_{i, a} log ((1 - ɛ) q_{i, a} + \frac{ɛ}{N}) \end{matrix}$ (8)p_i,a and q_i,a is the ground truth identity distribution and the probability of the predict identity, respectively. $\frac{1}{N}$ represents a uniform distribution with distribution weight 1 - ɛ and ɛ.

Besides, we add the center loss to the above loss: $L_{C} = \frac{1}{2} \sum_{i = 1}^{S} {∥ x_{i} - c_{y_{i}} ∥}_{2}^{2}$ (9) For S images in a batch, x_i is the deep feature of the y_i-th image which belongs to the y_i-th class. c_{y
_i} represents the feature center of the y_i class. In some instances, the intra-class variation of the same person is likely to be greater than the inter-class variation. Therefore, keeping the intra-class compact can lead to a more robust judgment result for samples with large intra-class changes.

4 Experiments

4.1 Implementation details

Our model is based on the pre-trained OSFA and built on the PyTorch framework. The size of training image is reset to 256 × 128. Then, we feed a batch size of 64 data randomly selected into the network to avoid overfitting. During testing, the test images are also resized to 256 × 128. Our model is trained for 200 epochs. The values of α₁, α₂ and the learning rate are set to 1, 0.0007, 3.5e - 5, respectively. In SAM, the number of horizontal parts is set to 4. All experiments are executed with a hardware environment as 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz and NVIDIA GeForce RTX 3060.

4.2 Datasets and settings

In this paper, we use three popular person Re-ID datasets to evaluate the performance of our proposed MSPF model. The three datasets statistics are provided in Table 1.

Table 1
Summary of three popular person Re-ID datasets

Datasets Camera Training Set Testing Set

ID Images ID Images

Market-1501 6 751 12936 750 19732

CUHK03 10 767 7368 700 6728

DukeMTMC 8 702 16522 702 19889

Datasets	Camera	Training Set	Testing Set
Market-1501	6	751	12936	750	19732
CUHK03	10	767	7368	700	6728
DukeMTMC	8	702	16522	702	19889

Market-1501 [39] contains 32,668 images of 1501 pedestrians captured by 6 different cameras. It consists of a training set and a testing set including 751 identities with 12,936 images and 750 identities with 19,732 images, respectively. It has two query modes: single-query and multiple-query. Single-query includes one query image at a time, and multiple-query takes the average pooling and max pooling of multi images as query features.

CUHK03 [40] contains 1467 pedestrian images captured by 10 cameras. Two non-overlapping cameras are employed to capture each identity. Unlike other datasets, CUHK03 includes two parts: DPM-detected bounding boxes and manually labeled bounding boxes. The former includes 7365 images for training, 5332 images for testing, and 1400 images in query set for testing, and the later uses 7368 images for training, 5328 images for testing, and 1400 images for querying.

DukeMTMC [41] includes 36,411 images of 1404 pedestrians captured by 8 different cameras. It consists of two parts: training set and testing set. The training set includes 702 people with 16,522 images. The remaining 702 people form the testing set which includes 2228 images in query set and 17,661 images in gallery set.

4.3 Comparison with State-of-the-Arts

We compare the performance of our model MSPF with that of other recent methods. Our model achieves state-of-the-art results on Market-1501, CUHK03, and DukeMTMC both in terms of Rank-1 and mAP. The comparison results on Market-1501 are shown in Table 2. We can see that our model is 1.2% and 0.9% ahead of OSFA on Rank-1 and mAP, respectively, and has better performance than other methods on Rank-1 and mAP.

Table 2
Performance comparison on Market-1501 dataset

Methods Rank-1(%) mAP(%)

DPFL [29] 92.3 82.7

OSFA [30] 95.5 88.5

SMGN [10] 94.1 79.2

Self-supervised person [38] 94.7 86.7

OSNet [12] 93.6 81.0

ABD-Net [37] 94.8 86.0

LG-OSNet [24] 94.8 86.7

ISP [15] 95.3 88.6

Ours(MSPF) 96.7 89.4

Methods	Rank-1(%)	mAP(%)
DPFL [29]	92.3	82.7
OSFA [30]	95.5	88.5
SMGN [10]	94.1	79.2
Self-supervised person [38]	94.7	86.7
OSNet [12]	93.6	81.0
ABD-Net [37]	94.8	86.0
LG-OSNet [24]	94.8	86.7
ISP [15]	95.3	88.6
Ours(MSPF)	96.7	89.4

Since OSFA [30] shows that the performance on CUHK03_detected dataset is unsatisfied, we do the experiment only on CUHK03_labeled dataset. The comparison results on CUHK03_labeled are shown in Table 3. The accuracy of Rank-1 and mAP are 2.8% and 1.2% higher than that of OSFA.

Table 3

Performance comparison on CUHK03_labeled dataset

Methods	Rank-1(%)	mAP(%)
OSFA [30]	80.7	77.1
SMGN [10]	73.1	70.3
Self-supervised person [38]	72.2	67.8
LG-OSNet [24]	72.3	67.8
ISP [15]	76.5	74.1
Ours(MSPF)	83.5	78.3

The comparison results on DukeMTMC dataset are shown in Table 4. We can see that our model outstrips OSFA on Rank-1/mAP by 0.5% /0.6% and has better performance than other methods on Rank-1 and mAP. Experiment results show that: Our model has high performance in pedestrian re-identification task.

Table 4

Performance comparison on DukeMTMC dataset

Methods	Rank-1(%)	mAP(%)
DPFL [29]	79.2	60.6
OSFA [30]	90.2	79.7
SMGN [10]	87.1	76.0
Self-supervised person [38]	89.0	78.2
OSNet [12]	84.7	68.6
ABD-Net [37]	89.0	78.6
LG-OSNet [24]	88.7	76.6
ISP [15]	89.6	80.0
Ours(MSPF)	90.7	80.3

4.4 Effectiveness study on multi-patch

As shown in Fig. 2, MPF includes two levels. To determine the effectiveness of multi-patch, we conduct experiments on Market-1501 and DukeMTMC. Then we compare four variations of MPF: MPF with 2-4-8 patches, regular MPF with 2-4 patches, MPF with only 2 patches and model without patches. The result is shown in Table 5. We can see that MPF with 2-4 patches can achieve the best result. Therefore, we can confirm that our proposed MPF is helpful for obtaining richer detailed global features of pedestrian.

Table 5
Effectiveness of multi-patch on mAP and Rank-1

Network Market-1501 DukeMTMC

Rank-1(%) mAP(%) Rank-1(%) mAP(%)

2-4-8 96.5 89.3 90.6 80.0

2-4 96.7 89.4 90.7 80.3

2 96.0 89.0 90.2 79.9

0 95.9 88.7 89.8 79.7

Network	Market-1501	DukeMTMC
2-4-8	96.5	89.3	90.6	80.0
2-4	96.7	89.4	90.7	80.3
2	96.0	89.0	90.2	79.9
0	95.9	88.7	89.8	79.7

4.5 Effectiveness study on attention module

In modified OSFA framework, we insert a attention module composed of CAM and SAM into the baseline to extract global features. To determine the effectiveness of attention module, we conduct experiments on Market-1501 and DukeMTMC datasets and compare five variations of MSPF(the same MPF module with 2-4 patches): MSPF without attention module, MSPF with SAM or CAM, MSPF with CAM behind SAM and regular MSPF. The mAP and Rank-1 results are shown in Table 6. After adding SAM or CAM to the network, the accuracy of Rank-1 and mAP on Market-1501 dataset are improved by approximately 1% and 3%. When simultaneously utilizing SAM and CAM, the network can enhance the learning ability of samples in terms of pixels and channels. Therefore, we confirm that attention module is meaningful for obtaining richer detailed global features.

Table 6
Effectiveness of attention module on mAP and Rank-1(× means not using the attention, ✓ means using the attention)

Network CAM SAM Market-1501 DukeMTMC

Rank-1(%) mAP(%) Rank-1(%) mAP(%)

MSPF × × 94.2 85.3 84.5 67.4

× ✓ 95.4 88.5 88.1 69.2

✓ × 95.8 88.7 89.2 70.1

behind front 96.2 89.1 90.1 79.8

✓ ✓ 96.7 89.4 90.7 80.3

Network	CAM	SAM	Market-1501	DukeMTMC
MSPF	×	×	94.2	85.3	84.5	67.4
	×	✓	95.4	88.5	88.1	69.2
	✓	×	95.8	88.7	89.2	70.1
	behind	front	96.2	89.1	90.1	79.8
	✓	✓	96.7	89.4	90.7	80.3

4.6 Experiments on self-constructed dataset

To confirm the utility of our model, we use self-constructed dataset for testing. The dataset includes 300 pictures of 30 pedestrians captured by the same camera. Each person has about 10 images. All images are hand-cropped with size 256 × 128. The testing set includes 30 images in query set(each pedestrian is provided with one query image.) and 270 images in gallery set. Three example query images are shown in Fig. 3. Each sub-figure contains, from left to right, a query image, two true matches, and three false matches (distractor). In the three sets of images, the three false matches are similar to the query image due to the influence of illumination and image resolution. Since we construct our model based on OSNet [12] and OSFA [30], we compare the performance of our model with that of OSNet and OSFA on the dataset. The testing result is shown in Table 7. We can see that both our model and OSFA have good performance with Rank-1 100%, and our model has 3.5% and 0.2% higher performance than OSNet and OSFA on mAP, respectively. Therefore, we demonstrate the utility of our model in the real world.

Fig. 3

Three example query images. Each sub-figure contains, from left to right, a query image, two true matches, and three false matches (distractor).

Table 7

Performance comparison on self-constructed dataset

Methods	Rank-1(%)	mAP(%)
OSNet [12]	96.7	94.2
OSFA [30]	100	97.5
Ours(MSPF)	100	97.7

5 Conclusion

In this paper, we propose an novel Multi-Scale and Multi-Patch Feature Fusion Network(MSPF) for pedestrian re-identification task. To enhance the feature extraction capability of the network, we first employ modified OSFA to extract global and local features. Smooth Aggregation Module(SAM) of OSFA can extract rich detailed information of the shallow network to obtain local features. Then we construct a multi-patch feature fusion module(MPF) to extract richer global features of pedestrian. Under the constraint of multiple losses, our model can learn robust feature representations. Finally, we compare the performance of our model with that of other methods on Market-1501, CUHK03_labeled, and DukeMTMC. Experiment results show that our model is superior to the state-of-the-art methods in person re-identification task. In addition, we do experiments on self-constructed dataset. Experiment results confirm the utility of our model in the real world.

Acknowledgements

The work is partially supported by School-level Scientific Research Found Projects at Minnan University of Science and Technology(Grant No.23KJX017).

References

Zhao

, OuYang

, Wang

Unsupervised salience learning for person re-identification[C/OL], IEEE Conferenceon Computer Vision and Pattern Recognition, Portland, OR, USA, 2013.

Liao

, Hu

, Zhu

, Li

S.Z.

Person Re-identification by Local Maximal Occurrence Representation and Metric Learning[J], IEEE Conference on Computer Vision and Pattern Recognition, Cornell University - arXiv, 2015.

Matsukawa

, Okabe

, Suzuki

, Sato

Hierarchical gaussian descriptor for person re-identification[C], IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016.

Kostinger

, Hirzer

, Wohlhart

, Roth

P.M.

, Bischof

Large scale metric learning from equivalence constraints[C/OL], IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, 2012.

Fei

, Gou

, Camps

, Sznaier

Person Re-Identification Using Kernel-Based Metric Learning Methods[M/OL], Computer Vision –ECCV 2014, Lecture Notes in Computer Science (2014), 1–16.

Wang

, Zuo

, Lin

, Zhang

Joint Learning of Single-Image and Cross-Image Representations for Person Re-identification[C/OL], IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016.

Xiao

, Li

, OuYang

, Xiao

Learning Deep Feature Representations with Domain Guided Dropout for Person Re-identification[C/OL], 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 2016.

Chang

, Hospedales

T.M.

, Xiang

Multi-level Factorisation Net for Person Re-identification[C/OL], 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, 2018.

, Guo

, Shyh

W.T.

Learning Large Margin Multiple Granularity Features with an Improved Siamese Network forPerson Re-Identification[J/OL], Symmetry (2020), 92.

10.

Xie

, Wu

, zhang

, Li

Learning Diverse Features with Part-Level Resolution for Person Re-Identification[J], Cornell University - arXiv, 2020.

11.

Zheng

, Zheng

, Yang

A Discriminatively Learned CNN Embedding for Person Re-identification [J/OL],ACMTransactions on Multimedia Computing, Communications, and Applications (2018), 1–20.

12.

Zhou

, Yang

, Cavallaro

, Tao

Omniscale feature learning for person re-identification[C/OL], 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019.

13.

, Shen

, Hengel

PersonNet: Person Reidentification with Deep Convolutional Neural Networks[J], Cornell University - arXiv, 2016.

14.

Zhu

, Guo

, Liu

, Tang

, Wang

Identity-Guided Human Semantic Parsing for Person Re-Identification[M/OL], Computer Vision –ECCV, Lecture Notes in Computer Science (2020), 346–363.

15.

Chen

, Xie

, He

An Empirical Study of Training Self-Supervised Vision Transformers[C/OL], 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021.

16.

Caron

, Touvron

, Misra

, Jegou

, Mairal

Emerging Properties in Self-Supervised Vision Transformers[C/OL], 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 2021.

17.

Wang

Adversarial Multi-scale Feature Learning for Person Re-identification[J], Cornell University - arXiv, 2020.

18.

Rao

, Hu

, Cheng

, Hu

SM-SGE: A Self-Supervised Multi-Scale Skeleton Graph Encoding Framework for Person Re-Identification[C/OL], Proceedings of the 29th ACM International Conference on Multimedia, 2021.

19.

Huang

, Piao

, Zhang

Multi-scale feature combination for person re-identification[J], IET image processing, 2022.

20.

Sah

, Direkoglu

Lightweight Deep Convolutional Neural Networks for Vehicle Re-Identification using Diffusionbased Image Masking[C], International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), 2021.

21.

Aksu

, Direkoglu

Lightweight Convolutional Neural Networks for Person Re-Identification[C/OL], 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), Ankara, Turkey. 2021.

22.

Ashish

, Dimitrii

, Konstantin

Lightweight Attribute Localizing Models for Pedestrian Attribute Recognition[C], Computer Vision and Pattern Recognition(cs.CV), 2023.

23.

Cheng

, Wang

and Wei

, Learning discriminative and generalizable features with multi-branch for person reidentification[J/OL], Journal of Intelligent & Fuzzy Systems 42(6) (2022), 5987–6001.

24.

Zhou

, Yang

, Cavallaro

, Tao

Learning Generalisable Omni-Scale Representations for Person Re-Identification[J/OL], IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–1.

25.

Sovrasov

, Sidnev

Building Computationally Efficient and Well-Generalizing Person Re-Identification Models with Metric Learning[C/OL], 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 2021. DOI: 10.1109/icpr48806.2021.9412598.

26.

Wang

, Li

, Wu

Mpanet: Multi-Patch Attention for Infrared Small Target Object Detection[C], IGARSS 2022–2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 2022.

27.

Zhang

, Zhang

, Dai

, Li

, Piotr

Eventguided Multi-patch Network with Self-supervision for Nonuniform Motion Deblurring[J/OL], International Journal of Computer Vision (2023), 453–470.

28.

Qian

, Fu

, Jiang

Y.G.

, Tao

, Xue

Multi-scale Deep Learning Architectures for Person Reidentification[J], Cornell University - arXiv, 2017.

29.

Chen

, Zhu

, Gong

Person Re-identification by Deep Learning Multi-scale Representations[C/OL], IEEE International Conference on Computer VisionWorkshops (ICCVW), 2017.

30.

, Liu

, Zhu

, Zhang

Person re-identification based on multi-scale feature learning[J/OL], Knowledge-BasedSystems, 2021.

31.

Zhang

, Dai

, Li

Deep Stacked Hierarchical Multipatch Network for Image Deblurring[C/OL], IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 2019.

32.

Wang

, Yuan

, Chen

, Li

, Zhou

Learning Discriminative Features with Multiple Granularities for Person Re-Identification[C/OL], Proceedings of the 26th ACM international conference on Multimedia, 2018.

33.

Xue

, Zhu

, Wang

, Yang

Person re-identification by multi-division attention[J], Optoelectronic Engineering, Opto-electronic Engineering, 2020.

34.

Ploco

, Rodriguez

A.M.

, Geradts

Spatial-Temporal Omni-Scale Feature Learning for Person Re-Identification[C], InternationalWorkshop on Biometrics and Forensics (IWBF), Porto, Portugal, 2020.

35.

Herzog

, Ji

, Teepe

, Hormann

, Gilg

Lightweight Multi-Branch Network for Person Re-Identification[C/OL], IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, 2021.

36.

Woo

, Park

, Lee

J.Y.

, Inso

CBAM: Convolutional Block Attention Module[M/OL], Computer Vision –ECCV, Lecture Notes in Computer Science, 2018.

37.

Chen

, Ding

, Xie

, Yuan

, Chen

, Yang

, Ren

, Wang

ABD-Net: Attentive but Diverse Person Re-Identification[C/OL], IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea (South), 2019.

38.

Chen

, Wang

, Tang

, Zhu

A feature disentangling approach for person re-identification via self-supervised data augmentation[J/OL], Applied Soft Computing (2021), 106939.

39.

Zheng

, Shen

, Tian

, Wang

, Tian

Scalable Person Re-identification: A Benchmark[C/OL], IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015.

40.

, Zhao

, Xiao

, Wang

DeepReID: Deep filter pairing neural network for person re-identification[C/OL], in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Columbus, OH, USA, 2014, pp. 152–159.

41.

Ristani

, Solera

, Zou

R.S.

, Cucchiara

, Tomasi

Performance measures and a data set for multi-target, multicamera tracking[C], in Proceedings of European Conference on Computer Vision, Vol. 9914, ECCV, 2016, pp. 17–35.

Multi-scale and multi-patch feature fusion network for person re-identification

Abstract

Keywords

1 Introduction

2 Related work

3 Our proposed network architecture

3.1 Network architecture overview

4.1 Implementation details

4.2 Datasets and settings

Table 1 Summary of three popular person Re-ID datasets Datasets Camera Training Set Testing Set ID Images ID Images Market-1501 6 751 12936 750 19732 CUHK03 10 767 7368 700 6728 DukeMTMC 8 702 16522 702 19889

Table 2 Performance comparison on Market-1501 dataset Methods Rank-1(%) mAP(%) DPFL [29] 92.3 82.7 OSFA [30] 95.5 88.5 SMGN [10] 94.1 79.2 Self-supervised person [38] 94.7 86.7 OSNet [12] 93.6 81.0 ABD-Net [37] 94.8 86.0 LG-OSNet [24] 94.8 86.7 ISP [15] 95.3 88.6 Ours(MSPF) 96.7 89.4

Table 5 Effectiveness of multi-patch on mAP and Rank-1 Network Market-1501 DukeMTMC Rank-1(%) mAP(%) Rank-1(%) mAP(%) 2-4-8 96.5 89.3 90.6 80.0 2-4 96.7 89.4 90.7 80.3 2 96.0 89.0 90.2 79.9 0 95.9 88.7 89.8 79.7

Acknowledgements

References

Table 1
Summary of three popular person Re-ID datasets

Datasets Camera Training Set Testing Set

ID Images ID Images

Market-1501 6 751 12936 750 19732

CUHK03 10 767 7368 700 6728

DukeMTMC 8 702 16522 702 19889

Table 2
Performance comparison on Market-1501 dataset

Methods Rank-1(%) mAP(%)

DPFL [29] 92.3 82.7

OSFA [30] 95.5 88.5

SMGN [10] 94.1 79.2

Self-supervised person [38] 94.7 86.7

OSNet [12] 93.6 81.0

ABD-Net [37] 94.8 86.0

LG-OSNet [24] 94.8 86.7

ISP [15] 95.3 88.6

Ours(MSPF) 96.7 89.4

Table 5
Effectiveness of multi-patch on mAP and Rank-1

Network Market-1501 DukeMTMC

Rank-1(%) mAP(%) Rank-1(%) mAP(%)

2-4-8 96.5 89.3 90.6 80.0

2-4 96.7 89.4 90.7 80.3

2 96.0 89.0 90.2 79.9

0 95.9 88.7 89.8 79.7