Learning discriminative and generalizable features with multi-branch for person re-identification

Abstract

Finer-grained local features play a supplementary role in the description of pedestrian global features, and the combination of them has been an essential solution to improve discriminative performances in person re-identification (PReID) tasks. The existing part-based methods mostly extract representational semantic parts according to human visual habits or some prior knowledge and focus on spatial partition strategies but ignore the significant influence of channel information on PReID task. So, we proposed an end-to-end multi-branch network architecture (MCSN) jointing multi-level global fusion features, channel features and spatial features in this paper to better learn more diverse and discriminative pedestrian features. It is worth noting that the effect of multi-level fusion features on the performance of the model is taken into account when extracting global features. In addition, to enhance the stability of model training and the generalization ability of the model, the BNNeck and the joint loss function strategy are applied to all vector representation branches. Extensive comparative evaluations are conducted on three mainstream image-based evaluation protocols, including Market-1501, DukeMTMC-ReID and MSMT17, to validate the advantages of our proposed model, which outperforms previous state-of-the-art in ReID tasks.

Keywords

Person re-identification multi-branch deep network multi-level global fusion feature spatial-channel partition

1 Introduction

PReID is an important vision task with wide real-world applications such as intelligent video surveillance, smart retailing, etc., aiming at matching person images captured from non-overlapping cameras. However, some problems remain to be solved owing to the challenges of ReID, including changes in camera viewpoints, illumination changes, human pose variation and occlusion. With the rapid development of deep learning based techniques, supervised ReID methods, such as [1, 2], have gained remarkable advances.

In traditional approach of pedestrian representations, it is a common strategy that extract discriminative features from the whole body on images in ReID task, which aims to capture the most salient clues of appearance to represent identities of different pedestrians. However, PReID methods that rely solely on global features of a person are prone to errors in case of occlusion and misalignment. In addition, since global features do not have any local spatial information, key local features are easily to ignore, which limits the improvement of ReID’s accuracy.

To relieve this dilemma, complementing global features with local features and building better person representations has been confirmed to be an effective approach for better ReID accuracy in many previous works. Therefore, part-based models have attracted a lot of attention in PReID research community. Part-based methods for PReID can be divided into three main pathways according to their part locating methods: 1) Methods based on the assistance of external tools, such as human pose estimation models [1, 3]; 2) Methods based on horizontal partitioning or slicing [4, 5]; 3) Methods based on semantic attributes [6, 7]. However, these methods tend to focus only on specific parts with fixed semantics and not cover all the discriminative information. Recently, multiple-branch architectures have been proposed in particular [2 , 9] where a shared-net is often followed by multiple sub-network branches, which allow the network to focus on different pedestrian features in individual branches, e.g., on distinct spatial parts or channels. They mostly try to learn global and spatial part features in individual branches or combine part, channel, and global features, either through pooling or attention. However, most multi-branch networks ignore the following points: 1) when extracting global features, only the impact of the final features on PReID result is considered and the intermediate layer features are ignored; 2) the most common in PReID tasks are spatial partition strategies of the persons’ images, and channel partitions are rarely considered. 3) most ReID models are optimized by combining ID loss and triplet loss together. However, the targets of these two losses are inconsistent in the embedding space, which leads to the phenomenon that one loss may be reduced while the other loss is oscillating or even increased.

This motivates the work in this paper, where we propose a novel three-branch architecture for PReID. Specifically, we propose an end-to-end multi-branch network jointing multi-level global fusion features, channel partition features and spatial partition features. It can build robust pedestrian representations for ReID task by complementing global features with local features extracted through partitions. Spatial and channel partition as shown in Fig. 1.

Fig. 1

Example of a spatial and channel partition. H, W and C refer to height, width and channel in a deep feature map, respectively. In this example, a whole feature map is partitioned into two channel groups and four spatial parts.

In summary, the contributions of us are threefold:

1) Based on the ResNet-50 baseline, we propose an end-to-end three-branch network for PReID. Its global branch fusions multi-level global features to enhance the richness of global features, its channel branch employs average-pooling layer to extract channel features and its spatial branch employs max-pooling layer to extract spatial features. The proposed architecture is shown to be effective at achieving feature diversity.

2) The BNNeck [10] structure is introduced to solve the problem that the recognition result fluctuates due to the targets of these two losses (ID loss and weighted regularization triplet (WRT) loss is used in this paper) are inconsistent in the embedding space when simultaneously optimize the same feature vector. Experimental results in Fig. 6, Fig. 7 and Fig. 8 show that BNNeck can not only speed up the convergence speed of MSCN, but also improve the recognition performance of it by a large margin.

Fig. 6

Comparison of the loss function curve before and after the BNNeck is used. Here, (a), (b) and (c) represent experiments are conducted on the Market-1501, DukeMTMC-ReID and MSMT17 datasets, respectively.

Fig. 7

Comparison of the accuracy rate curve before and after the BNNeck is used. Here, (a), (b) and (c) represent experiments are conducted on the Market-1501, DukeMTMC-ReID and MSMT17 datasets, respectively.

Fig. 8

Ablation study of BNNeck influences on the Market-1501, DukeMTMC-ReID and MSMT17datasets in terms of rank-1 and mAP. “MCSN (W/O BNNeck)” means BNNeck is removed from our model.

3) Extensive experiments demonstrate the superiority of the proposed MCSN over a wide range of state-of-the-art ReID models on three large benchmarks, Market-1501 [11], DukeMTMC-ReID [12] and MSMT17 [13]. In rigorous ablation studies, we show how three branches improve model performance and why our network performs better than other approaches.

2 Related work

We review three relevant work: 1) multi-branch models for PReID, 2) part-based models, 3) Methods based on multi-loss.

2.1 Multi-branch models for PReID

Combining global and part-based features in all feature representation learning strategies is the most effective strategy. Wang et al. [2] proposed a multi-task attentional network with curriculum sampling that includes attention loss, triplet loss and focal loss branches. Wang et al. [8] proposed a multi-branch deep network, which split the last layer of ResNet into three branches and partition feature maps into one, two and three horizontal stripes. In HPM [9], Fu et al. split respectively the entire body feature map into one, two, four and eight identical horizontal stripes, which takes into account both local and global features. In [14], the striped pyramidal block is introduced in a deep architecture at different depths. However, all current methods only consider multiple spatial partitions and neglect the potentially important effects of channel groups.

2.2 Part-based models

Finding spatial partitions of the persons’ images is an important research direction. Generally, the input image is divided into several disjoint horizontal space parts to obtain discriminative partition features of pedestrian matching. In previous studies, some methods based on hand-crafted features [15, 16] are used for part-based feature learning. However, these strategies mentioned above are not so robust on large datasets. Recently, part-based deep learning methods have been developed to extracting more discriminative features of pedestrians. Fan et al. [17] proposed a spatial-channel loss to ensure that each channel in the representation pays attention to a dedicated partitioned part of the body. In [18], Sun et al. split respectively the entire body feature map into identical horizontal stripes to extract discriminative features of pedestrians. Some studies have demonstrated the potential of partitioned channel groups as an effective dimension in various visual tasks. For example, MobileNet [19] adopts channel-wise convolutions where the number of groups equals the number of channels. In our proposed model, by simultaneously partitioning channels into two groups and the entire pedestrian body into four identical horizontal stripes spatially for PReID task.

2.3 Methods based on Multi-Loss

Loss functions are used as supervisory signals in feature learning. The following two type losses are widely used during the training phase for deep ReID systems: 1) classification loss, e.g., softmax cross-entropy loss, which treats the training process of PReID as an image classification problem [20]; 2) ranking loss, e.g., WRT loss [21], which treats the ReID model training process as a retrieval ranking problem. The classification task and ranking task are complementary to each other. Recently, both classification loss and ranking loss are simultaneously used to optimize the network [8, 22], which is obvious how important it is to PReID tasks. In this paper, we employed the softmax cross-entropy loss and the WRT loss for PReID tasks.

3 The proposed approach

In this section, the detailed introduction of network architecture is given first in Sec. 3.1, followed by the training and loss functions used in our proposed model in Sec. 3.2. Finally, the differences between our model and other related work are discussed in Sec. 3.3.

3.1 Network architecture

The general architecture of our proposed MCSN is represented in Fig. 2. With reference to the most advanced methods [8, 23], we designed an end-to-end neural network architecture based on strong image feature extraction backbones pretrained on ResNet-50 [24]. Based on this backbone, three modifications are conducted: 1) the subsequent part after res_conv2 layer is divided into three independent branches, global (G) branch, channel (P1) and spatial (P2) partition branches, which share the similar architecture of the original ResNet-50 before res_conv2 layer; 2) in the P1 and P2 branch, the down-sampling with stride-2 convolution is replaced by a stride-1 convolution in the res_conv5_1 block; 3) in the G branch, the network structure of multi-level global feature fusion is adopted due to the semantic differences brought by the network output of different layers. Table 1 lists the settings of these branches. Let $X \in ℝ^{256 \times 128 \times 3}$ be an input image. Before separating into distinct branches, the image X is passed through a truncated ResNet-50 backbone, up until the third block of the first layer, i.e., res_conv2_2. After forwarding X through the initial layers, the network forms the three branches, which comprise the remaining layers of ResNet-50 up to the fourth layer. By this design, only the layers up to conv2_2 are shared by all the branches. For each individual branch, we obtain a tensor of size 2048×8×4 and two tensors of size 2048×16×8.

Fig. 2

The overall network architecture of MCSN. For the backbone network, the subsequent part after res_conv2 layer is divided into three independent branches, global (G) branch, channel (P1) and spatial (P2) partition branches, sharing the similar architecture with the original ResNet-50. Then, multiple spatial and channel partitions are conducted on the feature maps. After multi-type pooling, dimensions of global branch(dim=2048), channel branch(dim=2048, 1024*2) and spatial branch(dim=2048, 2048*4) features are unified by 1×1 convolution (reduction) and batch normalization (BN) to 256. Then, pedestrian identity predictions of input images are given by fully connected layers (FC). Note that the tensors obtained after the BN layer are concatenated as the final feature to test model performance during testing phases.

Table 1

Comparison of the settings of three branches in MCSN. Here, the size of input images is 256×128. “Branch” represents the abbreviation for three branches. “Part number” means the number of partitions on each branch. “Map Size” refers to the size of output feature maps after res_conv5 layer from each branch. “Dims” refers to the dimensionality for the output feature representations after BN layer. “Feature” means the symbols for the output feature representations

Branch	Part number	Map Size	Dims	Feature
G	1	(8,4)	256	$\overset{\land}{f_{g}^{G}}$
P1	2	(16,8)	256*2+256	$\overset{\land}{f_{g}^{P_{1}}}, \overset{\land}{f_{p_{i} \|_{i = 0}^{i = 1}}^{P_{1}}}$
P2	4	(16.8)	256*4+256	$\overset{\land}{f_{g}^{P_{2}}}, \overset{\land}{f_{p_{i} \|_{i = 0}^{i = 3}}^{P_{2}}}$

In the G branch, firstly, the down-sampling with a stride-2 convolution layer is employed in res_conv5_1 block, followed by a generalized-mean (GeM) pooling operation on the corresponding output feature map, we obtain a 2048-dimensional tensor g₂. Secondly, we apply adaptive average pooling operation to the tensors of size 512×32×16 and 1024×16×8 to obtain the tensor g₁ of size 512×4×1 and g₃ of size 1024×2×1. Thirdly, 2048-dimensional multi-level global fusion feature vector $y_{g}^{G}$ is obtained through feature stacking: $y_{g}^{G} = g_{1} + g_{2} + g_{3} .$ (1)

Then, a 1×1 convolution layer with batch normalization and ReLU to reduce 2048-dimensional features $y_{g}^{G}$ to 256-dimensional $f_{g}^{G}$ .

In the P1 branch and P2 branch, the global feature maps $y_{g}^{P_{1}}$ and $y_{g}^{P_{2}}$ are extracted from each branch in the same way. Different from G branch, the down-sampling with a stride-2 convolution layer is replaced by a stride-1 convolution layer in the res_conv5_1 block to ensure that the partitioned areas have more distinctive pedestrian information. The output feature maps in P1 and P2 branches are split into 2 channel groups and 4 spatial parts, respectively.

In the P1 branch, Firstly, the initial tensor of size 2048×16×8 is reduced to a 2048-dimensional vector and then partitioned into two vectors of length 1024 each. Then, the 1×1 convolution layer with batch normalization and ReLU is employed to reduce 1024-dimensional features to 256-dimensional vectors ${f_{p_{i} |_{i = 0}^{i = 1}}^{P_{1}}}$ . Here, the parameters of the 1×1 convolutions are shared among both channel parts.

In the P2 branch, inspired by PCB [23], the initial tensor of size 2048×16×8 is transformed into four representations. Specifically, max pooling is used to obtain a tensor of size 2048×4×1 that we split into four 2048-dimensional part-based representations ${y_{p_{i} |_{i = 0}^{i = 3}}^{P_{2}}}$ , representing four non-overlapping human bodies, respectively. Then, the 1×1 convolution layer with batch normalization and ReLU are employed to reduce 2048-dimensional features to 256-dimensional vectors ${f_{p_{i} |_{i = 0}^{i = 3}}^{P_{2}}}$ .

Then, BNNeck is used in 9 tensors of 256-dimensional to improve model stability in training phases. It is worth noting that the bias of the fully connected layer is turned off. Finally, 3 global feature tensors ${f_{g}^{G}, f_{g}^{P_{1}}, f_{g}^{P_{2}}}$ obtained before the batch normalization layer are used for optimization with respect to WRT loss, while 9 tensors ${F_{g}^{G}, F_{g}^{P_{1}}, F_{g}^{P_{2}}, F_{p_{i} |_{i = 0}^{i = 1}}^{P_{1}}, F_{p_{i} |_{i = 0}^{i = 3}}^{P_{2}}}$ obtained after the fully connected layer are used for optimization with respect to softmax cross-entropy loss. In addition, tensors obtained after the BN layer are concatenated as the final feature to test model performance during testing phases. From the resulting embeddings we form two sets, given by

$S = {F_{g}^{G}, F_{g}^{P_{1}}, F_{g}^{P_{2}}, F_{p_{0}}^{P_{1}}, F_{p_{1}}^{P_{1}}, F_{p_{0}}^{P_{2}}, F_{p_{1}}^{P_{2}}, F_{p_{2}}^{P_{2}}, F_{p_{3}}^{P_{2}}}$ (2) $T = {f_{g}^{G}, f_{g}^{P_{1}}, f_{g}^{P_{2}}}$ (3) for training in identity and rank spaces, respectively, where set S represents the tensors after the fully connected layer and set T denotes the tensors after 1×1 convolutional layer.

3.2 Training and loss functions

Different loss functions have different functions, and the learned representations are also different. To improve the discriminative ability of deep embedding learning, we use the combination of softmax cross-entropy loss and WRT loss in training phases to train the network. The former is employed for classification tasks and the latter is used for metric learning, which are both widely used in various deep ReID methods, and the combination of the two can greatly improve the recognition performance of the network.

Softmax Cross-Entropy loss. PReID is usually regarded as a multi-classification task, whose goal is to learn the most discriminative features of pedestrians by optimizing the classification model, and then accurately predict the identity of each pedestrian with a high probability. The softmax cross-entropy loss in a mini-batch can be described as: $l_{ID} = \sum_{i = 1}^{N} - q_{i} log (p_{i}) {\begin{matrix} q_{i} = 0 \begin{matrix} , \end{matrix} y \neq i \\ q_{i} = 1 \begin{matrix} , \end{matrix} y = i \end{matrix}$ (4) where y represents the ground-truth label and p_i represent ID prediction logits of class i, and q_i as the true probability that the current sample belongs to class i. The last layer of the network is a fully-connected layer with a hidden size that is equal to the number of pedestrian identities N. This loss function is called ID loss in this paper due to the category of the classification determined by the number of pedestrian identity. Additionally, different from the traditional softmax cross-entropy loss function, label smoothing (LS) [25] method is adopted in this paper, which can effectively avoid overfitting in classification tasks by reducing the weight of q_i when calculating the loss value. It changes the construction of q_i to:

$q_{i} = {\begin{matrix} 1 - ω + \frac{ω}{N}, y = i \\ \frac{ω}{N}, y \neq i . \end{matrix}$ (5)

Therefore, ID loss can be re-expressed as: $l_{ID} = \sum_{i = 1}^{N} - q_{i} log (p_{i}) {\begin{matrix} q_{i} = 1 - ω + \frac{ω}{N}, y = i \\ q_{i} = \frac{ω}{N}, y \neq i \end{matrix}$ (6) where ω denotes adjustment factor, which is used to encourage the model to be less confident on the training set. In this paper, ω=0.1. In addition, the ID loss is used to the global tensors ${F_{g}^{G}, F_{g}^{P_{1}}, F_{g}^{P_{2}}}$ after the fully connected layer and local tensors ${F_{p_{0}}^{P_{1}}, F_{p_{1}}^{P_{1}}, F_{p_{0}}^{P_{2}}, F_{p_{1}}^{P_{2}}, F_{p_{2}}^{P_{2}}, F_{p_{3}}^{P_{2}}}$ .

Weighted Regularization Triplet loss. Triplet loss is often used for metric learning, its goal is to keep the feature embeddings of samples of the positive pair as close as possible while negative pair as far away as possible. All the global features ${f_{g}^{G}, f_{g}^{P_{1}}, f_{g}^{P_{2}}}$ after 1×1 convolutional layer are trained with triplet loss to enhance metric learning performance. To avoid introducing additional margin parameters, the WRT loss [21] is adopted, which is entirely parameter free. The WRT loss function is formulated as follows: $l_{wrt} (i, j, k) = log (1 + exp (w_{i}^{p} d_{ij}^{p} - w_{i}^{n} d_{ik}^{n}))$ (7) $w_{i}^{p} = \frac{exp (d_{ij}^{p})}{\sum_{d^{p} \in ρ} exp (d^{p})}, w_{i}^{n} = \frac{exp (- d_{ik}^{n})}{\sum_{d^{η} \in η} exp (- d^{n})}$ (8) where (i, j, k) represents the mined hard triplet in a mini-batch. η is the corresponding negative set, and ρ is the corresponding positive set. $d_{ij}^{p}$ and $d_{ik}^{n}$ are the feature distances of the positive pair and negative pair, respectively. $w_{i}^{p}$ and $w_{i}^{n}$ are the weighted weights of the positive pair and negative pair, respectively.

Total loss. The overall objective loss at either the global branch or the local branch is the sum of ID loss and WRT loss, namely, $L_{total} = λ l_{ID} + l_{wrt} (i, j, k)$ (9) where the parameter λ is used to balance the contribution of two loss functions. Several possibilities of λ are tested in the next section to find an optimal setting for all experiments.

To further boost performance, a warmup strategy [26] is applied to bootstrap the network for better performance rather than traditional step learning rate schedules. Specifically, as shown in Fig. 3, the learning rate first grows linearly from 3.5 × 10^-5 to 3.5 × 10^-4 in the first 10 epochs. Then, the learning rate decayed to 3.5 × 10^-5 and 3.5 × 10^-6 at 40th epoch and 70th epoch respectively. The learning rate lr(t) at epoch t is computed as

$lr (t) = {\begin{matrix} 3.5 \times 10^{- 5} \times \frac{t}{10} \begin{matrix} if \begin{matrix} \end{matrix} t ⩽ 10 \end{matrix} \\ 3.5 \times 10^{- 4} \begin{matrix} \begin{matrix} \end{matrix} if \begin{matrix} \end{matrix} 10 < t ⩽ 40 \end{matrix} \\ 3.5 \times 10^{- 5} \begin{matrix} \begin{matrix} \end{matrix} if \begin{matrix} \end{matrix} 40 < t ⩽ 70 \end{matrix} \\ 3.5 \times 10^{- 6} \begin{matrix} \begin{matrix} \end{matrix} if \begin{matrix} \end{matrix} 70 < t ⩽ 120 . \end{matrix} \end{matrix}$ (10)

Fig. 3

Learning rate schedules with warmup strategy.

3.3 Discussion

In this section, the difference between MCSN and the other two methods based on CNNs are discussed.

The first category is single branch network, such as PCB [23]. Specifically, given an image input, it outputs a convolutional descriptor consisting of several part-level features by a uniform spatial partition strategy, which focuses on spatial discriminative part in feature maps while channel information is neglected. Compared with the spatial part-level feature extraction method above mentioned, our proposed method combines global, spatial, and channel features in a unified architecture to better exploit both spatial and channel information. With this method, MCSN achieves competitive results with state-of-the-art methods, and proves channel partition can complement feature representations of spatial partition.

The second category is multi-branch networks, such as MGN [8], which is a multi-branch deep network architecture consisting of one branch for global feature representations and two branches for local feature representations. The images are uniformly partitioned into several stripes to obtain local feature representations with multiple granularities. Similar to our proposed model is that multi-branch network is used and global and spatial part-based features are considered. In addition, different from the MGN model, the channel partition strategy and BNNeck are introduced in our proposed method, and its performance exceeds that of many previous methods. Besides, our proposed method is completely an end-to-end learning process, which is easy for learning and implementation.

4 Experiments

Extensive experiments have been performed to evaluate the effectiveness of our proposed approach over three public PReID datasets: Market1501, DukeMTMC-ReID and MSMT17. The results are compared with state-of-the-art methods.

4.1 Implementation details

First of all, input images are resized to 256×128 and then augmented by random horizontal flip and random erasing with a probability of 0.5, and normalization mean value of the three color channels of RGB is set to [0.485, 0.456, 0.406], and the standard deviation is set to [0.229, 0.224, 0.225]. The backbone ResNet-50 is initialized from the ImageNet pre-trained model. All the layers after res_conv2 layer are duplicated into three independent branches. Each reduction layer is followed by the BNNeck architecture. Models are trained for 120 epochs for Market-1501, DukeMTMC-reID and MSMT17 with a batchsize of 64. A batch consists of 16 identities, with 4 instances per identity. And then, Adam is adopted as the optimizer with the base learning rate initialized to 3.5 × 10^-5. In the first 10 epochs, the learning rate linearly increasing the learning rate from 3.5 × 10^-5 to 3.5 × 10^-4, then decayed to 3.5 × 10^-5 at 40 epoch and further decayed to 3.5 × 10^-6 at 70 epoch, respectively. The total training process lasts for 120 epochs. To balance the losses, we chose λ=2. In order to get an unbiased comparison, all the experiments were performed on a same PC, which was configured with detailed settings as shown in Table 2.

Table 2
Detailed settings

Name Detailed settings

Hardware

GPU NVIDIA GeForce GTX 1070

Frequency 1506/1683MHz

DRAM 8192MB

Hard drive 519GB

Software

Operating system Ubuntu 18.04

Language PyTorch 1.8

Name		Detailed settings
Hardware
	GPU	NVIDIA GeForce GTX 1070
	Frequency	1506/1683MHz
	DRAM	8192MB
	Hard drive	519GB
Software
	Operating system	Ubuntu 18.04
	Language	PyTorch 1.8

4.2 Dataset description

Market-1501 dataset: It includes 1,501 pedestrian images collected from six cameras. The training set has 751 identities with 12,936 images, and each identity has 17.2 photos on average, while the testing set has 750 identities with 3,368 query images and 19,732 gallery images, and each person has 30.8 photos on average.

DukeMTMC-reID dataset: It provides a new video dataset recorded by 8 synchronized cameras with more than 7,000 single-camera tracks and more than 2,700 independent characters. The dataset consists of 16,522 images of 702 persons in the training set and 2,228 query images and 17,661 gallery images of 702 persons for testing. There are 23.5 images per identity in the training set.

MSMT17 dataset: It contains 126,441 bounding boxes with 4,101 pedestrians. Among them, the training set contains 1,041 pedestrians with a total of 32,621 bounding boxes, while the testing set contains 3,060 pedestrians with a total of 93,820 bounding boxes.

4.3 Evaluation metrics

Rank-1 and mean average precision (mAP) are used to evaluate the performance of our proposed model. (1) CMC describes the accuracy of the first N elements in the sorted list that are consistent with the target in the detection set, i.e. calculates the hit probability of top-K, which is defined as: $cmc (n) = \sum_{n = 1}^{N} r (n),$ (11) where r (n) indicates the probability that the n-th column in the sorted list is consistent with the target identity, when n=1, cmc (n) = rank (1). Rank-1 is called the first hit rate, which represents the probability that the first element in the sorted list is consistent with the identity of the detection set. (2) For each query, the average precision (AP) is the area under the Precision-Recall curve, it is shown in the following formula: $AvgP = \sum_{k = 1}^{N} p (k) Δ r (k),$ (12) where p (k) represents the accuracy rate when the k-th sample in the candidate set is retrieved. Δr (k) represents the change in the recall rate when the number of retrieved samples changes from k - 1 to k.

Calculate the average of the average accuracy of all the targets to be queried, i.e. mAP, which is shown in the following formula: $mAP = \frac{\sum_{q = 1}^{Q} AveP (q)}{q},$ (13) where AveP (q) represents the average accuracy of the q-th target to be queried, and Q represents the number of all samples in the detection set.

4.4 Ablation studies

To verify the effectiveness of each component and setting of MCSN, we designed several ablation studies with different settings on Market-1501, DukeMTMC-ReID and MSMT17, including the influence of partition strategies, branches, multi-level global fusion features, channel features, backbones, loss functions, parameters in total loss and BNNeck. Note that all unrelated settings are the same as MCSN implementation detailed in Section 4.1.

4.4.1 Influence of partition strategies

Extensive experiments are conducted to validate the effectiveness of channel branch is split into 2 blocks and the spatial branch is divided into 4 blocks (i.e., P1(2)+P2(4)) by comparing our proposed model with other three partition strategies. These partition strategies are compared on the Market-1501, DukeMTMC-ReID and MSMT17 datasets. Results are reported in Fig. 4. On the Market-1501 dataset, the model with P1(2)+P2(4) partition strategy outperforms the one with P1(2)+P2(3) partition strategy by 0.1% rank-1 and 0.2% mAP. On the MSMT17 dataset, the model with P1(2)+P2(4) partition strategy outperforms the one with P1(3)+P2(4) partition strategy by 0.8% rank-1 and 0.5% mAP. In general, although the performance of our proposed P1(2)+P2(4) partition strategy on the DukeMTMC-ReID dataset is average, it is still the best choice compared with the other partition strategies.

Fig. 4

Performance comparison of different partition strategies (P1(2)+P2(3), P1(3)+P2(3), P1(3)+P2(4), P1(2)+P2(4)(Ours)) on Market-1501, DukeMTMC-ReID and MSMT17 datasets. “P1(2)” and “P1(3)” refer that the entire feature map is partitioned into 2 and 3 channel groups in P1 branch, while ”P2(3)” and ”P2(4)” respectively refer to 3 and 4 spatial parts in P2 branch.

4.4.2 Influence of branches

Our proposed network consists of global(G) branch, channel (P1) and spatial (P2) partition branches. Table 3 depicts our network’s performance in different branch combinations. The results suggest that the performance of the model increases with the increase of the number of branches. Among all combinations, the model has the lowest performance on the three datasets when there is only a global branch. As can be seen from Table 3, on the three datasets, our model achieves state-of-the-art results in all the combination of branches by using all three branches together, indicating that local features are crucial for generalization.

Table 3
Ablation study of branch influences. The models’ performance is studied under the different branch configurations, where “G+P1+P2” refers to our proposed model and “P1(spatial)” means that spatial partition is employed rather than channel partition in P1 branch. In addition, “G(w/o Multi)" refers that multi-level global feature fusions method is removed and only the feature after res_conv5 layer is used. The bold font denotes the best result

Branch Market-1501 DukeMTMC-ReID MSMT17

rank-1 mAP rank-1 mAP rank-1 mAP

G 92.3 82.4 84.4 72.1 66.8 47.4

G+P1 93.6 85.5 87.1 75.6 67.7 49.4

G+P1(spatial)+P2 93.5 85.3 87.6 76.5 69.9 53

G(w/o Multi)+P1+P2 94.2 85.7 87.6 77.1 70.4 53

G+P1+P2(Ours) 94.6 86.2 87.8 76.8 70.6 53.1

Branch	Market-1501	DukeMTMC-ReID	MSMT17
G	92.3	82.4	84.4	72.1	66.8	47.4
G+P1	93.6	85.5	87.1	75.6	67.7	49.4
G+P1(spatial)+P2	93.5	85.3	87.6	76.5	69.9	53
G(w/o Multi)+P1+P2	94.2	85.7	87.6	77.1	70.4	53
G+P1+P2(Ours)	94.6	86.2	87.8	76.8	70.6	53.1

4.4.3 Effectiveness of multi-level global fusion feature

Different from the previous common global feature extraction methods, e.g., PCB employed a uniform partition strategy to produce part-level features, which did not consider multi-layer global feature representations. However, the proposed MCSN learns the global feature representation of pedestrians by extracting multi-level global fusion features. Several experiments are performed on the Market-1501, DukeMTMC-ReID and MSMT17 datasets to confirm the necessity and efficiency of our proposed multi-level global fusion features setting. The results of the experiment are shown in the last two rows of Table 3 for all the three datasets. Clearly, it achieves consistently improved performance in all three datasets, i.e., our proposed model achieves 0.4% rank-1 and 0.5% mAP improvement on the Market-1501 dataset compared with the model with the multi-level global feature fusions setting not be used, 0.2% rank-1 improvement on the DukeMTMC-ReID dataset and 0.2% rank-1 and 0.1% mAP improvement on the MSMT17 dataset. The results with multi-level global feature fusions are generally better, which proves the effectiveness of our proposed method. The possible reason is that the combination of low-level features and high-level features plays a complementary role.

4.4.4 Effectiveness of channel feature

As can be seen in Table 3, when substituting the two channel groups (G+P1+P2) in P1 branch with two spatial parts (G+P1(spatial)+P2), performance decreases. That is to say, the combination of spatial and channel partitions outperforms the one which only considers spatial partition. For example, our proposed method outperforms the combination G+P1(spatial)+P2 by 1.1% rank-1 accuracy and 0.9% mAP on the Market1501 dataset, 0.2% rank-1 accuracy and 0.3% mAP on the DukeMTMC-ReID dataset, and 0.7% rank-1 accuracy and 0.1% mAP on the MSMT17 dataset, respectively. The possible reason for this result is that the uniform spatial partation by 2 and 4 introduces no overlap areas between stripes, and thus more discriminative overlap areas information can’t be learned.

4.4.5 Influence of backbones

Table 4 shows some examples of the different backbone performances. As shown in Table 4, the ResNet-50 has the weakest performance among all models. Comparing the results of the ResNet-50 with our proposed MCSN, it can be observed that MCSN(ResNet-50) makes a significant performance improvement from rank-1/mAP=87.5%/71.4% to 94.6%/86.2% (+7.1%/14.8%). In addition, the same experiment is implemented with the SE-ResNet-50 and ResNet_se_ibn. Experiment results indicates that MCSN(ResNet-50) achieves 1.1%/1.4% and 5.6%/12.5% gains compared with MCSN(SE-ResNet-50) and MCSN(ResNet_se_ibn) in rank-1/mAP, respectively. This shows that the additional configuration is not the most important to improve the performance of PReID, but the well-designed network architecture. The results above prove that our proposed MCSN has incredible capability of feature representations for PReID.

Table 4
Ablation Study of backbone influences on the Market-1501 dataset. “MCSN(SE-ResNet-50)", “MCSN(ResNet_se_ibn)” and “MCSN(ResNet-50)” refer to SE-ResNet-50, ResNet_se_ibn and ResNet-50 are regarded as the backbone of our proposed model, respectively. Here, “SE-ResNet-50” refers to Squeeze-and-Excitation structure added in ResNet-50, “ResNet_se_ibn” refers to IN and BN structure added in ResNet-50. The bold font denotes the best result

Model Market-1501

rank-1 mAP

ResNet-50 87.5 71.4

MCSN(SE-ResNet-50) 93.5 84.8

MCSN(ResNet_se_ibn) 89.0 73.7

MCSN (ResNet-50) 94.6 86.2

Model	Market-1501
ResNet-50	87.5	71.4
MCSN(SE-ResNet-50)	93.5	84.8
MCSN(ResNet_se_ibn)	89.0	73.7
MCSN (ResNet-50)	94.6	86.2

4.4.6 Influence of loss functions

Several model performances are studied under the specified training modifications in Table 5. Some conclusions are drawn from the experimental results in Table 5: 1) when only the classification loss is considered, it is obvious that the model using label smoothing cross-entropy loss is higher than that of softmax cross-entropy loss in recognition performance; 2) when both classification loss and metric loss are considered, surprisingly, there is a large performance gap between batch hard triplet loss and WRT loss. We can see that the combination of Softmax+LS+WRT has the best performance compared with other combinations, which is the 0.4% and 0.6% rank-1 accuracy and the 0.3% and 0.6% mAP higher than the one with the best performance in others on the Market-1501 and DukeMTMC-ReID datasets, respectively. To form a unified framework, we apply the combination of Softmax+LS+WRT to all experiments.

Table 5
Ablation study of loss function influences on the Market-1501 and DukeMTMC-ReID datasets. Here, “Softmax” means the softmax cross-entropy loss, “Softmax+LS” means softmax cross-entropy loss with label smoothing, “BHTP” means batch hard triplet loss, and “WRT” represents weighted regularization triplet loss. The bold font denotes the best result

Softmax Softmax+LS BHTP WRT Market-1501 DukeMTMC-ReID

rank-1 mAP rank-1 mAP

✓ 91.5 80.0 82.1 69.4

✓ ✓ 94.0 84.7 85.6 73.9

✓ ✓ 93.6 84.5 86.5 74.4

✓ 92.7 81.8 84.2 72.7

✓ ✓ 94.2 85.9 87.2 76.2

✓ ✓ 94.6 86.2 87.8 76.8

Softmax	Softmax+LS	BHTP	WRT	Market-1501	DukeMTMC-ReID
✓				91.5	80.0	82.1	69.4
✓		✓		94.0	84.7	85.6	73.9
✓			✓	93.6	84.5	86.5	74.4
	✓			92.7	81.8	84.2	72.7
	✓	✓		94.2	85.9	87.2	76.2
	✓		✓	94.6	86.2	87.8	76.8

4.4.7 Parameter in total loss

To balance the contributions of softmax cross-entropy with label smoothing and WRT losses, the weight parameter λ should be determined. Five possibilities λ = 1, 2, 3, 4, 5 are tested on the Market-1501, DukeMTMC-ReID and MSMT17 datasets. Results in Fig. 5 shows that our proposed model gets the best performance with λ = 2 on all three datasets. To form a unified framework, we set λ = 2 for all experiments.

Fig. 5

Ablation study of different parameter values for λ from L_total. Experiments are conducted on the Market-1501 (as shown in (a) & (b)), DukeMTMC-ReID (as shown in (c) & (d)) and MSMT17 (as shown in (e) & (f)) datasets, respectively.

4.4.8 Influence of BNNeck

Most works combined ID loss and triplet loss together to train ReID models. However, the targets of them are inconsistent in the embedding space. Thus, the BNNeck [10] structure is introduced to solve the problem that the recognition result fluctuates due to the targets of these two losses (ID loss and WRT loss is used in this paper) are inconsistent in the embedding space when simultaneously optimize the same feature vector. Comparative experiments are implemented on the loss function curve and accuracy rate curve before and after BNNeck is used. Experiments results are shown in Figs. 7. Clearly, the model with BNNeck achieves faster convergence and higher accuracy rate on the three datasets, which increases the stability of the model in training process. In addition, it also improves the performance of the model. From Fig. 8, MCSN achieves additional +0.6%/+1.5%, +1%/+0.8% and +2.4%/+2.8% gains in comparison with MCSN(W/O BNNeck) in rank-1/mAP accuracy on the Market-1501, DukeMTMC-ReID and MSMT17 datasets, respectively.

4.5 Comparison with state-of-the-art

To verify the effectiveness of the proposed MCSN for PReID task, we compare it with some recent remarkable works. All methods have been divided into different types, including pose-guided, mask-guided, global feature, stripe-based, gan-based, attention-based, and are compared over the popular benchmark datasets Market-1501, DukeMTMC-ReID and MSMT17. For a fair comparison, any re-ranking [27] or multi-query fusion techniques [11] are not used in our proposed method.

4.5.1 Results on Market-1501

Comparisons between MCSN and state-of-the-art methods on Market-1501 are shown in Table 6. In the following, we report the performance of some models considering the single-query. Our proposed model outperforms GLAD, which is pose-guided, by 4.7% rank-1 matching rate and 12.3% mAP under single query mode. Compared with SPReID which is mask-guided, our MCSN outperforms it by 2.1% rank-1 matching rate and 4.9% mAP, respectively. We achieve similar performance with the baseline BagOfTricks based on global feature. In addition, compared with the stripe-based models, we reach the second place on the leader board and the best performance is obtained by MGN [8]. Our proposed model MCSN supass the best performance in models GAN-based and attention-based by 1.5% rank-1 and 3.9% mAP. Thus, our proposed method is proved superior on the Market1501 dataset.

Table 6
Comparison of our method with state-of-the-art. The table lists our results on the two most used benchmarks, Market-1501 and DukeMTMC-ReID. Note that all results are reported without re-ranking. Bold font denotes the performance of our proposed method

Type Methods Market-1501 DukeMTMC-ReID

rank-1 mAP rank-1 mAP

Pose-guided GLAD [28] 89.9 73.9 80.0 62.2

PIE [1] 87.33 69.25 80.84 64.09

Mask-guided SPReID [29] 92.5 81.3 84.4 71.0

MGCAM [30] 83.55 74.25 — —

Global feature BagOfTricks [10] 94.5 85.9 86.4 76.4

DLCE [31] 79.5 59.9 68.9 49.3

OG-Net [32] 86.19 68.09 76.93 57.20

Stripe-based PCB+RPP [23] 93.8 81.6 83.3 69.2

MGN [8] 95.7 86.9 88.7 78.4

AlignedReID [33] 91.8 79.3 81.2 71.0

GAN-based Camstyle [34] 88.1 68.7 75.3 53.5

PN-GAN [35] 89.4 72.6 73.6 53.2

Attention-based HA-CNN [36] 91.2 75.7 80.5 63.8

Mancs [2] 93.1 82.3 84.9 71.8

Ours MCSN 94.6 86.2 87.8 76.8

Type	Methods	Market-1501	DukeMTMC-ReID
Pose-guided	GLAD [28]	89.9	73.9	80.0	62.2
	PIE [1]	87.33	69.25	80.84	64.09
Mask-guided	SPReID [29]	92.5	81.3	84.4	71.0
	MGCAM [30]	83.55	74.25	—	—
Global feature	BagOfTricks [10]	94.5	85.9	86.4	76.4
	DLCE [31]	79.5	59.9	68.9	49.3
	OG-Net [32]	86.19	68.09	76.93	57.20
Stripe-based	PCB+RPP [23]	93.8	81.6	83.3	69.2
	MGN [8]	95.7	86.9	88.7	78.4
	AlignedReID [33]	91.8	79.3	81.2	71.0
GAN-based	Camstyle [34]	88.1	68.7	75.3	53.5
	PN-GAN [35]	89.4	72.6	73.6	53.2
Attention-based	HA-CNN [36]	91.2	75.7	80.5	63.8
	Mancs [2]	93.1	82.3	84.9	71.8
Ours	MCSN	94.6	86.2	87.8	76.8

4.5.2 Results on DukeMTMC-ReID

Similar to Market-1501, the comparisons between our proposed method with related methods are shown in Table 6. This dataset is challenging, as it has 8 different cameras and the person bounding box size varies drastically across different camera views. Even so, our proposed model achieves even better performance on this dataset. Compared with the state-of-art method BagOfTricks except for MGN, our proposed method achieves an improvement of 0.4% mAP and 1.4% rank-1 performance.

4.5.3 Results on MSMT17

The results indicate that our proposed method achieves the better performance on such a dataset. As shown in Table 7, our proposed model obtains 70.1% rank-1 accuracy and 52.7% mAP, which significantly outperforms some existing methods. In addition, although the accuracy rate on the rank-1 is not the best, on the mAP is the best among the methods mentioned. It indicates that our proposed architecture is able to learn meaningful feature representations as well as a similarity measure with limited data.

Table 7
Comparison to state-of-the-art methods on the MSMT17 dataset

Methods MSMT17

rank-1 mAP

DG-Net [37] 77.2 52.3

MGN + CircleLoss [38] 76.9 52.1

ResNet50 + CircleLoss [38] 76.3 50.2

OSNet-IAP 1.0x [39] 77.97 48.66

CBN [40] 72.8 42.9

DLCE [32] 60.48 31.58

OG-Net [33] 47.82 22.82

UTAL [41] 31.4 13.1

TAUDL [42] 28.4 12.5

MCSN (Ours) 70.1 52.7

Methods	MSMT17
	rank-1	mAP
DG-Net [37]	77.2	52.3
MGN + CircleLoss [38]	76.9	52.1
ResNet50 + CircleLoss [38]	76.3	50.2
OSNet-IAP 1.0x [39]	77.97	48.66
CBN [40]	72.8	42.9
DLCE [32]	60.48	31.58
OG-Net [33]	47.82	22.82
UTAL [41]	31.4	13.1
TAUDL [42]	28.4	12.5
MCSN (Ours)	70.1	52.7

4.6 Visualization of retrieval results

Although there are visual differences between the datasets, such as scene bias and detection bias, we prove that our proposed model consistently improves the performance of PReID. In addition, PReID can be regarded as an image retrieval problem. Therefore, to get a better understanding on how our proposed MSCN can outperform previous state-of-the-art, we compare some retrieved results between PCB and our MSCN on the Market-1501 dataset, as shown in Fig. 9. These results confirm the effectiveness of spatial and channel partition on keeping more salient information.

Fig. 9

Two examples of MCSN and PCB on the Market-1501 dataset. The retrieved images from left to right are sorted according to the similarity score. For each query, the query results from rank-1 to rank-3 are displayed. From left to right: query image, top-3 results of PCB, and top-3 results of MSCN. Images in red boxes are negative results and in green boxes are positive results. The results show that our proposed model boosts the retrieval performance. The magnified visual effect is the best. It should be noted that we used the retrieval results of PCB in [43].

5 Conclusion and future work

We have presented a multi-branch neural network that joints multi-level global fusion features, channel features and spatial features to make better use of spatial and channel information. In addition, the BNNeck and the joint loss function strategy are applied to all vector representation branches to improve the stability of model training and the generalization ability of the model. Experiments conducted on three mainstream image-based evaluation protocols including Market-1501, DukeMTMC-ReID and MSMT17 validate the performance of our proposed model, which outperforms previous state-of-the-art in PReID tasks.

Since this paper proposes a multi-branch network, the large number of parameters of the model makes the model less efficient, which is an obstacle that limits the further improvement of model performance. Thus, after the recognition performance of the model has been greatly improved, efficiency is also an aspect that must be considered. In the future, the channel attention “soft” pruning algorithm of channel local correlation will be applied to our proposed model to improve model efficiency. It is worth mentioning that redundant pruning with more local clustering can effectively maintain the original network channel distribution.

References

Zheng

, Huang

, Lu

and Yang

, Pose-invariant embedding for deep person re-identification, IEEE Transactions on Image Processing 28(9) (2019) 4500–4509.

Wang

, Zhang

, Huang

, Liu

and Wang

, Mancs: A multi-task attentional network with curriculum sampling for person re-identification, in: Proceedings of the European Conference on Computer Vision (ECCV), (2018), 365–381.

Zhao

, Tian

, Sun

, Shao

, Yan

, Yi

, Wang

and Tang

, Spindle net: Person re-identification with human body region guided feature decomposition and fusion, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), 1077–1085.

, Chen

, Zhang

and Huang

, Learning deep contextaware features over body and latent parts for person reidentification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2017), 384–393.

Yao

, Zhang

, Hong

, Zhang

, Xu

and Tian

, Deep representation learning with part loss for person reidentification, IEEE Transactions on Image Processing 28(6) (2019), 2860–2871.

Liu

, Zhao

, Tian

, Sheng

, Shao

, Yi

, Yan

and Wang

, Hydraplus-net: Attentive deep features for pedestrian analysis, in: Proceedings of the IEEE international conference on computer vision, (2017), 350–359.

Zhao

, Li

, Zhuang

and Wang

, Deeply-learned partaligned representations for person re-identification, in: Proceedings of the IEEE international conference on computer vision, (2017), 3219–3228.

Wang

, Yuan

, Chen

, Li

and Zhou

, Learning discriminative features with multiple granularities for person re-identification, in: Proceedings of the 26th ACM international conference on Multimedia, (2018), 274–282.

, Wei

, Zhou

, Shi

, Huang

, Wang

, Yao

and Huang

, Horizontal pyramid matching for person reidentification, in: Proceedings of the AAAI conference on artificial intelligence 33 (2019), 8295–8302.

10.

Luo

, Gu

, Liao

, Lai

and Jiang

, Bag of tricks and a strong baseline for deep person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, (2019), 0–0.

11.

Zheng

, Shen

, Tian

, Wang

and Tian

, Scalable person re-identification: A benchmark, in: Proceedings of the IEEE international conference on computer vision, (2015), 1116–1124.

12.

Ristani

, Solera

, Zou

, Cucchiara

and Tomasi

, Performance measures and a data set for multi-target, multicamera tracking, in: European conference on computer vision, Springer, (2016), 17–35.

13.

Wei

, Zhang

, Gao

and Tian

, Person transfer gan to bridge domain gap for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), 79–88.

14.

Martinel

, Luca Foresti

and Micheloni

, Aggregating deep pyramidal representations for person re-identification, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (2019), 0–0.

15.

Farenzena

, Bazzani

, Perina

, Murino

and Cristani

, Person re-identification by symmetry-driven accumulation of local features, in: 2010 IEEE computer society conference on computer vision and pattern recognition, IEEE, (2010), 2360–2367.

16.

Gray

and Tao

, Viewpoint invariant pedestrian recognition with an ensemble of localized features, in: European conference on computer vision, Springer, (2008), 262–275.

17.

Fan

, Luo

, Zhang

, He

, Zhang

and Jiang

, Scpnet: Spatial-channel parallelism network for joint holistic and partial person re-identification, in: Asian conference on computer vision, Springer, (2018), 19–34.

18.

Sun

, Zheng

, Deng

and Wang

, Svdnet for pedestrian retrieval, in: Proceedings of the IEEE international conference on computer vision, (2017), 3800–3808.

19.

Howard

A.G.

, Zhu

, Chen

, Kalenichenko

, Wang

, Weyand

, Andreetto

and Adam

, Mobilenets: Efficient convolutional neural networks for mobile vision applications, arXiv preprint arXiv:1704.04861 (2017).

20.

Zheng

, Zhang

, Sun

, Chandraker

, Yang

and Tian

, Person re-identification in the wild, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, (2017), 1367–1376.

21.

, Shen

, Lin

, Xiang

, Shao

and Hoi

S.C.

, Deep learning for person re-identification:Asurvey and out-look, IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).

22.

Chen

, Chen

, Zhang

and Huang

, A multi-task deep network for person re-identification, in: Proceedings of the AAAI Conference on Artificial Intelligence 31 (2017).

23.

Sun

, Zheng

, Yang

, Tian

and Wang

, Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline), in: Proceedings of the European conference on computer vision (ECCV), (2018), 480–496.

24.

, Zhang

, Ren

and Sun

, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), 770–778.

25.

Szegedy

, Vanhoucke

, Ioffe

, Shlens

and Wojna

, Rethinking the inception architecture for computer vision, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2016), 2818–2826.

26.

Fan

, Jiang

, Luo

and Fei

, Spherereid: Deep hypersphere manifold embedding for person re-identification, Journal of Visual Communication and Image Representation 60 (2019), 51–58.

27.

Zhong

, Zheng

, Cao

and Li

, Re-ranking person reidentification with k-reciprocal encoding, in: Proceedings of the IEEE conference on computer vision and pattern recognition (2017), 1318–1327.

28.

Wei

, Zhang

, Yao

, Gao

and Tian

, Glad: Global-localalignment descriptor for pedestrian retrieval, in: Proceedings of the 25th ACM international conference on Multimedia, (2017), 420–428.

29.

Kalayeh

M.M.

, Basaran

, Gokmen

, Kamasak

M.E.

and Shah

, Human semantic parsing for person reidentification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), 1062–1071.

30.

Song

, Huang

, Ouyang

and Wang

, Mask-guided contrastive attention model for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), 1179–1188.

31.

Zheng

, Zheng

and Yang

, A discriminatively learned cnn embedding for person reidentification, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14(1) (2017), 1–20.

32.

Zheng

and Yang

, Person re-identification in the 3d space, arXiv preprint arXiv:2006.04569 (2020).

33.

Zhang

, Luo

, Fan

, Xiang

, Sun

, Xiao

, Jiang

, Zhang

and Sun

, Alignedreid: Surpassing humanlevel performance in person re-identification, arXiv preprint arXiv:1711.08184 (2017).

34.

Zhong

, Zheng

, Li

and Yang

, Camstyle: A novel data augmentation method for person re-identification, IEEE Transactions on Image Processing 28(3) (2018), 1176–1190.

35.

Qian

, Fu

, Xiang

, Wang

, Qiu

, Wu

, Jiang

Y.-G.

and Xue

, Pose-normalized image generation for person re-identification, in: Proceedings of the European conference on computer vision (ECCV), (2018), 650–667.

36.

, Zhu

and Gong

, Harmonious attention network for person re-identification, in: Proceedings of the IEEE conference on computer vision and pattern recognition, (2018), 2285–2294.

37.

Zheng

, Yang

, Yu

, Zheng

, Yang

and Kautz

, Joint discriminative and generative learning for person reidentification, in: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2020).

38.

Sun

, Cheng

, Zhang

, Zheng

, Wang

and Wei

, Circle loss: A unified perspective of pair similarity optimization, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020).

39.

Sovrasov

and Sidnev

, Building computationally efficient and well-generalizing person re-identification models with metric learning (2020).

40.

Zhuang

, Wei

, Xie

, Zhang

, Wu

, Ai

and Tian

, Rethinking the distribution gap of person reidentification with camera-based batch normalization (2020).

41.

, Zhu

and Gong

, Unsupervised tracklet person reidentification, IEEE transactions on pattern analysis and machine intelligence 42(7) (2019), 1770–1782.

42.

, Zhu

and Gong

, Unsupervised person re-identification by deep learning tracklet association, in: European Conference on Computer Vision, (2018).

43.

Chen

, Lagadec

and Bremond

, Learning discriminative and generalizable representations by spatial-channel partition for person re-identification, in: Proceedings of the IEEE/CVFWinter Conference on Applications of Computer Vision, (2020), 2483–2492.

Learning discriminative and generalizable features with multi-branch for person re-identification

Abstract

Keywords

1 Introduction

2.1 Multi-branch models for PReID

2.2 Part-based models

2.3 Methods based on Multi-Loss

3 The proposed approach

3.1 Network architecture

4 Experiments

4.1 Implementation details

Table 2 Detailed settings Name Detailed settings Hardware GPU NVIDIA GeForce GTX 1070 Frequency 1506/1683MHz DRAM 8192MB Hard drive 519GB Software Operating system Ubuntu 18.04 Language PyTorch 1.8

4.3 Evaluation metrics

4.4.1 Influence of partition strategies

4.4.4 Effectiveness of channel feature

4.4.5 Influence of backbones

4.5 Comparison with state-of-the-art

4.5.1 Results on Market-1501

4.5.3 Results on MSMT17

References

Table 2
Detailed settings

Name Detailed settings

Hardware

GPU NVIDIA GeForce GTX 1070

Frequency 1506/1683MHz

DRAM 8192MB

Hard drive 519GB

Software

Operating system Ubuntu 18.04

Language PyTorch 1.8