Joint deep learning of angular loss and hard sample mining for person re-identification

Abstract

Person re-identification (ReID) is a critical work in the field of intelligent image processing and deep learning, which has attracted the attention of industry application. Person ReID focuses on matching person images obtained from non-overlapping camera views and finding the person-of-interest. An important unresolved problem is to obtain efficient metric for measuring the similarity among pedestrian images. Lately, deep learning with metric learning has become a general method for person ReID. Yet, previous methods mainly used a variety of distance to measure the similarity among samples. The way of distance measure is more sensitive when the scale changes. In this paper, we propose angular loss with hard sample mining (ALHSM) to learn better similarity metric for the person ReID. Our work uses the angular relationship in triangles as a measure of similarity, minimizing the angle at the negative point of the triangle. ALHSM combines with hard negative mining strategies, which learn better similarity metric and achieve advanced performance on several benchmark datasets. The experimental results show that our work is competitive compared to the state-of-the-art.

Keywords

Person re-identification deep learning angular loss intelligent image processing

1 Introduction

In recent years, person re-identification (ReID) has drawn significant attention in surveillance security and retrieval of suspects. Person ReID aims at matching objects/persons observed in non-overlapping camera views with feature descriptors and finding a person-of-interest (query) among a gallery of person image dataset [1]. Owing to various difficulties including changed lighting, alter of body pose, background environments and view angles, occlusions and low-resolution images, the similarity among different persons increases the difficulty, person ReID is still a challenging problem [2].

In the person ReID task, deep learning has attained better results than the traditional approach recently [3 –5]. The existing methods mainly consist of two stages. One stage is extracting discriminative feature descriptors from samples. The second stage is computing the various distances of samples by feature comparison. The convolutional neural network (CNN) is frequently used for feature descriptor representation, extracts discriminative features from the query and the gallery images [6, 7]. The first stage mainly considers extracting more robust features. The second aspect involves the metric learning. Suitable distance metrics are indispensable to resolve person ReID ranking problem. The approaches of metric learning and ranking have been used to the person ReID [8]. The metric learning problem finds a mapping function, minimizes the same person samples distance and maximizes the different person samples distance. In multiple camera views, the similarity measure between person images increases due to changes in camera distance, body posture, etc. This article focuses on the metric learning phase in person ReID.

Several recent person ReID approaches show that identification loss combined with verification loss can learn a more discriminative person embedding [4 , 9–11]. However, parameters of identification loss grow when the number of identities increases. Many parameters are discarded after retraining. Yet, verification loss is used to a pair of images, requires that training data contains real-valued labels, which are usually not available in practice. Meanwhile, the use of verification loss to determine the similarity of two pictures, one to one comparison, the efficiency is relatively low. Therefore, different distance loss methods, such as improved triplet loss [8], quadruplet loss [12], margin sample mining loss [2], multi-class N-pair loss [13], etc. are proposed and get better performance. However, the previous method mainly adopts the metric distance to measure the similarity of samples. The method of distance measure is more sensitive when the scale changes. Triplet loss margin selection is obviously not suitable for different intra-class. The previous measurement method mainly considered the optimization of similarity (e.g. contrastive loss) or the relative similarity (e.g. triplet loss and quadruplet loss). The loss function referred in the previous literature is defined by the distance of samples point, and other probable forms of loss function are rarelydiscussed [14].

In our work, we introduce angular loss combined with extremely hard sample mining (ALHSM), which performs better than some of metric learning losses on the person ReID problem. Our method illustrates significant performance on common person ReID datasets i.e. Market1501 [15], CUHK03 [1] and MARS [16] is competitive compared to the state-of-the-art.

Use angles rather than distances to define the core part of a measure of learning loss and describe the local structure more accurately. Combined with the existing architecture N-pair loss [13] and MSML [2], further enhance their performance. Angular loss uses the relationship of angles as a measure of similarity. Angular loss presents further accurately local structure better than distance-based triplicate loss. The approach in this paper limits the angle n described in Fig. 1 between the negative point of the two sides of the triplet triangle. The main advantage is the introduction of the scale invariance, which can resist the problem of scale change, improve the robustness of the target to the feature differences. This work captures the complementary partial structure of triplet triangles and introduces the idea of hard sample mining to achieve better convergence. To the best of our knowledge, this is the first work which explores angular loss combined with hard sample mining in person ReID field.

Fig. 1

Illustration of the angular loss.

2 Related work

Learning Mahalanobis distance in a Euclidean space is a traditional metric learning method, such as Keep It Simple and Straightforward Metric Learning (KISSME) [17] applied in person ReID. Recently, deep metric learning methods usually extract features from CNN or other models, and then compute the feature distances in Euclidean space. Triplet loss function [5 , 19] is used to investigate the relative similarity of different person images pair. In person ReID retrieval [3] and face recognition [20] tasks, triplet loss function usually was applied to solve the distance metric problem. In deep metric learning, a positive pair is two image samples of the same sample whereas a negative pair is two of different samples. One structure of triplet is made up of three samples, which comprise a positive person image pair and a negative person image pair. In order to achieve correct classification, the distance of the negative pair sample is enforced to be larger than the positive pair. The triplet loss is motivated by the threshold between pairs of positive and negative. However, the traditional triplet loss needs mining hard samples for efficient mining of similar features, otherwise training process will stagnate, training unstable and time-consuming [8]. If the sample is too difficult and will lead to training process shock, unable to converge. Traditional triplet loss may be generalized in the test set general effect, mainly since the class variance is still relatively large.

Some variant of the triplet loss is proposed [2 , 22] to solve above problems. Wu et al. [21] proposed DeepLDA method which using fisher vectors combined with the LDA objective function. Yet, this method seems more difficult to train. Variations in the same id pedestrian are effectively reduced [5]. They further restrict the distance between pairs belonging to the same type based on Triplet Loss to be less than a pre-set value. Unfortunately, the method partly neglects the relative relationships between pairs. Ding et al. [22] proposed Batch All Triplet Loss which count all possible when calculating loss. The method of lifted structured feature embedding fills the batch with triplets considering all but the anchor-positive pair as negatives, meanwhile optimizes the smooth boundary of loss [7]. Hermans et al. [8] proposed a generalization of the lifted embedding loss which considers all anchor-positive pairs based on [7] and [22]. Chen et al. [12] resolved the problem of samples ranking aspect in ReID which introducing the quadruplet loss. One structure ofthe quadruplet loss including four samples which extending inter-class distances and diminishing intra-class distances. The quadruplet loss considers absolute distance between the pair of positive and negative samples. The quadruplet loss pushes away negative from positive pairs of samples. The quadruplet loss differentiates two pairs images on whether the query images are same or not.

Xiao et al. [2] proposed margin sample mining loss (MSML) with hard sample mining which absorbing quadruplet loss and TriHard loss functions advantages. The MSML loss only picks out the hardest one positive sample pair and the hardest one negative sample pair to calculate the loss. Thus, MSML is a harder sample harder than TriHard. MSML is to push the boundaries of positive and negative sample pairs away, hence the name of the boundary sample mining loss. MSML only used two pairs of samples to calculate the loss. It seems to waste a lot of training data. However, two pairs of sample pairs are chosen based on the results of the entire batch, so other images in the batch also indirectly affect the final loss. And with the increase of the training cycle, almost all the data will participate in the calculation of loss. In summary, MSML is a measure learning method that takes both relative and absolute distances into consideration and introduces the idea of a difficult sample sampling.

Based on the work of [2 , 23–25], we propose angular loss combined with hard sample mining method applied on person ReID. The performance of common triplet loss method is not good in the case of large variation unbalanced class. This problem can be resolved by minimizing the angle of the negative point [25]. Angle is not only rotationally constant but also determined by the geometric nature of the triangle. This makes the goal more robust to changes in local feature mapping. Angular Loss has rotation invariance and scale invariance. ALHSM introduces the idea of hard sample mining to angular loss. In a triplet structure, the method of ALHSM employs angular relations as the distance measure, the hardest positive sample pair in batch is selected while using the hardest negative sample pairs to make the constraints.

3 Methodology

In this section, we introduce an angular loss with hard sample mining called ALHSM applied in person ReID, utilizing the deep network for feature extracting. Firstly, we review the common methods of triplet loss. Then the method of angular loss combined with hard sample mining is presented. We discuss the optimization of the method on a batch in detailat last.

3.1 The triplet loss and the variants

The purpose of metric embedding learning is to learn a function $f (x) : ℝ^{F} \to ℝ^{D}$ which maps semantically similar instances from the data manifold in $ℝ^{F}$ onto metrically close points in $ℝ^{D}$ . The method of triplet loss has been demonstrated effective in learning discriminative feature descriptors compared to softmax loss for classification, which widely used in person ReID and face recognition. However, due to the deficiencies mentioned in Section 2 on triplet loss, there are many literatures that have proposed many ways to improve triplet loss.

We review of the method of common triplet loss. For instance, each of triplet loss {x_a, x_p, x_n} contains an anchor x_a, a positive x_p and a negative x_n in an iteration of the small batch. x_a and x_p are two pedestrian images of the same id, and x_n is another different id person. The principle of triplet loss function is attempting to minimize the distance between an anchor sample and a positive sample. Maximize the distance between the anchor and a negative at the same time. The mathematical formulation can be formed as Equation 1, $L_{triplet} = \sum_{a, p, n}^{m} {[{\overset{︷}{{∥ f (x_{a}) - f (x_{p}) ∥}_{2}^{2}}}^{minimize} - {\overset{︷}{{∥ f (x_{a}) - f (x_{n}) ∥}_{2}^{2}}}^{maxmize} + α_{triplet}]}_{+}$ (1) where α_triplet is a distance margin distinguish the positive pairs with the negative. f (x_a), f (x_p),f (x_n) represents normalized highly-embed features and [x] ₊ = max(x, 0).

The Euclidean distance is applied to measure the similarity of extracted feature descriptors from two input samples in triplet loss. In this paper, we use the learning metric g (x_a, x_p) instead of the Euclidean distance to improve the robustness. The full derivation and proof can be found in [26]. Regardless of the threshold α_triplet, the model can multiply g (x_a, x_p) and g (x_a, x_n) by appropriate values to meet the boundary threshold requirement. At the same time, a softmax constraint is added to obtain the similarity of [0,1].

The quadruplet loss improves the triplet loss by adding another different negative pair. A quadruplet loss function involves four different images {I_a, I_p, I_s, I_t}, where I_a and I_p are pedestrian images of the same id while I_s and I_t are pedestrian images of another two different persons separately. The quadruplet loss is formulated as Equation 2, $\begin{matrix} L_{quad} = \sum m {[{\overset{︷}{{∥ g (x_{a}) - g (x_{p}) ∥}_{2} - {∥ g (x_{a}) - g (x_{s}) ∥}_{2} + α}}^{relative distance}]}_{+} \\ + \sum m {[{\overset{︷}{{∥ g (x_{a}) - g (x_{p}) ∥}_{2} - {∥ g (x_{s}) - g (x_{t}) ∥}_{2} + β}}^{absolute distance}]}_{+} \\ = {[d_{a, p} - d_{a, s} + α]}_{+} + {[d_{a, p} - d_{s, t} + β]}_{+} \end{matrix}$ (2) where α and β are the values of the margins, β is set to be less than the marginal α to achieve the relative perimeter constraint, so the first term plays a major role.

Furthermore, if we ignore the effects of the parameters α and β, we can represent the quadruple losses in a more general form as Equation 3, $\begin{matrix} L_{{quad}^{'}} = \sum m {[{∥ g (x_{a}) - g (x_{p}) ∥}_{2} - {∥ g (x_{s}) - g (x_{t}) ∥}_{2} + α]}_{+} \\ = {[d_{a, p} - d_{s, t} + α]}_{+} \end{matrix}$ (3) where s and t are a pair of negative samples, s and a may be either a pair of positive samples or a pair of negative samples.

The direct use of L_quad′ does not produce very good results because the number of quadruplets may increase dramatically as the amount of data increases. Most pairs of samples are relatively simple, which limits the performance of the model. In order to solve this problem, the margin sample mining loss (MSML) adopted the idea of TriHard loss. TriHard loss calculates triples loss in a batch for each picture in the batch, pick the hardest positive sample and the hardest negative sample and the picture form a triplet. Therefore, the loss function of MSML is defined as Equation 4, $L_{MSML} = {[max_{a, p} d_{a, p} - min_{s, t} d_{s, t} + α]}_{+}$ (4) where a, p, s, t are the pictures in the batch, $max_{a, p} d_{a, p}$ is the largest distance positive sample pairs in the batch, $min_{s, t} d_{s, t}$ are the negative sample pairs with the smallest distance in the batch, both a and s can be positive sample pairs or a negative sample pair. In summary, TriHard loss picks one triple for each picture in the batch, whereas MSML loss picks only the hardest one positive pair and the hardest one negative pair for loss calculation. MSML is a measure learning method that takes both relative distance and absolute distance into consideration and introduces the idea of difficult sample sampling.

3.2 The angular loss with hard sample mining (ALHSM)

Traditional triplet loss or contrastive loss is based on distance measurement, which cannot solve the problem of scale change. Only consider the sample second-order features. The angular loss constrains the angle of the triplet’s negative point, considering the third-order features rather than considering only the second-order features of the sample. The angular loss has scale invariance and improves the robustness of the objective function to counter the feature change. Angular loss essentially adds third-order geometric constraints, which can capture additional local structures in comparison with triplet loss or contrastive loss, as well as better convergence.

We represent the anchor, positive and negative samples in the angular loss structure, respectively, in the triangular geometry. Our purpose is to make the anchor and positive distance as large as possible, anchor and negative distance as small as possible. Intuitively, angular loss specifies that the angle of the negative point in triplet loss is less than a certain value, shown in Fig. 1. x_a and x_p are the same person, x_n is a different negative sample. Depending on the nature of the triangle, if you want to keep negative samples away from anchor and positive samples, you need to make ∠n smaller. However, when ∠α is greater than 90°, the negative sample is closer to the anchor while minimizing the angle n, and the distance between different persons becomes larger. Therefore, a new triangle Δ_mcn is formulated to moving the anchor sample x_a and positive sample x_p to x_c and x_m separately. Utilizing tangent theorem and the defined angle n, the definition of the angular loss function is to minimize the loss L_ang (Equation 5).

$\begin{matrix} L_{ang} = {[{∥ x_{a} - x_{p} ∥}^{2} - 4 {tan}^{2} α {∥ x_{n} - x_{c} ∥}^{2}]}_{+} \\ = {[d_{a, p} - 4 {tan}^{2} α \cdot d_{n, c}]}_{+} \end{matrix}$ (5)

Based on the work mentioned earlier in section 3.1, we propose the angular loss with hard sample mining method named ALHSM apply to person ReID. However, just use of L_ang does not get better results. As the amount of data increases, the number of triples may increase dramatically. Most samples are relatively simple, which limits the performance of the model. The method uses two pairs of sample pairs and an angular limit to calculate the loss. The two pairs of sample pairs are picked based on the results of the entire batch, so other images in the batch also indirectly affect the final loss. And with the increase of training cycle, almost all the data will participate in the calculation of loss.

We combine the angular loss with a margin sample mining strategy. ALHSM selects the most dissimilar positive pairs and the most similar negative pair in the batch while satisfying angular loss, as Equation 6, $L_{ALHSM} = {[max_{a, p} d_{a, p} - 4 {tan}^{2} α \cdot d_{n, c} - min_{s, t} d_{s, t}]}_{+}$ (6) where a and p are the same persons with the largest distance in the batch, s and t are the smallest distance persons in batch, s and t can share the same person with a or not. In Equation 6, the distance measure between two sample points uses the method mentioned in Section 3.1.2 of this article instead of using the traditional Euclidean distance. Figure 2 can be used to summarize the relationship between these losses.

Fig. 2

Several loss function relations.

We adjust the upper bound of the smoothness in Equation 6 in our experiments. It is assumed that the feature has a unit length in Equation 6, We use Equation 7 to represent the angle loss of a batch.

$L_{ALHSM} (Ψ) = \frac{1}{N} \sum_{x \in Ψ} {log [1 + \sum_{y_{n} \neq y_{a}, y_{p}} exp (g_{a, p, n})]}$ (7)

We reduce the constant term, as formulated in Equation 8, where ϒ in Equation 8 is the smallest distance persons in batch $min_{s, t} d_{s, t}$ . $g_{a, p, n} = 4 {tan}^{2} α (x_{a} + x_{p})^{T} x_{c} - 2 (1 + {tan}^{2} α) x_{a}^{T} x_{p} - ϒ$ (8)

In summary, compared with other metric learning losses, our ALHSM has following advantages. Our proposed method applies angular loss to the person ReID. In a triplet angle loss, the negative sample point is constrained by the angle. Meanwhile the method can be used to counter the scale and feature variation. Considering the relative distance and absolute distance, it combines the idea of hard sample sampling with angular loss.

4 Experiments

In this section, we mainly assess the proposed method using three common benchmark datasets of person ReID, i.e. Market1501 [15], CUHK03 [1] and MARS [16]. We present performance results trained by GoogLeNet and ResNet-50 network structures separately. Then we contrast the proposed approach with state-of-the-art methods.

4.1 Datasets

Market1501 dataset [15] is one of most widely used datasets in the person ReID field. It contains 32,668 annotated bounding boxes of 1,501 identities collected from six cameras. It contains 19,732 images for testing and 12,936 images for training. There are 17.2 images per identity in the training set.

CUHK03 [1] contains 14,097 images of 1,467 identities collected in the CUHK campus. Each identity is captured by two cameras and has 4.8 images in average for each view. This dataset contains two kinds of bounding boxes. We evaluate our model on the bounding boxes detected by DPM, which is closer to the realistic setting. We report the averaged result after training/testing 15 times. We report the single-shot results on all the datasets.

MARS (Motion Analysis and Recognition Set) dataset [16] is an expanded version of the Market1501 dataset. This is a large video-based person ReID dataset. Since all bounding boxes and trajectories are automatically generated. MARS has a total of 20,478 tracklets, including 1,261 identities for 6 camera views.

4.2 Experiments results

The Chainer package is used throughout the experiments. Chainer [27] is an open-source deep learning framework featuring the define-by-run approach. Each image is normalized to 256×256 pixels before processing with data enhancement such as random horizontal flip, random crop and zoom. The final feature dimensions of GoogLeNet and Resnet50 are transformed to 1024 through a fully-connected layer. Adam optimizer is used and the initial learning rate is set to 0.0001. We use the SGD solver to train our model and set batch size to 120. We use HDF5 format to read and write data files, data of different types can be embedded in an HDF file.

We evaluate our method with rank-1, rank-5, rank-10 accuracy and mean average precision (mAP). We compared our method and the representative person ReID method on several benchmark datasets. The results are shown in Tables 1 –3.

Table 1
Comparison on Market1501 with single query

Methods mAP Rank-1 Rank-5 Rank-10

BoW + KISSME [15] 20.76 44.42 63.90 72.18

SL [28] 26.35 51.90 – –

Null Reid [29] 29.87 55.43 – –

Fisher Net [30] 29.94 48.15 – –

Multiregion CNN [31] 41.17 66.36 85.01 90.17

Past (ResNet-50) [6] 47.80 73.90 87.68 91.54

SOMAnet [32] 47.89 73.87 88.03 92.22

Spindle Net [33] – 76.90 91.50 94.60

Triplet loss (ResNet-50) [20] 54.80 75.90 89.60 –

Pose [34] 55.95 79.33 90.76 94.41

DCGAN [35] 56.23 78.06 – –

TOMM(ResNet-50) [4] 59.87 79.51 90.91 94.09

Quad (ResNet-50) [12] 61.10 80.00 91.80 –

Pose-driven [36] 63.41 84.14 92.73 94.92

Attribute [37] 64.67 84.29 93.20 95.19

Deep Transfer [10] 65.50 83.70 – –

Deep joint [11] 65.50 85.10 – –

Defense (ResNet-50) [8] 69.14 84.92 94.21 –

MSML (ResNet-50) [2] 69.60 85.20 93.70 –

GoogLeNet-Basel. Ours (GoogLeNet) 71.32 86.87 94.61 96.91

ResNet-50-Basel. Ours (ResNet-50) 73.85 87.93 95.09 97.29

Methods	mAP	Rank-1	Rank-5	Rank-10
BoW + KISSME [15]	20.76	44.42	63.90	72.18
SL [28]	26.35	51.90	–	–
Null Reid [29]	29.87	55.43	–	–
Fisher Net [30]	29.94	48.15	–	–
Multiregion CNN [31]	41.17	66.36	85.01	90.17
Past (ResNet-50) [6]	47.80	73.90	87.68	91.54
SOMAnet [32]	47.89	73.87	88.03	92.22
Spindle Net [33]	–	76.90	91.50	94.60
Triplet loss (ResNet-50) [20]	54.80	75.90	89.60	–
Pose [34]	55.95	79.33	90.76	94.41
DCGAN [35]	56.23	78.06	–	–
TOMM(ResNet-50) [4]	59.87	79.51	90.91	94.09
Quad (ResNet-50) [12]	61.10	80.00	91.80	–
Pose-driven [36]	63.41	84.14	92.73	94.92
Attribute [37]	64.67	84.29	93.20	95.19
Deep Transfer [10]	65.50	83.70	–	–
Deep joint [11]	65.50	85.10	–	–
Defense (ResNet-50) [8]	69.14	84.92	94.21	–
MSML (ResNet-50) [2]	69.60	85.20	93.70	–
GoogLeNet-Basel. Ours (GoogLeNet)	71.32	86.87	94.61	96.91
ResNet-50-Basel. Ours (ResNet-50)	73.85	87.93	95.09	97.29

Table 2

Comparison on CUHK03 with single query

Methods	mAP	Rank-1	Rank-5	Rank-10
KISSME [1]	–	19.90	49.30	64.70
BoW + KISSME [15]	–	24.30	–	–
LOMO+XQDA [38]	–	46.30	78.90	88.60
SI-CI [26]	–	52.20	84.30	94.80
Null Reid [29]	–	58.90	85.60	92.45
Ensembles [19]	–	62.10	89.10	94.30
DeepLDA [21]	–	63.23	89.95	92.73
GOG [39]	–	67.30	91.00	96.00
Gated Siamese [41]	–	68.10	88.10	94.60
SOMAnet [29]	–	72.40	92.10	95.80
Triplet loss (ResNet-50) [20]	–	73.00	92.00	96.00
DGD [42]	–	80.50	94.90	97.10
Quad (ResNet-50) [12]	–	79.10	95.30	97.90
TOMM(ResNet-50) [4]	86.40	83.40	97.10	98.70
Deep Transfer [10]	–	84.10	–	–
Defense(ResNet-50) [8]	–	79.50	95.00	98.00
MSML(ResNet-50) [2]	–	84.00	96.70	98.20
DCGAN [35]	87.40	84.60	97.60	98.90
GoogLeNet-Basel. Ours (GoogLeNet)	88.11	85.23	98.07	98.97
ResNet-50-Basel. Ours (ResNet-50)	89.29	86.81	98.15	99.10

Table 3

Comparison on MARS with single query

Methods	mAP	Rank-1	Rank-5	Rank-10
MARS [16]	42.40	60.00	77.90	87.90
IDE (R) + ML [44]	55.12	70.51	–	–
Latent Parts (Fusion) [41]	56.05	71.77	86.57	–
Triplet loss (ResNet-50) [20]	62.10	76.10	89.60	–
Quad (ResNet-50) [12]	62.10	74.90	88.90	–
Defense (ResNet-50) [8]	71.30	82.50	92.10	–
MSML(ResNet-50) [2]	72.00	83.00	92.60	–
GoogLeNet-Basel. Ours (GoogLeNet)	74.02	84.96	94.21	96.25
ResNet-50-Basel. Ours (ResNet-50)	75.39	85.67	95.09	96.99

As shown in Tables 1 and 2, BoW + KISSME [20] is the baseline experiment. We have obtained 86.87% and 87.93% rank-1 accuracy by GoogLeNet and ResNet-50, respectively on Market1501. Our model further improves performance from baselines on Market1501 dataset. Performance improvements can be observed on GoogLeNet and ResNet-50 network architecture. Specifically, we gain 1.67% and 2.73% rank-1 accuracy improvement compared with the method of MSML [2], respectively, using GoogLeNet and ResNet50 on Market1501. The comparison with several existing models on the CUHK03 and MARS dataset is presented in Table 2 and Table 3 separately. Similarly, we obtain 85.23% and 86.81% rank-1 accuracy on CUHK03 dataset in single-shot setting, using GoogLeNet and ResNet50 network. As shown in the table, our method outperforms all other methods with a margin, achieving 89.29% and 75.39% in mAP on the CUHK03 dataset and MARS dataset, respectively.

Finally, our ALHSM achieves the better accuracy on most experimental datasets for the different basic models. The results of these experiments show that our approach can handle different networks and improve their performance.

As depicted in Fig. 3, we further visualize some retrieval results on the three datasets, i.e., Market1501 [15] and CUHK03 [1]. The images in the first column are the query images. The retrieved images are sorted and shown in the second to the eleventh columns according to the similarity scores in the order of high to low. The correct and false matches are shown in green and red bounding boxes (best viewed in color), respectively. As shown in this figure, most candidate images can be retrieved correctly. The CHUK03 dataset is more challenging, which contain pedestrians with occlusions and similar appearance. Therefore, the proposed model has retrieved some incorrect candidates.

Fig. 3

Examples of pedestrian retrieval results on three datasets using the proposed method in single query mode.

5 Conclusion

This paper has designed a model of angular loss combined with hard sample mining applied for person ReID (ALHSM). Depending on the nature of the triangle, we use angular loss to constrain the angle of the negative point of triplet, which can be used to combat scale and feature changes. We calculate the distance matrix of the batch, and restrict the angular loss by hard sampling mining, selecting the largest positive distance pair and the smallest negative distance pair to train the model, so as to improve the robustness of the model. We use GoogLeNet and ResNet-50 as base models to do some contrast experiments with different metric learning losses. On several benchmark datasets, including CUHK-03, Market1501 and MARS, the results show that our approach better performance than most of relative methods. In the feature, we will extend ALHSM to learning medical image processing from one view.

Funding

The authors acknowledge the Major Program of Natural Science Foundation of the Higher Education Institutions of Jiangsu Province (Grant: 18KJA520002), the Natural Science Foundation of Jiangsu Province (Grant: BK20171267), the Fifth Issue 333 High-Level Talent Training Project of Jiangsu Province (Grant: BRA2018333), the Horizontal Project (Grant: Z421A19830).

References

Li ,

Zhao ,

Xiao and

Wang , Deepreid: Deep filter pairing neural network for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 9(24) (2014), 152–159.

Xiao ,

Luo and

Zhang , Margin sample mining loss: A deep learning based method for person re-identification. arXiv preprint arXiv:1710.00478, (2017).

Wang ,

Jia ,

He and

Jiang , Joint learning of body and part representation for person re-identification, IEEE Access 6(8) (2018), 44199–44210.

Zheng ,

Zheng and

Yang , A discriminatively learned cnn embedding for person reidentification, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14(1) (2017), 13–21.

Cheng ,

Gong ,

Zhou ,

Wang and

Zheng , Person re-identification by multi-channel parts-based cnn with improved triplet loss function. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(9) (2016), 1335–1344.

Zheng ,

Yang and

A.G.

Hauptmann , Person reidentification: Past, present and future. arXiv preprint arXiv:1610.02984, (2016).

Song ,

Xiang ,

Jegelka and

Savarese , Deep metric learning via lifted structured feature embedding. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(9) (2016), 4004–4012.

Hermans ,

Beyer and

Leibe , In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, (2017).

Chen ,

Zhang and

Huang , A multi-task deep network for person re-identification. in Association for the Advance of Artificial Intelligence 1(2) (2017), 3988–3994.

10.

Ni ,

Gu ,

Wang ,

Zhang ,

Chen and

Jin , Discriminative deep transfer metric learning for cross-scenario person re-identification, Journal of Electronic Imaging 27(4) (2018), 26–43.

11.

Li ,

Zhu and

Gong , Person re-identification by deep joint learning of multi-loss classification. in Proceedings of International Joint Conference on Artificial Intelligence 8(19) (2017), 2194–2200.

12.

Chen ,

Zhang and

Huang , Beyond triplet loss: A deep quadruplet network for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 11(6) (2017), 403–412.

13.

Sohn , Improved deep metric learning with multi-class N-pair loss objective. in Advances in Neural Information Processing Systems 12(5) (2016), 1857–1865.

14.

Ustinova and

V.S.

Lempitsky , Learning deep embeddings with histogram loss. in Advances in Neural Information Processing Systems 12(5) (2016), 4177–4185.

15.

Naik and Dr.

Rama Mohan Reddy , Optimized Configuration for OSPF Protocol to Aggregate Routes using Prim’s MST Algorithm, International Journal of Innovative Research in Computer and Communication Engineering (IJIRCCE) 6(8), 2018.

16.

Zheng ,

Shen ,

Tian ,

Wang ,

Wang and

Tian , Scalable person re-identification: A benchmark. in Proceedings of the IEEE International Conference on Computer Vision 2(17) (2015), 1116–1124.

17.

Zheng ,

Bie ,

Sun ,

Wang ,

Su and

Wang , MARS: A video benchmark for large-scale person re-identification. in Proceedings European Conference on Computer Vision 99(10) (2016), 868–884.

18.

Koestinger ,

Hirzer ,

Wohlhart ,

P.M.

Roth and

Bischof , Large scale metric learning from equivalence constraints. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 6(16) (2012), 2288–2295.

19.

Liao ,

Ying Yang ,

Zhan and

Rosenhahn , Triplet-based deep similarity learning for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 1(19) (2017), 385–393.

20.

Paisitkriangkrai ,

Shen and

van den Hengel , Learning to rank in person re-identification with metric ensembles. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 10(14) (2015), 1846–1855.

21.

Schroff ,

Kalenichenko and

Philbin , FaceNet: A unified embedding for face recognition and clustering in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 10(14) (2015), 815–823.

22.

Wu ,

Shen and

van den Hengel , Deep linear discriminant analysis on fisher networks: A hybrid architecture for person re-identification, Pattern Recognition 65(5) (2017), 238–250.

23.

Ding ,

Lin ,

Wang and

Chao , Deep feature learning with relative distance comparison for person re-identification, Pattern Recognition 48(10) (2015), 2993–3003.

24.

Fix ,

Gruber ,

Boros and

Zabih , A graph cut algorithm for higher-order markov random fields. in Proceedings of the IEEE International Conference on Computer Vision 11(6) (2011), 1020–1027.

25.

Kundu , Techniques Used for Mining Data in Educational System, International Journal of Innovative Research in Computer and Communication Engineering (IJIRCCE) 6(6) 2018.

26.

Duchenne ,

F.R.

Bach ,

Kweon and

Ponce , A tensor-based algorithm for high-order graph matching, IEEE Trans Pattern Anal Mach Intell 33(12) (2011), 2383–2395.

27.

Wang ,

Zhou ,

Wen ,

Liu and

Lin , Deep metric learning with angular loss, In International Conference on Computer Vision and Pattern Recognition 12(22) (2017), 2593–2601.

28.

Wang ,

Zuo ,

Lin ,

Zhang and

Zhang , Joint learning of single-image and cross-image representations for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(9) (2016), 1288–1296.

29.

Oda and

Tetsuya , A vegetable category recognition system: A comparison study for caffe and Chainer DNN frameworks, Soft Computing 23(9) (2019), 3129–3136.

30.

Chen ,

Yuan ,

Chen and

Zheng , Similarity learning with spatial constraints for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(9) (2016), 1268–1277.

31.

Zhang ,

Xiang and

Gong , Learning a discriminative null space for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(9) (2016), 1239–1248.

32.

Wu ,

Shen and

van den Hengel , Deep linear discriminant analysis on fisher networks: A hybrid architecture for person re-identification, Pattern Recognition 65(5) (2016), 238–250.

33.

Ustinova ,

Ganin and

Lempitsky , Multi-region bilinear convolutional neural networks for person re-identification. in Advanced Video and Signal Based Surveillance 10(20) (2017), 1–6.

34.

I.B.

Barbosa ,

Cristani ,

Caputo ,

Rognhaugen and

Theoharis , Looking beyond appearances: Synthetic training data for deep cnns in re-identification, Computer Vision and Image Understanding 167(2) (2018), 50–62.

35.

Zhao ,

Tian ,

Sun ,

Shao ,

Yan ,

Yi and

Tang , Spindle net: Person re-identification with human body region guided feature decomposition and fusion. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 11(6) (2017), 907–915.

36.

Arathi ,

M.S.

Vinutha and

Pai , Survey on Smart City Essentials Using Artificial Intelligence, International Journal of Innovative Research in Computer and Communication Engineering (IJIRCCE) 6(7) 2018.

37.

Liu ,

Ni ,

Yan ,

Zhou ,

Cheng and

Hu , Pose Transferrable Person Re-Identification [C]. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(4) (2018), 4099–4108.

38.

Zheng ,

Zheng and

Yang , Unlabeled samples generated by GAN improve the person re-identification baseline in vitro. in Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition 12(22) (2017), 3754–3762.

39.

Su ,

Li ,

Zhang ,

Xing ,

Gao and

Tian , Pose-driven deep convolutional model for person re-identification. in IEEE International Conference on Computer Vision and Pattern Recognition 12(22) (2017), 3980–3989.

40.

Lin ,

Zheng ,

Wu and

Yang , Improving person re-identification by attribute and identity learning. arXiv preprint arXiv:1703.07220, 2017.

41.

Liao ,

Hu ,

Zhu and

S.Z.

Li , Person re-identification by local maximal occurrence representation and metric learning. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 10(14) (2015), 2197–2206.

42.

Matsukawa ,

Okabe ,

Suzuki and

Sato , Hierarchical gaussian descriptor for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(9) (2016), 1363–1372.

43.

Li ,

Chen ,

Zhang and

Huang , Learning deep context-aware features over body and latent parts for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(22) (2017), 7398–7407.

44.

R.R.

Varior ,

Haloi and

Wang , Gated siamese convolutional neural network architecture for human reidentification. in Proceedings European Conference on Computer Vision 99(12) (2016), 791–808.

45.

Xiao ,

Li ,

Ouyang and

Wang , Learning deep feature representations with domain guided dropout for person re-identification. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 12(9) (2016), 1249–1258.

46.

Zhong ,

Zheng ,

Cao and

Li , Re-ranking person reidentification with k-reciprocal encoding. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 11(6) (2017), 3652–3661.