Cross-view pedestrian clustering via graph convolution network for unsupervised person re-identification

Abstract

At present, supervised person re-identification method achieves high identification performance. However, there are a lot of cross cameras with unlabeled data in the actual application scenarios. The high cost of marking data will greatly reduce the effect of the supervised learning model transferring to other scene domains. Therefore, unsupervised learning of person re-identification becomes more attractive in the real world. In addition, due to changes in camera angle, illumination and posture, the extracted person image representation is generally different in the non-cross camera view, but the existing algorithm ignores the difference among cross camera images under camera parameters and environments. In order to overcome the above problems, we propose unsupervised person re-identification metric learning method. The model learns a shared space to reduce the discrepancy under different cameras. The graph convolution network is further employed to cluster the cross-view image features extracted from the shared space. Our model improves the scalability of pedestrian re-identification in practical application scenarios. Extensive experiments on four large-scale person re-identification public datasets have been conducted to demonstrate the effectiveness of the proposed model.

Keywords

Person re-identification unsupervised clustering graph convolution network cross-view

1 Introduction

Given one or more target person image, person re-identification (person ReID) aims at matching or retrieving persons across cameras or terminals [1]. Due to the variance of illumination, camera parameters and body pose, person ReID still is a challenging problem under the open-set, as shown in Fig. 1. In most recent works, research focuses on person representation [2 –6], metric learning [7 –12], and dataset extension [13 –16]. Most of the existing person ReID methods focus on either robust feature extraction or distance similarity measurement for comparing features, or a combination of both [17 –19].

Fig.1

Person re-identification challenges.

However, the high performance of the existing person ReID methods rely on the availability of large-scale annotated training data. In the supervised person ReID learning method, a large number of people sample pairs in the range of cross-camera need to be manually annotated. In the current person ReID dataset, the number of non-cross-camera training samples for pedestrians in the same ID is insufficient. The lack of label pairs makes learning discriminant features challenging. Due to the high cost of manual annotation, the scalability and practicability of supervised method in real video surveillance applications are greatly restricted when applied to a new scenario. Without the help of labeled data, such as the differences of scene, light intensity, pedestrian age span and camera parameters under cross domain, the model trained on source domain is directly applied to another unlabeled target domain, which has poor generalization ability and a great decline on target domain. Our method tries to reduce the dependence on the labeled data in the target domain. In order to make full use of the limited cheap unlabeled data, the recently published approaches are mainly based on learning distance metrics or subspace strategies [20 –26]. However, the performance still mainly relies on substantial labeled samples. Without labeled data, it is difficult to model the large differences among across cameras. This leads to the deviation between the original feature spaces in different camera views, which reduces the accuracy of person matching in the case of cross-camera. Existing unsupervised learning models generally ignore this deviation and process image samples from different camera views in the same manner.

In addition, recent research on face identification shows that the application of unlabeled face clustering method can improve the identification performance [27 –29]. However, few approaches focus on clustering effectively on person ReID problems, especially in the large-scale dataset. In order to effectively utilize unlabeled data, we use an effective clustering algorithm to deal with the complex topology in the actual application. However, traditional clustering methods such as K-means and spectral clustering are based on simple assumptions. These methods lack the ability to cluster complex topological structures, and lead to bring noise clustering, especially large-scale image training sets collected in real scenes. This situation limits the improvement of performance. In this paper, we propose an unsupervised deep learning framework which learns a shared space to reduce the deviation under cross-camera view. A learnable clustering method is proposed based on graph convolutional network. The model desire to capture the general mode in unlabeled pedestrian clusters using the powerful of graph convolutional network. Extensive experiments indicate that our method can significantly improve the accuracy of person clustering and further improve the performance of person ReID.

2 Related work

2.1 Unsupervised person re-identification

To straightly take advantage of unlabeled data for person ReID, several unsupervised person ReID models have been proposed [23–25 , 30–34]. Peng et al. [23] focus on extending the dataset by constructing the multi-task dictionary to learn the feature expression of the target dataset biased. Yu et al. [24] learn asymmetric metrics through asymmetric clustering of person images, which belongs to unsupervised learning based on asymmetric metrics. To utilize the unlabeled data under the cross domain, Lv et al. [25] transfer the labeled model trained in the source domain to the unlabeled target domain, learn the temporal and spatial information of persons, and construct the unsupervised network model based on the temporal and spatial model by combining visual features to improve the classifier effect. Li et al. [26] use the trajectory information to extract weak marks to train the model based on trajectory correlation between cameras. Deng et al. [30] use CycleGAN to improve the overall performance by preserving the category information of persons, focusing on the transformation model between the learning source domain and the target domain. Aiming at the problem that the image in the currently existing dataset lacks illumination diversity and the costly of marking all the lighting conditions, Bak et al. [31] use the HDR and other environmental rendering methods to model the real indoor and outdoor lighting conditions, and construct a synthetic person ReID dataset containing hundreds of lighting conditions. However, this virtual 3D character model generated by software has a certain distortion, and it is difficult to generalize to a real application scenario. Zhou et al. [32] perform local metric learning on negative samples, reducing the need for a large number of positive sample pairs, which is a data enhancement method. Wang et al. [33] transfer the labeled information in the source dataset, and jointly learn the attribute information and identity tags of the target domain to achieve unsupervised learning in the target domain. Qi et al. [34] employ the graph Laplacian to project the images from different camera views in each domain into a shared subspace.

To reduce the impact of cross-domain distribution variation, most recent work based on dictionary learning are proposed and achieved significant progress [35 –38]. Peng et al. [35] propose a dictionary learning model which decomposes the dictionary space into parts corresponding to semantic. Cross-view dictionary learning model is used to solve multi-view learning problem [36 –38]. However, the above methods only focus on the overall differences under cross domain and does not explicitly take into account the deviation among the original feature spaces under different camera views. Existing unsupervised approaches treat the samples from different views in the same way, and thus the effects of view bias have been ignored. In order to solve the above problems, this paper proposes a method of learning a shared space to reduce the deviation among camera views by projecting original training data.

2.2 Clustering

Clustering is one of the basic unsupervised learning methods in person ReID. Most existing clustering methods are unsupervised. Due to various differences in scenes, light intensity, season, person age span and camera parameters, pedestrian clusters vary significantly in size, shape and density. Pedestrian clustering provides a manner to utilize limited unlabeled data. The complex distribution of pedestrian representations makes it unsuitable to apply traditional clustering algorithms. The traditional methods have rigid assumptions on data distribution. It is generally used to estimate a label for unlabeled data, cluster unlabeled data into pseudo classes, and then train the model with initial labeled data and partially selected data with pseudo-label, so that they can be used like labeled data and used for supervised learning [20]. However, the model trained only with the initial label data is too weak, and the reliable pseudo-labeled data is small. Therefore, selecting data will introduce many wrong training samples, and the network constraint performance based on a large number of error training samples is improved.

Hierarchical clustering algorithm based on unsupervised learning is robust in data grouping with complex distribution. Fan et al. [39] improve the performance of the model on the person target dataset by K-means clustering and iterative training of the unlabeled source dataset. Lin et al. [40] propose a bottom-up clustering method to jointly optimize the relationship between convolutional neural networks and unlabeled samples to solve the unsupervised person ReID problem. On the contrary, Yang et al. [27] propose a top-down face clustering method. Lin et al. [41] design the similarity measure of the nearest neighbor based on data samples based on linear SVM. Zhan et al. [42] train the classifier to aggregate information and obtain clustering through connected components.

However, the above work is mainly focused on designing new similarity measures, relying on traditional clustering methods. Traditional clustering methods, such as K-means, spectral clustering and hierarchical clustering relying on strict assumptions of data distribution lack the ability of clustering complex structure and are easy to bring in noise clustering. For example, K-means assumes that each cluster has only a single center, and spectral clustering requires similar clustering scale. However, due to the dramatic changes in body poses such as illumination, the clustering of pedestrian images varies greatly in size, shape and density. The complex distribution of such features makes them unsuitable for applying traditional clustering algorithms. Especially in the case of a large-scale person image of a real application scenario, the above problem limits the improvement of performance. In this paper, we propose a clustering method based on graph convolutional network to learn the complex feature topology.

Fig.2

The framework of our proposed network.

2.3 Graph convolutional network (GCN)

In person ReID problems, the samples of cross view can be organized into a graphical structure. Existing work has shown the advantages of GCNs, such as the strong capability of modeling complex input structures [43 –47]. The traditional clustering method lacks the ability of complex clustering structure, which will produce noise clustering, especially in large-scale datasets. This problem severely limits performance. To effectively exploit unlabeled face data, we need to learn how to cluster from data. We employ the expressive power of graph convolutional network to capture the representation in pedestrian samples clusters, and leverage them to classify the unlabeled data. However, there is still few researches on the difficulty of using the graph convolutional neural network to solve the person ReID. Graph convolutional network is multi-layer stacked, and the parameters of different network layers are different. Graph neural network is iterative solution, and the parameters of each layer of network are shared.

Zhao et al. [43] apply the semantic graph convolutional network to the three-dimensional pose regression prediction of human body, and complete the three-dimensional shape of the known partial scatter. Chen et al. [44] apply the graph convolutional network to the more difficult image multi-label classification and identification, and construct a directed graph between multiple target labels. Each node in the graph is embedded into the feature representation of the label, and the graph convolution maps this label map to the target classifier. Gao et al. [45] apply the graph convolutional network to visual tracking, and use the context of the frame to learn the adaptive features of target location. Li et al. [46] apply graph convolution to learn temporal and spatial characteristics of motion identification. Ma et al. [47] apply graph convolutional confrontation networks to jointly model data structures, domain labels, and class labels for unsupervised domain adaptation. In this paper, we adopt graph convolutional network to cluster the cross-view image features extracted from the shared space.

3 Methodology

3.1 The framework of our proposed network

In this section, we introduce an unsupervised person ReID learning framework based on the convolutional clustering in detail. Even in the unlabeled target domain dataset, it is known that the camera information where the image samples are located when acquiring the original unlabeled pedestrian image samples. The camera number of the sample is known, and this information can be utilized as supervised information that maps different views in the original feature space to the shared feature space. Our unsupervised person ReID method based on graph convolutional clustering presented is shown in Fig. 2. Firstly, the model uses metric learning method to learn a shared space of different views, which can reduce the feature deviation between the same ID of pedestrian samples under cross camera. The model learns a transformation to map source sample into a shared space which can better separate different person samples; Secondly, the basic deep neural network model is initialized by the existing labeled dataset; Thirdly, the original model is used to extract the image features from the unlabeled dataset. Graph convolutional network is used to cluster the cross-view image features extracted from the shared space. The clustering results constitute the data of the training set; Finally, the training model composed of clusters is used to fine tune the initial model, which is stored as the original model. The fine tune model is used to extract pedestrian sample features, repeat clustering and fine tune until the scale of reliable samples in the training set becomes stable. The fine tune of the model and correction process of clustering adopt the diversity self-paced learning method proposed in [48].

3.2 Shared space learning

In unsupervised RE-ID scenarios, clustering is susceptible to interfered by the bias of cross view. Samples from the same view are easier to cluster. Due to the deviation among cross view, samples with the same ID across views are not easy to form clusters. However, our model reduces the deviation between images of the same ID under different cameras by learning the shared space, thus alleviating the inaccurate clustering results in the model due to the deviation of the cross view.

We define that there are C camera views, each with Nc (c = 1,...,C) person images. There are N = N1 + ...+NC person images in the training set, the characteristic of the source domain feature space samples is $X = [x_{1}^{1}, . . ., x_{i}^{c}] \in ℝ^{S \times S^{'}}$ , i = 1, . . . , N_c ; c = 1, . . . , C. $x_{i}^{c}$ represents that the s-dimensional representation of the i^th image from the c^th camera view in the source feature space. The purpose of our model is to learn mapping U¹, . . . , U^C, $U^{c} \in ℝ^{S \times T}$ . Feature representation $x_{i}^{c}$ is mapped from the original space $ℝ^{S}$ to the shared space $ℝ^{T}$ . This method alleviates the negative impact of different camera views on feature clustering, maximizing the centers of different types of samples as much as possible under the cross-view. Meanwhile, the intra-class differences can be narrowed down. We utilize the objective function of minimum K-means clustering model proposed in [24], As shown in Equation 1, G is the number of clustering results, g _k represents the k^th clustering center, and c represents the index number of camera view, ∑^c = X^cX^cT/N_c + I, I represents the identity matrix that avoids singularity of covariance matrix. The mapping transformation U^c of each camera view corresponds to each sample feature $x_{i}^{c}$ , while the orthogonal constraints prevent the model from giving a zero matrix. δ is regularization of cross-camera views, ∥ · ∥ _F represents the matrix’s Frobenius norm.

$\begin{matrix} min_{U^{c}} L_{s} = \frac{1}{N} \sum_{k = 1}^{K} \sum_{i \in G_{k}} {∥ U^{cT} x_{i}^{c} - g_{k} ∥}^{2} \\ + δ \sum_{c \neq c^{'}} {∥ U^{c} - U^{c^{'}} ∥}_{F}^{2} s . t . U^{cT} \sum^{c} U^{c} = I \end{matrix}$ (1)

We formulate our idea as optimizing the Equation (1). The model learns the mapping of different camera views to the shared space, reducing the feature deviation caused by cross view.

3.3 Graph convolutional learning

Our model uses metric learning method to learn a shared space of cross view, which reduces the feature deviation of the same ID samples. However, this traditional clustering method has assumptions on feature distribution, and is susceptible to the negative impact of complex feature representation distribution of samples, thereby making false assumptions on data distribution clustering. Link-based clustering methods can achieve higher accuracy without any assumptions about data distribution [49]. The current link-based clustering generally utilizes traditional threshold metrics to calculate the possibility. Wang et al. [28] propose the possibility of using the context of nodes to predict whether the nodes and neighbors belong to the same category. Compared with traditional methods, the link-based clustering method considers important information in the context of samples, and overcomes the difficulty of selecting fixed thresholds in metric learning methods.

In order to utilize valuable information in the context of a node, this information belongs to unstructured graph structure data. Thus, the features of the image extracted from the traditional discrete convolutional neural network in Euclidean space cannot be used. In this case, graph convolution neural network is more suitable to solve this problem if the problem refers to graph topology data. We adopt graph convolution network (GCN) [28] to simplify the clustering task to link prediction. The model predicts the similarities of labels between the anchor point and its nearest neighbor. If the labels are the same, they are linked. To represent the local context information, a node pair with each neighbor is constructed for each anchor point. Finally, GCN is used to learn the above task, and a set of values of link possibilities between nodes in the graph is output. That is, a set of weighted edges is output. The weights of edges are the link possibilities. The final clustering results are obtained by combining the linked nodes.

For the sample characteristics that map to the shared space $P = [p_{1}, . . ., p_{N}]^{T} \in ℝ^{N \times T}$ , N is the number of training sample images and T is the dimension of the feature. The goal of the cluster is to assign a label to each sample so that samples with the same label form a cluster, only need to calculate the possibility of linking between an instance and its k nearest neighbors. The graph convolution layer takes the node feature matrix P and the adjacency matrix A as inputs, and outputs the transformed node feature matrix Y. In the first layer, the input node feature matrix is the original node feature matrix P = F. The goal of graph convolution network is to learn a function f (· , ·) in a graph topology G. The matrix of feature description $H^{u} \in ℝ^{n \times t_{in}}$ and the corresponding adjacent matrix $A \in ℝ^{n \times n}$ are input. We apply $H^{u + 1} \in ℝ^{n \times t_{out}}$ to update the feature of the node. The transformed feature map is set up by the output of the graph convolution operation. We set t_out< =t_in, reducing the feature dimension through the graph convolution. n denotes the number of nodes in the graph, t_in and t_out denote the dimension of the feature of the input and output nodes in the graph respectively. Each hidden layer of such a convolution can be represented by Equation 2, u = 0, 1, ... , U-1. $H^{u + 1} = f (H^{u}, A)$ (2)

After applying the convolution operation, the function f (· , ·) can be expressed as Equation 3, among which $W^{u} \in ℝ^{t_{in} \times t_{out}}$ is a learnable graph convolutional weight transformation matrix, the size is 2t_in×t_out, $\hat{A} \in ℝ^{n \times n}$ is the row normalization of the adjacency matrix A. δ (·) represents a nonlinear activation function LeakyReLU [50]. $H^{u + 1} = δ (\hat{A} H^{u} W^{u})$ (3)

In most recent works, part-based methods have achieved much better performance for person ReID than just learning full-body discriminative features [11, 13]. In this paper, spatial local information of different body parts is used for pedestrian images representation. In the first layer of the graph structure, the input node characteristic matrix is the original node characteristic matrix $P = H^{0} \in ℝ^{n \times t_{0}}$ , different local features can be represented by P = [P¹||P²||P³], The operator “||” represents a matrix cascade along the characteristic dimension.

The model learns the nodes through the hidden layers of multiple GCNs and summarizes the basic information about the node neighbors. Then, the input node feature P is cascaded with the clustering information along the feature dimension to obtain a set of edges weighted by link possibility. In the fine tune process of the model, the clustering results with noise will make the potential variables prone to fall into the bad local optimal value or oscillation state. In order to alleviate the above problems, the diversity self-paced learning method (SPLD) proposed in the model literature [48] encourages the selection of samples in multiple subsets in self-paced learning, masking noise data while avoiding self-paced learning to select the imbalance in the number of samples between components during each iteration. Finally, the cross-entropy loss function is used to train the graph convolution network.

Table 1

Comparison of our method with the latest unsupervised person ReID performance

Method	DukeMTMC-reID⟶Market1501					Market1501⟶DukeMTMC-reID
	rank-1	rank-5	rank-10	rank-20	mAP	rank-1	rank-5	rank-10	rank-20	mAP
PTGAN0	38.6	–	66.1	–	–	27.4	–	50.7	–	–
PUL0	44.7	59.1	65.6	71.7	20.1	30.4	44.5	50.7	56.0	16.4
SPGAN0	51.5	70.1	76.8	–	22.8	41.1	56.6	63.0	69.6	22.3
CAMEL0	54.5	–	–	–	26.3	–	–	–	–	–
TJ-AIDL0 ⁰	58.2	74.8	81.1	86.5	26.5	44.3	59.6	65.0	70.0	23.0
HHL0	62.2	78.8	84.0	–	31.4	46.9	61.0	66.7	–	27.2
Ours	64.1	80.6	85.2	88.4	32.3	48.4	62.3	67.5	71.6	27.9
Method	DukeMTMC-reID⟶MSMT17					Market1501⟶MSMT17
	rank-1	rank-5	rank-10	rank-20	mAP	rank-1	rank-5	rank-10	rank-20	mAP
PTGAN0	18.0	–	36.4	–	6.2	17.7	–	35.9	–	6.0
PUL0	20.2	–	38.8	–	7.1	20.6	–	37.3	–	6.3
HHL0	22.1	–	40.5	–	7.6	21.4	–	38.1	–	7.0
Ours	23.4	34.7	41.1	46.3	8.1	22.5	33.1	39.8	44.7	7.6
Method	CUHK03⟶Market1501					CUHK03⟶DukeMTMC-reID
	rank-1	rank-5	rank-10	rank-20	mAP	rank-1	rank-5	rank-10	rank-20	mAP
PTGAN0	31.5	–	60.2	–	–	17.6	–	38.5	–	–
PUL0	41.9	57.3	64.3	70.5	18.0	23.0	34.0	39.5	44.2	12.0
SPGAN0	42.3	–	–	–	19.0	–	–	–	–	–
HHL0	56.8	74.7	81.4	86.3	29.8	42.7	57.5	64.2	69.1	23.4
Ours	58.5	76.3	83.5	88.4	31.1	43.5	59.3	66.4	72.2	23.9

4 Results and discussion

4.1 Datasets

We use four benchmark datasets of MSMT17 [14], Market1501 [51], DukeMTMC-reID [52] and CUHK03 [53] for person reID to evaluate the performance of our model. MSMT17 dataset is the latest and largest person ReID dataset, with a total of 4,101 pedestrian IDs. The images are from 15 cameras inside and outside the campus building of Peking University. Market-1501dataset is collected from the campus of Tsinghua University in the summer. There are totally 1501 pedestrians IDs, with an average of 17.2 training data per person. DukeMTMC-reID dataset is collected in Duke University during the winter, with a total of 1812 pedestrian IDs. There are many cases that different persons wear similar clothes. This dataset is challenging. CUHK03 is captured in two cameras with no cross over coverage.

4.2 Implementation details

In this paper, we compare the commonly used mAP (mean average precision) and rank-i with existing methods in the experiment. The model searches for probe person image in gallery, and sequentially calculates the distance between the probe features and other samples. The rank-i indicates the probability which a query sample is found among the top i matching in the ranking list. We adopt ResNet-50 pre-trained on ImageNet as the backbone network. Since the training set and the test dataset are independent and identical distribution hypotheses, the batch norm layer is added before the fully connected classification layer. Thus, the input samples of each layer in the training process maintain the same distribution. All training and test images are standardized to 224×128 pixels. We apply common data enhancement strategies. By performing a random horizontal flip with a probability of 0.5, zero-padding with 10 pixels, normalization processing, and fine-tuning to alleviate the problem of overfitting. In the experiment, dropout value is set to be 0.5 and batch size is set to be 128. SGD solver and adam optimizer are used to train the network model. We set the initial learning rate to 0.0001, and reduce the learning rate to 0.00001 in the 50th iteration until the model convergence.

4.3 Comparison with the state-of-the-art

We compare the proposed method in this paper with the current unsupervised learning person ReID methods. The performance comparison of unsupervised learning using MSMT17, Market1501, DukeMTMC-reID datasets as source domain or target domain dataset is shown in Table 1. “Duke MTMC-reID⟶Market1501” means that the model is fine-tuned on Duke MTMC-reID dataset and tested on the target dataset of Market1501. The rank-1 and mAP accuracy of the method proposed in this paper exceeds the methods proposed in the latest literature. The accuracy of rank-1 tested in the target dataset MSMT17 is increased by 5.4%, and the mAP value is increased by 2.9%.

Table 2
Influence of shared space learning and graph convolutional learning on person ReID task

Method DukeMTMC-reID⟶Market1501 Market1501⟶DukeMTMC-reID

r1 r5 r10 r20 mAP r1 r5 r10 r20 mAP

Supervised learning 85.6 93.3 95.6 97.1 65.8 72.3 84.1 88.1 90.9 51.8

Unsupervised w./o. SHL 50.4 68.1 75.3 81.3 24.1 36.3 52.6 58.5 65.7 21.7

Learning w./o. GCL 58.3 74.9 79.6 83.1 26.8 42.1 56.9 62.1 68.4 23.4

Ours w./o. SPLD 59.7 75.4 81.5 84.2 27.4 43.6 57.4 63.3 68.1 24.2

Ours 64.1 80.6 85.2 88.4 32.3 48.4 62.3 67.5 71.6 27.9

Method CUHK03⟶Market1501 CUHK03⟶DukeMTMC-reID

r1 r5 r10 r20 mAP r1 r5 r10 r20 mAP

Supervised learning 85.6 93.3 95.6 97.1 65.8 72.3 84.1 88.1 90.9 51.8

Unsupervised w./o. SHL 47.3 65.1 72.3 78.9 23.4 29.1 44.1 51.6 57.1 15.3

Learning w./o. GCL 52.4 70.6 76.9 82.6 24.3 35.3 51.6 59.3 64.3 16.7

Ours w./o. SPLD 53.9 71.7 78.4 83.8 25.7 36.9 53.2 61.4 66.4 17.2

Ours 58.5 76.3 83.5 88.4 31.1 43.5 59.3 66.4 72.2 23.9

Method Market1501⟶CUHK03 DukeMTMC-reID⟶CUHK03

r1 r5 r10 r20 mAP r1 r5 r10 r20 mAP

Supervised learning 82.4 95.5 96.2 97.6 68.5 82.4 95.5 96.2 97.6 68.5

Unsupervised w./o. SHL 34.7 60.8 69.3 76.1 15.8 36.2 58.8 63.4 70.9 13.8

Learning w./o. GCL 41.6 66.9 75.1 80.3 20.4 40.6 64.9 69.6 76.2 19.3

Ours w./o. SPLD 42.7 67.2 76.8 81.6 22.5 42.1 66.2 71.0 75.8 20.1

Ours 47.1 73.4 81.7 86.1 26.1 46.2 70.6 75.3 80.7 24.1

Method DukeMTMC-reID⟶MSMT17 Market1501⟶MSMT17

r1 r5 r10 r20 mAP r1 r5 r10 r20 mAP

Supervised learning 61.4 76.8 81.6 85.9 34.0 61.4 76.8 81.6 85.9 34.0

Unsupervised w./o. SHL 10.6 24.3 30.8 35.1 2.1 9.8 23.0 28.8 33.7 1.9

Learning w./o. GCL 16.4 28.9 35.4 40.6 5.6 15.6 26.4 34.2 38.7 4.9

Ours w./o. SPLD 17.1 28.4 36.8 42.5 6.9 16.8 27.9 35.3 41.2 5.5

Ours 23.4 34.7 41.1 46.3 8.1 22.5 33.1 39.8 44.7 7.6

Method	DukeMTMC-reID⟶Market1501	Market1501⟶DukeMTMC-reID
Supervised learning	85.6	93.3	95.6	97.1	65.8	72.3	84.1	88.1	90.9	51.8
Unsupervised	w./o. SHL	50.4	68.1	75.3	81.3	24.1	36.3	52.6	58.5	65.7	21.7
Learning	w./o. GCL	58.3	74.9	79.6	83.1	26.8	42.1	56.9	62.1	68.4	23.4
	Ours w./o. SPLD	59.7	75.4	81.5	84.2	27.4	43.6	57.4	63.3	68.1	24.2
	Ours	64.1	80.6	85.2	88.4	32.3	48.4	62.3	67.5	71.6	27.9
Method	CUHK03⟶Market1501	CUHK03⟶DukeMTMC-reID
		r1	r5	r10	r20	mAP	r1	r5	r10	r20	mAP
Supervised learning	85.6	93.3	95.6	97.1	65.8	72.3	84.1	88.1	90.9	51.8
Unsupervised	w./o. SHL	47.3	65.1	72.3	78.9	23.4	29.1	44.1	51.6	57.1	15.3
Learning	w./o. GCL	52.4	70.6	76.9	82.6	24.3	35.3	51.6	59.3	64.3	16.7
	Ours w./o. SPLD	53.9	71.7	78.4	83.8	25.7	36.9	53.2	61.4	66.4	17.2
	Ours	58.5	76.3	83.5	88.4	31.1	43.5	59.3	66.4	72.2	23.9
Method	Market1501⟶CUHK03	DukeMTMC-reID⟶CUHK03
		r1	r5	r10	r20	mAP	r1	r5	r10	r20	mAP
Supervised learning	82.4	95.5	96.2	97.6	68.5	82.4	95.5	96.2	97.6	68.5
Unsupervised	w./o. SHL	34.7	60.8	69.3	76.1	15.8	36.2	58.8	63.4	70.9	13.8
Learning	w./o. GCL	41.6	66.9	75.1	80.3	20.4	40.6	64.9	69.6	76.2	19.3
	Ours w./o. SPLD	42.7	67.2	76.8	81.6	22.5	42.1	66.2	71.0	75.8	20.1
	Ours	47.1	73.4	81.7	86.1	26.1	46.2	70.6	75.3	80.7	24.1
Method	DukeMTMC-reID⟶MSMT17	Market1501⟶MSMT17
		r1	r5	r10	r20	mAP	r1	r5	r10	r20	mAP
Supervised learning	61.4	76.8	81.6	85.9	34.0	61.4	76.8	81.6	85.9	34.0
Unsupervised	w./o. SHL	10.6	24.3	30.8	35.1	2.1	9.8	23.0	28.8	33.7	1.9
Learning	w./o. GCL	16.4	28.9	35.4	40.6	5.6	15.6	26.4	34.2	38.7	4.9
	Ours w./o. SPLD	17.1	28.4	36.8	42.5	6.9	16.8	27.9	35.3	41.2	5.5
	Ours	23.4	34.7	41.1	46.3	8.1	22.5	33.1	39.8	44.7	7.6

4.4 Discussion and analysis

In this section, we mainly discuss the effect of learning shared space and application of graph convolution network on person ReID datasets. As shown in Table 2, “Supervised Learning” indicates that the training and testing of the model are completed in the target dataset. The accuracy reaches the current level of person ReID baseline. The results in the unsupervised learning section indicate that the training set is the source dataset before the arrow, and the testing set is the target dataset after the arrow. As shown in the Table 2, “w./o. SHL” represents the identification performance of the model without the application of shared space learning; “w./o. GCL” represents that the model does not apply graph convolution network. In the clustering process, we adopt traditional threshold measurement to learn the identification performance under the link prediction. “Ours w./o. SPLD” represents the performance of the in the case of self-paced learning in the clustering process. “Ours” represents the identification performance of the proposed model under unsupervised learning. The results show that the performance of the proposed model is superior to the single method. It can be seen from the results in Table 2 that our method attains 22.5% matching rate at rank-1 in the target dataset MSMT17. “w./o. GCL” and “Ours w./o. SPLD” rank-1 increase by 6.9% and 5.7% respectively. Experimental results show that learning shared space can reduce the deviation among views. The application graph convolutional network that uses context information in graph structure to link sample clusters has a positive impact on the performance of person ReID.

5 Conclusion

In this paper, we have proposed a novel unsupervised person ReID metric learning algorithm to reduce the deviation of the cross-camera view by learning the shared space, and thus better performance of cross-view matching can be achieved. Graph convolutional networks are employed to extract cross-view features from shared spaces. The model performs the linked person clustering by using context information in the graph structure. Encouraging experimental results on four large-scale person ReID public datasets can outperform existing ones in general showing the effectiveness of our method.

6 Funding

The authors acknowledge the Natural Science Foundation Project of Huaiyin Institute of Technology (Grant: 19HGZ004), the Natural Science Research of Jiangsu Higher Education Institutions of China (Grant: 18KJA520002), the Natural Science Foundation of Jiangsu Province (Grant: BK20171267), the Fifth Issue 333 High-Level Talent Training Project of Jiangsu Province (Grant: BRA2018333), Qing Lan Project of JiangSu Province, the Horizontal Project (Grant: Z421A19830, Z421A19888).

References

Cheng

, Gong

, Zhou

, Wang

, Zheng

, Person re-identification by multi-channel parts-based CNN with improved triplet loss function, In Proceedings Computer Vision Pattern Recognition 12(9) (2016), 1335–1344.

Ahmed

, Jones

, Marks

, An improved deep learning architecture for person re-identification, In Proceedings Computer Vision Pattern Recognition 10(14) (2015), 3908–3916.

Matsukawa

, Suzuki

, Person re-identification using cnn features learned from combination of attributes, In Proceedings International Conference Pattern Recognition 12(4) (2016), 2428–2433.

Chen

, Chen

, Zhang

, Huang

, A multi-task deep network for person re-identification, In Proceedings AAAI Conference Artificial Intelligence 2(4) (2017), 3988–3994.

Yang

, Wen

, Lyu

, Li

, Unsupervised learning of multi-level descriptors for person re-identification, In Proceedings AAAI Conference Artificial Intelligence 2(4) (2017), 4306–4312.

Schumann

, Stiefelhagen

, Person re-identification by deep learning attribute-complementary information, In Proceedings Computer Vision Pattern Recognition 8(22) (2017), 1435–1443.

Varior

R.R.

, Haloi

and Wang

, Gated siamese convolutional neural network architecture for human re-identification, In Proceedings European Conference on Computer Vision 10(11) (2016), 791–808.

, Zhu

, Gong

, Person re-identification by deep joint learning of multi-loss classification, In Proceedings International Joint Conference on Artificial Intelligence 8(19) (2017), 2194–2200.

Liu

, Feng

, Qi

, Jiang

, Yan

, End-to-end comparative attention networks for person re-identification, IEEE Trans Image Processing 26(7) (2017), 3492–3506.

10.

Ergys

, Carlo

, Features for multi-target multi-camera tracking and re-identification, In Proceedings Computer Vision Pattern Recognition 12(14) (2018), 6036–6046.

11.

Chen

, Chen

, Zhang

, Huang

, Beyond triplet loss: a deep quadruplet network for person re-identification, In Proceedings Computer Vision Pattern Recognition 12(6) (2017), 1320–1329.

12.

, Ding

, Li

, Zhang

, Fu

, Support neighbor loss for person re-identification, ACM Transactions on Multimedia Computing, Communications and Applications 10(15) (2018), 1492–1500.

13.

Zhong

, Zheng

, Li

, Yang

, Camera style adaptation for person re-identification, In Proceedings Computer Vision Pattern Recognition 12(14) (2018), 5157–5166.

14.

Wei

, Zhang

, Gao

, Tian

, Person transfer gan to bridge domain gap for person re-identification, In Proceedings Computer Vision Pattern Recognition 12(14) (2018), 79–88.

15.

Huang

, Li

, Zhang

, Chen

, Huang

, Adversarially occluded samples for person re-identification, In Computer Vision Pattern Recognition 12(14) (2018), 5098–5107.

16.

Zhong

, Zheng

, Li

, Yang

, Generalizing a person retrieval model hetero- and homogeneously, European Conference on Computer Vision 9(8) (2018), 176–192.

17.

Chang

, Huang

, Shen

, Liang

, Yang

, Alexander

, RCAA: Relational context-aware agents for person search, In European Conference on Computer Vision 9(8) (2018), 86–102.

18.

Lan

, Zhu

, Gong

, Person search by multi-scale matching, In European Conference on Computer Vision 9(8) (2018), 553–569.

19.

Zheng

, Zhang

, Sun

, Chandraker

, Yang

, Tian

, Person re-identification in the wild, In Computer Vision Pattern Recognition 11(6) (2017), 1367–1376.

20.

Huang

, Xu

, Wu

, Zheng

, Zhang

, Multi-pseudo regularized label for generated data in person re-identification, IEEE Trans Image Processing 28(3) (2019), 1391–1403.

21.

Yang

, Wang

, Hong

, Tian

, Rui

, Enhancing person re-identification in a self-trained subspace, ACM Transactions on Multimedia Computing, Communications and Applications 13(3) (2016), 1–23.

22.

, Zheng

, Wu

, Guo

, Gong

, Lai

, Unsupervised person re-identification by soft multilabel learning, In Computer Vision Pattern Recognition 6(16) (2019), 2148–2157.

23.

Peng

, Xiang

, Wang

, Pontil

, Gong

, Huang

, Tian

, Unsupervised cross-dataset transfer learning for person re-identification, In Computer Vision Pattern Recognition 12(9) (2016), 1306–1315.

24.

, Wu

, Zheng

, Cross-view asymmetric metric learning for unsupervised person re-identification, In International Conference on Computer Vision 12(22) (2017), 994–1002.

25.

, Chen

, Li

, Yang

, Unsupervised cross-dataset person re-identification by transfer learning of spatial-temporal patterns, Computer Vision Pattern Recognition 12(14) (2018), 7948–7956.

26.

, Zhu

, Gong

, Unsupervised person re-identification by deep learning tracklet association, In European Conference on Computer Vision 9(8) (2018), 772–788.

27.

Yang

, Zhan

, Chen

, Yan

, Loy

, Lin

, Learning to cluster faces on an affinity graph, In Computer Vision Pattern Recognition 6(16) (2019), 2298–2306.

28.

Wang

, Zheng

, Li

, Wang

, Linkage based face clustering via graph convolution network, In Computer Vision Pattern Recognition 6(16) (2019), 1117–1125.

29.

Zhan

, Liu

, Yan

, Lin

, Loy

C.C.

, Consensus-driven propagation in massive unlabeled data for face recognition, European Conference on Computer Vision 9(8) (2018), 568–583.

30.

Deng

, Zheng

, Ye

, Kang

, Yang

, Jiao

, Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification, In Computer Vision Pattern Recognition 12(14) (2018), 994–1003.

31.

Bak

, Carr

, Lalonde

, Domain adaptation through synthesis for unsupervised person re-identification, In European Conference on Computer Vision 9(8) (2018), 193–209.

32.

Zhou

, Yu

, Tang

, Wu

, Efficient online local metric adaptation via negative samples for person re-identification, In International Conference on Computer Vision 12(22) (2017), 2439–2447.

33.

Wang

, Zhu

, Gong

, Li

, Transferable joint attribute-identity deep learning for unsupervised person re-identification, Computer Vision Pattern Recognition 12(14) (2018), 2275–2285.

34.

, Huo

, Fan

, Shi

, Gao

, Unsupervised joint subspace and dictionary learning for enhanced cross-domain person re-identification, IEEE Journal of Selected Topics in Signal Processing 12(6) (2018), 1263–1275.

35.

Peng

, Tian

, Xiang

, Wang

, Pontil

, Huang

, Joint semantic and latent attribute modelling for cross-class transfer learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 40(7) (2018), 1625–1638.

36.

, Ding

, Li

, Fu

, Discriminative semi-coupled projective dictionary learning for low-resolution person re-identification, AAAI Conference on Artificial Intelligence 2(2) (2018), 2331–2338.

37.

, Xu

, Zhu

, Tao

, Yu

, Top distance regularized projection and dictionary learning for person re-identification, Information Sciences 502(1) (2019), 472–491.

38.

, Shao

, Fu

, Person Re-Identification by Cross-View Multi-Level Dictionary Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence 40(12) (2018), 2963–2977.

39.

Fan

, Zheng

, Yan

, Yang

, Unsupervised person re-identification: clustering and fine-tuning, ACM Transactions on Multimedia Computing, Communications and Applications 14(4) (2018), 83–92.

40.

Lin

, Dong

, Zheng

, Yan

, Yang

, A bottom-up clustering approach to unsupervised person re-identification, In AAAI Conference Artificial Intelligence 1(27) (2019), 1121–1125.

41.

Lin

, Chen

, Castillo

, Chellappa

, Deep density clustering of unconstrained faces, In Proceedings Computer Vision Pattern Recognition 12(14) (2018), 8128–8137.

42.

Zhan

, Liu

, Yan

, Lin

, Loy

, Consensus-driven propagation in massive unlabeled data for face recognition, European Conference on Computer Vision 9(8) (2018), 568–583.

43.

Zhao

, Peng

, Tian

, Kapadia

, Metaxas

, Semantic graph convolutional networks for 3d human pose regression, Proceedings Computer Vision Pattern Recognition 6(16) (2019), 3425–3435.

44.

Chen

, Wei

, Wang

, Guo

, Multi-label image recognition with graph convolutional networks, Computer Vision Pattern Recognition 6(16) (2019), 5177–5186.

45.

Gao

, Zhang

, Xu

, Graph convolutional tracking, In Computer Vision Pattern Recognition 6(16) (2019), 4649–4659.

46.

, Chen

, Zhang

, Wang

, Tian

, Actional-structural graph convolutional networks for skeleton-based action recognition, In Computer Vision Pattern Recognition 6(16) (2019), 3595–3603.

47.

, Zhang

, Xu

, GCAN: Graph convolutional adversarial network for unsupervised domain adaptation, In Computer Vision Pattern Recognition 6(16) (2019), 8266–8276.

48.

Jiang

, Meng

, Yu

S.I.

, Lan

, Shan

, Hauptmann

A.G.

, Self-paced learning with diversity, In: Neural Information Processing Systems 12(8) (2014), 2078–2086.

49.

Zhang

, Chen

, Link prediction based on graph neural networks, In Neural Information Processing Systems 12(2) (2018), 5165–5175.

50.

Maas

, Hannun

, Ng

, Rectifier nonlinearities improve neural network acoustic models, International Conference on Machine Learning 28(6) (2013), 1–6.

51.

Zheng

, Shen

, Tian

, Wang

, Tian

, Scalable person re-identification: a benchmark, In International Conference on Computer Vision 2(17) (2015), 1116–1124.

52.

Zheng

, Zheng

, Yang

, Unlabeled Samples Generated by Gan Improve the Person Re-identification Basel ine in Vitro, In International Conference on Computer Vision 12(22) (2017), 3774–3782.

53.

, Zhao

, Xiao

, Wang

, DeepReID: deep filter pairing neural network for person re-identification, In Proceedings Computer Vision Pattern Recognition 9(24) (2014), 152–159.