Abstract
At present, supervised person re-identification method achieves high identification performance. However, there are a lot of cross cameras with unlabeled data in the actual application scenarios. The high cost of marking data will greatly reduce the effect of the supervised learning model transferring to other scene domains. Therefore, unsupervised learning of person re-identification becomes more attractive in the real world. In addition, due to changes in camera angle, illumination and posture, the extracted person image representation is generally different in the non-cross camera view, but the existing algorithm ignores the difference among cross camera images under camera parameters and environments. In order to overcome the above problems, we propose unsupervised person re-identification metric learning method. The model learns a shared space to reduce the discrepancy under different cameras. The graph convolution network is further employed to cluster the cross-view image features extracted from the shared space. Our model improves the scalability of pedestrian re-identification in practical application scenarios. Extensive experiments on four large-scale person re-identification public datasets have been conducted to demonstrate the effectiveness of the proposed model.
Introduction
Given one or more target person image, person re-identification (person ReID) aims at matching or retrieving persons across cameras or terminals [1]. Due to the variance of illumination, camera parameters and body pose, person ReID still is a challenging problem under the open-set, as shown in Fig. 1. In most recent works, research focuses on person representation [2–6], metric learning [7–12], and dataset extension [13–16]. Most of the existing person ReID methods focus on either robust feature extraction or distance similarity measurement for comparing features, or a combination of both [17–19].

Person re-identification challenges.
However, the high performance of the existing person ReID methods rely on the availability of large-scale annotated training data. In the supervised person ReID learning method, a large number of people sample pairs in the range of cross-camera need to be manually annotated. In the current person ReID dataset, the number of non-cross-camera training samples for pedestrians in the same ID is insufficient. The lack of label pairs makes learning discriminant features challenging. Due to the high cost of manual annotation, the scalability and practicability of supervised method in real video surveillance applications are greatly restricted when applied to a new scenario. Without the help of labeled data, such as the differences of scene, light intensity, pedestrian age span and camera parameters under cross domain, the model trained on source domain is directly applied to another unlabeled target domain, which has poor generalization ability and a great decline on target domain. Our method tries to reduce the dependence on the labeled data in the target domain. In order to make full use of the limited cheap unlabeled data, the recently published approaches are mainly based on learning distance metrics or subspace strategies [20–26]. However, the performance still mainly relies on substantial labeled samples. Without labeled data, it is difficult to model the large differences among across cameras. This leads to the deviation between the original feature spaces in different camera views, which reduces the accuracy of person matching in the case of cross-camera. Existing unsupervised learning models generally ignore this deviation and process image samples from different camera views in the same manner.
In addition, recent research on face identification shows that the application of unlabeled face clustering method can improve the identification performance [27–29]. However, few approaches focus on clustering effectively on person ReID problems, especially in the large-scale dataset. In order to effectively utilize unlabeled data, we use an effective clustering algorithm to deal with the complex topology in the actual application. However, traditional clustering methods such as K-means and spectral clustering are based on simple assumptions. These methods lack the ability to cluster complex topological structures, and lead to bring noise clustering, especially large-scale image training sets collected in real scenes. This situation limits the improvement of performance. In this paper, we propose an unsupervised deep learning framework which learns a shared space to reduce the deviation under cross-camera view. A learnable clustering method is proposed based on graph convolutional network. The model desire to capture the general mode in unlabeled pedestrian clusters using the powerful of graph convolutional network. Extensive experiments indicate that our method can significantly improve the accuracy of person clustering and further improve the performance of person ReID.
Unsupervised person re-identification
To straightly take advantage of unlabeled data for person ReID, several unsupervised person ReID models have been proposed [23–25, 30–34]. Peng et al. [23] focus on extending the dataset by constructing the multi-task dictionary to learn the feature expression of the target dataset biased. Yu et al. [24] learn asymmetric metrics through asymmetric clustering of person images, which belongs to unsupervised learning based on asymmetric metrics. To utilize the unlabeled data under the cross domain, Lv et al. [25] transfer the labeled model trained in the source domain to the unlabeled target domain, learn the temporal and spatial information of persons, and construct the unsupervised network model based on the temporal and spatial model by combining visual features to improve the classifier effect. Li et al. [26] use the trajectory information to extract weak marks to train the model based on trajectory correlation between cameras. Deng et al. [30] use CycleGAN to improve the overall performance by preserving the category information of persons, focusing on the transformation model between the learning source domain and the target domain. Aiming at the problem that the image in the currently existing dataset lacks illumination diversity and the costly of marking all the lighting conditions, Bak et al. [31] use the HDR and other environmental rendering methods to model the real indoor and outdoor lighting conditions, and construct a synthetic person ReID dataset containing hundreds of lighting conditions. However, this virtual 3D character model generated by software has a certain distortion, and it is difficult to generalize to a real application scenario. Zhou et al. [32] perform local metric learning on negative samples, reducing the need for a large number of positive sample pairs, which is a data enhancement method. Wang et al. [33] transfer the labeled information in the source dataset, and jointly learn the attribute information and identity tags of the target domain to achieve unsupervised learning in the target domain. Qi et al. [34] employ the graph Laplacian to project the images from different camera views in each domain into a shared subspace.
To reduce the impact of cross-domain distribution variation, most recent work based on dictionary learning are proposed and achieved significant progress [35–38]. Peng et al. [35] propose a dictionary learning model which decomposes the dictionary space into parts corresponding to semantic. Cross-view dictionary learning model is used to solve multi-view learning problem [36–38]. However, the above methods only focus on the overall differences under cross domain and does not explicitly take into account the deviation among the original feature spaces under different camera views. Existing unsupervised approaches treat the samples from different views in the same way, and thus the effects of view bias have been ignored. In order to solve the above problems, this paper proposes a method of learning a shared space to reduce the deviation among camera views by projecting original training data.
Clustering
Clustering is one of the basic unsupervised learning methods in person ReID. Most existing clustering methods are unsupervised. Due to various differences in scenes, light intensity, season, person age span and camera parameters, pedestrian clusters vary significantly in size, shape and density. Pedestrian clustering provides a manner to utilize limited unlabeled data. The complex distribution of pedestrian representations makes it unsuitable to apply traditional clustering algorithms. The traditional methods have rigid assumptions on data distribution. It is generally used to estimate a label for unlabeled data, cluster unlabeled data into pseudo classes, and then train the model with initial labeled data and partially selected data with pseudo-label, so that they can be used like labeled data and used for supervised learning [20]. However, the model trained only with the initial label data is too weak, and the reliable pseudo-labeled data is small. Therefore, selecting data will introduce many wrong training samples, and the network constraint performance based on a large number of error training samples is improved.
Hierarchical clustering algorithm based on unsupervised learning is robust in data grouping with complex distribution. Fan et al. [39] improve the performance of the model on the person target dataset by K-means clustering and iterative training of the unlabeled source dataset. Lin et al. [40] propose a bottom-up clustering method to jointly optimize the relationship between convolutional neural networks and unlabeled samples to solve the unsupervised person ReID problem. On the contrary, Yang et al. [27] propose a top-down face clustering method. Lin et al. [41] design the similarity measure of the nearest neighbor based on data samples based on linear SVM. Zhan et al. [42] train the classifier to aggregate information and obtain clustering through connected components.
However, the above work is mainly focused on designing new similarity measures, relying on traditional clustering methods. Traditional clustering methods, such as K-means, spectral clustering and hierarchical clustering relying on strict assumptions of data distribution lack the ability of clustering complex structure and are easy to bring in noise clustering. For example, K-means assumes that each cluster has only a single center, and spectral clustering requires similar clustering scale. However, due to the dramatic changes in body poses such as illumination, the clustering of pedestrian images varies greatly in size, shape and density. The complex distribution of such features makes them unsuitable for applying traditional clustering algorithms. Especially in the case of a large-scale person image of a real application scenario, the above problem limits the improvement of performance. In this paper, we propose a clustering method based on graph convolutional network to learn the complex feature topology.

The framework of our proposed network.
In person ReID problems, the samples of cross view can be organized into a graphical structure. Existing work has shown the advantages of GCNs, such as the strong capability of modeling complex input structures [43–47]. The traditional clustering method lacks the ability of complex clustering structure, which will produce noise clustering, especially in large-scale datasets. This problem severely limits performance. To effectively exploit unlabeled face data, we need to learn how to cluster from data. We employ the expressive power of graph convolutional network to capture the representation in pedestrian samples clusters, and leverage them to classify the unlabeled data. However, there is still few researches on the difficulty of using the graph convolutional neural network to solve the person ReID. Graph convolutional network is multi-layer stacked, and the parameters of different network layers are different. Graph neural network is iterative solution, and the parameters of each layer of network are shared.
Zhao et al. [43] apply the semantic graph convolutional network to the three-dimensional pose regression prediction of human body, and complete the three-dimensional shape of the known partial scatter. Chen et al. [44] apply the graph convolutional network to the more difficult image multi-label classification and identification, and construct a directed graph between multiple target labels. Each node in the graph is embedded into the feature representation of the label, and the graph convolution maps this label map to the target classifier. Gao et al. [45] apply the graph convolutional network to visual tracking, and use the context of the frame to learn the adaptive features of target location. Li et al. [46] apply graph convolution to learn temporal and spatial characteristics of motion identification. Ma et al. [47] apply graph convolutional confrontation networks to jointly model data structures, domain labels, and class labels for unsupervised domain adaptation. In this paper, we adopt graph convolutional network to cluster the cross-view image features extracted from the shared space.
Methodology
The framework of our proposed network
In this section, we introduce an unsupervised person ReID learning framework based on the convolutional clustering in detail. Even in the unlabeled target domain dataset, it is known that the camera information where the image samples are located when acquiring the original unlabeled pedestrian image samples. The camera number of the sample is known, and this information can be utilized as supervised information that maps different views in the original feature space to the shared feature space. Our unsupervised person ReID method based on graph convolutional clustering presented is shown in Fig. 2. Firstly, the model uses metric learning method to learn a shared space of different views, which can reduce the feature deviation between the same ID of pedestrian samples under cross camera. The model learns a transformation to map source sample into a shared space which can better separate different person samples; Secondly, the basic deep neural network model is initialized by the existing labeled dataset; Thirdly, the original model is used to extract the image features from the unlabeled dataset. Graph convolutional network is used to cluster the cross-view image features extracted from the shared space. The clustering results constitute the data of the training set; Finally, the training model composed of clusters is used to fine tune the initial model, which is stored as the original model. The fine tune model is used to extract pedestrian sample features, repeat clustering and fine tune until the scale of reliable samples in the training set becomes stable. The fine tune of the model and correction process of clustering adopt the diversity self-paced learning method proposed in [48].
Shared space learning
In unsupervised RE-ID scenarios, clustering is susceptible to interfered by the bias of cross view. Samples from the same view are easier to cluster. Due to the deviation among cross view, samples with the same ID across views are not easy to form clusters. However, our model reduces the deviation between images of the same ID under different cameras by learning the shared space, thus alleviating the inaccurate clustering results in the model due to the deviation of the cross view.
We define that there are C camera views, each with Nc (c = 1,...,C) person images. There are N = N1 + ...+NC person images in the training set, the characteristic of the source domain feature space samples is
We formulate our idea as optimizing the Equation (1). The model learns the mapping of different camera views to the shared space, reducing the feature deviation caused by cross view.
Our model uses metric learning method to learn a shared space of cross view, which reduces the feature deviation of the same ID samples. However, this traditional clustering method has assumptions on feature distribution, and is susceptible to the negative impact of complex feature representation distribution of samples, thereby making false assumptions on data distribution clustering. Link-based clustering methods can achieve higher accuracy without any assumptions about data distribution [49]. The current link-based clustering generally utilizes traditional threshold metrics to calculate the possibility. Wang et al. [28] propose the possibility of using the context of nodes to predict whether the nodes and neighbors belong to the same category. Compared with traditional methods, the link-based clustering method considers important information in the context of samples, and overcomes the difficulty of selecting fixed thresholds in metric learning methods.
In order to utilize valuable information in the context of a node, this information belongs to unstructured graph structure data. Thus, the features of the image extracted from the traditional discrete convolutional neural network in Euclidean space cannot be used. In this case, graph convolution neural network is more suitable to solve this problem if the problem refers to graph topology data. We adopt graph convolution network (GCN) [28] to simplify the clustering task to link prediction. The model predicts the similarities of labels between the anchor point and its nearest neighbor. If the labels are the same, they are linked. To represent the local context information, a node pair with each neighbor is constructed for each anchor point. Finally, GCN is used to learn the above task, and a set of values of link possibilities between nodes in the graph is output. That is, a set of weighted edges is output. The weights of edges are the link possibilities. The final clustering results are obtained by combining the linked nodes.
For the sample characteristics that map to the shared space
After applying the convolution operation, the function f (· , ·) can be expressed as Equation 3, among which
In most recent works, part-based methods have achieved much better performance for person ReID than just learning full-body discriminative features [11, 13]. In this paper, spatial local information of different body parts is used for pedestrian images representation. In the first layer of the graph structure, the input node characteristic matrix is the original node characteristic matrix
The model learns the nodes through the hidden layers of multiple GCNs and summarizes the basic information about the node neighbors. Then, the input node feature P is cascaded with the clustering information along the feature dimension to obtain a set of edges weighted by link possibility. In the fine tune process of the model, the clustering results with noise will make the potential variables prone to fall into the bad local optimal value or oscillation state. In order to alleviate the above problems, the diversity self-paced learning method (SPLD) proposed in the model literature [48] encourages the selection of samples in multiple subsets in self-paced learning, masking noise data while avoiding self-paced learning to select the imbalance in the number of samples between components during each iteration. Finally, the cross-entropy loss function is used to train the graph convolution network.
Comparison of our method with the latest unsupervised person ReID performance
Datasets
We use four benchmark datasets of MSMT17 [14], Market1501 [51], DukeMTMC-reID [52] and CUHK03 [53] for person reID to evaluate the performance of our model. MSMT17 dataset is the latest and largest person ReID dataset, with a total of 4,101 pedestrian IDs. The images are from 15 cameras inside and outside the campus building of Peking University. Market-1501dataset is collected from the campus of Tsinghua University in the summer. There are totally 1501 pedestrians IDs, with an average of 17.2 training data per person. DukeMTMC-reID dataset is collected in Duke University during the winter, with a total of 1812 pedestrian IDs. There are many cases that different persons wear similar clothes. This dataset is challenging. CUHK03 is captured in two cameras with no cross over coverage.
Implementation details
In this paper, we compare the commonly used mAP (mean average precision) and rank-i with existing methods in the experiment. The model searches for probe person image in gallery, and sequentially calculates the distance between the probe features and other samples. The rank-i indicates the probability which a query sample is found among the top i matching in the ranking list. We adopt ResNet-50 pre-trained on ImageNet as the backbone network. Since the training set and the test dataset are independent and identical distribution hypotheses, the batch norm layer is added before the fully connected classification layer. Thus, the input samples of each layer in the training process maintain the same distribution. All training and test images are standardized to 224×128 pixels. We apply common data enhancement strategies. By performing a random horizontal flip with a probability of 0.5, zero-padding with 10 pixels, normalization processing, and fine-tuning to alleviate the problem of overfitting. In the experiment, dropout value is set to be 0.5 and batch size is set to be 128. SGD solver and adam optimizer are used to train the network model. We set the initial learning rate to 0.0001, and reduce the learning rate to 0.00001 in the 50th iteration until the model convergence.
Comparison with the state-of-the-art
We compare the proposed method in this paper with the current unsupervised learning person ReID methods. The performance comparison of unsupervised learning using MSMT17, Market1501, DukeMTMC-reID datasets as source domain or target domain dataset is shown in Table 1. “Duke MTMC-reID⟶Market1501” means that the model is fine-tuned on Duke MTMC-reID dataset and tested on the target dataset of Market1501. The rank-1 and mAP accuracy of the method proposed in this paper exceeds the methods proposed in the latest literature. The accuracy of rank-1 tested in the target dataset MSMT17 is increased by 5.4%, and the mAP value is increased by 2.9%.
Influence of shared space learning and graph convolutional learning on person ReID task
Influence of shared space learning and graph convolutional learning on person ReID task
In this section, we mainly discuss the effect of learning shared space and application of graph convolution network on person ReID datasets. As shown in Table 2, “Supervised Learning” indicates that the training and testing of the model are completed in the target dataset. The accuracy reaches the current level of person ReID baseline. The results in the unsupervised learning section indicate that the training set is the source dataset before the arrow, and the testing set is the target dataset after the arrow. As shown in the Table 2, “w./o. SHL” represents the identification performance of the model without the application of shared space learning; “w./o. GCL” represents that the model does not apply graph convolution network. In the clustering process, we adopt traditional threshold measurement to learn the identification performance under the link prediction. “Ours w./o. SPLD” represents the performance of the in the case of self-paced learning in the clustering process. “Ours” represents the identification performance of the proposed model under unsupervised learning. The results show that the performance of the proposed model is superior to the single method. It can be seen from the results in Table 2 that our method attains 22.5% matching rate at rank-1 in the target dataset MSMT17. “w./o. GCL” and “Ours w./o. SPLD” rank-1 increase by 6.9% and 5.7% respectively. Experimental results show that learning shared space can reduce the deviation among views. The application graph convolutional network that uses context information in graph structure to link sample clusters has a positive impact on the performance of person ReID.
Conclusion
In this paper, we have proposed a novel unsupervised person ReID metric learning algorithm to reduce the deviation of the cross-camera view by learning the shared space, and thus better performance of cross-view matching can be achieved. Graph convolutional networks are employed to extract cross-view features from shared spaces. The model performs the linked person clustering by using context information in the graph structure. Encouraging experimental results on four large-scale person ReID public datasets can outperform existing ones in general showing the effectiveness of our method.
Funding
The authors acknowledge the Natural Science Foundation Project of Huaiyin Institute of Technology (Grant: 19HGZ004), the Natural Science Research of Jiangsu Higher Education Institutions of China (Grant: 18KJA520002), the Natural Science Foundation of Jiangsu Province (Grant: BK20171267), the Fifth Issue 333 High-Level Talent Training Project of Jiangsu Province (Grant: BRA2018333), Qing Lan Project of JiangSu Province, the Horizontal Project (Grant: Z421A19830, Z421A19888).
