Abstract
The recent successful methods of person re-identification (person Re-ID) involving deep learning have mostly adopted supervised learning algorithms, which require large amounts of manually labelled data to achieve good performance. However, there are two important unresolved problems, dataset annotation is an expensive and time-consuming process, and the performance of recognition model is seriously affected by visual change. In this paper, we primarily study an unsupervised method for learning visual invariant features using networks with temporal coherence for person Re-ID; this method exploits unlabelled data to learn expressions from video. In addition, we propose an unsupervised learning integration framework for pedestrian detection and person Re-ID for practical applications in natural scenarios. In order to prove the performance of the unsupervised person re-identification algorithm based on visual invariance features, the experimental results were verified on the iLIDS-VID, PRID2011 and MARS datasets, and a better performance of 57.5% (R-1) and 73.9% (R-5) was achieved on the iLIDS-VID and MARS datasets, respectively. The efficiency of the algorithm was validated by using BING + R-CNN as the pedestrian detector, and the person Re-ID system achieved a computation speed of 0.09s per frame on the PRW dataset.
Keywords
Introduction
Person re-identification (person Re-ID) is the computer vision task of recognising an individual in a network of video surveillance cameras with non-overlapping fields of view [1]. The issue of person Re-ID has received considerable attention in a wide range of areas in recent years [2]. With the scientific and technological progress and the rapid development of information technology in the 21st century [3–5], intelligent security technology has entered a new era. The boundary between intelligent security technology and computers is gradually disappearing. Two applications of person Re-ID [6] are tracking criminal suspects and finding lost elderly people or children. In such cases, a query person (probe) is to be found among a group of candidates (gallery), where images of the probes and gallery are obtained from different non-overlapping camera views [2]. Compared with facial recognition with strong biological characteristics, this task is more difficult because of the low resolution of face images, unconstrained postures, illumination changes and occlusions; therefore, person Re-ID is a challenging task [7]. A key characteristic of person Re-ID is that it is cross-camera, so when evaluating the performance of person Re-ID technologies, it is necessary to retrieve the same pedestrian pictures from different cameras [8]. To address this multi-view matching problem, the method of Wang et al. [9] first learned a subspace using canonical correlation analysis (CCA) in which the goal was to maximize the correlation between data from different cameras but corresponding to the same people, given a probe from one camera view, and represented it using a sparse representation from a jointly learned coupled dictionary in the CCA subspace. Therefore, the angle between pedestrians and cameras varies greatly, so our main goal has been to identify visual invariant features.
Most of the existing techniques [10, 11] are based on defining pedestrian feature descriptions (usually including clothing colour and texture) and measures of the similarity between a pair of descriptors to obtain an evaluation score, which we can define manually or extract from data [15]. Khedher et al. [16] presented a multi-shot person Re-ID system from video sequences based on SURF matching, and proposed a new method of SURF matching via sparse representation. To bridge the human appearance variations across cameras [17], two coupled dictionaries that relate to the gallery and probe cameras are jointly learned from both labelled and unlabelled images. Wang et al. [12] proposed angular loss with hard sample mining (ALHSM) to learn better similarity metric for the person Re-ID, and used the angular relationship in triangles as a measure of similarity, minimizing the angle at the negative point of the triangle. Maria Jose et al. [13] exploited the transference of learning previously acquired from a multi-object-tracking (MOT) domain, a unique deep triplet architecture has been trained on both domains, six different levels of transfer learning have been implemented and evaluated, proving that the transference of leaning from a different domain remarkably increases the person Re-ID performance. Grzegorz et al. [14] addressed the problem of detecting and identifying persons with a mobile robot, by sensory fusion of thermal and colour vision information. This strategy still faces difficult challenges despite extensive research, because the appearance of a person captured from different cameras often varies significantly due to changes in the camera angle. Fig. 1 shows multiple images of two people at different angles from different cameras. The images of the same person show great differences in shape, colour and background.

Multiple images of two people at different angles from different cameras.
In an actual scene, the position and shooting angle of the surveillance camera are generally relatively fixed, so that the relative positions of the foreground targets in the surveillance scene, such as pedestrians, are difficult to predict [8]. To ensure that the extracted features have a certain degree of visual invariance, this paper will improve the algorithm to better use person Re-ID technology. Our visual system can identify objects from different angles and directions that vary in shape, proportions and lighting conditions [18]. This ability depends on strong visual signals or continuous representation, and this is difficult to achieve in computer vision. Bak et al.[19] proposed eliminating perspective distortion by using 3D scene information, as this minimises perspective changes. Sun et al. [20] narrowed the viewpoint problem to that of determining the angle of rotation of the pedestrian, came to a firm conclusion, and quantitatively analysed the influence of the angle of rotation of the pedestrian on the recognition accuracy. Li et al. [21] proposed two multitemporal dictionary learning algorithms, expanding on their KSVD and Bayesian counterparts, to make better use of the temporal correlations. The expanded KSVD algorithm seeks an optimized temporal path, and the expanded Bayesian method adaptively weights the temporal correlations. Karanam et al. [22] suggested reducing the impact of the angle by learning a dictionary that can distinguishably and sparsely encode features that represent different people. From the above discussion, we can see that the visual invariance of pedestrians is very important in person Re-ID, but temporal coherence [18, 24] has not been proposed in person Re-ID. In deep neural networks, visual invariance can be learnt by training on a large amount of unlabelled data. In our work, we extract visual invariant features with temporal coherence [23, 24] for person Re-ID to prevent a reduction in accuracy due to the recognition process being affected by the angle.
In this paper, building a person Re-ID system with a fast speed, high accuracy and strong robustness is the ultimate goal of researchers. In our model, person Re-ID performance is not affected when the angle is transformed. This study provides the following contributions: Designs visual invariant features descriptors of pedestrian based on temporal coherence for person Re-ID; Proposes an unsupervised learning algorithm for person Re-ID; Integrates these algorithms of pedestrian detection and pedestrian recognition to build a prototype of person Re-ID system.
The remainder of this paper is organized as follows. Section 2 presents the methodology of person re-identification. Our core work is introduced in Section 3 and Section 4, which includes the methods of rapid pedestrian estimation and unsupervised learning of visual invariance. Section 5 presents the experimental results, and Section 6 presents the conclusions.
In this part, we propose a new person Re-ID system that enables us to identify a person more efficiently. The person Re-ID system can be divided into two parts: one is the training perspective-invariant feature model, and the other is the method for pedestrian detection and recognition in real scenes. To improve the speed of person Re-ID and ensure its accuracy is affected as little as possible by angle transformation, first, we use a two-layer neural network with temporal coherence [24] to train a model to obtain pedestrian visual invariant features. Then, we use the binarised normed gradients (BING) algorithm to estimate the pedestrian position from the original image that needs to be queried (probe), extract the visual invariant features from the estimated image and gallery images, and send them to the classifier to obtain the results of person Re-ID. The flow chart of the system is shown in Fig. 2.

Unsupervised learning integration framework of pedestrian detection and person Re-ID.
As the position between the surveillance camera and pedestrians changes, the angle also changes. To reduce the influence of angle change on person Re-ID, we propose visual invariant features of pedestrians. Due to the limitations on the invariance of one-layer models, we build a two-layer model. The second layer of the network combines the features of the first layer by means of a set of super-complete neurons. We do not rigidly impose the weights of the second level. Instead, we optimise the other goal, keeping the activation of the second layer consistent over time. In the two layers of the network, we follow the classical composite cell model, grouping neurons with an energy pool of size 2. Then, classifiers are used to transform the learnt features into person Re-ID tasks.
In actual monitoring scenarios, estimating pedestrian locations quickly is a challenging problem. In this paper, a fast pedestrian estimation method based on BING features is proposed. BING is an accelerated version of standardised gradients that aims to speed up feature extraction and detection. First, to effectively quantify the object state in the image window, we combine the magnitude of the pixel gradients of the window into a 64-bit feature to train a classifier to estimate pedestrians quickly. Then, pedestrians’ visual invariant features are extracted and sent to the classifier to identify pedestrians.
Finally, a complete person Re-ID system will be formed.
As a key task in computer vision, object detection has attracted much attention in recent years. Rapid pedestrian estimation is also an indispensable part of person Re-ID. Most existing classifiers evaluate images by sliding windows, which is a time-consuming process. In recent years, the primary goal of pedestrian estimation has been to speed up evaluation, so training a general objectivity measure for categories has been proposed [25]. General objects with clearly defined closed boundaries are strongly correlated in the standard gradient space when the corresponding image window is adjusted to a small fixed size [26]. The size of the image window is adjusted to 8×8, and the norm gradient is used as a simple 64D feature to describe the pedestrian and explicitly train a generic objectivity metric. The binary version of this feature is BING, which can be used for effective and fast pedestrian estimation. This is a simple, fast and high-quality object detection method. By using BING features to calculate an image window of any scale and aspect ratio, we need only a few atomic operations (addition, bitwise operations, etc.).
The goal of the method is to estimate pedestrians by scanning a well-defined quantisation window with each window scored by a linear model w ∈ R64. If s l represents the filter score, g l represents the normalised gradient (NG) feature, l represents the coordinates, i represents the scale, and (x, y) represents the window position, then the score can be expressed as s l =< w, g l >, l =< i, x, y >. Since the possibility of including objects in windows of different scales varies, we need to define the object state score (calibration filter score) o l = v i · s l + t i . v i represents separately learnt coefficients, and t i represents a bias term for each quantised size i. This feature is insensitive to changes in the position, scale and frame ratio of an object, and it can be used to detect any kind of object effectively. Moreover, the ability of the feature to represent a target accurately makes its computation and verification efficient, so this feature has good potential for real-time applications.
Aiming to estimate pedestrians more efficiently, BING, as a sped-up version of NG features, is proposed. The linear learning model w ∈ R64 can be approximately expressed as a combination of a series of basis vectors
Through a simple operation [26], we approximate the first N
g
bits of the gradient magnitude for binarisation. Thus, the 64 NG-dimensional feature values g
l
can be approximated by the first N
g
-bit binary gradient magnitude (BGM) image as follows:
For the BING feature bx,y, its last line is rx,y, and the last element is bx,y. By calculating a series of binary patterns in a fixed 8×8 size range using certain atomic operations [26], the obtained filter score will be:
By using fast BITWISE and POPCNT SSE operators, we can determine Cj,k=28-k(2<a j + ,bk,l > - |bk.l|).
In order to estimate pedestrians quickly, in the first stage, we use the normed gradients as a simple 64D feature to describe an image region that is normalized to 8 × 8 and train a generic image region measure. In the second stage, we use the binarized normed gradients (BING) as a binarized version of this feature, which requires only a few atomic operations [26], such as ADD and BITWISE SHIFT.
Most of the existing person Re-ID features are based on information such as the appearance and texture of pedestrians, which are easily affected by the angle of view, thus reducing accuracy. Recently, most of the success of deep learning in person Re-ID has been achieved by supervised learning, which requires a large amount of labelled data. However, label collection is a cumbersome and time-consuming process. In this work, we use an unsupervised learning framework to extract visual invariant features from a large amount of unlabelled data. We use a two-layer neural network with temporal coherence[23, 24] to learn visual invariant features.
The first layer of the algorithm is an unsupervised learning architecture based on an automatic encoder for pre-training deep networks. To learn the features from the data sample x
i
, ordinary automatic encoders aim to reconstruct data by minimising the cost function [18]:
Where g and S are a sigmoid (such as
The linear self-encoder mentioned above can be represented by complex numbers. Using a complex weight matrix and complex hidden activation instead of real weights [18], the automatic encoder can be represented as:
The pooling in the first layer network can only enable the model to learn local invariance. For real-world scenarios, we want to learn more about visual invariant features on a larger scale. To achieve this invariance at a higher level, it is better if second-layer networks notice persistent changes in the features of first-layer networks. If an object is moving, the first-layer network needs a feature to describe the object, and the second-layer network needs another feature to describe the same object appearing elsewhere. Only greyscale features are used throughout this process.
If the first-layer network has m hidden layer units, then there are m response maps after convolution. The second layer of the network is built on top of these response maps. It includes a principal component analysis (PCA) process for dimensionality reduction. The second-layer network should be identical in structure to the first-layer network.
Fig. 3 shows the invariant images in the first and second layers after learning with temporal coherence cost. Convolution is used to extract image features: the first-layer features are obtained by sampling with fixed steps, and the visualization results are shown in Fig. 3(a). The second-layer features are obtained in the same way as for the first-layer response graph, and the visualization results are shown in Fig. 3(b). Generally, we use a two-layer feature extraction network to average the features of each layer and extract the visual invariant features.

A comparison of first-layer and second-layer invariance learnt from data obtained with temporal coherence cost. (a) indicates the visualization results of the first-layer features, and (b) indicates the visualization results of the second-layer features.
In this paper, unsupervised experiments are carried out on 3 challenging datasets: iLIDS-VID, PRID2011 and MARS. In iLIDS-VID, based on the assumption that a real person Re-ID system should track each identity, 600 tracks of 300 identities are extracted from the iLIDS-MCTS dataset. The PRID2011 dataset has a total of 1134 trajectories from 2 different cameras. Only 200 people appear in both cameras. MARS is the first large-scale video-based person Re-ID dataset. There are 1191003 images belonging to 1261 pedestrians in this dataset.
We compare our proposed method with current popular methods, and the results are shown in Table 1. According to the test results on the iLIDS-VID dataset, our approach achieves a much better recognition rate than the other methods; in particular, our approach achieves the best recognition rate at Rank-1. In the PRID2011 dataset, due to the large number of people studying this dataset, the existing methods of each rank have reached a high level. However, the accuracy of our methods is also high. In the MARS dataset, our method is more accurate than the other methods of Rank-5. Although it does not achieve the best recognition rate in other accuracy levels, the gap between the accuracy of our method and the best recognition rate is within 2%.
Comparison to state-of-the-art unsupervised results on the iLIDS-VID, PRID2011 and MARS datasets according to Rank-m accuracies (%)
Comparison to state-of-the-art unsupervised results on the iLIDS-VID, PRID2011 and MARS datasets according to Rank-m accuracies (%)
In real-life images or videos, the pedestrian detection rate is also an important part of person Re-ID. To further verify the effectiveness of the proposed method, we compared our detectors and recognisers with existing target detection methods and recognisers on the PRW dataset, and the results are shown in Table 2. The experiment was carried out in the same environment, on a machine with 16 GB of memory, a K40 GPU and an Intel i7-4770 Processor. Compared with other methods, our method has a better accuracy. In particular, when three or five targets are detected, our method has a higher accuracy than other methods. When five targets are detected, Faster R-CNN is the detector, and the proposed method is the recogniser, the person Re-ID accuracy of rank-20 reaches 80.1, and the mAP is 23.2, and the person Re-ID system of BING + R-CNN achieved a computation speed of 0.09s per frame. It is worth noting that our detection rate is much higher than that of classic target detection methods. Compared with the classic DPM-AlexNet method, our detection rate with Faster R-CNN is 92 times higher, and compared with the BING + R-CNN method, our detection rate is 71 times higher. Compared with the relatively fast locally decorrelated channel features (LDCF) detection method, our detection rate with Faster R-CNN is also at least 19 times higher, and it is 14 times higher than that of the BING + R-CNN method. The improvement in the detection rate will have an impact on the future detection of large-scale pedestrian datasets.
Performance comparison of detectors and recognisers on the PRW dataset
Fig. 4 showed that sample person Re-ID visual results on the proposed PRW dataset with the BING + R-CNN algorithm. The three images in column 1 are queries, the images with green borders are success cases, and the ranklist ranges from 1 to 10. We typically use 128×64 for the normalized scale of the bounding box.

Sample person Re-ID visual results on the proposed PRW dataset with the BING + R-CNN algorithm.
In this paper, we proposed the research on person Re-ID based on visual invariant features with unsupervised learning. A general unsupervised deep learning recognition model was established for the influence of angle change on the performance of person Re-ID model. Under the condition of fine-tuning without label, the visual transformation and visual invariance feature were learned by temporal coherence to realize the design of visual invariance feature descriptor. According to the demand of practical application in natural scene, this paper used BING feature to realize the attention mechanism of pedestrian detection, which quickly located the approximate position of pedestrian in the original monitoring image, accelerated the process of pedestrian detection, and then weigh the speed and accuracy of pedestrian recognition system, and weighed a unsupervised person Re-ID system prototype. In order to prove the reliability of unsupervised person Re-ID algorithm based on visual invariance, in this paper, the results were verified on the iLIDS-VID,PRID2011 and MARS datasets, and the better performance of 57.5% (R-1) and 73.9% (R-5) were obtained on the iLIDS-VID and MARS datasets, respectively. Using BING + R-CNN as the pedestrian detector, the person Re-ID system obtained the computation speed of 0.09 seconds per frame.
Footnotes
Acknowledgments
This work is supported by the Nature Science Foundation of China (No. 61762023), the Sprouts Come Special Project of GuiZhou Department of Science and Technology (No. QKHPTRC [2017]5726), the Shaoguan Science and Technology Plan Project (2019sn064), the Shaoguan University Research Project (NOS. SY2018KJ03), and the Startup Project of Doctoral Research of Guizhou Normal University (2017).
