Abstract
Learning to discriminate, whether two person-images correspond to the same person or not, is a daunting challenge when only two images per person are available. This task is called single-shot person re-identification (re-id) and it assumes that each one of the two available images was captured from a different camera view entailing variations in pose, resolution, scale, illumination and background. Addressing this task through supervised training of a deep convolutional neural network is susceptible to model overfitting due to the critical lack of enough labelled data. This paper proposes to exploit the transference of learning previously acquired from a multi-object-tracking (MOT) domain. In this context, a unique deep triplet architecture has been trained on both domains. Six different levels of transfer learning have been implemented and evaluated, proving that the transference of leaning from a different domain remarkably increases the re-id performance. Experimental results validate accuracy and robustness of the proposed method as comparable to other state-of-the-art techniques. These results also confirm that, despite the data problem, deep learning is also applicable to the single-shot re-id task.
Keywords
Introduction
Person re-identification (re-id) consists of matching people across non-overlapping camera views, at different locations and time. It has become one of the most studied tasks in video surveillance since many other applications, like tracking or behaviour analysis, rely on the person re-id performance [1].
Even though many research efforts from the computer vision field have focused on solving the re-id problem [2], it remains unsolved [3]. This is because, in unconstrained scenarios, a person’s appearance often changes dramatically across camera views due to changes in body pose, view angle, occlusion and illumination conditions. These changes cause intra-class variations, that means quite different representations of the same identity. Moreover, the presence of different people with a similar appearance introduces inter-class ambiguities, i.e. similar representations corresponding to different people. The intra-class ambiguities tend to distance the representations of a person in the feature space. Conversely, inter-class ambiguities close the features corresponding to different people. Therefore, the clustering of the features according to the identity that they are rendering is a hard task, turning the recognition of the identities into a daunting challenge.
Recently, the use of deep re-id models has been boosted by the success of Convolutional Neural Networks (CNNs) in various vision problems, like image classification [4], objects detection [5], face identification [6], transportation [7, 8, 9], structures analysis [10, 11, 12, 13, 14, 15, 16], medical diagnosing [17, 18], or even in computational modelling [19, 20], combinatorial optimization [21], and big data management [22]. Most of the networks used in person re-id treat this task as a pairwise binary classification to discriminate between matched and mismatched pairs of images and they are trained with a supervised learning framework, like the work presented in [23].
However, person re-identification poses a unique challenge to deep supervised learning, since a deep model with millions of parameters must be learnt from a small training set. Two types of re-id datasets can be found: multi-shot recognition [24], where a tracklet of every individual (i.e. small sequence of images) is available for each camera view; and single shot recognition, where only one image per person and per view is used. Figure 1 shows examples of matched pairs (in each column) from the single-shot PRID2011 dataset [25]. Every pair is formed by one image captured from each one of the two different cameras views, a and b. This paper is focused on the single-shot case, where the overfitting of the model is particularly acute due to the small size of the available labelled data.
Examples of matched pairs belonging to the single shot PRID2011 dataset.
The objective of the presented work is to address the re-id data problem in deep models learning. Given the requirement of a large amount of labelled data to learn a non-overfitted model, able to generalise the solution with unknown samples, and the small size of single-shot re-id datasets, the transference of learning from a different domain has been adopted.
In contrast to the use of other re-id datasets or classification datasets, this paper proposes the use of a Multi-object tracking (MOT) dataset to pre-train a re-id model, which has been subsequently fine-tuned with the re-id target dataset.
With this approach, the re-id network architecture does not require any modification to transfer learning from people tracking to re-identification task. Instead of that, we propose the adaptation of the MOT dataset to feed a re-id network. Therefore, a new source dataset has been created by extracting people images from MOT sequences.
The use of the proposed cross-sequence dataset provides certain benefits to the re-id model learning. First, the unbalanced nature of the re-id data, where there are only two instances of a certain individual against the high number of different people, is attenuated. In the new dataset, many detections of the same person are extracted, making easier the convergence of the model learning process.
In addition, samples from different MOT sequences have been compared in the training. This setting helps to avoid the dependence on the characteristics of a certain camera view, and consequently, to avoid the negative transfer of learning.
Therefore, this paper addresses the single-shot reidentification problem by transferring the similarity degree of appearance learnt from MOT datasets, and its main contributions are: i) the creation of a new source dataset to pre-train a person re-id model from MOT sequences, by designing and using a data generation tool, which has been publicly delivered;1 ii) the implementation and evaluation of different stages of transfer learning, according to different values for the number of layers that are initialized with the pre-trained weights, and for the number of layers that are fine-tuned on the target re-id dataset. This prior analysis provides experienced knowledge about the best configuration for transferring learning from MOT to re-id domain, which is applicable beyond the presented experiments. iii) Finally, the last contribution is the learning of a degree of appearance similarity measurement applicable to both, tracking and re-id tasks through a unique deep network architecture.
The pair-wise binary classification performance of the pre-trained model has been tested on the MOT17 dataset. Furthermore, the effects of the different stages of transfer learning on the final re-id capacity have been evaluated over two of the most challenging re-id datasets: PRID20111 [25], and VIPeR [26], proving the effectiveness of the proposed approach.
The rest of the paper is organized as follows: Section 2 presents the existing related work. The used learning model is introduced in Section 3 and Section 4 describes the different conducted stages of transfer learning. Section 5 evaluates the learning process evolution and presents the experimental results, and some concluding remarks are given in Section 6.
Traditionally, re-identification problem has been faced through two different types of approaches: first, those which enhance the design of distinctive feature representations, like graph representations [27] and spatial co-occurrence representations [28]. And second, the methods to optimally combine and quantify visual features, such as boosting with AdaBoost [26], Ranking Support Vector Machines (Rank-SVM) [29], Probabilistic Relative Distance Comparison (PRDC) [30], or Metric Learning algorithms, such as Linear Discriminant Analysis (LDA) [31] and Logistic Discriminant Metric Learning (LDML) [32].
These mentioned methods address the learning of a global weighting that reflects the stability of each feature component across two cameras, so they can be grouped under the paradigm of Global Feature Importance (GFI) methods. On contrary, in [33], a Prototype-Sensitive Feature Importance based method is proposed to adaptively weight features according to different clusters of population.
On the other hand, a new approach has been recently adopted. This, which has been inspired by the success of deep Convolutional Neural Networks (CNNs) in image classification [4], is based on the learning of Deep re-id models.
In general, two types of CNN models are commonly employed in computer vision. The first type is the classification model, which is used in image classification [4] and object detection [34]. The second is the Siamese model using image pairs [35] or triplets [36] as input. The first two works in re-id to use deep learning [37, 38], employed a siamese neural network [39] to determine if a pair of images belong to the same identity. Yi et al. [38] add an additional cost function in the network, while Li et al. [37] use a finer body partitioning. Siamese Networks consist of two deep CNNs branches, sharing parameters and joined in the last layer, where the loss function performs a pairwise verification. Later, Ahmed et al. [40] improved the siamese model by computing the cross-input neighbourhood difference features, which compares the features from one input image to features in neighbouring locations of the other image.
Another variation of the Siamese network is the one known as Triplet model, presented in [36], where face re-identification is addressed. This model has been widely used to automatically find salient high-level representations from raw images, like in the work presented in [41]. The work presented in [42] demonstrated how triplet model helps to alleviate the unbalance nature of the re-id datasets, by allowing multiple triplet combinations of re-id samples.
In the Triplet model, each training input is a set of three samples, two are rendering the same person and the third one, a different identity. Therefore, this model allows the comparison between a matched and a mismatched pair, from each triplet input, so the objective function can maximize the relative distance between them. For that reason, this has been the learning architecture used in this work to train VGG16 networks [43]. Hence, each branch of the proposed Triplet model is a VGG16 net. Other triplet approaches that chose the VGG16 model to learnt representative features are the presented in [44, 23] to solve face and person re-identification problems, respectively. This last work also adds LSTM networks, since they adopted a multi-shot re-id approach, in contrast to the single shot re-id that is presented in this work. In [45] a combination of LSTM and VGG networks are used solve people tracking.
However, addressing the re-id problem by means of Deep learning involves an exceptional challenge, because of the lack of training data. Single-shot re-id datasets, such as PRID2011 [25] or VIPeR [26], provide only two images for each identity so currently most CNN-based re-id methods focus on the Siamese and Triplet model, treating the re-id task as a pairwise binary classification.
The requirement of a large amount of labelled data to learn a non-overfitted model, able to generalise the solution with unknown samples, severely limits its scalability in real applications. To overcome this limitation, some works employ unlabelled data from video-surveillance people detection [46, 47]. Nevertheless, without labelled matching pairs across camera views, existing unsupervised models are not capable of learning what makes a person recognisable under severe appearance changes. Therefore, the matching performance of unsupervised approaches is generally lower than that for supervised methods.
Given the inefficiency of unsupervised methods in re-id and the insufficiency of labelled data to apply supervised ones, transferring the models learned from a larger auxiliary dataset becomes necessary. The literature presents transfer learning, a.k.a. domain adaptation [48], as a widely applied tool in deep re-id model learning, where the target task is short of labelled data. The most common deep transfer learning strategy is fine-tuning [49]: firstly, a base network is trained on a large source data, then the weights corresponding to the first n layers are copied in the target network, and the remaining layers are randomly initialised; and secondly, fine-tuning is performed over these layers or over all layers.
In theory, domain adaptation can be performed by any unsupervised deep learning method, such as, auto-encoder [50] and dictionary learning [51] which can be integrated as the later layers of a CNN network [52]. The method presented in [53] transfers a pre-trained model to solve re-id on an unlabelled target dataset. In addition, soft-label self-training based deep unsupervised learning has become popular recently [54]. In [55], a novel co-training based unsupervised domain adaptation method is proposed, in [56] a semi-supervised approach is presented, and in [57], both supervised and unsupervised settings are used to perform a two-stepped finetuning strategy.
On the other hand, to keep the advantages of supervised models and to cope with the lack of labelled training data, some recent works use auxiliary source datasets from a different domain, taking a multi-task joint training approach. These methods go beyond existing re-id datasets and consider much larger sources, like classification datasets. An example of vast classification dataset is ImageNet dataset [58], which contains millions of images of thousands of object categories. This has been shown useful as source dataset for model pre-training in re-id. In [55], the ImageNet dataset is employed in a two-stepped fine-tuning strategy, to solve a re-id model.
However, transferring knowledge from a classification dataset to a re-id dataset has a number of drawbacks, due to the differences in the tasks for which they were meant, and the dissimilarities in their input data. In [55], several model architecture modifications, and different loss functions are proposed, to combine the learning on two different domains. Indeed, the inputs of a re-id model are person detection images, which have very different aspect ratios and much lower resolutions than classification images. Consequently, the architectures used to train models with them are quite different. That implies the modification and adaptation of the traditional re-id architectures to be able to receive the transferred knowledge from the classification models.
Other methods aim to minimise the discrepancy between the marginal [59, 60] or joint [61] distributions of the source and target datasets, by blurring the domain boundary [59], with a cross-domain loss. However, this method is unsuitable when the source and target domains have completely different tasks, like in [55]. A systematic study is presented in [49] which examines how transferable features of different layers are between different domains. This work concludes that the generalisation ability decreases when the discrepancy between the source and target tasks increases.
For that reason, instead of adapting the re-id model network to a completely different domain dataset, in this work, a new auxiliary source dataset is created from a different domain but keeping the necessary characteristics to feed a re-id model. A multi-object tracking dataset has been chosen, concretely, the MOT17 dataset from the MOT challenge. As most of the datasets from the people tracking domain, the MOT17 provides the localization of the people detections and their identities at every frame of a bunch of video sequences.
In that way, the resolution and aspect ratio of the created auxiliary dataset is quite similar to those in re-id datasets, so we keep the advantages of using an auxiliary re-id dataset, but with the possibility of getting a larger training dataset. Moreover, a pairwise binary classification can be adopted in both the pre-training step, with the new dataset and the fine-tuning stage, with the re-id target dataset. Hence the same network is used as source and target model.
In order to use a unique architecture, other works use different re-id datasets as source and target datasets. The cross-dataset transfer learning has been recently adopted for re-id to provide transferable discriminative information from other re-id labelled data to a given target dataset. Among the existing cross-dataset transfer learning works, the literature presents [62], where an SVM multi-kernel learning transfer strategy is adopted, and [63, 64], where multi-task metric learning models are employed. In [65], a single model is learnt across multiple re-id datasets before the fine-tuning in each one of them.
The size of the training data, even when it is formed by a combination of re-id datasets, is relatively small, which tend to overfit a camera-pair-specific model. In consequence, that model cannot be directly transferred to solve re-id in a target re-id dataset containing people who never appeared in the source datasets, which were captured from a new camera pair.
Besides that, the use of existing re-id labelled datasets, each captured from camera pairs with different viewing conditions, to pre-train a model, allows learning the view-invariant features of a person’s appearance. However, the differences between the source and target domains, due to the drastically different camera viewing conditions could result in a negative transfer [53].
Therefore, the negative transfer of the pre-learnt view-to-view transitions should be avoided. For that reason, contrary to the cross-dataset approach, this work deals with the lack of data by means of creating a new vast dataset from multi-object tracking sequences. In addition, people detections from different sequences are compared in the pre-training of the model. In that way this is fed with input data captured from different cameras views, avoiding the learning, and consequently the negative transfer of specific camera-to-camera transformations.
Re-identification model
Considering single-shot re-identification task as an isolated module, its objective is to identify the person represented by an image from one view (probe image) among all the images from the other view (gallery images).
This is achieved by calculating the distances between the probe image and all the gallery images. Then, the gallery image presenting the smallest distance is selected as the correct match.
The distances are not computed directly over the raw images but over a representative feature array. Therefore, it is necessary to learn an embedding
This feature embedding has been modelled by a deep convolutional neural network, whose architecture is described in Section 3.1. To train the weights of this neural network, a Triplet model has been employed, as it is described in Section 3.2.
The objective of this model is to perform person re-identification, so re-id domain is considered as the target domain. However, learning previously acquired in a different domain, called source or auxiliary domain can be transferred to initialize the weights of the model.
The presented learning model has been used for both, its pre-training on a source dataset, and its fine-tuning on a target dataset, allowing the use of a unique architecture to transfer learning between two different data domains.
Model architecture
The transformation of every image into its corresponding point in the feature space is performed by a deep convolutional neural network, DCNN. Concretely, the 16-layered network presented as the D version of a set of Very Deep Convolutional Neural networks in [43] and hereafter called VGG16. The selection of this architecture has been inspired by its extensive use in both domains, multi-object tracking [45] and re-identification [23], with demonstrated good results. In addition, this architecture fits perfectly for the proof of concept of the proposed approach, conducted in this work, and its configuration is easily implemented with Caffe libraries [66].
VGG16 presents thirteen convolution layers and three fully connected (FC) layers. Moreover, this network presents a SoftMax layer as the final layer, which has been removed in this work. Besides that, the input size has been modified in order to adapt it to the re-identification training samples dimensions. Therefore, the input of the proposed DCNN is a person RGB image of 128
Structure of the used VGG16-based model. The input and output sizes are described in #rows
#cols
#filters; the kernel, in #rows
#cols
#filters, stride, and #outputs for FC layers
Structure of the used VGG16-based model. The input and output sizes are described in #rows
To learn the weights of the previously described deep neural network, a Triplet model approach is proposed since it has been demonstrated to alleviate the unbalanced nature of the re-id datasets [42]. Indeed, the number of instances of a query identity is very limited, specifically when compared with the vast quantity of potentially available different people representations.
According to the triplet model, the network is triplicated in three branches, which are forced to share the same parameters. That means that the learnt weights are identical in the three branches, so only one feature descriptor is learnt, but it is computed for each one of the three different input images, as Fig. 2 shows.
The first branch receives an anchor sample,
Triplet learning architecture.
The goal is to learn a transformation from the image to the feature space, such it leads the representations for the same person near and far away from different person representations. These constraints are imposed by the triplet loss function.
The feature representation for an image,
The similarities between the input images,
This equation ensures that the anchor sample is closer to all samples of the same person, in the feature space, than it is to any sample of other people since
However, in the one-shot person re-identification challenge, there is only one positive sample for every anchor image. So, the available data is not enough to adopt an individual-meant approach, which clusters the samples of the same person close from each other and distant to another person identity cluster.
On contrary, the proposed method treats all the possible positive pairs as a set rendering the similarity class. In the same way, negative pairs represent the dissimilarity class. The discrimination between similarity and dissimilarity is learnt by means of comparing every positive pair with all the negative pairs generated from the same anchor, taking advantage of the triplet model architecture for that.
The huge amount of possible triplet combinations makes impossible the used of all the data in every iteration, due to the limitations of the computational memory resources. This has dealt to implement a mini-batch-based learning algorithm approach. Therefore, eventually, the loss function to minimise,
Batch gradient descent, where the batch is formed by all the training samples, provides more accuracy in the parameters updating than Stochastic gradient descent (
The value of
The re-id performance of the proposed model is measured over some re-id datasets, so these are known as target datasets. Because of the insufficiency of labelled data provided by the target re-id datasets to avoid the overfitting of the trained neural model, transferring the parameters learned on a larger auxiliary dataset becomes necessary.
This auxiliary dataset, also called source dataset, is used to pre-train the neural model. In that way, the learning acquired by the whole model or by some of its layers could be transferred to solve the re-id task with improved performance.
The employed source and target datasets, and the levels of learning transference from one to the other domain are described in the following sections.
Target and source datasets
The following two datasets have been used as target datasets: PRID2011 (Person Re-ID) dataset2 [25] and VIPeR (Viewpoint-Invariant Pedestrian Re-identi- fication) dataset3 [26]. These are two of the most widely used datasets for evaluating re-identification approaches.
Both are composed of two sets (A and B) of person images captured from two static non-overlapping camera views with remarkable differences in camera characteristics, illumination, person’s poses, and background. The person images have a resolution of 64
Triplet samples from PRID2011 dataset.
The number of images and different individuals presented by each dataset (in their single-shot version) are exposed below:
Triplet samples generated from MOT17 dataset.
PRID2011: Set A contains 385 different images, and set B, 749 images. 200 of the captured individuals are rendered in both sets, and 90 of them were randomly extracted to be used to generate the training set and 10 for the cross-validation set. The test set has been formed by following the procedure described in [25], i.e., the images of set A for the 100 remaining individuals with representation in both sets have been used as a probe set. The gallery set has been formed by 649 images belonging to set B (all images of set B except the 100 corresponding to the training and cross-validation individuals). VIPeR: It presents 632 pedestrians, each one with representation in both sets, A and B. For evaluation, the procedure described in [26] has been followed. The set of 632 pairs has been randomly split into two sets of 316 pairs each. One of them has been divided in a proportion of 90% -10% to create the training and cross-validation sets, respectively; and the other set has been used for testing the algorithm. Therefore, the gallery set is formed by 316 images from set A, and the probe set by their matching images from set B.
For both datasets, those individuals selected as training samples, have been properly combined to form a huge number of training triplets, and in the same way, for the cross-validation set. This has been performed by a data generation tool,4 which takes the anchor samples from one of the views (A) and the positive and negative samples from the other one (B), as Fig. 3 shows through some triplet examples extracted from PRID2011. In order to generate a vast number of triplets, every available positive pair has been grouped with every possible negative sample.
Before training and testing over the described re-id datasets, the model has been pre-trained using an auxiliary dataset. This auxiliary dataset, also called source dataset, has been created by extracting people detections from the MOT17 dataset.5 This dataset includes seven variated and labelled real-world surveillance sequences, meant to train and test multi-person tracking algorithms. The MOT17 dataset belongs to the MOT challenge and contains the same set of sequences as MOT16 [67], but with extended ground truth.
A data generation tool6 has has been employed to extract and combine people to create the training and cross-validation sets of triplets. From each frame, the appearing people have been extracted thanks to their bounding boxes, which are defined in the dataset ground truth, as well as, their identification number. Subsequently, triplets have been generated from the extracted people images, taking as the anchor and the positive images pairs of samples from the same individual. These samples have not why to be consecutive in time, that means that a certain time step is allowed between the frames from which they come. Then, for each pair, the negative sample is added to form a triplet. The negative sample is an image from a different individual and can be taken even from a different sequence of the MOT17 dataset, as Fig. 4 shows.
The person images have been resized to the same resolution than re-id datasets samples, 64
Fine-tuning [49] has been the adopted transfer learning strategy in this work. This is composed of two training processes: firstly, the neural model (presented in Section 3) is trained on the source dataset, created from tracking sequences. Secondly, the model is fine-tuned over the target dataset. To do that, the pre-trained weights corresponding to the first
Therefore,
Depending on the values given to
Set-up values for
and
, for every level of transfer learning
Set-up values for
Contribution of the source (blue) and target (yellow) datasets on the learning of the different layers weights for an example model, and for given values of 
No layer is pre-trained in the source dataset ( Seven of the layers pre-trained on the source dataset (0 Seven of the layers pre-trained on the source dataset (0 All the layers pre-trained on the source dataset ( All the layers pre-trained on the source dataset ( All the layers are trained on the source dataset (
In general, the intuition behind fine-tuning is that low-level features are learnt from the source dataset. This dataset is formed by person images from MOT sequences, so many of the learnt features can be transferred to solve the task of re-identifying people. Then, by fine-tuning the parameters and learning the final layers of the model on the re-id target dataset, high-level descriptors are learnt. These descriptors automatically acquire the capacities of discriminating people and recognising an identity in images from different views, through their training on the target dataset. That is because, contrary to the source dataset, the re-id target dataset contains samples of images captured from different camera views.
The transfer learning method basically consists of a first stage of pre-training on the source dataset and a second stage of fine-tuning in the target dataset. Instead of only evaluating the re-id performance of the final model, a prior test of the pre-trained model has been also conducted. Therefore, both stages of transfer learning have been evaluated.
The following Subsections 5.1 and 5.2 describe the evaluation of the pre-trained and final model respectively. Both present the employed metrics and datasets, as well as the conducted experiments and the obtained results.
Pre-trained model evaluation
In this work, a pair-wise binary classification model has been pre-trained over samples that were extracted from a MOT dataset. The obtained model is meant to initialize the weights of a re-identification network. Therefore, a prior test of the performance of this model to discriminate between positive and negative pairs of person images is essential to evaluate whether it is suitable to be used as the initializer. For that reason, this model has been evaluated by measuring its performance to discriminate between positive and negative pairs of samples.
A test set has been generated by extracting pairs of samples from the MOT17 sequences [67], as it has been explained in Section 4.1. For that, different individuals from those appearing in the training and cross-validation sets have been used. In order to provide a fair evaluation, the test set has been formed by 20000 positive pairs and the same number of negatives ones, presenting a completely balanced proportion of samples from each class.
The performance of the pre-trained model to properly classify the pairs samples of the test set has been rendered by a Relative Operating Characteristic (ROC) curve [69], shown by Fig. 6. This curve plots the True Positive Rate (TPR), also called Sensitivity or Recall, against the False Positive Rate (FPR), also known as fall-out rate, at various discrimination threshold, th, settings.
ROC curve of the pre-trained model.
The TPR and FPR metrics are defined by Eqs (3) and (4), respectively, where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives. The obtained values for these metrics are listed in Table 3.
The classification test proves the high capacity of the proposed re-id architecture to learn to discriminate between positive pairs of images, belonging to the same person, and negative ones, corresponding to different people.
The distance between the features of each image of a pair measures the degree of dissimilarity between them. Therefore, the chosen threshold, th, divides the distance space in two ranges of values corresponding to each class. The objective was to learn features that make the distance between two images of the same person lower than a certain threshold, and higher for different people images, in order to provide a high TPR and low FPR.
Metrics scores resulting from the binary classification performed by the model pretrained on the MOT dataset, for different values of th
The results show that choosing the value 0.7 for th, the model keeps the TPR higher than 90% and the FPR lower than 10%, which are considerably good values for a challenging task as re-identification, even in a MOT sequence. In addition, these results prove the potential applicability of this model not only in re-identification task but also for a visual-based association of identities in a tracking algorithm.
In the second stage of transfer learning, the weights of a certain number of layers of the pre-train model are transferred to the network and then some of its layers weights are fine-tuned. In this work, six different levels of transfer learning (defined in Section 4.2) have been conducted over two different re-id datasets, PRID2011 and VIPeR dataset. Therefore, twelve different final models have been obtained and evaluated. The procedures exposed in Section 4.1. have been applied to select test data.
The performance of the learnt models has been evaluated by computing their Cumulative Matching Characteristic (CMC) curve [70], which is a standard re-id performance measurement.
To obtain the CMC curve, first, every image from the probe set is coupled with every image from the gallery set and the distance metric (squared Euclidean distance) between them is computed. The matches presenting the lowest values of the distance metric are considered as the top matches since two images belonging to the same person should be rendered close in the feature space and further from different people representations. The distance metrics obtained by comparing a probe image with all the gallery set images are ranked. This process is repeated for each one of the probe images. Then the CMC curve renders the expectation of finding the correct match within the top
The time required to compute the proposed image descriptor using a Nvidia Titan Xp graphic process unit is 11.34 ms. This time has been measured as the average time taken to compute the output of the trained network for 65.000 different samples. The obtained computation time makes possible the use of the proposed descriptor in a real-time surveillance system.
The CMC scores obtained for the models learnt over the PRID2011 dataset by applying different transfer learning levels are listed in Table 4, and represented in Fig. 7, in ease of exposition. Analogously, Table 5 and Fig. 8 present the CMC scores for the models obtained with VIPeR dataset.
CMC scores (in [%]) for different levels of transfer learning, tested over PRID2011 dataset
CMC scores (in [%]) for different levels of transfer learning, tested over PRID2011 dataset
CMC curves for different levels of transfer learning, tested over PRID2011 dataset.
CMC scores (in [%]) for different levels of transfer learning, tested over VIPeR dataset
CMC curves for different levels of transfer learning, tested over VIPeR dataset.
In general, both graphics show that the extreme levels 0 and 5 present the worst results and the intermediate levels 3 and 4, the best scores.
In level 0, the model has been trained and tested in the target dataset. Because of the small number of samples on the target dataset (especially the number of positive samples), this model suffers from overfitting and it is not able to generalise the learnt solution to the unknown samples from the test set. Figure 9 shows the learning curve for the level 0 training on PRID2011 and VIPeR datasets.
Learning curves for level 0 training with PRID2011 (a) and VIPeR dataset (b).
The curves plot the loss value,
On contrary, the problem in level 5 is that the model has been trained on an auxiliary dataset, where, despite presenting a huge number of samples, they were not captured from two different camera views. In consequence, the model is not able to learn a feature robust against the intra-class variations.
Accuracy metric values comparison
On the other hand, in levels 3 and 4, all the weights pre-trained in the source dataset have been transferred to the model. Then, in level 3 all the layers have been fine-tuned with the target dataset and, in level 4, only the final
In addition, the model overfitting is reduced by the pre-training in a larger auxiliary dataset, as Fig. 10 shows with the reduction of the cross-validation loss.
Learning curves for the fine-tuning of level 4 with PRID2011 (a) and VIPeR dataset (b).
The intuition under this approach is that the most representative features of a person are automatically learnt on the MOT dataset. From the set of learnt descriptors, the low-level ones, which are learnt in the earlier layers of the network model, are kept. These low-level features have been trained on a larger and more variate dataset, so they not overfit the re-id training data. Then, the most high-level representations, coded in the further layers, are fine-tuned on a re-id target dataset to make them more discriminative.
Furthermore, prediction accuracy have been also estimated during the training process. This metric is the percentage of well-classified triplets over the total, considering as well-classified those triplets where the negative pair distance is larger than the positive pair distance by a threshold,
The models that have been obtained by transferring the MOT dataset learning clearly outperform the models directly learnt on the re-id dataset. This result can be taken as proof of concept to verify that transfer learning can be exploited to solve the re-id data problem, which is the main objective of this work.
Although the proposed method does not directly address the inter-class variations and intra-class ambiguities problems, it implicitly alleviates them, through the pretraining of the model in a larger and more variate dataset. The higher scores of CMC demonstrate how the proposed transference of learning from MOT domain makes the re-id model capable of successfully managing a larger number of inter-class variation cases and intra-class ambiguities.
Concretely the best performance is achieved when all the network layers are pre-trained in the source dataset and fine-tunned in the target one. This is what we have call level 3 of transfer learning.
For that reason, the models obtained by applying level 3 of transfer learning have been compared with other state-of-the-art methods, and their CMC scores are listed in Tables 7 and 8, for PRID2011 and VIPeR dataset, respectively.
Re-id methods comparison on PRID2011 dataset, ‘–’ indicates no result was reported
Re-id methods comparison on VIPeR dataset, ‘–’ indicates no result was reported
Instead of automatically find high-level features by using deep learning algorithms, the compared methods are mainly based on the computation of hand-crafted low-level features and the learning of a metric distance to compare them.
Linear Discriminant Analysis (LDA) [31], and Logistic Discriminant Metric Learning (LDML) [32], are discriminant methods. They learn a metric distance over a set of previously computed features, on contrary to the proposed approach where the whole re-id model is jointly trained, resulting in a notable improvement.
In [33], a Prototype-Sensitive Feature Importance based method is proposed to adaptively weight features according to different clusters of population. On contrary, [71] present a Global Feature Importance (GFI) approach, addressing the learning of a global weighting, i.e. a vector of generic weights and invariant to the population. Other examples of GFI methods are Ranking Support Vector Machines (Rank-SVM) [29], and Probabilistic Relative Distance Comparison (PRDC) [30]. No population discrimination has been made in this work, and the general weighting of the features to create a global person descriptor have been implicitly and automatically performed by the proposed deep neural model learning. This proposed approach outperforms the previously cited methods and even the combination of some of them (PSFI
In [72], the Euclidean distance is directly applied to compare two person representations. This paper also proposes to use the Euclidean distance, with the difference that the compared descriptors have been learnt by a deep re-id neural model, which produces a remarkable improvement in the performance.
In general, the proposed re-id model presents better results than the other compared methods, which do not use deep learning models. Therefore, this paper proves that, as long as the lack of re-id data is faced by a proper strategy like transfer learning, a deep learning approach is applicable to solve the challenging task of single-shot re-id.
The main goal of this paper was to solve a problem derived from the lack of data when addressing the single-shot re-identification task through the learning of a deep triplet neural network. With that objective, this paper proposes the transference of learning from the Multi-Object Tracking (MOT) domain to the Re-Identification (re-id) one.
In order to avoid the overfitting of the proposed neural model, a huge amount of triplet samples have been generated from a set of MOT sequences. Therefore, the MOT domain has been adapted to feed a re-id neural model, and not the other way around.
Different levels of transfer learning have been designed and analysed, and the results have proved that the transfer of learning from a larger and different domain dataset provides a remarkable improvement in the re-identification performance of the proposed deep model.
The method has been evaluated over two of the most challenging and commonly used re-id datasets: PRID20111 [25], and VIPeR [26], proving the effectiveness of the proposed approach, which outperforms some of the most used metric learning methods.
Besides proving the potential application of deep and transfer learning to solve single-shot re-identifi- cation, this paper proposed a generalized solution to learn a degree of appearance similarity metric. The conducted evaluation presents this metric as a potential tool for comparing people images not only in the context of re-identification but also in people tracking applications. In people tracking, an identity association process based on visual appearance can be provided by our degree of appearance similarity measurement. This measurement is the result of comparing two images with our modelled deep features, and it can be used as the cost of matching a certain detection with previous captures of a certain tracked identity.
Footnotes
The data generation tool has been implemented as a set of c
Publicly available under:
Publicly available under:
The data generation tool has been implemented as a set of c
Publicly available under:
The data generation tool has been implemented as a set of c
Acknowledgments
This work was supported by the Spanish Government through the CICYT projects (TRA2015-63708-R and TRA2016-78886-C3-1-R), and Ministerio de Educación, Cultura y Deporte para la Formación de Profesorado Universitario (FPU14/02143), Ayudas a la movilidad para estancias breves y traslados temporales (2018) of Programa Estatal de promoción del talento y su empleabilidad, and Comunidad de Madrid through SEGVAUTO-TRIES (S2013/MIT-2713). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.
