Abstract
The automatization of the Re-Identification of an individual across different video-surveillance cameras poses a significant challenge due to the presence of a vast number of potential candidates with a similar appearance. This task requires the learning of discriminative features from person images and a distance metric to properly compare them and decide whether they belong to the same person or not. Nevertheless, the fact of acquiring images of the same person from different, distant and non-overlapping views produces changes in illumination, perspective, background, resolution and scale between the person’s representations, resulting in appearance variations that hamper his/her re-identification. This article focuses the feature learning on automatically finding discriminative descriptors able to reflect the dissimilarities mainly due to the changes in actual people appearance, independently from the variations introduced by the acquisition point. With that purpose, such variations have been implicitly embedded by the Mahalanobis distance. This article presents a learning algorithm to jointly model features and the Mahalanobis distance through a Deep Neural Re-Identification model. The Mahalanobis distance learning has been implemented as a novel neural layer, forming part of a Triplet Learning model that has been evaluated over PRID2011 dataset, providing satisfactory results.

Introduction
Person Re-Identification (Re-Id) consists of visually recognising an individual through images from non-overlapping and distant camera views. Automating Person Re-Identification is fundamental to extend the monitoring capacity of a large distributed Intelligent Surveillance System (ISS) across multiple cameras views at different times and locations.
Typically, in real-world surveillance scenarios, accurate temporal and spatial constraints are difficult to establish, and fine biometric cues cannot be acquired from distant sensors. For that reason, the research has been mainly focused on Re-Id models that are based on visual data. These models extract features from surveillance images and compute a distance metric (a.k.a connection function) to compare them.
In the last decade, the Re-Id task has been addressed as a Single-Shot recognition problem, using only one person capture from each camera view. In that contest, the distance metric is a contrastive measurement that predicts whether two images correspond to the same person or not. Hence, the Re-Id problem is treated as a pairwise binary classification task. Many metric learning algorithms have been developed under this paradigm. These algorithms optimise a distance function to compare features, once these features have been previously extracted by a hand-crafted descriptor.
The emergence of the Multi-Target Multi-Camera Tracking challenge (MTMCT) has brought a new generation of Re-Id algorithms that follows a Multi-Shot recognition [10, 37, 41, 73, 90]. These methods acquire temporary consistency from the previous tracking of the individuals. Unlike Single-Shot Re-Id, Multi-Shot Re-Id relies on the availability of track fragments formed by consecutive detections. Consequently, the Single-Shot approach reduces the quantity of processed data and it speeds up communications in a distributed network of cooperative sensors.
To take advantage of the automatic feature extraction provided by neural models, without losing the benefits of Single-Shot strategy, this article refocuses the Single-Shot metric learning approach through the backwards propagation of the distance embedding as a set of neural parameters.
This approach entails several computer vision challenges. Firstly, inter-class ambiguities are caused by the presence of different people with a similar appearance. For that reason, the learning of discriminative features is fundamental. However, modelling Deep Convolutional Neural Networks (DCNN) with that purpose is a daunting task due to the lack of Re-Id data since there is no previous knowledge about the queried individual, but only the current captures from two different views. Hence, the number of instances of a query identity is very limited, especially when compared with the vast quantity of potentially available different people representations. The shortage of data and its underlying unbalanced nature tends to overfit and collapse the neural models.
Secondly, the challenge of Person Re-Id is to match images of the same person that were captured by different non-overlapping camera views against significant and unknown cross-view feature distortion. The differences between the camera characteristics and their perspectives cause large changes in scale, resolution, illumination, background and pose, and consequently the misalignment in the compared human shapes. This results in dramatically different appearances and consequently, different representations of the same person, which are called intra-class variations.
In the case of having a fixed pair of camera views, a view-to-view transformation can be modelled. This transformation encompasses the variation between the cameras’ perspectives and their visual characteristics. Subsequently, this transformation can be considered by the Re-Id model through the proper distance metric. In that way, the extraction of representative person features is reduced to the learning of those descriptors able to compare the actual individual’s appearance, without the perturbation caused by the view-to-view variations.
To code the view-to-view transitions, Roth et al. [69], proposed the use of the Mahalanobis distance as the distance metric. They estimated the Mahalanobis matrix utilizing multiple well-known metric learners from previously computed hand-crafted features. Then, in [22], an estimation method which was integrated into the learning of a neural Re-Id model was introduced. In that work, the Mahalanobis matrix was updated every certain number of learning iterations after the features were calculated by forward propagating the neural model. Indeed, the estimation of the matrix was not involved in the backpropagation process.
This article proposes not only the use of the Mahalanobis distance to compare person images and its integration in the process to train visual features, but also the learning of the Mahalanobis matrix as an extra set of neural parameters in a triplet model. Therefore, the main contributions of this work are:
The design of a deep neural model for person Re-Identification, where discriminative visual features and the Mahalanobis matrix to compare them are simultaneously learnt. The re-formulation of the triplet loss function, taking the Mahalanobis distance as the connection function between the features to compare. The formulation of the back-propagation stage of the new loss function. The design of the learning procedure for training the elements of the Mahalanobis matrix as an extra set of neural parameters. The vectorized implementation of the new loss and connection functions as neural layers, which have been publicly delivered in the format of Caffe-python layers1
Two major challenges are addressed in this work: the learning of person features capable to render the most discriminant aspects of an individual’s appearance to cope with the interclass ambiguities and the learning of the proper distance metric which optimally combines the visual features and reflects the camera-to-camera transitions to face the intra-class variations.
Examples of matched pairs belonging to PRID2011 dataset.
A Triplet approach has been used to train the proposed Re-Id neural model, which has been evaluated over PRID2011 [32] dataset, providing successful results. This is one of the most used Singe-Shot Re-Id datasets, and it presents an uncommon and particular acquisition setting that perfectly fits the requirements of the conducted research. This dataset was captured from two fix camera views, and groups the images from each camera separately, which allows demonstrating the effectiveness of the proposed Mahalanobis distance learning method to embed the intra-class variations between two fixed capturing points.
The rest of the article is structured as follows: Section 2 presents a review of the existing related works. Section 3 describes the proposed Re-Id neural model, and Section 4, the developed learning algorithm to train it. Finally, Sections 5 and 6 present the obtained results and some concluding remarks, respectively.
Person Re-Identification (Re-Id) has become one of the most studied tasks in intelligent video surveillance since many other applications, like tracking [86] or behaviour analysis [1, 50], rely on solving the Re-Id problem [78]. A good Re-Id performance is necessary to extend the functionalities of an Intelligent Surveillance System through several cameras views, which is the base of large scale distributed ISSs [7, 74].
In the context of the visual appearance Re-Id paradigm, the literature presents two main Re-Id strategies: Multi-Shot, e.g. [42, 76], and Single-Shot recognition, [56]. Multi-Shot recognition is performed by matching two tracklets (i.e. small sequence of images), captured from each camera view. These methods have been boosted by the arising of Multi-Target Multi-camera Tracking (MTMCT) systems and datasets like DukeMTMCT, CUKH03 and Market1501.
Conversely, the Single-Shot approach avoids the dependence on the availability of a sequence of detections of a tracked individual to re-identify him/her. In Single-Shot frameworks, only one image per person and per view is given. Figure 1 shows examples of matched pairs (in each column) from a Single-Shot dataset, concretely PRID2011 [32]. Every pair is formed by images that were captured from two different cameras views (cam a and cam b).
Under the Single-Shot constraints, the Re-Id task aims to identify the person represented by an image from one view, called probe image, among all the images of different people from the other view, called gallery of images. Hence, the Single-Shot Re-Id problem is commonly treated as a pairwise binary classification task, composed of two main stages: features extraction and their comparison through a certain distance metric.
Consequently, literature methods for Re-Id generally fall into two categories: methods focused on enhancing the person features design to represent the most discriminant aspects of an individual’s appearance, e.g. [52, 92], and those meant to learn an effective discriminative distance to optimally combine the previously extracted visual features, e.g. [44, 89].
The earliest works made use of handcrafted features based on low-level local features, like colour, texture and shape [6, 27, 34, 59, 80, 88]. Other works proposed the integration of several types of features with complementary nature, into a global signature, such as BoW models [19], or covariance descriptors [14].
Besides, many research efforts have been dedicated to learning salient features [91, 92]. In the context of Re-Id, human salience is different from general image salience in the way of drawing visual attention.
Concretely, for describing Re-Id images, there exist two types of visual features that are desirable to learn: invariant and discriminative ones. Invariant features, e.g. [6], are both, distinctive and stable under changing viewing conditions between different cameras. However, the large intraclass appearance variations, make the computation of these features often impossible under realistic conditions. To overcome this limitation, discriminative methods take advantage of the class information to exploit the discriminative information to find a more distinctive representation, e.g. [13]. However, such methods tend to suffer from overfitting. Moreover, they are often based on local image descriptors, which might be a severe disadvantage. The dependence on salient local attributes makes impossible the re-identification of a person in two images if the salient attribute is not visible in both captures.
The tendency in feature design is to learn a lower number of high-level representations to describe a person. Recently, there are also some works of literature applying neural models to address the Re-Id problem. This has been inspired by the demonstrated capacity of deep learning techniques to automatically find salient high-level features from the pixels of an image.
Deep learning has been proven to provide successful results in many fields of application, such as image recognition or classification, e.g. [30, 36, 75], objects and people detection [8, 12, 54, 72, 77], face identification [79], transportation [4, 57], data anomaly detection [58] and structures analysis [63, 64, 66, 82]. Nevertheless, its potential to learn a unique appearance model, able to represent any anonymous individual in a scene, has not been sufficiently exploited yet.
The new stream of methods based on the learning of Deep Convolutional Neural Networks (DCNNs) for Re-Id, e.g. [3, 11, 17, 46], follows a supervised learning framework and employs a contrastive model to perform a pair-wise binary classification to discriminate between matched and mismatched pairs of images.
The first two works in Re-Id to use deep learning, [40, 87], employed a Siamese neural network. Siamese Networks, earlier used to verify signatures [9], consist of two DCNNs sharing parameters and joined in the last layer, where the loss function performs a pair-wise verification. Each network computes the feature representation for one of the images of a pair. The objective is to make the distances between the features of matched pairs smaller than that for mismatched ones, achieving a binary classification in the distance space.
Traditionally, Siamese networks have used the Contrastive Loss function, proposed in [29]. This function forces the distance between matched pairs of images to be lower than a set margin, and higher for mismatched pairs. Later, an improved version which uses two separated margins was presented in [21] to learn more discriminative features.
Another variation of the Siamese network is the one known as Triplet model. The basics of the Triplet loss function are presented in [71], where face recognition is addressed. In the Triplet model, each input is a set of three samples. Two of them are rendering the same person and the third one, a different identity. Therefore, this model allows the comparison between a matched and a mismatched pair so the objective function can maximise the relative distance between them.
Some works, e.g. [17, 81], extend the triplet model to the Re-Id problem with an efficient learning algorithm and a triplet generation scheme. In the last years, researchers have proved that hard triplets’ mining is essential for the success of the triplet loss. With that purpose different upgraded formulations of the Triplet loss can be found in the recent literature, like the Triplet Focal Loss [90], the Attribute-aware Identity-hard triplet Loss [10] and the Compact Triplet Loss [73].
Many works build each branch of the triplet model with neural architectures initially meant to different purposes, like the VGG network, introduced in [75]. For instance, Zhuang et al. [99] chose the VGG16 model to solve Face Re-Id with a triplet model, and Liu et al. [45] to address Person Re-Id.
The recognition of a person utilizing an appearance neural model presents an intrinsically unbalanced nature, given the lack of data about the people to identify and the huge number of possible false assignments with surrounding agents. This results in the overfitting and collapse of the neural models.
On the other hand, instead of designing high dimensional feature representations to capture all relevant information, another widely extended Re-Id stream is composed of methods to optimally combine and quantify visual features. The learning of a metric with that purpose can boost Re-Id performance using quite simple features. In that way, an appropriate metric distance can contribute to reducing the model overfitting, since it allows to use simpler models with fewer parameters to embed the person features.
Some approaches perform feature selection by evaluating the discriminative importance of different types of features to properly weight them, [47]. Examples of this type of methods are AdaBoost [27], RankSVM [62], or discriminative classification methods that properly weight previously extracted features, such as LDA, [70], and LDML, [28]. Moreover, in [47], a PSFI based method is proposed to adaptively weight features according to different clusters of population.
Once the extracted features have been properly weighted to generate a person descriptor, pairs of these descriptors are compared by a distance metric to measure the similarity between the images that they are rendering. In [33], Hirzer et al. used the Euclidean distance, also referred to as L2-norm in [34]. However, in [51], Ma et al. discussed the idea that different descriptors need different distance metrics to properly re-identify people. They claimed that using a non-Euclidean distance even less distinctive features, which do not need to capture the visual invariances between the cameras, are sufficient for getting considerable results.
There is a vast variety of supervised methods encompassed under the paradigm of Metric Learning algorithms. Metric learning approaches use training data to learn effective distance metrics, searching for strategies that combine given features maximising interclass variation whilst minimising intra-class variation.
Some widely used metric learners are Probabilistic Relative Distance Comparison (PRDC) [94], Large Margin Nearest Neighbor (LMNN) [84], Information-Theoretic Metric Learning (ITML) [15], Logistic Discriminant Metric Learning (LDML) [28], and Pairwise Constrained Component Analysis (PCCA) [53]. These algorithms are based on complex optimisation schemes, with high computational cost and memory requirements, making them unfeasible in practice.
There is another branch of research composed of methods which seek to embed a camera-to-camera transition function in feature space, using labelled samples. To account for the appearance changes between non-overlapping cameras some early methods modelled the transfer of colours associated with the two specific cameras, e.g. [43], which can be understood as inter-camera colour calibration. Following the work of Porikli [61], designed for overlapping cameras, some early studies proposed different ways to estimate a Brightness Transfer Function (BTF) for modelling the appearance transfer between two cameras. However, this approach has some limitations. First, it assumes that a perfect foreground-background segmentation is available for the training and when the re-identification system makes predictions at real-time, and second, it is not always sufficient for modelling all the variability in the possible appearance changes.
On the other side, the Implicit Camera Transfer (ICT) method [5] introduces a novel way of modelling the camera-dependent transfer. The camera transfer is modelled by a binary relation that provides a generalised representation of the variabilities of the changes in appearance, without relying on high-level features or previous background subtraction. Instead of that, the figure-ground separation is performed implicitly by automatic feature selection.
The embedding of cross-view transforms can also be used to deal with the misaligned features problem. Indeed, Li and Wang [38] learnt a mixture of cross-view transformations, and features were projected into a common space that allowed its alignment.
Moreover, Mahalanobis matrix can be employed to code the view-to-view transitions, so that, the cross-view variations are reduced. For example, in [69] a generative standard Mahalanobis distance learning is presented. Mahalanobis learning exploits the structure of the data under the assumption that the classes present the same distribution. In contrast, RankSVM is used to learn an independent weight for each feature element, and PRDC outputs an orthogonal matrix that encodes the global importance of each feature. Instead of that, Mahalanobis metric learning optimises a full matrix that relates the features computed from one of the view sets, with those calculated from the other view set of images, to code the view-to-view transition.
The approach presented in [69], also proposed in [70, 28], optimises a Mahalanobis matrix after the person features have been computed for a dataset, treating the features extraction and the distance learning as two independent stages.
Instead of that, the works presented in [22, 23] proposed the use of Deep Re-Id neural models and the integration of a process to estimate the Mahalanobis matrix into the features learning process. In that way, the matrix estimation affects the features learning evolution and improves it simultaneously. Those methods employed a discriminative data analysis process to calculate the Mahalanobis matrix, which was optimised by the adaptive estimation of the covariance matrices of two features spaces, the similar and dissimilar ones. Nevertheless, those methods require the analysis of the data structure through several learning iterations to observe the effects of the Mahalanobis matrix on the learnt features, and, consequently, correct the values of the elements of that matrix.
The existing methods to estimate the Mahalanobis matrix for Re-Id do it after the features extraction or during its training in a parallel (but not the same) estimation process. However, the learning of the Mahalanobis matrix elements as an extra set of neural parameters has not been explored yet.
On difference to the works in [22, 23], this article presents a Mahalanobis Learning method that directly updates the elements of the matrix according to the objective to achieve, which is defined by the loss function. The novelty resides in the fact that the matrix estimation is not only integrated into the learning of the features, but it is simultaneously learnt. The elements of the matrix are considered as an additional set of neural parameters and consequently are backpropagated and updated by a novel learning algorithm.
Re-Identification model
This article addresses the Single-Shot Re-Identification task, assuming that person images have been previously detected in two different camera views. Therefore, the objective is to identify the person represented by an image from one view, called probe image, among a set of images captured from the other view, denoted as the gallery of images. Hence, the comparison of the two images must provide a prediction of whether they belong to the same identity or not. A pair of images of the same person is hereinafter called a matched or positive pair. Analogously, a pair of images of different people is denoted as a mismatched or negative pair.
Due to the contrastive essence of the pair-wise approach, the proposed Re-Id appearance model measures a comparative distance between two images. Instead of computing the distances directly over the raw images, these are calculated from representative feature. Therefore, it is necessary to learn a certain distance metric, dm and an embedding F(I) to map an image I to a feature space, such that the distances between samples rendering the same person are smaller than those between different people in that feature space.
In general, the Re-Id algorithm consists of the computation of features for a pair of person images and its comparison through a connection function, a.k.a. distance metric, dm, as Fig. 2 shows. The feature embedding has been modelled by a DCNN, and the Mahalanobis distance has been used as the distance metric.
Architecture of the Siamese Re-Id model.
Deep learning has been used to automatically find the most salient features of the individuals’ appearance. Consequently, the transformation of every image, I, to its corresponding representation in the feature space,
The architecture of the used DCNN is an adapted version of the well-known and widely used VGG architecture. Concretely, an 11-layered network, hereafter called VGG11, presented as the A version of a set of Very Deep CNN in [74], has been implemented. The layers specifications for the proposed VGG11-based embedding are listed in Table 1.
Structure of the used VGG11-based model. The input and output sizes are described in #rows
#cols
#filters; the kernel, in #rows
#cols
#filters, stride, or #outputs for FC layers
Structure of the used VGG11-based model. The input and output sizes are described in #rows
VGG11 presents eight convolution layers, three fully connected layers and a SoftMax final layer. The SoftMax layer has been removed to get a feature array as output instead of a classification probability value. Hence, its output is a point in the feature space represented by a 1000-dimensional array
In previous experiments, two different VGG architectures were evaluated, with sixteen and eleven layers. The effects of shortening the network architecture in the Single-Shot Re-identification domain were analysed, concluding that the simpler the model, the less prone to be over-fitted it is, due to the lack of training data. Therefore, the VGG11 network has been selected to train the person features.
Person Re-Identification requires a distance metric to compare two images and decide whether they belong, or not, to the same person. This involves a great challenge, due to the presence of people with a similar appearance. For that reason, a metric to properly combine discriminative features is needed. However, the variations of illumination, perspective, background, resolution and scale between two images of the same person, which were captured from different views, make his or her appearance vary, hampering the re-identification. This article proposes coding the view-to-view transition in a matrix, so that, the learnt features are focus on rendering the dissimilarity mainly due to appearance changes instead of the view changes.
To achieve that purpose, the Mahalanobis distance has been taken as a metric distance. Equation (1) defines the Mahalanobis distance,
The formulation of the Mahalanobis distance exploits the structure of the data. Therefore, using the Mahalanobis distance as connection distance, and forcing this to be small for positive pairs and large for negative pairs, the Mahalanobis Matrix can be trained. The resulting Mahalanobis matrix embeds the view-to-view transitions that relate the characteristics of the images from each camera.
Theoretically,
The main contribution of this article is the design of a unique supervised learning process to train the Re-Id model, presented in the previous section, where the DCNN and the distance metric are jointly learnt. With that purpose, the triplet approach has been adopted to learn the model. Figure 3 shows the architecture of the proposed learning approach, where the parameters of the feature embedding,
Architecture of the proposed learning model.
Every
The objective is to learn a transformation from the image to the feature space, such it leads the representations for the same person near, and far away from different people’s representations. This constraint is imposed by the Triplet objective formulation, presented in [71], which establishes a relative distance relationship. For a certain triplet of images, the triplet loss requires the squared Euclidean distance for the negative pair,
These equations ensure that the anchor sample is closer to all samples of the same person, in the feature space, than it is to any sample of other people since
By contrast, the proposed method treats all the possible positive pairs as a set rendering the condition of similarity. Analogously, negative pairs represent the dissimilarity situation. The discrimination between similarity and dissimilarity is learnt by comparing every positive pair with all the possible negative pairs. In that way, the network is trained to identify a person among a huge number of negative samples. Instead of using an online triplets generation mechanism, like in [71], all the possible triplet combinations from the available data have been previously determined, using the Triplet Permutation tool formulated in [24].
This approach of global comparison calls for the use of a Batch learning algorithm. However, the huge amount of possible triplet combinations composing the training set, and the limitations in processing memory resources have led to implementing a Mini-Batch Triplet loss function.
The Triplet loss function has been reformulated according to these three specifications: first, the loss value has to be computed from a mini-batch of training triplets. Therefore, the loss function to minimise,
Second, the connection functions,
A Triplet-based Mini-Batch Gradient Descent learning algorithm has been designed and implemented to learn both, deep features models and the Mahalanobis matrix, simultaneously. The main procedures of this optimisation algorithm are presented by Algorithm 1.
A feature descriptor is computed for every input image by the forward propagation of the DCNN, and subsequently, they are contrasted by the Triplet cost function,
Then, the equations to perform its backpropagation have been formulated. Backpropagation is performed by calculating the derivatives of the loss function with respect each parameter to learn (set
As it was explained above, the distance metric computation has been integrated on the triplet loss function. In that way, the loss function can be directly derived with respect to the features,
The back-propagation of the DCNN, according to each one of the three obtained descriptors, is performed to obtain
Moreover, the Mahalanobis matrix,
Once
A novel layer2 has been implemented and delivered to compute the forward and the backwards propagation of the triplet loss function with Mahalanobis distance. The inputs of this layer are the features arrays,
The learnt Mahalanobis matrix,
However, a reliable estimation for the Mahalanobis matrix is not achieved until executing a certain number of learning iterations in the features training process. This is because the matrix elements are simultaneously learnt in the training process. For that reason, the distance metric, dm, takes two different formulations along the learning process, as Eq. (15) defines, so that the Euclidean distance is employed until the number of learning iterations,
At the beginning of the learning process the Euclidean distance, which does not rely on any parameter, is used as a connection function. Meanwhile, the Mahalanobis distance learning is simultaneously conducted from the start. The learning of the Mahalanobis matrix depends on the person feature weighs, which are learnt at the same time. Because of that, the iteration number that was chosen as the threshold,
The designed learning algorithm has been used in a training process whose setting is explained below.
The Mahalanobis distance relates each element of the features difference vector array with each other through the Mahalanobis matrix elements. For that reason, the anchor image has to be always taken from the same camera view, and the positive and negative samples from the other one, since according to the triplet model, these two types of vectors of features differences are employed,
To alleviate the Re-Id data problem, it has been performed the transference of learning previously acquired from the Multi-Object Tracking (MOT) domain to the Re-Id model, by initialising its weights,
The intuition under this approach is that the most representative features of a person are automatically learnt on a MOT dataset. From the set of learnt descriptors, the low-level ones, which are learnt in the earlier layers of the network model, are kept. Then, the most high-level representations, coded in the further layers, are fine-tuned on a Re-Id target dataset to make them invariant and discriminative.
Moreover, a triplet selection mechanism has been designed to accelerate the training process, as it is explained below.
Triplets selection
An online triplet selection stage has been formulated to speed up the learning. Faster convergence can be obtained by selecting triplets that violate the triplet constraint in Eq. (2). Hence, this selection step acts as a filter which makes the loss function consider only the relevant samples for the training and obviate those samples easy to classify. In that way, the back-propagation is ruled by the hardest samples, which produce larger increments (or decrements) for the updating of the weights, accelerating the training process.
Triplet selection was firstly proposed by Schroff et al. [71]. They addressed face recognition by closing the samples of a certain identity in the feature space. They sought an identity clustering and trained their network on multiple instances of every individual. In that frame, triplet selection is performed by the online generation of the triplets: given a certain anchor image, hard-positive, and hard-negative images are selected. That means that for a given anchor, the samples that generate the hardest triplets are selected.
The intrinsic constraints of the Single-Shot Re-Id challenge do not allow the use of an anchor-meant approach, either in the triplet generation or in the triplet selection. Instead of that, the generation of a wide variety of samples has been sought by an offline combination process and a Mini-Batch Triplet loss function and learning algorithm have been proposed.
Therefore, a new neural layer has been designed to perform online triplet selection4 under the constraints of the Re-Id task, to be integrated into the proposed learning algorithm (Algorithm 1), and it has been placed before the loss function layer. At every iteration
Triplets selection according to the value of the positive and the negative metric distance.
Triplets presenting the lowest values for
According to Eq. (16),
The proposed Re-Identification model has been trained using the designed learning algorithm over a target Re-Id dataset. Its Re-Id capacity has been evaluated following the methodology explained in the next subsection. Finally, a comparative analysis of the obtained experimental results has been performed.
Evaluation methodology
The model has been trained and tested over a benchmark Single-Shot Re-Id dataset, PRID2011 [32].
A decisive factor for uniquely choosing PRID2011 dataset was its particular and uncommon acquisition setting. In most of the Single-Shot datasets, such as VIPeR [27], GRID [49] or CUHK [39], the images were captured from many different views, even inside the same set (gallery or probe set). However, in PRID2011, all its probe images were captured from the same camera view, and all the samples of the gallery were acquired from a second camera view different from the first one. PRID2011 is composed of two sets of person images, captured from two fixed static surveillance cameras, placed outdoors, with notable differences in camera parameters, illumination, poses, and background. Hence, this allows demonstrating the effectiveness of the proposed Mahalanobis distance learning method to embed the view-to-view variations.
In the single-shot version, used in this work, camera view A contains 385 different images, and camera B, 749. Besides, 200 of the individuals are rendered in both sets and the rest of them are distraction samples which do not form matched pairs. 100 of the 200 matched pairs were randomly extracted to be used as training and validation samples. The test set has been formed by following the procedure described in [31], i.e., the images of view A for the 100 remaining individuals have been used as probe set, and the gallery has been formed by 649 images belonging to camera view B (all images of view B except the 100 images corresponding to the training individuals). The resolution of the images is 64
The performance of the learnt models has been evaluated by computing their Cumulative Matching Characteristic (CMC) curve [55], which is a standard Re-Id performance measurement. To obtain the CMC curve, first, every image from the probe set is coupled with every image from the gallery set and the distance metric, dm, (squared Mahalanobis distance) between them is computed. The match presenting the lowest value of dm is considered as the top match since two images belonging to the same person should be rendered close in the feature space and further from different people representations. The distance metrics obtained by comparing a probe image with all the gallery set images are ranked. This process is repeated for each one of the probe images. The rank value, i.e. the position of the correct match in the ranking, is calculated for each probe image and, subsequently, the percentage in which each rank appears. Then the CMC curve renders the expectation of finding the correct match within the top
Distance metrics comparison
The ranking capacity of the model obtained with the proposed Mahalanobis distance learning algorithm has been evaluated through an experiment called Exp.MahaLearning. To compare the enhancement given by the proposed method in comparison with the use of the Euclidean distance as the Re-Id distance metric, dm, the obtained CMC scores have been compared with those obtained from a second experiment, named Exp.Euclidean. In this second experiment, the triplet Re-Id model is used with the Euclidean distance, as it was described in [25].
In both experiments, Exp.Euclidean and Exp. MahalaLearning, the same transfer learning method has been applied. The results are given in Table 2 and their corresponding curves are rendered in Fig. 5 for visual comparison.
The scores are generally better for the use of the Mahalanobis distance, especially at the first ranks which are the most critical ones for the Re-Id task. An increase in the performance in the first ranks allows the reduction of the number of candidates provided by a Re-Id system, within which the sought identity must be found. The reason is that the Mahalanobis matrix encompasses the visual camera-to-camera transitions, related to changes in illumination, resolution, and point of view. This reduces the effect of the intra-class variation and makes easier the Re-Id task.
CMC scores (in [%]) for models using Euclidean and Mahalanobis distance metric on PRID2011 dataset
CMC scores (in [%]) for models using Euclidean and Mahalanobis distance metric on PRID2011 dataset
CMC curves of models using Euclidean and Mahalanobis distance metric on PRID2011 dataset.
Although the Mahalanobis distance has already been employed in previous works, the main novelty of the proposed method is the learning of the Mahalanobis matrix elements as neural weights through its backpropagation. In previous works, estimation methods were used. In [25], the estimation of the Mahalanobis matrix is even integrated into a neural model to learnt discriminative features. To provide a fair comparison, this method has been integrated into the proposed neural architecture, and this experiment has been named Exp.MahaEstimation. Therefore, Exp.MahaEstimation and Exp.MahaLearning share the same neural architecture. The unique difference between them is the method used to get the the Mahalanobis matrix, a discriminative estimation process in the first one, and the proposed method in the second one.
The learning curves for both experiments are shown in Fig. 6. The loss value is measured by the triplet loss function, and the accuracy metric renders the percentage of well-classified triplets over the total, considering as well-classified those triplets where the negative pair distance,
Comparison of the learning process evolution using different methods to calculate the Mahalanobis matrix, over PRID2011.
In both experiments, the method explained in Section 4.3 for integrating the Mahalanobis distance computation into the feature learning process, has been implemented. The effect of the change in the distance formulation at iteration
Even though the validation loss is increased from this iteration, the Re-Id accuracy is also increased, so a higher number of samples are well-classified by the Mahalanobis distance, although the cost value of those bad-classified is higher. In Exp.MahaEstimation the use of the Mahalanobis distance produces larger oscillations in the loss and accuracy value. This is due to the adaptation of the Mahalanobis matrix elements through its estimation process.
With the proposed method (Exp.MahaLearning) the Mahalanobis matrix is learnt as the set of parameters of a neural layer. This approach reduces the amplitude of the oscillations and a softer but continuously decreasing trend is observed for the loss value and an increasing trend, for the accuracy value. This behaviour indicates that the Mahalanobis distance is being well learnt. Moreover, the gap between both losses (training and validation) is continuously reduced. This means that the algorithm is able to give a more generalised solution for unknown samples. This learning improvement results in an enhancement of the Re-Id capacity provided by the model trained with the proposed method as Table 3 demonstrates.
Besides, as explained above, the proposed Mahalanobis distance learning has been accelerated by the triplet selection mechanism, without causing any effect on the Re-Id performance of the learnt model. Figure 7 shows the learning curves of a training process using and not the triplets selection layer.
CMC scores (in [%]) for models using estimation and learning of the Mahalanobis matrix on PRID2011 dataset
Comparison of the learning curves using (a) or not (b) the triplets selection layer.
To provide a fair comparison, the selection layer is not applied to the validation samples. When the selection layer is used to train the model, the validation loss decreases from 0.8 to 0.4 in 25,000 iterations. A million iterations are needed to produce that effect when triplet selection is not used.
Moreover, the ranking power of the obtained model over PRID2011 dataset has been analysed through the image presented in Fig. 8. This figure shows the top 20 gallery images taken as most similar for some probe images (first column). The correct match is bounded by a yellow box. Examples of correct matches found at several ranks are given.
Top 15 ranking provided by the proposed Re-Id model over some samples of PRID2011 dataset.
Although the model sometimes fails to find the correct match in the first rank, it is able to rank the images according to the visual appearance similarity. The model gives the smallest distances, dm to those people images most similar to the query probe sample, in a way akin a human would do. For a certain probe image, people wearing similar clothes, accessories or bags, or in the same colours, are ranked in the top positions, with independence from the pose.
A new generation of Re-Identification methods, boosted by the developing of MTMCT algorithms, is achieving superior results over multi-shot datasets, such as Market-1501 [40], MARS [97], CUHK03 [98], DukeMTMC-ReID [67] and DukeMTMC-VID [85]. Si et al. [73] achieved a matching score of 94’67% on the first rank over Market-1501dataset, and 69.6% over CUHK03 dataset. Moreover, Zhang et al. [90] obtained 85.55% over DukeMTMC-ReID. Besides, Chen et al. [10] achieved the scores of 88.2% in the first rank over the MARS dataset, and 95.4 over DukeMTMC-VID.
Nevertheless, the goal of this article is to refocus the Single-Shot metric learning approach, through a novel method for backwards propagating the Mahalanobis distance embedding as a set of neural parameters. This method is a potential tool for learning contrastive neural models, keeping the Single-Shot approach of the offline metric learning methods. For that reason, the suitability of the proposed method to address that purpose is demonstrated by comparing the performance of the proposed approach with an extensive list of well-known metric learners. Their CMC scores are shown in Table 4.
Comparison of CMC rates (in [%]) of Re-Id methods on PRID2011 dataset, ‘–’ indicates no result was reported
Comparison of CMC rates (in [%]) of Re-Id methods on PRID2011 dataset, ‘–’ indicates no result was reported
The listed methods are based on the design of hand-crafted features and metric distance learning. Some approaches are focused on finding the proper combination of the features to represent a person image, like Ranking Support Vector Machines (RankSVM), [62]. Other works apply general metric learners to Person Re-Identification, such as Probabilistic Relative Distance Comparison (PRDC) [95], Large Margin Nearest Neighbor (LMNN) [83], Information-Theoretic Metric Learning (ITML) [15], Logistic Discriminant Metric Learning (LDML) [28] and Linear Discriminant Analysis (LDA) [20]. Some of these methods have been adapted to the Re-Id task, like Large Margin Nearest Neighbor with Rejection (LMNN-R) [16]. Moreover, in [47], a method based on Prototype-Sensitive Feature Importance (PSFI) is proposed to adaptively weight features according to different groups of the population and combined with two previously cited methods (PSFI
In general, the performance of the listed methods is overcome by the proposed deep Re-Id model which automatically learns the features embedding and the Mahalanobis matrix as the set of parameters of a neural layer, utilizing the vectorized implementation of its forward and backward propagation.
The two methods presenting the highest-ranking ability, i.e. the proposed one and the presented by Roth et al. [69], use the Mahalanobis distance as connection function. Roth et al. estimated the Mahalanobis matrix with a discriminative study of the features, once these had been previously computed in a separated process. On the contrast, the proposed method back-propagates the gradients of the elements of the Mahalanobis matrix to embed the view-to-view transitions and to learn deep features simultaneously.
This article formulates a unified neural framework to jointly and automatically find the most salient features and the optimal Mahalanobis metric distance to compare people appearance.
Nevertheless, the application of deep neural models to the Single-Shot Re-Identification task poses a daunting challenge due to the lack of data, and its unbalanced nature. The traditional metric learners for Single-Shot Re-Id had been displaced by a new generation of Multi-Shot algorithms that require from a multi-camera tracking setting.
Nevertheless, one of the main purposes of this work was to refocus the Single-Shot metric learning approach to brings together its low data requirements and the advantages of the automatic embedding learning provided by neural models. The Single-Shot metric learning approach has been reformulated through the back-propagation of the Mahalanobis distance embedding as a set of neural parameters. In that way, the proposed method enables the integration of deep neural networks into the pair-wise binary classification approach of the early metric learning methods, to perform Single-Shot Re-Identification. This endeavour has been successfully achieved as the results over a challenging benchmark dataset demonstrate.
Satisfactorily, even using the standard VGG architecture, the designed distance learning method yields a considerable performance. The conducted experiments have demonstrated the effectiveness of the proposed approach, whose results are comparable to that of the benchmark metric learners that estimate the metric distance from a set of previously computed features.
These results are taken as proof of concept to claim the back-propagation of the Mahalanobis distance as a valid method for learning Re-Id models, and a potential tool for training contrastive models with more sophisticated architectures in their compared branches.
Under this approach, enhancements can be obtained through the use of novel neural architectures for the backbone network in each branch of the proposed contrastive model.
The knowledge generated in the classification domain [2, 60, 65, 68] can be transferred to the design of novel architectures for the feature branches.
Conclusions
Metric learning has been deeply researched and targeted by a wide assortment of methods, but the learning of the parameters of a distance function through neural networks had not been addressed yet. This article presents the formulation of a novel method to backpropagate the gradients of the elements of a Mahalanobis matrix as additional neural parameters of a deep model to learn the Mahalanobis metric distance for Person Re-identification.
In the case of re-identifying from two fixed camera views, the use of the Mahalanobis distance to compare two images implicitly contributes to deal with the intra-class variations, since the Mahalanobis matrix embeds the camera-to-camera transitions.
The proposed novel method aims to automatically model the cameras transition and the proper comparison of features jointly with the features learning in a unique process. To achieve that, a new learning algorithm has been designed following the triplet approach. The triplet loss function has been re-formulated, as well as, its derivatives with respect both, the neural weights defining the person features, and the Mahalanobis matrix elements, to perform their backwards propagation. Besides, a triplets selection layer has been implemented to accelerate the learning process.
The performance of the proposed method has been demonstrated to overcome that provided by the use of the Euclidean distance and even the estimation of the Mahalanobis matrix in the forward propagation of the model [22, 23]. The previous estimation methods require the analysis of the data structure through several learning iterations to observe the effects of the Mahalanobis matrix on the learnt features, and, consequently, correct the values of the elements of that matrix. On difference, this article presents a Mahalanobis Learning method that directly updates the elements of the matrix according to the objective to achieve, enhancing the features and the distance learning and their global performance.
Through, the proposed method a unified Re-Id neural framework as been obtained. This jointly and automatically finds the most salient features and the optimal Mahalanobis metric distance to perform Single-Shot Person Re-Identification.
Footnotes
Layers to learn Mahalanobis matrix are publicly available under
The triplet loss function with Mahalanobis distance learning has been formulated in a new Caffe-python layer called TripletMahaLoss, which is publicly available under
Triplets Permutation tool is publicly available under
The triplet selection mechanism has been coded in a new Caffe-python layer called TripletSelection, which is publicly available under
Acknowledgments
Research supported by the Spanish Government through the CICYT projects (PID2019-104793RB-C31 and RTI2018-096036-BC21), Universidad Carlos III of Madrid through (PEAVAUTO-CMUC3M) and the Comunidad de Madrid through SEGVAUTO-4.0-CM (P2018/EMT-4362). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.
