Back-propagation of the Mahalanobis istance through a deep triplet learning model for person Re-Identification

Abstract

The automatization of the Re-Identification of an individual across different video-surveillance cameras poses a significant challenge due to the presence of a vast number of potential candidates with a similar appearance. This task requires the learning of discriminative features from person images and a distance metric to properly compare them and decide whether they belong to the same person or not. Nevertheless, the fact of acquiring images of the same person from different, distant and non-overlapping views produces changes in illumination, perspective, background, resolution and scale between the person’s representations, resulting in appearance variations that hamper his/her re-identification. This article focuses the feature learning on automatically finding discriminative descriptors able to reflect the dissimilarities mainly due to the changes in actual people appearance, independently from the variations introduced by the acquisition point. With that purpose, such variations have been implicitly embedded by the Mahalanobis distance. This article presents a learning algorithm to jointly model features and the Mahalanobis distance through a Deep Neural Re-Identification model. The Mahalanobis distance learning has been implemented as a novel neural layer, forming part of a Triplet Learning model that has been evaluated over PRID2011 dataset, providing satisfactory results.

Keywords

Re-Identification deep learning metric learning Mahalanobis distance

ï»¿

1. Introduction

Person Re-Identification (Re-Id) consists of visually recognising an individual through images from non-overlapping and distant camera views. Automating Person Re-Identification is fundamental to extend the monitoring capacity of a large distributed Intelligent Surveillance System (ISS) across multiple cameras views at different times and locations.

Typically, in real-world surveillance scenarios, accurate temporal and spatial constraints are difficult to establish, and fine biometric cues cannot be acquired from distant sensors. For that reason, the research has been mainly focused on Re-Id models that are based on visual data. These models extract features from surveillance images and compute a distance metric (a.k.a connection function) to compare them.

In the last decade, the Re-Id task has been addressed as a Single-Shot recognition problem, using only one person capture from each camera view. In that contest, the distance metric is a contrastive measurement that predicts whether two images correspond to the same person or not. Hence, the Re-Id problem is treated as a pairwise binary classification task. Many metric learning algorithms have been developed under this paradigm. These algorithms optimise a distance function to compare features, once these features have been previously extracted by a hand-crafted descriptor.

The emergence of the Multi-Target Multi-Camera Tracking challenge (MTMCT) has brought a new generation of Re-Id algorithms that follows a Multi-Shot recognition [10, 37, 41, 73, 90]. These methods acquire temporary consistency from the previous tracking of the individuals. Unlike Single-Shot Re-Id, Multi-Shot Re-Id relies on the availability of track fragments formed by consecutive detections. Consequently, the Single-Shot approach reduces the quantity of processed data and it speeds up communications in a distributed network of cooperative sensors.

To take advantage of the automatic feature extraction provided by neural models, without losing the benefits of Single-Shot strategy, this article refocuses the Single-Shot metric learning approach through the backwards propagation of the distance embedding as a set of neural parameters.

This approach entails several computer vision challenges. Firstly, inter-class ambiguities are caused by the presence of different people with a similar appearance. For that reason, the learning of discriminative features is fundamental. However, modelling Deep Convolutional Neural Networks (DCNN) with that purpose is a daunting task due to the lack of Re-Id data since there is no previous knowledge about the queried individual, but only the current captures from two different views. Hence, the number of instances of a query identity is very limited, especially when compared with the vast quantity of potentially available different people representations. The shortage of data and its underlying unbalanced nature tends to overfit and collapse the neural models.

Secondly, the challenge of Person Re-Id is to match images of the same person that were captured by different non-overlapping camera views against significant and unknown cross-view feature distortion. The differences between the camera characteristics and their perspectives cause large changes in scale, resolution, illumination, background and pose, and consequently the misalignment in the compared human shapes. This results in dramatically different appearances and consequently, different representations of the same person, which are called intra-class variations.

In the case of having a fixed pair of camera views, a view-to-view transformation can be modelled. This transformation encompasses the variation between the cameras’ perspectives and their visual characteristics. Subsequently, this transformation can be considered by the Re-Id model through the proper distance metric. In that way, the extraction of representative person features is reduced to the learning of those descriptors able to compare the actual individual’s appearance, without the perturbation caused by the view-to-view variations.

To code the view-to-view transitions, Roth et al. [69], proposed the use of the Mahalanobis distance as the distance metric. They estimated the Mahalanobis matrix utilizing multiple well-known metric learners from previously computed hand-crafted features. Then, in [22], an estimation method which was integrated into the learning of a neural Re-Id model was introduced. In that work, the Mahalanobis matrix was updated every certain number of learning iterations after the features were calculated by forward propagating the neural model. Indeed, the estimation of the matrix was not involved in the backpropagation process.

This article proposes not only the use of the Mahalanobis distance to compare person images and its integration in the process to train visual features, but also the learning of the Mahalanobis matrix as an extra set of neural parameters in a triplet model. Therefore, the main contributions of this work are:

i.
The design of a deep neural model for person Re-Identification, where discriminative visual features and the Mahalanobis matrix to compare them are simultaneously learnt.
ii.
The re-formulation of the triplet loss function, taking the Mahalanobis distance as the connection function between the features to compare.
iii.
The formulation of the back-propagation stage of the new loss function.
iv.
The design of the learning procedure for training the elements of the Mahalanobis matrix as an extra set of neural parameters.
v.
The vectorized implementation of the new loss and connection functions as neural layers, which have been publicly delivered in the format of Caffe-python layers1

Two major challenges are addressed in this work: the learning of person features capable to render the most discriminant aspects of an individual’s appearance to cope with the interclass ambiguities and the learning of the proper distance metric which optimally combines the visual features and reflects the camera-to-camera transitions to face the intra-class variations.

Figure 1.
Examples of matched pairs belonging to PRID2011 dataset.

A Triplet approach has been used to train the proposed Re-Id neural model, which has been evaluated over PRID2011 [32] dataset, providing successful results. This is one of the most used Singe-Shot Re-Id datasets, and it presents an uncommon and particular acquisition setting that perfectly fits the requirements of the conducted research. This dataset was captured from two fix camera views, and groups the images from each camera separately, which allows demonstrating the effectiveness of the proposed Mahalanobis distance learning method to embed the intra-class variations between two fixed capturing points.

The rest of the article is structured as follows: Section 2 presents a review of the existing related works. Section 3 describes the proposed Re-Id neural model, and Section 4, the developed learning algorithm to train it. Finally, Sections 5 and 6 present the obtained results and some concluding remarks, respectively.
2. Related work

Person Re-Identification (Re-Id) has become one of the most studied tasks in intelligent video surveillance since many other applications, like tracking [86] or behaviour analysis [1, 50], rely on solving the Re-Id problem [78]. A good Re-Id performance is necessary to extend the functionalities of an Intelligent Surveillance System through several cameras views, which is the base of large scale distributed ISSs [7, 74].

In the context of the visual appearance Re-Id paradigm, the literature presents two main Re-Id strategies: Multi-Shot, e.g. [42, 76], and Single-Shot recognition, [56]. Multi-Shot recognition is performed by matching two tracklets (i.e. small sequence of images), captured from each camera view. These methods have been boosted by the arising of Multi-Target Multi-camera Tracking (MTMCT) systems and datasets like DukeMTMCT, CUKH03 and Market1501.

Conversely, the Single-Shot approach avoids the dependence on the availability of a sequence of detections of a tracked individual to re-identify him/her. In Single-Shot frameworks, only one image per person and per view is given. Figure 1 shows examples of matched pairs (in each column) from a Single-Shot dataset, concretely PRID2011 [32]. Every pair is formed by images that were captured from two different cameras views (cam a and cam b).

Under the Single-Shot constraints, the Re-Id task aims to identify the person represented by an image from one view, called probe image, among all the images of different people from the other view, called gallery of images. Hence, the Single-Shot Re-Id problem is commonly treated as a pairwise binary classification task, composed of two main stages: features extraction and their comparison through a certain distance metric.

Consequently, literature methods for Re-Id generally fall into two categories: methods focused on enhancing the person features design to represent the most discriminant aspects of an individual’s appearance, e.g. [52, 92], and those meant to learn an effective discriminative distance to optimally combine the previously extracted visual features, e.g. [44, 89].

The earliest works made use of handcrafted features based on low-level local features, like colour, texture and shape [6, 27, 34, 59, 80, 88]. Other works proposed the integration of several types of features with complementary nature, into a global signature, such as BoW models [19], or covariance descriptors [14].

Besides, many research efforts have been dedicated to learning salient features [91, 92]. In the context of Re-Id, human salience is different from general image salience in the way of drawing visual attention.

Concretely, for describing Re-Id images, there exist two types of visual features that are desirable to learn: invariant and discriminative ones. Invariant features, e.g. [6], are both, distinctive and stable under changing viewing conditions between different cameras. However, the large intraclass appearance variations, make the computation of these features often impossible under realistic conditions. To overcome this limitation, discriminative methods take advantage of the class information to exploit the discriminative information to find a more distinctive representation, e.g. [13]. However, such methods tend to suffer from overfitting. Moreover, they are often based on local image descriptors, which might be a severe disadvantage. The dependence on salient local attributes makes impossible the re-identification of a person in two images if the salient attribute is not visible in both captures.

The tendency in feature design is to learn a lower number of high-level representations to describe a person. Recently, there are also some works of literature applying neural models to address the Re-Id problem. This has been inspired by the demonstrated capacity of deep learning techniques to automatically find salient high-level features from the pixels of an image.

Deep learning has been proven to provide successful results in many fields of application, such as image recognition or classification, e.g. [30, 36, 75], objects and people detection [8, 12, 54, 72, 77], face identification [79], transportation [4, 57], data anomaly detection [58] and structures analysis [63, 64, 66, 82]. Nevertheless, its potential to learn a unique appearance model, able to represent any anonymous individual in a scene, has not been sufficiently exploited yet.

The new stream of methods based on the learning of Deep Convolutional Neural Networks (DCNNs) for Re-Id, e.g. [3, 11, 17, 46], follows a supervised learning framework and employs a contrastive model to perform a pair-wise binary classification to discriminate between matched and mismatched pairs of images.

The first two works in Re-Id to use deep learning, [40, 87], employed a Siamese neural network. Siamese Networks, earlier used to verify signatures [9], consist of two DCNNs sharing parameters and joined in the last layer, where the loss function performs a pair-wise verification. Each network computes the feature representation for one of the images of a pair. The objective is to make the distances between the features of matched pairs smaller than that for mismatched ones, achieving a binary classification in the distance space.

Traditionally, Siamese networks have used the Contrastive Loss function, proposed in [29]. This function forces the distance between matched pairs of images to be lower than a set margin, and higher for mismatched pairs. Later, an improved version which uses two separated margins was presented in [21] to learn more discriminative features.

Another variation of the Siamese network is the one known as Triplet model. The basics of the Triplet loss function are presented in [71], where face recognition is addressed. In the Triplet model, each input is a set of three samples. Two of them are rendering the same person and the third one, a different identity. Therefore, this model allows the comparison between a matched and a mismatched pair so the objective function can maximise the relative distance between them.

Some works, e.g. [17, 81], extend the triplet model to the Re-Id problem with an efficient learning algorithm and a triplet generation scheme. In the last years, researchers have proved that hard triplets’ mining is essential for the success of the triplet loss. With that purpose different upgraded formulations of the Triplet loss can be found in the recent literature, like the Triplet Focal Loss [90], the Attribute-aware Identity-hard triplet Loss [10] and the Compact Triplet Loss [73].

Many works build each branch of the triplet model with neural architectures initially meant to different purposes, like the VGG network, introduced in [75]. For instance, Zhuang et al. [99] chose the VGG16 model to solve Face Re-Id with a triplet model, and Liu et al. [45] to address Person Re-Id.

The recognition of a person utilizing an appearance neural model presents an intrinsically unbalanced nature, given the lack of data about the people to identify and the huge number of possible false assignments with surrounding agents. This results in the overfitting and collapse of the neural models.

On the other hand, instead of designing high dimensional feature representations to capture all relevant information, another widely extended Re-Id stream is composed of methods to optimally combine and quantify visual features. The learning of a metric with that purpose can boost Re-Id performance using quite simple features. In that way, an appropriate metric distance can contribute to reducing the model overfitting, since it allows to use simpler models with fewer parameters to embed the person features.

Some approaches perform feature selection by evaluating the discriminative importance of different types of features to properly weight them, [47]. Examples of this type of methods are AdaBoost [27], RankSVM [62], or discriminative classification methods that properly weight previously extracted features, such as LDA, [70], and LDML, [28]. Moreover, in [47], a PSFI based method is proposed to adaptively weight features according to different clusters of population.

Once the extracted features have been properly weighted to generate a person descriptor, pairs of these descriptors are compared by a distance metric to measure the similarity between the images that they are rendering. In [33], Hirzer et al. used the Euclidean distance, also referred to as L2-norm in [34]. However, in [51], Ma et al. discussed the idea that different descriptors need different distance metrics to properly re-identify people. They claimed that using a non-Euclidean distance even less distinctive features, which do not need to capture the visual invariances between the cameras, are sufficient for getting considerable results.

There is a vast variety of supervised methods encompassed under the paradigm of Metric Learning algorithms. Metric learning approaches use training data to learn effective distance metrics, searching for strategies that combine given features maximising interclass variation whilst minimising intra-class variation.

Some widely used metric learners are Probabilistic Relative Distance Comparison (PRDC) [94], Large Margin Nearest Neighbor (LMNN) [84], Information-Theoretic Metric Learning (ITML) [15], Logistic Discriminant Metric Learning (LDML) [28], and Pairwise Constrained Component Analysis (PCCA) [53]. These algorithms are based on complex optimisation schemes, with high computational cost and memory requirements, making them unfeasible in practice.

There is another branch of research composed of methods which seek to embed a camera-to-camera transition function in feature space, using labelled samples. To account for the appearance changes between non-overlapping cameras some early methods modelled the transfer of colours associated with the two specific cameras, e.g. [43], which can be understood as inter-camera colour calibration. Following the work of Porikli [61], designed for overlapping cameras, some early studies proposed different ways to estimate a Brightness Transfer Function (BTF) for modelling the appearance transfer between two cameras. However, this approach has some limitations. First, it assumes that a perfect foreground-background segmentation is available for the training and when the re-identification system makes predictions at real-time, and second, it is not always sufficient for modelling all the variability in the possible appearance changes.

On the other side, the Implicit Camera Transfer (ICT) method [5] introduces a novel way of modelling the camera-dependent transfer. The camera transfer is modelled by a binary relation that provides a generalised representation of the variabilities of the changes in appearance, without relying on high-level features or previous background subtraction. Instead of that, the figure-ground separation is performed implicitly by automatic feature selection.

The embedding of cross-view transforms can also be used to deal with the misaligned features problem. Indeed, Li and Wang [38] learnt a mixture of cross-view transformations, and features were projected into a common space that allowed its alignment.

Moreover, Mahalanobis matrix can be employed to code the view-to-view transitions, so that, the cross-view variations are reduced. For example, in [69] a generative standard Mahalanobis distance learning is presented. Mahalanobis learning exploits the structure of the data under the assumption that the classes present the same distribution. In contrast, RankSVM is used to learn an independent weight for each feature element, and PRDC outputs an orthogonal matrix that encodes the global importance of each feature. Instead of that, Mahalanobis metric learning optimises a full matrix that relates the features computed from one of the view sets, with those calculated from the other view set of images, to code the view-to-view transition.

The approach presented in [69], also proposed in [70, 28], optimises a Mahalanobis matrix after the person features have been computed for a dataset, treating the features extraction and the distance learning as two independent stages.

Instead of that, the works presented in [22, 23] proposed the use of Deep Re-Id neural models and the integration of a process to estimate the Mahalanobis matrix into the features learning process. In that way, the matrix estimation affects the features learning evolution and improves it simultaneously. Those methods employed a discriminative data analysis process to calculate the Mahalanobis matrix, which was optimised by the adaptive estimation of the covariance matrices of two features spaces, the similar and dissimilar ones. Nevertheless, those methods require the analysis of the data structure through several learning iterations to observe the effects of the Mahalanobis matrix on the learnt features, and, consequently, correct the values of the elements of that matrix.

The existing methods to estimate the Mahalanobis matrix for Re-Id do it after the features extraction or during its training in a parallel (but not the same) estimation process. However, the learning of the Mahalanobis matrix elements as an extra set of neural parameters has not been explored yet.

On difference to the works in [22, 23], this article presents a Mahalanobis Learning method that directly updates the elements of the matrix according to the objective to achieve, which is defined by the loss function. The novelty resides in the fact that the matrix estimation is not only integrated into the learning of the features, but it is simultaneously learnt. The elements of the matrix are considered as an additional set of neural parameters and consequently are backpropagated and updated by a novel learning algorithm.

3. Re-Identification model

This article addresses the Single-Shot Re-Identification task, assuming that person images have been previously detected in two different camera views. Therefore, the objective is to identify the person represented by an image from one view, called probe image, among a set of images captured from the other view, denoted as the gallery of images. Hence, the comparison of the two images must provide a prediction of whether they belong to the same identity or not. A pair of images of the same person is hereinafter called a matched or positive pair. Analogously, a pair of images of different people is denoted as a mismatched or negative pair.

Due to the contrastive essence of the pair-wise approach, the proposed Re-Id appearance model measures a comparative distance between two images. Instead of computing the distances directly over the raw images, these are calculated from representative feature. Therefore, it is necessary to learn a certain distance metric, dm and an embedding F(I) to map an image I to a feature space, such that the distances between samples rendering the same person are smaller than those between different people in that feature space.

In general, the Re-Id algorithm consists of the computation of features for a pair of person images and its comparison through a connection function, a.k.a. distance metric, dm, as Fig. 2 shows. The feature embedding has been modelled by a DCNN, and the Mahalanobis distance has been used as the distance metric.

Figure 2.

Architecture of the Siamese Re-Id model.

3.1 DCNN architecture

Deep learning has been used to automatically find the most salient features of the individuals’ appearance. Consequently, the transformation of every image, I, to its corresponding representation in the feature space, $F_{W}\left(I\right)$ , is performed by a DCNN, whose output depends on the values of its weights, W.

The architecture of the used DCNN is an adapted version of the well-known and widely used VGG architecture. Concretely, an 11-layered network, hereafter called VGG11, presented as the A version of a set of Very Deep CNN in [74], has been implemented. The layers specifications for the proposed VGG11-based embedding are listed in Table 1.

Table 1
Structure of the used VGG11-based model. The input and output sizes are described in #rows $\times$ #cols $\times$ #filters; the kernel, in #rows $\times$ #cols $\times$ #filters, stride, or #outputs for FC layers

Layer	Input size	Output size	Kernel
Conv-1-1	128 $\times$ 64 $\times$ 3	128 $\times$ 64 $\times$ 64	3 $\times$ 3 $\times$ 3
Pool-1	128 $\times$ 64 $\times$ 64	64 $\times$ 32 $\times$ 64	2 $\times$ 2 $\times$ 64, 2
Conv-2-1	64 $\times$ 32 $\times$ 64	64 $\times$ 32 $\times$ 128	3 $\times$ 3 $\times$ 64
Pool-2	64 $\times$ 32 $\times$ 128	32 $\times$ 16 $\times$ 128	2 $\times$ 2 $\times$ 128, 2
Conv-3-1	32 $\times$ 16 $\times$ 128	32 $\times$ 16 $\times$ 256	3 $\times$ 3 $\times$ 128
Conv-3-2	32 $\times$ 16 $\times$ 256	32 $\times$ 16 $\times$ 256	3 $\times$ 3 $\times$ 256
Pool-3	32 $\times$ 16 $\times$ 256	16 $\times$ 8 $\times$ 256	2 $\times$ 2 $\times$ 256, 2
Conv-4-1	16 $\times$ 8 $\times$ 256	16 $\times$ 8 $\times$ 512	3 $\times$ 3 $\times$ 256
Conv-4-2	16 $\times$ 8 $\times$ 512	16 $\times$ 8 $\times$ 512	3 $\times$ 3 $\times$ 512
Pool-4	16 $\times$ 8 $\times$ 512	8 $\times$ 4 $\times$ 512	2 $\times$ 2 $\times$ 512, 2
Conv-5-1	8 $\times$ 4 $\times$ 512	8 $\times$ 4 $\times$ 512	3 $\times$ 3 $\times$ 512
Conv-5-2	8 $\times$ 4 $\times$ 512	8 $\times$ 4 $\times$ 512	3 $\times$ 3 $\times$ 512
Pool-5	8 $\times$ 4 $\times$ 512	4 $\times$ 2 $\times$ 512	2 $\times$ 2 $\times$ 512 ,2
FC-6	4 $\times$ 2 $\times$ 512	1 $\times$ 1 $\times$ 4096	4096
FC-7	1 $\times$ 1 $\times$ 4096	1 $\times$ 1 $\times$ 4096	4096
FC-8	1 $\times$ 1 $\times$ 4096	1 $\times$ 1 $\times$ 1000	1000

VGG11 presents eight convolution layers, three fully connected layers and a SoftMax final layer. The SoftMax layer has been removed to get a feature array as output instead of a classification probability value. Hence, its output is a point in the feature space represented by a 1000-dimensional array $F\left(I\right)\in{\mathbb{R}}^{n}$ ( $n=$ 1000). Moreover, the input size used in [75] has been modified to adapt its value to the dimensions of the Re-Id samples. Therefore, the input of the proposed DCNN is an RGB image of a fixed size, which has been set as 64 $\times$ 128 pixels. All hidden layers are provided of a Rectified Linear Unit, ReLU, [36], as the activation function, as it was described in [75].

In previous experiments, two different VGG architectures were evaluated, with sixteen and eleven layers. The effects of shortening the network architecture in the Single-Shot Re-identification domain were analysed, concluding that the simpler the model, the less prone to be over-fitted it is, due to the lack of training data. Therefore, the VGG11 network has been selected to train the person features.

3.2 Mahalanobis distance metric

Person Re-Identification requires a distance metric to compare two images and decide whether they belong, or not, to the same person. This involves a great challenge, due to the presence of people with a similar appearance. For that reason, a metric to properly combine discriminative features is needed. However, the variations of illumination, perspective, background, resolution and scale between two images of the same person, which were captured from different views, make his or her appearance vary, hampering the re-identification. This article proposes coding the view-to-view transition in a matrix, so that, the learnt features are focus on rendering the dissimilarity mainly due to appearance changes instead of the view changes.

To achieve that purpose, the Mahalanobis distance has been taken as a metric distance. Equation (1) defines the Mahalanobis distance, $d_{M}\left(F_{w}\left(I^{a}\right),F_{w}\left(I^{b}\right)\right)$ , or simply $d_{M}$ , between the features of the pair of images to compare, where $M$ is the Mahalanobis matrix.

$\displaystyle d_{M}=\left(F_{W}\left(I^{a}\right)-F_{W}\left(I^{b}\right)% \right)^{T}M\left(F_{W}\left(I^{a}\right)-F_{W}\left(I^{b}\right)\right)$ (1)

The formulation of the Mahalanobis distance exploits the structure of the data. Therefore, using the Mahalanobis distance as connection distance, and forcing this to be small for positive pairs and large for negative pairs, the Mahalanobis Matrix can be trained. The resulting Mahalanobis matrix embeds the view-to-view transitions that relate the characteristics of the images from each camera.

Theoretically, $M$ is defined as the inverse matrix of the covariance matrix for the variable formed by the difference of the feature arrays, $F_{W}\left(I^{a}\right)\mathrm{-}F_{W}\left(I^{b}\right)$ . Therefore, $M$ allows considering the data structure for that new array, $F_{W}\left(I^{a}\right)\mathrm{-}F_{W}\left(I^{b}\right)$ , encoding the relationship of each one of its elements with each other. In that way, this matrix intrinsically embeds those variations subjacent in all the comparison between images from the camera view $a$ and $b$ , which are mainly due to the view-to-view transition. To attain this aim, the image $I^{a}$ always has to be taken from the same camera ( $a$ ), and the image $I^{b}$ , from the other one ( $b$ ). The dataset PRID2011 [32] fits this requirement, and it was selected to develop this research for that reason.

4. Learning algorithm

The main contribution of this article is the design of a unique supervised learning process to train the Re-Id model, presented in the previous section, where the DCNN and the distance metric are jointly learnt. With that purpose, the triplet approach has been adopted to learn the model. Figure 3 shows the architecture of the proposed learning approach, where the parameters of the feature embedding, $W$ , and the parameters of the connection function, $M$ , are simultaneously learnt.

Figure 3.

Architecture of the proposed learning model.

Every $i$ sample of the training dataset, $X$ , is constituted by a triplet of person images, $X_{i}=\left\langle I^{a}_{i},I^{p}_{i},I^{n}_{i}\right\rangle$ . The first image is an anchor sample, $I^{a}_{i}$ , the second one is an image, $I^{p}_{i}$ , rendering the same person than the anchor image, and the third one is a different person’s image, $I^{n}_{i}$ . Therefore $I^{a}_{i}$ and $I^{p}_{i}$ form a positive pair of samples, and $I^{a}_{i}$ and $I^{n}_{i}$ , a negative one. The triplet model tries to maximise the relative distance between the values of the distance metric for the positive and the negative pair.

The objective is to learn a transformation from the image to the feature space, such it leads the representations for the same person near, and far away from different people’s representations. This constraint is imposed by the Triplet objective formulation, presented in [71], which establishes a relative distance relationship. For a certain triplet of images, the triplet loss requires the squared Euclidean distance for the negative pair, ${dm}^{-}$ , to be larger than that for the positive one, ${dm}^{+}$ , by a predefined margin, $\tau$ . This constraint is shown by Eq. (2), where ${dm}^{+}$ and ${dm}^{-}$ are defined by Eqs (3) and (4) respectively.

$\displaystyle dm^{+}\left(X_{i}\right)+\tau<dm^{-}\left(X_{i}\right),\forall X% _{i},\in X$ (2) $\displaystyle dm^{+}\left(X_{i}\right)=\left|F_{W}\left(I^{a}_{i}\right)-F_{W}% \left(I^{p}_{i}\right)\right|^{2}_{2}$ (3) $\displaystyle dm^{-}\left(X_{i}\right)=\left|F_{W}\left(I^{a}_{i}\right)-F_{W}% \left(I^{n}_{i}\right)\right|^{2}_{2}$ (4)

These equations ensure that the anchor sample is closer to all samples of the same person, in the feature space, than it is to any sample of other people since $X$ is the set of all possible triplets in the training set. However, in the Single-Shot Re-Id task, there is only one positive sample for every anchor image. So, the available data are not enough to adopt an individual-meant training approach, which clusters the same person samples close from each other and distant to another identity cluster.

By contrast, the proposed method treats all the possible positive pairs as a set rendering the condition of similarity. Analogously, negative pairs represent the dissimilarity situation. The discrimination between similarity and dissimilarity is learnt by comparing every positive pair with all the possible negative pairs. In that way, the network is trained to identify a person among a huge number of negative samples. Instead of using an online triplets generation mechanism, like in [71], all the possible triplet combinations from the available data have been previously determined, using the Triplet Permutation tool formulated in [24].

This approach of global comparison calls for the use of a Batch learning algorithm. However, the huge amount of possible triplet combinations composing the training set, and the limitations in processing memory resources have led to implementing a Mini-Batch Triplet loss function.

4.1 Mini-batch triplet loss with mahalanobis distance

The Triplet loss function has been reformulated according to these three specifications: first, the loss value has to be computed from a mini-batch of training triplets. Therefore, the loss function to minimise, $f_{L}$ , is defined by Eq. (5), where $B$ is the number of elements in a batch of triplets, $X^{t}=X_{i}$ . $f_{C}$ is the cost produced by the deviation of the features of a triplet sample from the objective, defined by Eq. (6), which follows the constraint imposed by Eq. (2). Equation (5), defines the average loss produced by the costs of an entire mini-batch of triplets. The chosen batch size ( $B=$ 64) was that as large as the available memory resources make possible.

Second, the connection functions, $dm^{+}\left(X_{i}\right)$ and $dm^{-}\left(X_{i}\right),$ are calculated by the Mahalanobis distance function (Eq. (1)), instead of the Euclidean distance. Third, the loss and the connection function need to be jointly calculated in a unique step. Therefore, ${dm}^{+}$ and ${dm}^{-}$ have been substituted by the Mahalanobis distance definition (see Eq. (1)), in Eq. (4.1). Eventually, Eq. (4.1) formulates the loss value as a function of the features, $F_{W}\left(I^{a}\right)$ , $F_{W}\left(I^{b}\right)$ and $F_{W}\left(I^{n}\right)$ , by implicitly computing the distance metrics, $dm^{+}$ and $dm^{-}$ . In that way, the derivatives of the loss value with respect to the input features can be defined. In ease of exposition, the features $F_{W}\left(I^{a}\right)$ , $F_{W}\left(I^{b}\right)$ and $F_{W}\left(I^{n}\right)$ , are hereinafter denoted as $F^{a}$ , $F^{p}$ and $F^{n}$ , respectively.

$\displaystyle f_{L}\left(W^{t};\ X^{t}\right)=\frac{1}{2B}\sum^{B}_{i=1}{\left% [f_{C}\left(W;X^{t}_{i}\right)\right]}$ (5) $\displaystyle f_{C}=\textit{max}\left(\tau+{dm}^{+}-{dm}^{-}\ \right)$ (6) $\displaystyle f_{C}=\textit{max}(\tau+{\left(F^{a}-F^{p}\right)}^{T}\cdot$ $\displaystyle\qquad M\cdot\left(F^{a}-F^{p}\right)-{\left(F^{a}-F^{n}\right)}^% {T}\cdot$ (7) $\displaystyle\qquad M\cdot\left(F^{a}-F^{n}\right),0)$

4.2 Mahalanobis distance learning algorithm

A Triplet-based Mini-Batch Gradient Descent learning algorithm has been designed and implemented to learn both, deep features models and the Mahalanobis matrix, simultaneously. The main procedures of this optimisation algorithm are presented by Algorithm 1.

A feature descriptor is computed for every input image by the forward propagation of the DCNN, and subsequently, they are contrasted by the Triplet cost function, $f_{C}$ . This process is performed for every sample of a batch to get the value of the loss function, $f_{L}$ .

Then, the equations to perform its backpropagation have been formulated. Backpropagation is performed by calculating the derivatives of the loss function with respect each parameter to learn (set $W$ and $M$ ). The chain rule of the derivatives has been applied to obtain that. Therefore, firstly, the derivative with respect each feature ( $F$ ) is calculated, and then, the derivative of these features with respect the parameters. This is described by Eqs (8)–(4.2), detailed below.

Algorithm 1 Triplet-based Mini-Batch Gradient Descent learning algorithm with Mahalanobis distance learning
Require: Batches of triplets, $X^{t}=\left\{X^{t}_{i}\right\}$ .
Ensure: The network parameters $W^{T}=W^{T}_{j}$
1:	$W^{0}=\{W^{0}_{j}\}$
2:	$M=I_{k}$
3:	while $t<T$ do
4:	$t\ \leftarrow t+1;\ \$
5:	$\frac{\partial f_{L}}{\partial W}=0$ and $\frac{\partial f_{L}}{\partial M}=0;$
6:	for all training triplet $X_{i}$ of the batch set $X^{t}$ do
7:	Calculate $F^{a}$ , $F^{p}$ and $\ F^{n}$ by forward propagation;
8:	Calculate $f_{C}$ by Eq. (4.1);
9:	end for
10:	Calculate $f_{L}$ by Eq. (5);
11:	for all training triplet $X_{i}$ of the batch set $X^{t}$ do
12:	Calculate $\frac{\partial f_{C}}{\partial F^{a}}$ , $\frac{\partial f_{C}}{\partial F^{p}}$ and $\frac{\partial f_{C}}{\partial F^{n}}$ by Eqs (8)–(10);
13:	Calculate $\frac{\partial F^{a}}{\partial W_{j}}$ , $\frac{\partial F^{p}}{\partial W_{j}}$ and $\frac{\partial F^{n}}{\partial W_{j}}$ by back propagation;
14:	Calculate the gradient $\frac{\partial f_{C}}{\partial W}$ according to Eq. (12);
15:	Calculate the gradient $\frac{\partial f_{C}}{\partial M}$ according to Eq.14;
16:	end for
17:	Calculate the gradient $\frac{\partial f_{L}}{\partial W}$ according to Eq. (11);
18:	Calculate the gradient $\frac{\partial f_{L}}{\partial M}$ according to Eq.13;
19:	Update parameters $W^{t}_{j}$ using Adagrad method;
20:	Update parameters $M^{t}_{j}=M^{t-1}_{j}-\alpha\frac{\partial f_{L}}{\partial M^{t-1}_{j}}$
21:	end while

As it was explained above, the distance metric computation has been integrated on the triplet loss function. In that way, the loss function can be directly derived with respect to the features, $F^{a}$ , $F^{p}$ and $\ F^{n}$ , according to Eqs (8)–(10), respectively, where $\otimes$ renders the Krockener product between two matrices. $I_{N}$ is an Identity matrix of size $N\times N$ , where $N$ is the number of elements of each feature array, $F_{W}\left(I\right)$ .

The back-propagation of the DCNN, according to each one of the three obtained descriptors, is performed to obtain $\frac{\partial F^{a}}{\partial W^{t}_{j}}$ , $\frac{\partial F^{p}}{\partial W^{t}_{j}}$ and $\frac{\partial F^{n}}{\partial W^{t}_{j}}$ for every sample of the batch. Subsequently, the derivative of the loss function with respect to the neural network weights $W$ has been computed by Eqs (11) and (12).

$\displaystyle\frac{\partial f_{C}}{\partial F^{a}}=\left\{\begin{array}[]{cc}I% _{N}\cdot M\cdot\left(F^{n}-F^{p}\right)+&\\ \left({\left(F^{n}-F^{p}\right)}^{T}\otimes I_{N}\right)\cdot&\\ \left(M\otimes I_{N}\right)\cdot\textit{vec}\left(I_{N}\right)&\textit{if}\ {% dm}^{-}-\\ &{dm}^{+}<\tau\\ 0&\textit{otherwise}\end{array}\right.$ (8)

$\displaystyle\frac{\partial f_{C}}{\partial F^{p}}=\left\{\begin{array}[]{cc}I% _{N}\cdot M\cdot\left(F^{p}-F^{a}\right)+&\\ \left({\left(F^{p}-F^{a}\right)}^{T}\otimes I_{N}\right)\cdot&\\ \left(M\otimes I_{N}\right)\cdot\textit{vec}\left(I_{N}\right)&\textit{if}\ {% dm}^{-}-\\ &{dm}^{+}<\tau\\ 0&\textit{otherwise}\end{array}\right.$ (9)

$\displaystyle\frac{\partial f_{C}}{\partial F^{n}}=\left\{\begin{array}[]{cc}I% _{N}\cdot M\cdot\left(F^{a}-F^{n}\right)+&\\ \left({\left(F^{a}-F^{n}\right)}^{T}\otimes I_{N}\right)\cdot&\\ \left(M\otimes I_{N}\right)\cdot\textit{vec}\left(I_{N}\right)&\textit{if}\ {% dm}^{-}-\\ &{dm}^{+}<\tau\\ 0&\textit{otherwise}\end{array}\right.$ (10)

$\displaystyle\frac{\partial f_{L}\left(W^{t};\ X^{t}\right)}{\partial W^{t}_{j% }}=\frac{1}{2B}\sum^{B}_{i=1}{\left[\frac{\partial f_{C}\left(W^{t};X^{t}_{i}% \right)}{\partial W^{t}_{j}}\right]}$ (11)

$\displaystyle\frac{\partial f_{C}\left(W;X_{i}\right)}{\partial W_{j}}=\frac{% \partial f_{C}\left(W;X_{i}\right)}{\partial F^{a}}\cdot\frac{\partial F^{a}}{% \partial W_{j}}+\frac{\partial f_{C}\left(W;X_{i}\right)}{\partial F^{p}}\cdot% \frac{\partial F^{p}}{\partial W_{j}}+\frac{\partial f_{C}\left(W;X_{i}\right)% }{\partial F^{n}}\cdot\frac{\partial F^{n}}{\partial W_{j}}$ (12)

Moreover, the Mahalanobis matrix, $M$ , has been treated as an additional set of parameters to learn. Therefore, the derivative of the loss function with respect to $M$ has been computed by Eqs (13) and (4.2).

$\displaystyle\frac{\partial f_{L}}{\partial M}=\frac{1}{B}\sum^{B}_{i=1}{\frac% {\partial f_{C}\left(X_{i}\right)}{\partial M}}$ (13)

$\displaystyle\frac{\partial f_{C}}{\partial M}=\left\{\begin{array}[]{cc}\left% ({\left(F^{n}-F^{p}\right)}^{T}\otimes I_{K}\right)\cdot\frac{\partial M}{% \partial M}\cdot&\\ \left(\left(F^{n}-F^{p}\right)\otimes I_{K}\right)&\textit{if}\ {dm}^{-}\\ &-{dm}^{+}<\tau\\ 0&\textit{otherwise}\end{array}\right.$

Once $\frac{\partial f_{L}}{\partial W}$ and $\frac{\partial f_{L}}{\partial M}$ are computed, the neural weights and the elements of the Mahalanobis Matrix can be updated, as it is indicated in Algorithm 1. This formulation includes the Adagrad method [18] to update the learning rate value, $\alpha$ . Both processes, forward and backward propagation, are repeated until achieving the pre-established maximum number of iterations, $T$ .

A novel layer2 has been implemented and delivered to compute the forward and the backwards propagation of the triplet loss function with Mahalanobis distance. The inputs of this layer are the features arrays, $F^{a}$ , $F^{p}$ and $F^{n}$ , for a batch of triplets.

4.3 Integration of the Mahalanobis distance into the feature learning process

The learnt Mahalanobis matrix, $M$ , encodes the relationship of each one of the elements of the feature arrays with each other.

However, a reliable estimation for the Mahalanobis matrix is not achieved until executing a certain number of learning iterations in the features training process. This is because the matrix elements are simultaneously learnt in the training process. For that reason, the distance metric, dm, takes two different formulations along the learning process, as Eq. (15) defines, so that the Euclidean distance is employed until the number of learning iterations, $t$ , achieves a certain threshold, $T_{t}$ , and then, the comparison is made by the Mahalanobis distance, $d_{M}$ , instead.

$\displaystyle dm=\left\{\begin{array}[]{l}d_{E}=|F_{W}\left(I^{a}_{i}\right)-F% _{W}\left(I^{p}_{i}\right)|^{2}_{2}\\ \textit{if}\ t<T_{t}\\ d_{M}={\left(F^{a}-F^{b}\right)}^{T}M\left(F^{a}-F^{b}\right)\\ \textit{if}\ t\geqslant T_{t}\end{array}\right.$ (15)

At the beginning of the learning process the Euclidean distance, which does not rely on any parameter, is used as a connection function. Meanwhile, the Mahalanobis distance learning is simultaneously conducted from the start. The learning of the Mahalanobis matrix depends on the person feature weighs, which are learnt at the same time. Because of that, the iteration number that was chosen as the threshold, $T_{t}$ , to start to use the Mahalanobis distance is large enough to ensure at least the convergence in the learning of the features. Hence, the threshold, $T_{t}$ , has been established after several experimental observations of the evolution of the loss value during the training process.

4.4 Training process setting

The designed learning algorithm has been used in a training process whose setting is explained below.

The Mahalanobis distance relates each element of the features difference vector array with each other through the Mahalanobis matrix elements. For that reason, the anchor image has to be always taken from the same camera view, and the positive and negative samples from the other one, since according to the triplet model, these two types of vectors of features differences are employed, ${(F}^{a}-F^{p})$ and ${(F}^{a}-F^{n})$ , describing the similarity and the dissimilarity subspaces. Concretely, the first training set formulation (set I) of the Triplets Permutation method, presented in [24]), has been employed3

To alleviate the Re-Id data problem, it has been performed the transference of learning previously acquired from the Multi-Object Tracking (MOT) domain to the Re-Id model, by initialising its weights, $\left\{w^{0}_{j}\right\}$ , with the pre-trained values, as it is explained in [25].

The intuition under this approach is that the most representative features of a person are automatically learnt on a MOT dataset. From the set of learnt descriptors, the low-level ones, which are learnt in the earlier layers of the network model, are kept. Then, the most high-level representations, coded in the further layers, are fine-tuned on a Re-Id target dataset to make them invariant and discriminative.

Moreover, a triplet selection mechanism has been designed to accelerate the training process, as it is explained below.

4.5 Triplets selection

An online triplet selection stage has been formulated to speed up the learning. Faster convergence can be obtained by selecting triplets that violate the triplet constraint in Eq. (2). Hence, this selection step acts as a filter which makes the loss function consider only the relevant samples for the training and obviate those samples easy to classify. In that way, the back-propagation is ruled by the hardest samples, which produce larger increments (or decrements) for the updating of the weights, accelerating the training process.

Triplet selection was firstly proposed by Schroff et al. [71]. They addressed face recognition by closing the samples of a certain identity in the feature space. They sought an identity clustering and trained their network on multiple instances of every individual. In that frame, triplet selection is performed by the online generation of the triplets: given a certain anchor image, hard-positive, and hard-negative images are selected. That means that for a given anchor, the samples that generate the hardest triplets are selected.

The intrinsic constraints of the Single-Shot Re-Id challenge do not allow the use of an anchor-meant approach, either in the triplet generation or in the triplet selection. Instead of that, the generation of a wide variety of samples has been sought by an offline combination process and a Mini-Batch Triplet loss function and learning algorithm have been proposed.

Therefore, a new neural layer has been designed to perform online triplet selection4 under the constraints of the Re-Id task, to be integrated into the proposed learning algorithm (Algorithm 1), and it has been placed before the loss function layer. At every iteration $t$ , this layer receives features from a batch of $B$ triplets, so the layer input is rendered by $Y^{t}={\left(F^{a}_{i},F^{p}_{i},F^{n}_{i}\right)}^{B}_{i=1}$ . The layer computation offers an online filter that outputs a smaller batch containing features corresponding to triplets that are helpful for the loss computation. Hence, the output batch of the remaining features, $R^{t}={\left(F^{a}_{i},F^{p}_{i},F^{n}_{i}\right)}^{B^{\prime}}_{i=1}$ , has a smaller size, which has been set as half of the input batch size ( $B^{\prime}=B/2$ ).

Figure 4.

Triplets selection according to the value of the positive and the negative metric distance.

Triplets presenting the lowest values for ${dm}^{-}\left(X_{i}\right)$ are considered as the hardest ones. However, the goal is to consider those triplets which are difficult to classify but not the hardest ones. In practice, selecting the hardest samples can lead to bad local minima early on the training, and it can result in a collapsed model. For that reason, the input batch of triplets has been divided into two sorted sequences, $N$ and $H$ , to differ between mid-hard (or neutral) and hard triplets, depending on whether ${dm}^{-}\left(I_{i}\right)$ is bigger or not than ${dm}^{+}\left(I_{i}\right)$ , respectively. Equations (16) and (17) define both sequences. These sequences of triplets are sorted in order of increasing negative distance, ${dm}^{-}$ , that is in descending order of difficulty to be classified, as Fig. 4 shows.

$\displaystyle N=\{N_{j}:N_{j}=Y_{i}=\left\langle F^{a}_{i},F^{p}_{i},F^{n}_{i}% \right\rangle,Y_{i}\in Y|{dm}^{-}\left(N_{j}\right)\geqslant{dm}^{-}\left(N_{j% -1}\right){dm}^{+}\left(N_{j}\right)<{dm}^{-}\left(N_{j}\right),\ \forall i\ % \in[1,B]\}$ (16)

$\displaystyle H=\{H_{j}:H_{j}=Y_{i}=\left\langle F^{a}_{i},F^{p}_{i},F^{n}_{i}% \right\rangle,Y_{i}\in Y|{dm}^{-}\left(H_{j}\right)\geqslant{dm}^{-}\left(H_{j% -1}\right){dm}^{+}\left(H_{j}\right)\geqslant{dm}^{-}\left(H_{j}\right),\ % \forall i\ \in[1,B]\ \}$ (17)

According to Eq. (16), $N$ can contain not only mid-hard triplets but also easy ones (those already well classified). The $B^{\prime}$ hardest triplets of this sequence are selected. If the number of samples in that sequence (even picking the easy ones) is not enough to form the output batch, the triplets from the hard sequence, $H$ , are added to the final batch in decreasing order of the negative distance, that is in ascending order of difficulty to be classified. The resulting batch, $R$ , is defined by Eqs (18) and (19), where $n$ is the number of elements in $N$ , and $h$ is the number of elements in $H$ . Therefore, if $n$ is higher or equal to $B^{\prime}$ , the resulting batch, $R$ , only takes triplets from $N$ . Otherwise, it is formed by samples from both sequences, $N$ and $H$ .

$\displaystyle R=\{R_{i},\forall i∈\left[1,B^{\prime}\right]\}$ (18) $\displaystyle R_{i}=\left\{\begin{array}[]{lc}N_{i},N_{i}\in N&\textit{if}\ i% \ \leqslant n\\ H_{h+n-i+1},H_{h+n-i+1}\in H&\textit{if}\ i>n\end{array}\right.$ (19)

5. Experimental results

The proposed Re-Identification model has been trained using the designed learning algorithm over a target Re-Id dataset. Its Re-Id capacity has been evaluated following the methodology explained in the next subsection. Finally, a comparative analysis of the obtained experimental results has been performed.

5.1 Evaluation methodology

The model has been trained and tested over a benchmark Single-Shot Re-Id dataset, PRID2011 [32].

A decisive factor for uniquely choosing PRID2011 dataset was its particular and uncommon acquisition setting. In most of the Single-Shot datasets, such as VIPeR [27], GRID [49] or CUHK [39], the images were captured from many different views, even inside the same set (gallery or probe set). However, in PRID2011, all its probe images were captured from the same camera view, and all the samples of the gallery were acquired from a second camera view different from the first one. PRID2011 is composed of two sets of person images, captured from two fixed static surveillance cameras, placed outdoors, with notable differences in camera parameters, illumination, poses, and background. Hence, this allows demonstrating the effectiveness of the proposed Mahalanobis distance learning method to embed the view-to-view variations.

In the single-shot version, used in this work, camera view A contains 385 different images, and camera B, 749. Besides, 200 of the individuals are rendered in both sets and the rest of them are distraction samples which do not form matched pairs. 100 of the 200 matched pairs were randomly extracted to be used as training and validation samples. The test set has been formed by following the procedure described in [31], i.e., the images of view A for the 100 remaining individuals have been used as probe set, and the gallery has been formed by 649 images belonging to camera view B (all images of view B except the 100 images corresponding to the training individuals). The resolution of the images is 64 $\times$ 128 pixels.

The performance of the learnt models has been evaluated by computing their Cumulative Matching Characteristic (CMC) curve [55], which is a standard Re-Id performance measurement. To obtain the CMC curve, first, every image from the probe set is coupled with every image from the gallery set and the distance metric, dm, (squared Mahalanobis distance) between them is computed. The match presenting the lowest value of dm is considered as the top match since two images belonging to the same person should be rendered close in the feature space and further from different people representations. The distance metrics obtained by comparing a probe image with all the gallery set images are ranked. This process is repeated for each one of the probe images. The rank value, i.e. the position of the correct match in the ranking, is calculated for each probe image and, subsequently, the percentage in which each rank appears. Then the CMC curve renders the expectation of finding the correct match within the top $r$ matches, for different values of $r$ , called ranks. The computed percentages are cumulative, so the CMC curve is always increasing since the correct matches found within the top $r$ matches will be also found within the top $r+1$ matches.

5.2 Distance metrics comparison

The ranking capacity of the model obtained with the proposed Mahalanobis distance learning algorithm has been evaluated through an experiment called Exp.MahaLearning. To compare the enhancement given by the proposed method in comparison with the use of the Euclidean distance as the Re-Id distance metric, dm, the obtained CMC scores have been compared with those obtained from a second experiment, named Exp.Euclidean. In this second experiment, the triplet Re-Id model is used with the Euclidean distance, as it was described in [25].

In both experiments, Exp.Euclidean and Exp. MahalaLearning, the same transfer learning method has been applied. The results are given in Table 2 and their corresponding curves are rendered in Fig. 5 for visual comparison.

The scores are generally better for the use of the Mahalanobis distance, especially at the first ranks which are the most critical ones for the Re-Id task. An increase in the performance in the first ranks allows the reduction of the number of candidates provided by a Re-Id system, within which the sought identity must be found. The reason is that the Mahalanobis matrix encompasses the visual camera-to-camera transitions, related to changes in illumination, resolution, and point of view. This reduces the effect of the intra-class variation and makes easier the Re-Id task.

Table 2
CMC scores (in [%]) for models using Euclidean and Mahalanobis distance metric on PRID2011 dataset

Rank	1	5	10	20	50	100
Exp.Euclidean	4	14	20	33	58	77
Exp.MahaLearning (proposed method)	13	31	41	47	63	82

Figure 5.

CMC curves of models using Euclidean and Mahalanobis distance metric on PRID2011 dataset.

5.3 Results of learning the Mahalanobis matrix

Although the Mahalanobis distance has already been employed in previous works, the main novelty of the proposed method is the learning of the Mahalanobis matrix elements as neural weights through its backpropagation. In previous works, estimation methods were used. In [25], the estimation of the Mahalanobis matrix is even integrated into a neural model to learnt discriminative features. To provide a fair comparison, this method has been integrated into the proposed neural architecture, and this experiment has been named Exp.MahaEstimation. Therefore, Exp.MahaEstimation and Exp.MahaLearning share the same neural architecture. The unique difference between them is the method used to get the the Mahalanobis matrix, a discriminative estimation process in the first one, and the proposed method in the second one.

The learning curves for both experiments are shown in Fig. 6. The loss value is measured by the triplet loss function, and the accuracy metric renders the percentage of well-classified triplets over the total, considering as well-classified those triplets where the negative pair distance, ${dm}^{-}$ , is larger than the positive pair distance, ${dm}^{+}$ , by a threshold, $\tau$ .

Figure 6.

Comparison of the learning process evolution using different methods to calculate the Mahalanobis matrix, over PRID2011.

In both experiments, the method explained in Section 4.3 for integrating the Mahalanobis distance computation into the feature learning process, has been implemented. The effect of the change in the distance formulation at iteration $T_{t}$ ( $T_{t}=60,000)$ can be observed in Fig. 6, where the loss and accuracy values in Exp.MahaEstimation suffer a dramatic variation in the learning iteration $T_{t}$ . From that iteration, the Mahalanobis distance is used as metric distance, dm, to feed the loss function, instead of the Euclidean distance, previously employed.

Even though the validation loss is increased from this iteration, the Re-Id accuracy is also increased, so a higher number of samples are well-classified by the Mahalanobis distance, although the cost value of those bad-classified is higher. In Exp.MahaEstimation the use of the Mahalanobis distance produces larger oscillations in the loss and accuracy value. This is due to the adaptation of the Mahalanobis matrix elements through its estimation process.

With the proposed method (Exp.MahaLearning) the Mahalanobis matrix is learnt as the set of parameters of a neural layer. This approach reduces the amplitude of the oscillations and a softer but continuously decreasing trend is observed for the loss value and an increasing trend, for the accuracy value. This behaviour indicates that the Mahalanobis distance is being well learnt. Moreover, the gap between both losses (training and validation) is continuously reduced. This means that the algorithm is able to give a more generalised solution for unknown samples. This learning improvement results in an enhancement of the Re-Id capacity provided by the model trained with the proposed method as Table 3 demonstrates.

Besides, as explained above, the proposed Mahalanobis distance learning has been accelerated by the triplet selection mechanism, without causing any effect on the Re-Id performance of the learnt model. Figure 7 shows the learning curves of a training process using and not the triplets selection layer.

Table 3

CMC scores (in [%]) for models using estimation and learning of the Mahalanobis matrix on PRID2011 dataset

Rank	1	5	10	20	50	100
Exp.MahaEstimation	8	22	33	42	60	77
Exp.MahaLearning
(proposed method)	13	31	41	47	63	82

Figure 7.

Comparison of the learning curves using (a) or not (b) the triplets selection layer.

To provide a fair comparison, the selection layer is not applied to the validation samples. When the selection layer is used to train the model, the validation loss decreases from 0.8 to 0.4 in 25,000 iterations. A million iterations are needed to produce that effect when triplet selection is not used.

Moreover, the ranking power of the obtained model over PRID2011 dataset has been analysed through the image presented in Fig. 8. This figure shows the top 20 gallery images taken as most similar for some probe images (first column). The correct match is bounded by a yellow box. Examples of correct matches found at several ranks are given.

Figure 8.

Top 15 ranking provided by the proposed Re-Id model over some samples of PRID2011 dataset.

Although the model sometimes fails to find the correct match in the first rank, it is able to rank the images according to the visual appearance similarity. The model gives the smallest distances, dm to those people images most similar to the query probe sample, in a way akin a human would do. For a certain probe image, people wearing similar clothes, accessories or bags, or in the same colours, are ranked in the top positions, with independence from the pose.

5.4 Comparison of metric learners

A new generation of Re-Identification methods, boosted by the developing of MTMCT algorithms, is achieving superior results over multi-shot datasets, such as Market-1501 [40], MARS [97], CUHK03 [98], DukeMTMC-ReID [67] and DukeMTMC-VID [85]. Si et al. [73] achieved a matching score of 94’67% on the first rank over Market-1501dataset, and 69.6% over CUHK03 dataset. Moreover, Zhang et al. [90] obtained 85.55% over DukeMTMC-ReID. Besides, Chen et al. [10] achieved the scores of 88.2% in the first rank over the MARS dataset, and 95.4 over DukeMTMC-VID.

Nevertheless, the goal of this article is to refocus the Single-Shot metric learning approach, through a novel method for backwards propagating the Mahalanobis distance embedding as a set of neural parameters. This method is a potential tool for learning contrastive neural models, keeping the Single-Shot approach of the offline metric learning methods. For that reason, the suitability of the proposed method to address that purpose is demonstrated by comparing the performance of the proposed approach with an extensive list of well-known metric learners. Their CMC scores are shown in Table 4.

Table 4
Comparison of CMC rates (in [%]) of Re-Id methods on PRID2011 dataset, ‘–’ indicates no result was reported

Method	Rank
	1	5	10	20	50	100
Proposed method	13	31	41	47	63	82
Mahalanobis [69]	16	–	41	51	64	76
ITML [15]	12	–	36	47	64	79
LMNN-R [16]	9	–	32	43	60	76
LMNN [83]	10	–	30	42	59	73
PSFI $+$ PRDC [47]	3	9	16	24	39	–
PRDC [95]	3	10	15	23	38	–
PSFI $+$ RankSVM [47]	4	9	13	20	32	–
RankSVM [62]	4	9	13	19	32	–
LDA [20]	4	–	14	21	35	48
GFI [49]	4	–	10	17	32	–
Euclidean [33]	3	–	10	14	28	45
LDML [28]	2	–	6	11	19	32
PSFI [47]	1	2	4	7	14	–

The listed methods are based on the design of hand-crafted features and metric distance learning. Some approaches are focused on finding the proper combination of the features to represent a person image, like Ranking Support Vector Machines (RankSVM), [62]. Other works apply general metric learners to Person Re-Identification, such as Probabilistic Relative Distance Comparison (PRDC) [95], Large Margin Nearest Neighbor (LMNN) [83], Information-Theoretic Metric Learning (ITML) [15], Logistic Discriminant Metric Learning (LDML) [28] and Linear Discriminant Analysis (LDA) [20]. Some of these methods have been adapted to the Re-Id task, like Large Margin Nearest Neighbor with Rejection (LMNN-R) [16]. Moreover, in [47], a method based on Prototype-Sensitive Feature Importance (PSFI) is proposed to adaptively weight features according to different groups of the population and combined with two previously cited methods (PSFI $+$ PRDC and PSFI+RankSVM). On the contrary, [49] presents a Global Feature Importance (GFI) approach, addressing the learning of a global weighting, i.e. a vector of generic weights and invariant to the population. No population discrimination has been made in the proposed method, and the general weighting of the features to create a global person descriptor have been implicitly and automatically performed by the proposed deep model learning.

In general, the performance of the listed methods is overcome by the proposed deep Re-Id model which automatically learns the features embedding and the Mahalanobis matrix as the set of parameters of a neural layer, utilizing the vectorized implementation of its forward and backward propagation.

The two methods presenting the highest-ranking ability, i.e. the proposed one and the presented by Roth et al. [69], use the Mahalanobis distance as connection function. Roth et al. estimated the Mahalanobis matrix with a discriminative study of the features, once these had been previously computed in a separated process. On the contrast, the proposed method back-propagates the gradients of the elements of the Mahalanobis matrix to embed the view-to-view transitions and to learn deep features simultaneously.

5.5 Discussion

This article formulates a unified neural framework to jointly and automatically find the most salient features and the optimal Mahalanobis metric distance to compare people appearance.

Nevertheless, the application of deep neural models to the Single-Shot Re-Identification task poses a daunting challenge due to the lack of data, and its unbalanced nature. The traditional metric learners for Single-Shot Re-Id had been displaced by a new generation of Multi-Shot algorithms that require from a multi-camera tracking setting.

Nevertheless, one of the main purposes of this work was to refocus the Single-Shot metric learning approach to brings together its low data requirements and the advantages of the automatic embedding learning provided by neural models. The Single-Shot metric learning approach has been reformulated through the back-propagation of the Mahalanobis distance embedding as a set of neural parameters. In that way, the proposed method enables the integration of deep neural networks into the pair-wise binary classification approach of the early metric learning methods, to perform Single-Shot Re-Identification. This endeavour has been successfully achieved as the results over a challenging benchmark dataset demonstrate.

Satisfactorily, even using the standard VGG architecture, the designed distance learning method yields a considerable performance. The conducted experiments have demonstrated the effectiveness of the proposed approach, whose results are comparable to that of the benchmark metric learners that estimate the metric distance from a set of previously computed features.

These results are taken as proof of concept to claim the back-propagation of the Mahalanobis distance as a valid method for learning Re-Id models, and a potential tool for training contrastive models with more sophisticated architectures in their compared branches.

Under this approach, enhancements can be obtained through the use of novel neural architectures for the backbone network in each branch of the proposed contrastive model.

The knowledge generated in the classification domain [2, 60, 65, 68] can be transferred to the design of novel architectures for the feature branches.

6. Conclusions

Metric learning has been deeply researched and targeted by a wide assortment of methods, but the learning of the parameters of a distance function through neural networks had not been addressed yet. This article presents the formulation of a novel method to backpropagate the gradients of the elements of a Mahalanobis matrix as additional neural parameters of a deep model to learn the Mahalanobis metric distance for Person Re-identification.

In the case of re-identifying from two fixed camera views, the use of the Mahalanobis distance to compare two images implicitly contributes to deal with the intra-class variations, since the Mahalanobis matrix embeds the camera-to-camera transitions.

The proposed novel method aims to automatically model the cameras transition and the proper comparison of features jointly with the features learning in a unique process. To achieve that, a new learning algorithm has been designed following the triplet approach. The triplet loss function has been re-formulated, as well as, its derivatives with respect both, the neural weights defining the person features, and the Mahalanobis matrix elements, to perform their backwards propagation. Besides, a triplets selection layer has been implemented to accelerate the learning process.

The performance of the proposed method has been demonstrated to overcome that provided by the use of the Euclidean distance and even the estimation of the Mahalanobis matrix in the forward propagation of the model [22, 23]. The previous estimation methods require the analysis of the data structure through several learning iterations to observe the effects of the Mahalanobis matrix on the learnt features, and, consequently, correct the values of the elements of that matrix. On difference, this article presents a Mahalanobis Learning method that directly updates the elements of the matrix according to the objective to achieve, enhancing the features and the distance learning and their global performance.

Through, the proposed method a unified Re-Id neural framework as been obtained. This jointly and automatically finds the most salient features and the optimal Mahalanobis metric distance to perform Single-Shot Person Re-Identification.

Footnotes

Layers to learn Mahalanobis matrix are publicly available under http://github.com/magomezs/MahalanobisMatrixLearning.

The triplet loss function with Mahalanobis distance learning has been formulated in a new Caffe-python layer called TripletMahaLoss, which is publicly available under http://github.com/magomezs/MahalanobisMatrixLearning.

Triplets Permutation tool is publicly available under http://github. com/magomezs/dataset_factory/tree/master/data_factory_from_reid.

The triplet selection mechanism has been coded in a new Caffe-python layer called TripletSelection, which is publicly available under http://github.com/magomezs/MahalanobisMatrixLearning.

Acknowledgments

Research supported by the Spanish Government through the CICYT projects (PID2019-104793RB-C31 and RTI2018-096036-BC21), Universidad Carlos III of Madrid through (PEAVAUTO-CMUC3M) and the Comunidad de Madrid through SEGVAUTO-4.0-CM (P2018/EMT-4362). We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research.

References

Acharya

Hagiwara

Tan

Adeli

Subha

DP.

Automated EEG-based screening of depression using deep convolutional neural network, Computer Methods and Programs in Biomedicine. 2018; 161: 103-113.

Ahmadlou

Adeli

Enhanced probabilistic neural network with local decision circles: A robust classifier, Integrated Computer-Aided Engineering. 2010; 17(3): 197-210.

Ahmed

Jones

Marks

TK.

An improved deep learning architecture for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2015, pp. 3908-3916.

Arabi

Haghighat

Sharma

A deep-learning-based computer vision solution for construction vehicle detection, Computer-Aided Civil and Infrastructure Engineering. 2020; 35(7): 753-767.

Avraham

Gurvich

Lindenbaum

Markovitch

Learning implicit transfer for person re-identification. In European Conference on Computer Vision. Springer. 2012, pp. 381-390.

Bazzani

Cristani

Murino

Symmetry-driven accumulation of local features for human characterization and re-identification. Computer Vision and Image Understanding. 2013; 117(2): 130-144.

Benito-Picazo

Domínguez

Palomo

López-Rubio

Deep learning-based video surveillance system managed by low cost hardware and panoramic cameras, Integrated Computer-Aided Engineering. 2020; 27(4).

Bonet

Caraffini

Peña

Puerta

Gongora

Oil Palm Detection via Deep Transfer Learning, 2020 IEEE Congress on Evolutionary Computation (CEC), Glasgow, United Kingdom, 2020, pp. 1-8, doi: 10.1109/CEC486062020.9185838.

Bromley

Guyon

LeCun

Säckinger

Shah

Signature verification using a “siamese” time delay neural network. In Advances in neural information processing systems. 1994, pp. 737-744.

10.

Chen

Jiang

Wang

Attribute-aware Identity-hard Triplet Loss for Video-based Person Re-identification. 2020, arXiv preprint arXiv200607597.

11.

Cheng

Gong

Zhou

Wang

Zheng

Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 1335-1344.

12.

Simon

C-D

Caraffini

Kuhn

Gongora

Florez-Lozano

Parra

Shallow buried improvised explosive device detection via convolutional neural networks. Integrated Computer-Aided Engineering. 2020, pp. 1-14.

13.

Corvee

Bremond

Thonnat

et al. Person re-identification using haar-based and dcd-based signature. In Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance. 2010, pp. 1-8.

14.

Corvee

Bremond

Thonnat

et al. Person re-identification using spatial covariance regions of human body parts. In Seventh IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). 2010, pp. 435-440.

15.

Davis

Kulis

Jain

Sra

Dhillon

IS.

Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning. ACM. 2007, pp. 209-216.

16.

Dikmen

Akbas

Huang

Ahuja

Pedestrian recognition with a learned metric. In Asian conference on Computer vision. Springer. 2010, pp. 501-512.

17.

Ding

Lin

Wang

Chao

Deep feature learning with relative distance comparison for person re-identification. Pattern Recognition. 2015; 48(10): 2993-3003.

18.

Duchi

Hazan

Singer

Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research. 2011; 12: 2121-2159.

19.

Farenzena

Bazzani

Perina

Cristani

Murino

Person re-identification by symmetry-driven accumulation of local features. In IEEE Conference on Computer Vision and Pattern Recognition, 2010; pp. 2360-2367.

20.

Fisher

RA.

The use of multiple measurements in taxonomic problems. Annals of Eugenics. 1936; 7(2): 179-188.

21.

Gómez-Silva

Armingol

de la Escalera

Deep part features learning by a normalised double-margin-based contrastive loss function for person re-identification. In Proceedings of the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2017) (6: VISAPP). 2017, pp. 277-285.

22.

Gómez-Silva

Armingol

de la Escalera

Deep parts similarity learning for person re-identification. In Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISIGRAPP 2018). 2018, pp. 419-428.

23.

Gómez-Silva

Armingol

de la Escalera

Balancing people re-identification data for deep parts similarity learning. Journal of Imaging Science and Technology. 2019; 63(2): 20401-14.

24.

Gómez-Silva

Armingol

de la Escalera

Triplet permutation method for deep learning of single-shot person re-identification. 9th International Conference on Imaging for Crime Detection and Prevention (ICDP 2019), IET. 2019, pp. 10-56.

25.

Gómez-Silva

Izquierdo

de la Escalera

Armingol

JM.

Transferring learning from multi-person tracking to person . Integrated Computer-Aided Engineering. 2019; 26(4): 329-344.

26.

Gong

Cristani

Loy

Hospedales

TM.

The re-identification challenge. Person Re-identification. 2014, pp. 1-20.

27.

Gray

Tao

Viewpoint invariant pedestrian recognition with an ensemble of localized features. In European Conference on Computer Vision, ECCV, Springer, 2008, pp. 262-275.

28.

Guillaumin

Verbeek

Schmid

Is that you? metric learning approaches for face identification.

IEEE 12th international conference on Computer Vision, 2009, pp. 498-505.

29.

Hadsell

Chopra

LeCun

Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06). 2006; 2: 1735-1742.

30.

Hamreras

Boucheham

Molina-Cabello

Benitez-Rochel

Lopez-Rubio

Content-based image retrieval by ensembles of deep learning object classifiers, Integrated Computer-Aided Engineering. 2020; 27(3): 317-331.

31.

Hirzer

Beleznai

Roth

Bischof

Person re-identification by descriptive and discriminative classification. In Scandinavian conference on Image analysis, Springer, 2011, pp. 91-102.

32.

Hirzer

Roth

Bischof

Person re-identification by efficient impostor-based metric learning. In IEEE Ninth International Conference on Advanced Video and Signal-Based Surveillance (AVSS). 2012, pp. 203-208.

33.

Hirzer

Roth

Köstinger

Bischof

Relaxed pairwise learned metric for person reidentification. Computer Vision-ECCV. 2012, pp. 780- 793 .

34.

Zhou

Lou

Tan

Maybank

Principal axis-based correspondence between multiple cameras for people tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2006; 28(4): 663-671.

35.

Kostinger

Hirzer

Wohlhart

Roth

Bischof

Large scale metric learning from equivalence constraints. In Computer Vision and Pattern Recognition (CVPR). 2012, pp. 2288-2295.

36.

Krizhevsky

Sutskever

Hinton

GE.

Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 2012, pp. 1097-1105.

37.

Leng

Tian

A survey of open-world person re-identification. IEEE Transactions on Circuits and Systems for Video Technology. 2019; 30(4): 1092-1108.

38.

Wang

Locally aligned feature transforms across views. In Computer Vision and Pattern Recognition (CVPR). 2013, pp. 3594-3601.

39.

Mukunoki

Minoh

Common-near-neighbor analysis for person re-identification. In 2012 19th IEEE International Conference on Image Processing. 2012, pp. 1621-1624.

40.

Zhao

Xiao

Wang

Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 152-159.

41.

Zhang

Zhu

Jiang

Huang

State-aware re-identification feature for multi-target multi-camera tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019.

42.

Zhu

Gong

Harmonious attention network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 2285-2294.

43.

Lian

Lai

Suen

Chen

Matching of tracked pedestrians across disjoint camera views using ci-dlbp. IEEE Transactions on Circuits and Systems for Video Technology. 2012; 22(7): 1087-1099.

44.

Lisanti

Masi

Bagdanov

Del Bimbo

Person reidentification by iterative re-weighted sparse ranking. IEEE transactions on pattern analysis and machine intelligence. 2015; 37(8): 1629-1642.

45.

Liu

Anguelov

Erhan

Szegedy

Reed

Berg

AC.

Ssd: Single shot multibox detector. In European conference on computer vision. Springer, 2016, pp. 21-37.

46.

Liu

Feng

Jiang

Yan

End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing. 2017; 26(7): 3492-3506.

47.

Liu

Gong

Loy

Lin

Evaluating feature importance for re-identification, In Person Re-identification, Springe. 2014, pp. 203-228.

48.

Liu

Gong

Loy

Lin

Person re-identification: What features are important? In European Conference on Computer Vision. Springer, 2012, pp. 391-401.

49.

Loy

Xiang

Gong

Time-delayed correlation analysis for multi-camera activity undestanding. International Journal Computer Vision. 2010; 90(1): 106-129.

50.

Luo

Zhou

Cao

. Combining Deep Features and Activity Context to Improve Recognition of Activities of Workers in Groups, Computer-Aided Civil and Infrastructure Engineering. 2020; 35(9): 965-978.

51.

Jurie

Discriminative image descriptors for person re-identification. In Person Re-identification, Springer, 2014, pp. 23-42.

52.

Matsukawa

Okabe

Suzuki

Sato

Hierarchical gaussian descriptor for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1363-1372.

53.

Mignon

Jurie

Pcca: A new approach for distance learning from sparse pairwise constraints. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2012; 2666-2672.

54.

Molina-Cabello

Luque-Baena

López-Rubio

Thurnhofer-Hemsi

Vehicle type detection by ensembles of convolutional neural networks operating on super resolved images. Integrated Computer-Aided Engineering. 2018; 25(4): 321-333.

55.

Moon

Phillips

PJ.

Computational and performance aspects of pca-based face-recognition algorithms. Perception. 2001; 30(3): 303-321.

56.

Munaro

Fossati

Basso

Menegatti

Van Gool

Oneshot person re-identification with a consumer depth camera. In Person Re-Identification. Springer, 2014, pp. 161-181.

57.

Nabian

Meidani

Deep Learning for Accelerated Reliability Analysis of Transportation Networks. Computer-Aided Civil and Infrastructure Engineering. 2018; 33(6): 459-480.

58.

Zhang

Noori

MN.

Deep Learning for Data Anomaly Detection and Data Compression of a Long-span Suspension Bridge, Computer-Aided Civil and Infrastructure Engineering. 2020; 35(7): 685-700.

59.

Oreifej

Mehran

Shah

Human identity recognition in aerial images. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 2010, pp. 709-716.

60.

Pereira

Piteri

Souza

Papa

Adeli

FEMa: A Finite Element Machine for Fast Learning, Neural Computing and Applications. 2020; 32(10): 6393-6404. doi: 10.1007/s00521-019-04146-4.

61.

Porikli

Inter-camera color calibration by correlation model function. In Proceedings. 2003 International Conference on Image Processing, ICIP. 2003; 2(II): 133.

62.

Prosser

Zheng

Gong

Xiang

Person re-identification by support vector ranking. In British Machine Vision Conference. 2010; 2(6).

63.

Rafiei

Adeli

A Novel Machine Learn-ing Based Algorithm to Detect Damage in Highrise Building Structures. The Structural Design of Tall and Special Buildings. 2017; 26(18).

64.

Rafiei

Adeli

A Novel Unsupervised Deep Learning Model For Global and Local Health Condition Assessment Of Structures. Engineering Structures. 2018; 156(1): 598-607.

65.

Rafiei

Adeli

A New Neural Dynamic Classification Algorithm, IEEE Transactions on Neural Networks and Learning Systems. 2017; 28(12), 3074-3083. doi: 10.1109/TNNLS2017.2682102.

66.

Rafiei

Khushefati

Demirboga

Adeli

Supervised Deep Restricted Boltzmann Machine for Estimation of Concrete Compressive Strength. ACI Materials Journal. 2017; 114(2): 237-244.

67.

Ristani

Solera

Zou

Cucchiara

Tomasi

Performance measures and a data set for multi-target, multi-camera tracking, in: European Conference on Computer Vision (ECCV), 2016, pp. 17-35.

68.

Rokibul Alam

Siddique

Adeli

A Dynamic Ensemble Learning Algorithm for Neural Networks, Neural Computing with Applications. 2020; 32(10): 6393-6404. doi: 10.1007/s00521-019-04359-7.

69.

Roth

Hirzer

Köstinger

Beleznai

Bischof

Mahalanobis distance learning for person re-identification. In Person Re-Identification. Springer, 2014, pp. 247-267.

70.

Sánchez

Perronnin

Mensink

Verbeek

Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision. 2013; 105(3): 222-245.

71.

Schroff

Kalenichenko

Philbin

Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 815-823.

72.

Shen

Xiong

Xue

Bian

A Convolutional Neural Network-Based Pedestrian Counting Model for Various Crowded Scenes, Computer-Aided Civil and Infrastructure Engineering. 2019; 34(10).

73.

Zhang

Liu

Compact triplet loss for person re-identification in camera sensor etworks. Ad Hoc Networks. 2019; 95: 101984.

74.

Simoes

Lau

Reis

, Exploring Communication Protocols and Centralized Critics in Multi-Agent Deep Learning, Engineering.

IntegratedComputer-Aided

2020; 27(4).

75.

Simonyan

Zisserman

Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.

76.

Song

Huang

Ouyang

Wang

Mask-guided contrastive attention model for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018, pp. 1179-1188.

77.

Vera-Olmos

Pardo

Melero

Malpica

DeepEye: Deep Convolutional Network for Pupil Detection in Real Environments, Integrated Computer-Aided Engineering. 2019; 26(1): pp. 85-95.

78.

Vezzani

Baltieri

Cucchiara

People reidentification in surveillance and forensics: A survey. ACM Computing Surveys (CSUR). 2013; 46(2): 29.

79.

Wang

Bai

Regional parallel structure based cnn for thermal infrared face identification. Integrated Computer-Aided Engineering. 2018; 25(3): 247-260.

80.

Wang

Doretto

Sebastian

Rittscher

Shape and appearance context modeling. In IEEE International Conference on Computer Vision, 2007, pp. 1-8.

81.

Wang

Song

Leung

Rosenberg

Wang

Philbin

Chen

Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 1386-1393.

82.

Wang

Zhao

Zou

Zhao

Autonomous Damage Segmentation and Measurement of Glazed Tiles in Historic Buildings via Deep Learning, Computer-Aided Civil and Infrastructure Engineering. 2020; 35(3): 277-291.

83.

Weinberger

Saul

LK.

Fast solvers and efficient implementations for distance metric learning. In Proceedings of the 25th international conference on Machine learning. ACM. 2008, pp. 1160-1167.

84.

Weinberger

Saul

LK.

Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research. 2009; 10: 207-244.

85.

Lin

Dong

Yan

Ouyang

Yang

Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In: CVPR, 2018.

86.

Yang

Cappelle

Ruichek

El Bagdouri

Multi-object Tracking with Discriminant Correlation Filter Based Deep Learning Tracker, Integrated Computer-Aided Engineering. 2019; 26(3): 273-284.

87.

Lei

Liao

et al. Deep metric learning for person re-identification. In 22nd; International Conference on Pattern Recognition (ICPR). IEEE, 2014, pp. 34-39.

88.

Zhang

Gabor-lbp based region covariance descriptor for person re-identification. In Sixth International Conference on Image and Graphics (ICIG). 2011, pp. 368-371.

89.

Zhang

Irie

Ruan

Sample-specific svm learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

90.

Zhang

Wei

Zhang

Xia

Person re-identification with triplet focal loss. IEEE Access. 2018; 6: 78092-78099.

91.

Zhao

Ouyang

Wang

Unsupervised salience learning for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3586-3593.

92.

Zhao

Ouyang

Wang

Learning mid-level filters for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 144-151.

93.

Zheng

Gong

Xiang

Associating groups of people. In Proceedings of the British Machine Vision Conference. 2009; 2(6): 231-2311.

94.

Zheng

Gong

Xiang

Person reidentification by probabilistic relative distance comparison. IEEE conference on Computer vision and pattern recognition (CVPR). 2011, pp. 649-656.

95.

Zheng

Gong

Xiang

Reidentification by relative distance comparison. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2013; 35(3): 653-668.

96.

Zheng

Yang Hauptmann

AG.

Person reidentification: Past, present and future. arXiv preprint arXiv1610.02984, 2016.

97.

Zheng

Bie

Sun

Wang

Tian

Mars: A video benchmark for large-scale person re-identification. In: ECCV, 2016.

98.

Zheng

Shen

Tian

Wang

Tian

Scalable person re-identification: a benchmark, in: IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1116-1124.

99.

Zhuang

Lin

Shen

Reid

Fast training of triplet-based deep binary embedding networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016, pp. 5955-5964.

Back-propagation of the Mahalanobis istance through a deep triplet learning model for person Re-Identification

Abstract

Keywords

1. Introduction

3. Re-Identification model

Table 1 Structure of the used VGG11-based model. The input and output sizes are described in #rows × #cols × #filters; the kernel, in #rows × #cols × #filters, stride, or #outputs for FC layers

4.5 Triplets selection

5.1 Evaluation methodology

5.2 Distance metrics comparison

Table 2 CMC scores (in [%]) for models using Euclidean and Mahalanobis distance metric on PRID2011 dataset

Table 4 Comparison of CMC rates (in [%]) of Re-Id methods on PRID2011 dataset, ‘–’ indicates no result was reported

6. Conclusions

Footnotes

Acknowledgments

References

Table 1
Structure of the used VGG11-based model. The input and output sizes are described in #rows $\times$ #cols $\times$ #filters; the kernel, in #rows $\times$ #cols $\times$ #filters, stride, or #outputs for FC layers

Table 2
CMC scores (in [%]) for models using Euclidean and Mahalanobis distance metric on PRID2011 dataset

Table 4
Comparison of CMC rates (in [%]) of Re-Id methods on PRID2011 dataset, ‘–’ indicates no result was reported