Multi-modal deep distance metric learning

Abstract

In many real-world applications, data contain heterogeneous input modalities (e.g., web pages include images, text, etc.). Moreover, data such as images are usually described using different views (i.e. different sets of features). Learning a distance metric or similarity measure that originates from all input modalities or views is essential for many tasks such as content-based retrieval ones. In these cases, similar and dissimilar pairs of data can be used to find a better representation of data in which similarity and dissimilarity constraints are better satisfied. In this paper, we incorporate supervision in the form of pairwise similarity and/or dissimilarity constraints into multi-modal deep networks to combine different modalities into a shared latent space. Using properties of multi-modal data, we design multi-modal deep networks and propose a pre-training algorithm for these networks. In fact, the proposed network has the ability of learning intra- and inter-modal high-order statistics from raw features and we control its high flexibility via an efficient multi-stage pre-training phase corresponding to properties of multi-modal data. Experimental results show that the proposed method outperforms recent methods on image retrieval tasks.

Keywords

Multi-modal data metric learning deep networks similar-dissimilar pairs pre-training

1. Introduction

A proper distance metric (or similarity measure) plays an important role in many learning and retrieval tasks. Until now, many methods have been proposed for metric learning [1, 2, 3, 4, 5]. In these methods, it is usually assumed that supervisory information in the form of relative distance constraints or similar/dissimilar pairs is available. Some of these methods learn linear [1, 2] or nonlinear [4] transformations on the feature space to find a new representation space in which distance (or similarity) constraints are better satisfied. However, those methods that learn nonlinear transformations (or implicitly kernel matrices) are usually either limited to learning a restricted form of non-linear transformations or very time consuming when they are flexible (i.e. they learn the whole kernel matrix). Moreover, the flexible methods that learn the whole kernel matrix are transductive and cannot be used to find similarities for new data. Recently, some deep metric learning methods [6, 7, 8, 9] have been proposed that can learn a non-linear transformation to achieve a new representation space in which the distance constraints are better satisfied. However, these deep models of metric learning have been designed for input data containing one modality. Therefore, they have not used properties of multi-modal data for designing the architecture and training of the deep networks.

On the other hand, real-world data usually contains different modalities such as text, image, and video. Video and corresponding audio [10] and annotated images [11, 12] are examples of multi-modal data. In recent years, several multi-modal methods have been proposed to incorporate heterogeneous modalities in the classification and retrieval tasks [11, 12, 13, 14]. Besides, there are similar challenges in multi-view descriptions of data. Any view is indeed a description of data obtained by a particular feature extraction method [15, 16]. Because of the somewhat similar nature of multi-modal and multi-view data, they pose similar challenges and even in some studies identical models have been introduced for both of them [17, 18].

Figure 1.

(a) Unfolded MMD [10] (b) Unfolded MMD-DML.

Deep networks are flexible and effective models that have been used in a wide range of applications [19]. They can model distributions of different modalities and make a connection between them by a common or shared space that is obtained using a layer above modality specific networks. Multi-modal deep models have recently attracted much attention in many applications such as cross-modal [11, 20] and multi-modal retrieval tasks [14, 21]. A popular deep network architecture for multi-modal data is shown in Fig. 1a. This architecture which includes modality specific networks and a single layer on the top of these networks (to find a shared representation) has been used in many multi-modal deep learning models [10, 14]. However, this architecture has not been already used for metric learning.

In this paper, we propose a deep metric learning method for multi-modal data (to the best of our knowledge, our method is the first deep metric learning method for multi-modal data). The proposed Multi-modal Deep Distance Metric Learning (MMD-DML) framework (see Fig. 1b) can include some layers for non-linear metric learning on the top of the single layer presenting the shared representation of modalities. We design an effective method for unsupervised pre-training of this model using the properties of multi-modal data. Since we intend to use the multi-modal deep network for metric learning, an optimization problem is presented that considers supervisory information in the form of similar/dissimilar pairs. Stochastic gradient descent is employed for training MMD-DML with batches of similar/dissimilar pairs.

In this work, our goal is learning a distance metric that incorporates multiple modalities. Retrieval is one of the tasks in which the distance metric has a critical role and also some data views or modalities usually exist in retrieval applications. For example, in Content-Based Image Retrieval (CBIR), different views of an image, which are obtained through various feature extraction techniques, can act as different modalities. Experimental results show the effectiveness of our pre-training method in such retrieval tasks.

The rest of this paper is organized as follows. Some related works are reviewed in Section 2. We first present some definitions and preliminaries of our proposed model in Section 3 and then the proposed method is described with details in Section 4. Experimental settings and results of our method for CBIR are presented in Section 5. Finally, we conclude our work in Section 6.

2. Related works

The existing methods of representation learning for multi-modal data can be categorized as below:

Multiple Kernel Learning (MKL)

Shallow Probabilistic Models

Deep Models

These approaches are widely used in unsupervised [10, 11, 20, 22] and supervised [4, 12, 21, 23] multi-modal retrieval.

2.1 Multiple kernel learning

MKL methods can be used to learn a kernel that is a combination of a set of fixed basis kernels. Although these methods were first applied to single modal data, they can also be utilized for multi-modal data where different kernels are considered for different modalities [24, 25].

Weighted kernel combination is one of the earliest MKL methods [24, 25, 26, 27] in which the kernel space is equivalent to weighted concatenation of kernel Hilbert spaces. Lanckriet et al. [24] employed weighted kernel combination in the Support Vector Machine (SVM) and learned optimal kernel weights and SVM parameters simultaneously. Even though most of MKL methods have been designed for classification purposes and used labels as supervisory information. However, they can be adapted to use supervisory information in the form of pairwise distance constraints [4] or triplets distance constraints [25, 27]. Recently, Chen et al. [25, 27] proposed methods for learning mobile applications similarity using weighted kernel combination approach.

Lin et al. [28] introduced Weighted Multiple Kernel Embedding (WMKE) method for learning a linear transformation on spaces resulted from weighted kernels combination. Although this method can model correlation between modalities, simple scaling and selecting kernels are the only degrees of freedom considered for integrating modalities. In Multiple Kernel Partial Order Embedding (MKPOE) method [4], distinct linear transformations are learned on kernel spaces simultaneously. Unlike WMKE, this method cannot directly model correlation between diverse modalities. However, it can transform modality spaces efficiently.

The learning stages of MKL approaches usually include optimization of a very big positive semi-definite (PSD) matrix. Therefore, these methods are not scalable to massive data in real-world applications such as multi-media retrieval tasks. Xia et al. [18] extended MKPOE method to an online mode by converting constrained optimization problem to its unconstrained equivalent and then projecting parameters to the constraints space. Xia et al. [23] proposed a similar method called Online Multiple Kernel Similarity Learning (OMKS) for CBIR application. They used different features extracted from image as modalities. To increase performance, several different kernels are considered for each modality and these kernels are combined to find the similarity measure. They proposed an efficient two-stage optimization technique to find kernel space transformations and optimal combination weights. Wu et al. [29] proposed a similar method called OM-DML which simultaneously learns a distinct linear transformation on each modality and also optimal weights for combining modalities. By directly learning a linear transformation instead of learning a Mahalanobis metric, OM-DML eliminates the time-consuming Positive Semi-Definite (PSD) projection step required in the OMKS algorithm. This method is also able to seek low-rank solutions by setting the number of dimensions for new spaces to be less than the number of input dimensions. In this paper, we propose the MMD-DML method which explicitly learns a non-linear transform having the advantage of kernel-based approaches while it does not need the PSD constraints (similar to the methods like OM-DML). As opposed to the MKL methods that fix the base kernels, our method can learn a flexible non-linear transform on each modality. Furthermore, unlike MKPOE, OMKS, and MMDDML methods, our method has the ability to model intermodal correlations using the joint multi-layer network on the top of the modality specific networks.

2.2 Probabilistic shallow and deep network models for multi-modal data

Shallow and deep networks are capable of providing a powerful framework using nonlinear activation functions or diverse conditional probability distributions and have been used extensively in various areas including multi-modal tasks [10, 12, 14, 15, 16, 22].

Harmonium [30] is a shallow probabilistic model containing a layer of latent variables as a hidden representation of data. Dual-Wing Harmonium (DWH) [22] is an extension from exponential Harmonium [31] which is applicable to data with two modalities in the visible layer. In this model, image and annotations (along with image) are embedded into a shared latent space. Assumptions about conditional probability distribution can be leveraged as prior information about data. Xie and Xing [12] extended DWH in their Multi-Modal Distance Metric Learning (MM-DML) method for distance metric learning through minimizing the cost function that has been defined according to similar and dissimilar pairs. Chen et al. proposed supervised extensions of DWH for large margin predictive subspace learning [15, 16]. Supervisory information in the form of labeled data is utilized in these methods.

Several models of multi-modal deep networks have been proposed in recent years. Most of them are unsupervised methods that model data distribution [10, 14]. Some of the existing methods try to find a latent space that can be constructed by each modality [11, 13]. These methods are useful in cross-modal tasks. For example, in multi-modal retrieval based on Stacked Auto Encoders (SAEs) [11], an SAE is trained for each of the two modalities of image-tag bimodal data. After that, these methods try to minimize Euclidean distance between the latent representation of the images and that of their associated tags. Feng et al. [13] proposed a similar method based on Restricted Boltzmann Machine (RBM) to map image and text into a low-dimensional common space for cross-modal retrieval task. They used correlation-based loss function to maintain correspondence between distinct deep RBMs of modalities. A deep model using Canonical Correlation Analysis (CCA) [32] to find a shared latent space has also been introduced in [33]. In this model, each modality is transformed through a separate deep network to a space where the inter-modal correlation of the transformed modalities is maximized. Ngiam et al. [10] proposed an effective Multi-Modal Deep Network (MMD) model that learns a shared representation from different modalities in an unsupervised manner. The MMD model is pre-trained in a greedy layer-wise manner and then fine-tuned for multi-modal or cross-modal tasks by back-propagation. Srivastava et al. [14] proposed Multi-Modal Deep Boltzmann Machines (MMDBM) as an unsupervised method that assigns a deep network to each modality and uses a layer on the top of these networks to find a shared latent space. In this method, for each layer, an RBM is used and the model is trained in a layer-wise manner using contrastive divergence. This method is similar to the MMD method [10] but uses DBM instead of SAE.

3. Preliminaries

In this section, we present some definitions and also some basic ideas about metric learning that have been presented in the previous works.

3.1 Definitions

In this part, some definitions are provided for the terms used in the following sections and some basic ideas about metric learning are presented.

DEFINITION 1: Multi-modal space

A multi-modal vector space is ${\mathbb{D}}_{M}={\mathbb{R}}^{d_{1}}\times\ldots\times{\mathbb{R}}^{d_{M}}$ for which any ${\bm{x}}=({\bm{x}}_{1},\ldots,{\bm{x}}_{M})\in{\mathbb{D}}_{M}$ has $M$ modalities such that ${\bm{x}}_{1}\ldots{\in}{\mathbb{R}}^{d_{1}}$ , $\ldots$ , ${\bm{x}}_{M}\ldots{\in}{\mathbb{R}}^{d_{M}}$ .

DEFINITION 2: Multi-modal retrieval

Given a query object $q\in{\mathbb{D}}_{M}$ and a target domain $D_{t}\subset{\mathbb{D}}_{M}$ with $T$ objects, we intend to find an order $O=(o_{1},\ldots o_{T})$ of $D_{t}$ such that $\forall i<j$ , $\textit{dist}\left(q,o_{i}\right)<\textit{dist}\left(q,o_{j}\right)$ .

DEFINITION 3: Similar/dissimilar pairs

Similar and dissimilar pair sets are defined as:

$\displaystyle{\cal S}=\left\{\left({\bm{x}},{\bm{x}^{\prime}}\right)\right\}% \subset{\mathbb{D}}_{M}\times{\mathbb{D}}_{M},$ (1) $\displaystyle{\cal D}=\left\{\left({\bm{x}},{\bm{x}^{\prime}}\right)\right\}% \subset{\mathbb{D}}_{M}\times{\mathbb{D}}_{M}.$

For each $\left({\bm{x}},{\bm{x}^{\prime}}\right)\in{\cal S}$ , ${\bm{x}}$ and ${\bm{x}^{\prime}}$ are regarded as similar pairs in the training stage and pairs in the set ${\cal D}$ are regarded as dissimilar ones.

3.2 Metric learning

In this section, we first present some important and popular optimization problems for metric learning. Then, the most popular multi-modal metric learning method is introduced. Xing et al. proposed a distance metric learning method that minimizes the distance between similar pairs while separating dissimilar pairs by a margin [1]. Hence, the optimization problem does not consider any loss for dissimilar pairs that are far enough from each other:

$\displaystyle\mathop{\text{arg min}}\limits_{\text{A}}\sum_{({\bm{x}},{\bm{y}}% )\in{\cal S}}||{\bm{x}}-{\bm{y}}||_{A}^{2}\ \text{s.t.}\forall({\bm{x}},{\bm{y% }})\in{\cal D},||{\bm{x}}-{\bm{y}}||^{2}_{A}\geqslant 1,{\bm{A}}\succeq 0,$ (2)

where $||{\bm{x}}-{\bm{y}}||^{2}_{A}=({\bm{x}}-{\bm{y}})^{T}{\bm{A}}({\bm{x}}-{\bm{y}% })=d_{A}({\bm{x}},{\bm{y}})$ denotes the Mahalanobis distance between data points ${\bm{x}}$ and ${\bm{y}}$ . Davis et al. [3] proposed an optimization problem that imposed a margin on similar pairs as well as dissimilar ones. Indeed, the distance between similar pairs which are adequately close to each other are not entered in the loss function:

$\displaystyle\mathop{\text{arg min}}\limits_{\text{A}}r({\bm{A}})=tr({\bm{A}})% -\log\det({\bm{A}})\ \text{s.t.}\ d_{A}({\bm{x}},{\bm{y}})\leqslant u,({\bm{x}% },{\bm{y}})\in{\cal S},d_{A}({\bm{x}},{\bm{y}})\geqslant\ell,({\bm{x}},{\bm{y}% })\in{\cal D}.$ (3)

Here, $r({\bm{A}})$ is a special case of LogDet divergence which has some properties, such as the scale and translation invariance, that are suitable for metric learning [34].

Xie et al. proposed the MM-DML method [11] with the following optimization problem based on dual-wing harmonium:

$\displaystyle\mathop{\text{arg min}}\limits_{\theta}\frac{1}{|{\cal X}|}{\cal L% }({\cal X};\theta)+\lambda\frac{1}{|{\cal S}|}\sum_{({\bm{x}},{\bm{y}})\in{% \cal S}}||t({\bm{x}})-t({\bm{y}})||^{2}\ \text{s.t.}\ \forall({\bm{x}},{\bm{y}% })\in{\cal D},||t({\bm{x}})-t({\bm{y}})||^{2}\geqslant 1,$ (4)

where $\Theta$ is model parameters, ${\cal L}({\cal X};\Theta)$ shows data likelihood in DWH, $\lambda$ is a regularizer parameters, and $t({\bm{x}})$ is the latent representation of $x$ . The MM-DML optimization problem in Eq. (4) is an extension of the one introduced in Eq. (2). By softening the constraints, the optimization problem in Eq. (4) can be reformulated as:

$\displaystyle\mathop{\text{arg min}}\limits_{\theta}\frac{1}{|{\cal X}|}{\cal L% }({\cal X};\theta)+\lambda_{1}\frac{1}{|{\cal S}|}\sum_{({\bm{x}},{\bm{y}})\in% {\cal S}}||t({\bm{x}})-t({\bm{y}})||^{2}+\lambda_{2}\frac{1}{|{\cal D}|}\sum_{% ({\bm{x}},{\bm{y}})\in{\cal D}}$ (5) $\displaystyle\qquad\max(0,1-||t({\bm{x}})-t({\bm{y}})||^{2}),$

where $\lambda_{1}$ and $\lambda_{2}$ are regularization parameters.

MM-DML method utilizes the stochastic gradient descent to directly optimize the feature transformation instead of learning the Mahalanobis metric ( $A$ ) used by Xing et al. [1]. Although the optimization problem of the MM-DML method is not convex and, without an intelligent parameter initialization strategy, MM-DML becomes prone to falling into an improper local-minima, it can provide some benefits. For example, a low-rank solution that is desirable in the context of Mahalanobis metric learning [3] can be achieved by explicitly learning a feature transformation that provides dimensionality reduction.

In general, learning a non-linear transformation has some advantages to learning a Mahalanobis metric or learning a kernel matrix. Deep networks provide a powerful framework to learn flexible non-linear transformations. Nonetheless, all of the existing deep metric learning methods [6, 7, 8, 9, 10] are proper for input data containing only one modality.

4. Proposed method

In this section, we propose the MMD-DML method that uses the deep learning approach to find a flexible non-linear transformation leading to an effective distance metric for multi-modal data. We use a multi-stage pre-training phase utilizing unlabeled multi-modal data. Then, we impose margin constraints for both similar and dissimilar pairs via an optimization problem inspired by the ITML method [3]. The batch-mode gradient descent technique is utilized to find the solution of the proposed optimization problem that considers similar/dissimilar pairs.

4.1 Optimization problem

Figure 1b shows the unfolded structure of the proposed architecture in our MMD-DML method. This model has a separate SAE with an arbitrary number of layers for each modality. Joint SAE (JSAE) takes the concatenation of the latent representations of the modalities as its input layer and provides a shared representation as the output.

The depth of the SAE considered to the $m$ -th modality is shown as $h_{m}$ and the depth of JSAE is denoted as $h_{\textit{joint}}$ . Let ${\bm{x}}^{0}=({\bm{x}}_{1}^{0},\ldots,{\bm{x}}_{M}^{0})\in{\mathbb{D}}_{M}$ , the representations resulted from the different layers of the SAE considered for the $m$ -th modality are denoted as ${\bm{x}}_{m}^{1},\ldots,{\bm{x}}_{m}^{h_{m}}$ (Fig. 1b). Moreover, ${\bm{x}}_{m}^{h_{m}+1},\ldots,{\bm{x}}_{m}^{2h_{m}}$ show the decoded representations obtained in the unfolded MMD-DML (Fig. 1b). Concatenation of the outputs of modality specific SAEs is shown as ${\bm{j}}^{0}=({\bm{x}}_{1}^{h_{1}},\ldots,{\bm{x}}_{M}^{h_{M}})$ that provides the input of JSAE. Representations resulted from encoder layers of JSAE are shown as ${\bm{j}}^{1},\ldots,{\bm{j}}^{h_{\textit{joint}}}$ . Moreover, ${\bm{j}}^{h_{\textit{joint}}+1},\ldots,{\bm{j}}^{{2h}_{\textit{joint}}}$ denote the outputs of the decoder layers (Fig. 1b). The mapping function corresponding to the whole MMD-DML model is denoted as $f_{M}\left({\bm{x}};\Theta\right)$ where $\Theta$ shows all model parameters and $f_{M}\left({\bm{x}}^{0};\Theta\right)={\bm{j}}^{h_{\textit{joint}}}$ . Let $\hat{x}_{m}^{l}$ ( $l=0,\ldots,h_{M}-1)$ be the reconstruction of ${\bm{x}}_{m}^{l}$ resulted from applying an encoder and the corresponding decoder (of an auto-encoder network with one hidden layer) on ${\bm{x}}_{m}^{l}$ shown in Fig. 2a–b. Similarly, let $\hat{j}^{0},\ldots,\hat{j}^{h_{\textit{joint}}-1}$ be the reconstructions obtained for ${\bm{j}}^{0},\ldots,{\bm{j}}^{h_{\textit{joint}}-1}$ . We also denote the reconstruction of the $m$ -th modality using the corresponding unfolded SAE as $\hat{x}_{m}(m=1,\ldots,M)$ shown in Fig. 2c. The notation symbols used in our method have been presented in Table 1.

Table 1
The notation symbols used in our method

Dymbol	Description
${\cal S}$ and ${\cal D}$	Sets of pairwise similarity and dissimilarity constraints
${\cal X}$	The set of available training data (containing only feature vectors and not labels)
$L_{r}^{m}(.,.)$	The loss function used for the reconstruction of the $m$ -th modality as in Eq. (9) (square loss in our experiments)
${\bm{x}}^{0}=({\bm{x}}_{1}^{0},\ldots,{\bm{x}}_{M}^{0})$	The input containing $m$ modalities
$h_{m}$	The depth of the SAE considered for the $m$ -th modality
${\bm{x}}_{m}^{l}$	The representation obtained in the $l$ -th encoder layer of the SAE considered to the $m$ -th modality
$\hat{\bm{x}}_{m}^{l}$	The reconstruction of ${\bm{x}}_{m}^{l}$ resulted from applying an auto-encoder network with one hidden layer on ${\bm{x}}_{m}^{l}$
$h_{\textit{joint}}$	The depth of JSAE’s encoder used as the shared SAE on the top of the modality specific networks
${\bm{j}}^{0}=({\bm{x}}_{1}^{h_{1}},\ldots,{\bm{x}}_{M}^{h_{M}})$	The input of the JSAE the concatenation of the outputs of modality specific SAEs
$h_{\textit{joint}}$	The depth of the JSAE’s encoder
${\bm{j}}^{l}$	The representation obtained by the $l$ -th encoder layer of JSAE
$\hat{j}^{l}$	The reconstruction of ${\bm{j}}^{l}$ resulted from applying an auto-encoder network with one hidden layer on ${\bm{j}}^{l}$
$f_{M}\left(.;\Theta\right)$	The mapping function corresponding to the whole MMD-DML model
$\hat{\bm{x}}_{m}$	The reconstruction of the $m$ -th modality using the corresponding unfolded SAE shown in Fig. 2c

Finally, we define the optimization problem of MMD-DML as:

$\displaystyle\mathop{\text{arg min}}_{\theta}\ {\cal L}_{r}({\cal X};\theta)$ $\displaystyle\ \ \text{s.t.}\ \forall({\bm{x}},{\bm{x}^{\prime}})\in{\cal S},d% (f_{M}({\bm{x}};\theta),f_{M}({\bm{x}^{\prime}},\theta))\leqslant u,$ (6) $\displaystyle\qquad\forall({\bm{x}},{\bm{x}^{\prime}})\in{\cal D},d(f_{M}({\bm% {x}};\theta),f_{M}({\bm{x}^{\prime}};\theta))\geqslant\ell,$

where $d\left(.,.\right):{\mathbb{R}}^{l}\times{\mathbb{R}}^{l}\to{\mathbb{R}}^{l}$ is a distance metric defined in ${\mathbb{R}}^{l}$ and $l$ is the number of units in the last layer of the decoder of the joint network. The loss term ${\cal L}_{r}\left({\cal X};\Theta\right)$ shows the average reconstruction error over ${\cal X}$ and is defined as:

$\displaystyle{\cal L}_{r}\left({\cal X};\Theta\right)=\frac{1}{|{\cal X}|}\sum% _{{\bm{x}}^{0}\in{\cal X}}\sum_{m=1}^{M}{{\cal L}_{r}^{m}({\bm{x}}_{m}^{0},{% \bm{x}}_{m}^{2h_{m}})}$ (7)

where ${\cal L}_{r}^{m}({\bm{x}}_{m}^{0},{\bm{x}}_{m}^{2h_{m}})$ denotes the reconstruction loss used for the $m$ -th modality. As suggested by Wang et al. [11], these functions can be selected depending on modality distributions. Since various features extracted from images usually follow Gaussian distributions [11], we use convenient squared Euclidean distance loss in all of our CBIR experiments.

Using hinge losses instead of hard margin constraints in Eq. (4.1), we obtain:

$\displaystyle\mathop{\text{arg min}}_{\theta}{\cal L}_{r}({\cal X};\theta)+% \lambda_{1}\frac{1}{|{\cal S}|}\sum_{({\bm{x}},{\bm{x}}^{\prime})\in{\cal S}}% \max(0,d(f_{M}({\bm{x}};\theta),f_{M}({\bm{x}^{\prime}};\theta))-u)$ (8) $\displaystyle\qquad+\lambda_{2}\frac{1}{|{\cal D}|}\sum_{({\bm{x}},{\bm{x}^{% \prime}})\in D}\max(0,\ell-d(f_{M}({\bm{x}};\theta),f_{M}({\bm{x}^{\prime}};% \theta))),$

where $\lambda_{1}$ and $\lambda_{2}$ are regularization parameters.

In the next subsections, we first introduce a pre-training algorithm to initialize the parameters of our MMD-DML model. Then, a gradient descent optimization technique is utilized to solve the optimization problem in Eq. (4.1) as the fine-tuning step in the proposed model. Since the hinge loss terms in this equation are not differentiable, we simply use sub-gradient technique by considering the gradient of hinge loss equal to zero in non-differentiable points.

4.2 Unsupervised Pre-Training of MMD-DML

Ngiam et al. [10] proposed a pre-training method for MMD in which the network is first initialized in a greedy layer-wise manner by sparse RBMs. After that, the unfolded MMD network is pre-trained by the backpropagation algorithm.

In our method, unsupervised pre-training of the network consists of three major steps. Different stages for pre-training are shown in Algorithm 1. The first step includes pre-training of the SAE of each modality (Fig. 2a). To achieve a proper starting point, every layer is first initialized by Singular Value Decomposition1 (SVD) and then pre-trained by the backpropagation2 algorithm to provide a suitable dimensionality reduction for the next layer (Fig. 2b). The SAE whose layers are found in this greedy manner (one after the other) is then trained as a whole multi-layer network by the backpropagation algorithm (Fig. 2c). Indeed, we train the network allocated to the $m$ -th modality to reach the lower reconstruction error for the representation obtained by this network. As mentioned in Section 4.1, reconstruction loss functions of modalities are chosen as:

$\displaystyle{\cal L}_{r}^{m}({\bm{x}}_{m},\hat{\bm{x}}_{m})={\frac{1}{2}\left% |\left|{\bm{x}}_{m}-\hat{\bm{x}}_{m}\right|\right|}_{2}^{2}$ (9)

where $\hat{\bm{x}}_{m}$ is the reconstruction of ${\bm{x}}_{m}$ obtained by the SAE of the m-th modality as shown in Fig. 2c. However, the loss functions that are utilized to show input reconstruction error are not needed to be the square loss necessarily. They can be chosen depending on modalities distributions as recommended by Wang et al. [11].

In the second step, the JSAE is pre-trained in a similar manner using inputs provided by the modality specific SAEs (Fig. 3). Eventually, in Step 3 of Algorithm 1, the whole unfolded network (Fig. 1b) is pre-trained by the backpropagation algorithm to find the shared representation that minimizes sum of the squared reconstruction error over all the modalities (i.e. the first term in Eq. (4.1)).

Figure 2.

Pre-training of the modalities SAEs (Step 1 of Algorithm 1). (a) SAE of the m-th modality. (b) Layer-wise SAE pre-training of the m-th modality (for the l-th layer) by firstly using SVD initialization and then update weights of this network (that has one hidden layer) using error backpropagation on the reconstruction error. (c) Backpropagation to minimize the reconstruction error for the unfolded SAE of each modality.

Figure 3.

Pre-training of Joint SAE (Step 2 of Algorithm 1). (a) Greedy layer-wise pre-training of Joint SAE by firstly using SVD initialization and then using backpropagation to minimize reconstruction error of each layer (b) Backpropagation to minimize the reconstruction error in the whole unfolded joint SAE.

4.3 Supervised fine-tuning of MMD-DML

In this section, we use the gradient descent method to fine-tune the pre-trained MMD-DML network by considering similar/dissimilar distance losses in the second and the third terms of Eq. (4.1). By utilizing distance losses in Eq. (4.1), we optimize MMD-DML parameters (weights and biases of MMD-DML encoders) as:

$\displaystyle\Theta^{*}=\mathop{\text{arg min}}_{\Theta}{\cal L}_{\textit{% metric}}(\Theta;{\cal S};{\cal D})=\lambda_{1}\frac{1}{|{\cal S}|}\sum_{({\bm{% x}},{\bm{x}^{\prime}})\in S}\max(0,d(f_{M}({\bm{x}};\Theta),f_{M}({\bm{x}^{% \prime}};\Theta))-u)+\lambda_{2}\frac{1}{|{\cal D|}}\sum_{({\bm{x}},{\bm{x}^{% \prime}})\in D}\max(0,\ell-d(f_{M}({\bm{x}};\Theta),f_{M}({\bm{x}^{\prime}};% \Theta))).$ (10)

As mentioned in Section 4.1, hinge losses in the above objective function are not differentiable in zero and we use sub-gradient strategy to train our model. In other words, the sub-gradient of the hinge loss is defined as:

$\displaystyle\nabla_{\Theta}\max(0,z)={\mathbb{I}}(z\left(\Theta\right)>0)% \nabla_{\Theta}z.$ (11)

Finally, the gradient of the cost function Eq. (10) is calculated as:

$\displaystyle\nabla_{\Theta}{\cal L}_{\textit{metric}}(\Theta;S;D)=\lambda_{1}% \frac{1}{|{\cal S}|}\sum_{({\bm{x}},{\bm{x}^{\prime}})\in{\cal S}}\!{\mathbb{I% }}(d(f_{M}({\bm{x}};\Theta),f_{M}({\bm{x}^{\prime}};\Theta))\!>u)\nabla_{% \Theta}d(f_{M}({\bm{x}};\Theta),f_{M}({\bm{x}^{\prime}};\Theta))$ (12) $\displaystyle\qquad-\lambda_{2}\frac{1}{|{\cal D}|}\sum_{({\bm{x}},{\bm{x}^{% \prime}})\in D}{\mathbb{I}}(d(f_{M}({\bm{x}};\Theta),f_{M}({\bm{x}^{\prime}};% \Theta))<\ell)\nabla_{\Theta}d(f_{M}({\bm{x}};\Theta),f_{M}({\bm{x}^{\prime}};% \Theta)).$

We utilized the batch-mode stochastic gradient descent technique. Therefore, in each step, we calculate Eq. (4.3) for a mini-batch of similar/dissimilar pairs that is a subset of ${\cal S}\cup{\cal D}$ . Note that Eq. (4.3) is a summation of gradients attributed to violating similar/dissimilar pairs in the $B$ batch. We can calculate gradient originating from every $({\bm{x}},{\bm{x}}^{{\bm{{}^{\prime}}}})\in B$ as:

$\displaystyle\nabla_{\Theta}d(f_{M}({\bm{x}};\Theta),f_{M}({\bm{x}^{\prime}};% \Theta))=$ $\displaystyle\qquad\nabla_{f_{M}({\bm{x}};\Theta)}d(f_{M}({\bm{x}};\Theta),f_{% M}({\bm{x}^{\prime}};\Theta))\times\nabla_{\Theta}f_{M}({\bm{x}};\Theta)+% \nabla_{f_{M}({\bm{x}^{\prime}};\Theta)}$ (13) $\displaystyle\qquad d(f_{M}({\bm{x}};\Theta),f_{M}({\bm{x}^{\prime}};\Theta))% \times\nabla_{\Theta}f_{M}({\bm{x}^{\prime}};\Theta)$

Algorithm 1: Pre-Training of MMD-DML
Inputs: A set of multi-modal vectors ${\bm{X}}^{0}=\left({\bm{X}}_{1}^{0}\ldots{\bm{X}}_{M}^{0}\right)$ (each row is one of the examples in ${\cal X}$ and ${\bm{X}}_{m}^{0}$ shows the matrix containing the $m$ -th modality of examples).
Outputs: Parameters of MMD-DML initialized using pre-training
Step 1: Pre-training an SAE for each modality
for $m=1$ to $M$ do Greedy layer-wise pre-training of the $m$ -th SAE corresponding to the ${\bm{m}}$ -th modality: for $l=1$ to $h_{m}$ do// in each iteration initialize an auto-encoder (called AE) with one hidden layer, update its weights and finally add its encoder layer as the $l$ -th layer of the $m$ -th SAE. ${\bm{U}}{\bm{\Sigma}}{\bm{V}}^{}\leftarrow\text{SVD}\ ({\bm{X}}^{l-1}_{m})$ . Initialize weights of the AE’s encoder layer and decoder layer using ${\bm{U}}$ and ${\bm{U}}^{T}$ matrices respectively and biases to 0. Apply AE (the encoder and decoder layer of this new AE) on ${\bm{X}}_{m}^{l-1}$ to find $\hat{\bm{X}}_{m}^{l-1}$ . Update weights of this auto-encoder by minimizing the reconstruction error between ${\bm{X}}_{m}^{l-1}$ and $\hat{\bm{X}}_{m}^{l-1}$ via the backpropagation algorithm. Add AE’s encoder layer as the $l$ -th layer of the $m$ -th SAE. Using the encoder layer of AE on ${\bm{X}}_{m}^{l-1}$ to find ${\bm{X}}_{m}^{l}$ . Backpropagation in the unfolded modality SAE: Using the encoder layers of the $m$ -th SAE and then their corresponding decoder layers as in Fig. 2c to find $\hat{\bm{X}}_{m}$ . Update whole weights of the $m$ -th SAE by minimizing the reconstruction loss between $\hat{\bm{X}}_{m}$ and ${\bm{X}}_{m}^{0}$ using backpropagation. Step 2: Pre-training JSAE Greedy layer-wise pre-training of the JSAE: for $l=1$ to $h_{\textit{joint}}$ do ${\bm{U}}{\bm{\Sigma}}{\bm{V}}^{}\leftarrow\text{SVD}({\bm{J}}^{l-1})$ . Initialize weights of the AE’s encoder layer and decoder layer using ${\bm{U}}$ and ${\bm{U}}^{T}$ matrices respectively and biases to 0. Apply AE (the encoder and decoder layer of this new AE) on ${\bm{J}}^{l-1}$ to find $\hat{\bm{J}}^{l-1}$ . Update weights of this auto-encoder by minimizing the reconstruction error between ${\bm{J}}^{l-1}$ and $\hat{\bm{J}}^{l-1}$ via the backpropagation algorithm. Add AE’s encoder layer as the $l$ -th layer of JSAE. Using the encoder layer of AE on ${\bm{J}}^{l-1}$ to find ${\bm{J}}^{l}$ . Backpropagation in unfolded JSAE: Using the encoder layers of JSAE and then their corresponding decoder layers as in Fig. 1b to find ${\bm{J}}^{2h_{\textit{joint}}}$ . Update whole weights of the JSAE by minimizing the reconstruction loss between ${\bm{J}}^{2h_{\textit{joint}}}$ and ${\bm{J}}^{0}$ using the backpropagation algorithm. Step 3: Backpropagation in unfolded MMD-DML Update weights of the whole network by minimizing ${\cal L}_{r}\left({\bm{X}}^{0};\Theta\right)$ in Eq. (7) via the backpropagation algorithm.

The first term on the right-hand side of Eq. (4.3) is the gradient of $d\left(f_{M}\left({\bm{x}};\Theta\right),o\right)$ w.r.t. model parameters where $o=f_{M}\left({\bm{x}^{\prime}};\Theta\right)$ is the fixed desired MMD-DML network output for ${\bm{x}}$ and $d\left(f_{M}\left({\bm{x}};\Theta\right),o\right)$ measures the loss of network in generating $o$ . Similarly, the second term of Eq. (4.3) is the gradient of $d\left(f_{M}\left({\bm{x}^{\prime}};\Theta\right),o^{\prime}\right)$ w.r.t. model parameters where $o^{\prime}=f_{M}\left({\bm{x}};\Theta\right)$ is a fixed desired network output for ${\bm{x}^{\prime}}$ . In other words, the gradients in Eq. (4.3) are both similar to the gradient in neural networks for regression problems. Thus, the partial derivative in Eq. (4.3) w.r.t. parameters of each layer can be calculated through the backpropagation algorithm [11].

5. Experiments

In this paper, we use various feature types such as SIFT and GIST as different modalities of image data and evaluate MMD-DML in CBIR as a multi-modal retrieval task.

5.1 Datasets

To assess the efficacy of our method in CIBR tasks, we evaluate it on three widely-used datasets:3 Caltech-256, Corel5k, and Indoor. These are the most common datasets for CIBR tasks and following [23] as the most related work to ours, we select these datasets. Caltech-256 dataset has 256 image categories plus an extra class named “Clutter” [35]. Similar to the work by Chechik et al. [36], we choose 10, 20, and 50 classes from this dataset. In our experiments, these subsets are referred to as “Cal10”, “Cal20”, and “Cal50” respectively. Corel5k has 50 diverse image categories collected from COREL image CDs [37]. Contrary to Caltech-256, which has varied number of images in each category, each of the classes in Corel5k contains exactly 100 images. Indoor is a dataset previously used for indoor scene recognition [38]. This dataset has 67 categories, each of which contains at least 100 images.

Following the work of Xia et al. [23], in order to avoid the dominating effect of one class with high number of images, we find the number of images in the smallest class and randomly choose samples of this size from each class so that the number of images is the same for all classes. Then, we randomly split the data into four partitions – that is – training set, validation set, query set, and test set. Training set contains 50% of images and is used for pre-training the network and extracting pairwise constraints. Validation set contains 10% of images and is used for tuning the hyper-parameters. Query set and test set contain 10% and 30% of the images respectively that are used for evaluation of the method. For comparison of different methods, query objects are chosen from the query set and the test set is regarded as the target domain. To extract pairwise constraints, we create all possible similar pairs in the training set and for each similar pair $({\bm{x}}_{1},{\bm{x}}_{2})$ we randomly choose a point ${\bm{x}}_{3}$ from another class and create a dissimilar pair $({\bm{x}}_{1},{\bm{x}}_{3})$ . After that, we keep half of the constraints to train the methods. The effect of using varied numbers of constraints on the performance of the methods is shown in Section 5.5. For OMKS, which uses triplets rather than pairwise constraints, we merge each similar pair $({\bm{x}}_{1},{\bm{x}}_{2})$ and dissimilar pair $({\bm{x}}_{1},{\bm{x}}_{3})$ to create a triplet $({\bm{x}}_{1},{\bm{x}}_{2},{\bm{x}}_{3})$ .

5.2 Extracted features

Similar to [23], we use several types of features from each image. These features are Local Binary Pattern, GIST features, Gabor wavelets, color histogram and color moments, edge direction histogram, SIFT features, and SURF features. For SIFT and SURF features, we use 200 and 1000 as codebook size, thus generating four types of features called SIFT200, SIFT1000, SURF200, and SURF1000. Using PCA, we extract 100 features from each feature set whose dimension is more than 100.

5.3 Choosing distance metric and margins

We use the measure defined below as the distance metric in Eq. (4.1):

$\displaystyle d\left({\bm{h}},{\bm{h}^{\prime}}\right)=\left(1-\frac{<{\bm{h}}% ,{\bm{h}^{\prime}}>}{\left|\left|{\bm{h}}\right|\right|\left|\left|{\bm{h}^{% \prime}}\right|\right|}\right).$ (14)

This is the distance metric related to cosine similarity and ranges over $(0,2)$ in every space. Using this distance metric we can restrict the values of $u$ and $\mathrm{l}$ . Indeed, we specified these margins as $u=\left(1-\cos\left(\frac{\pi}{12}\right)\right)$ and $l=\left(1-\cos\left(\frac{\pi}{6}\right)\right)$ in all the below experiments.

As mentioned by Xing et al. [1], the margin value in Eq. (2) corresponds to only scaling and different margin values may yield equivalent solutions. Suppose a near optimal (w.r.t. visual similarity) pre-trained MMD-DML model (see Section 4.2 for the pre-training stage). Distance between pairs in this model, w.r.t. Euclidian metric, ranges over $(0,V)$ . The upper bound of $V$ can be investigated using the number and the type of activation functions of neurons in the last layer or can been estimated using representation of examples in $X$ by finding the maximum distance between data points. Choosing suitable values for $u$ and $\mathrm{l}$ margins, in this range, results in fast convergence of gradient descent algorithm by reducing the number of iterations and insignificant changes in pre-trained MMD-DML. For example, suppose the margins are chosen so that $u<\ell<<{\mathbb{E}}_{X}[d\left({\bm{x}},{\bm{y}}\right)]$ where ${\mathbb{E}}_{X}[d\left({\bm{x}},{\bm{y}}\right)]$ denotes expected value of distances between data points in the last layer of MMD-DML. In this situation, the dissimilar hinge loss term in Eq. (4.1) is mostly inactive and, thus, gradient descent tries to shrink the distance between similar pairs while disregarding a significant portion of dissimilar pairs until ${\mathbb{E}}_{X}[d\left({\bm{x}},{\bm{y}}\right)]$ gets close to dissimilarity margin. As a result, the network will require more iterations to achieve a desirable solution and will be unable to take advantage of the initial point found by our MMD-DML method. Notice that this performance degradation is due to blindly choosing margins that are inconsistent with the scale of the obtained features in the shared representation. Consequently, we use the cosine similarity that is scale invariant and thus the range of margin values does not depend on the properties of the shared representation space. Experiments in the following subsections show that angular distance metric of Eq. (14) achieves state-of-the-art results in few iterations.

Figure 4.

Performance of the network with different depths.

5.4 Network architecture

In this section, we evaluate networks with various numbers of layers and different numbers of units to find the desired architecture of the network. We start with a network having single-layer modality-specific SAEs and no JSAE. At each step, we increase the number of layers in modality-specific SAEs by one while fixing other hyper-parameters and evaluate the resulted model using mean Average Precision (mAP). This process is continued until adding a new layer decreases the network performance. The number of units in each layer is chosen so that the first layer reduces the dimensionality of each modality to 50 and the subsequent layers do not further reduce the dimensionality. Figure 4 shows performance of the network with respect to the number of layers on the different datasets. According to these results, performance tends to degrade when the number of layers goes beyond three (especially for datasets with the smaller number of samples) since the networks with the higher number of parameters are more prone to overfitting. Indeed, when we increase the number of layers, the flexibility of the model can also be promoted. However, the number of adjustable parameters (i.e. weights and biases) are raised and since in many datasets the number of training samples is not sufficient, the overfitting may occur when new layers are added. For the Indoor dataset, the largest dataset, the network can be five layers deep since we have more samples to train the network.

For each dataset, we then pick the network with the highest performance and replace its last layer with a single-layer JSAE. This network is evaluated using 64, 128, and 256 output neurons. The results are summarized in Table 2.

For the activation functions of the network, as recommended in [39], hyperbolic tangent is used due to its symmetry around the origin which allows faster convergence for all encoders and decoders except decoders of the first layers. Indeed, decoders of the first layers employed linear activation functions. The network is then trained in 300 iterations with a batch size of 250.

Table 2
mAP of networks with different output widths

Output width	Cal10	Cal20	Cal50	Corel5k	Indoor
64	0.38761	0.26526	0.16651	0.48642	0.07384
128	0.41103	0.29705	0.18786	0.48981	0.08843
256	0.42248	0.26665	0.18763	0.48206	0.06734

5.5 Compared methods

We compare our method with three recent methods introduced for multi-modal retrieval. As a baseline, we also report the results of the Unsupervised Multi-Modal Deep SAE (U-MMD-DML) as the unsupervised version of our method that can be considered as an extension of Bimodal Deep Network proposed in [10] with some difference in training (that have been mentioned in Section 4.1). Details of the methods used for comparison are provided below:

OMKS [23]: Using training triplets, OMKS optimizes several kernel functions for each modality while learning the optimal weights for linear combination of these functions. Similar to [23], we used three RBF kernels with $\sigma\in\{2^{-1},2^{0},2^{1}$ and also a cosine similarity kernel for each modality as the base kernels.

MM-DML [12]: This method was described in Section 3.2. For this model, we fixed the number of outputs to 128 and set the values of the $\lambda_{1}$ and $\lambda_{2}$ parameters through cross-validation.

OM-DML [29]: This method that was mentioned in Section 2.1 simultaneously learns a distinct linear transformation on each modality and also finds optimal weights for combining the transformed modalities.

Proposed Unsupervised MMD-DML (U-MMD-DML): As opposed to MMD-DML, this version of our method does not use the pairwise constraints to fine-tune the whole network and only uses the unsupervised pre-training shown in Algorithm 1.

Proposed Multi-Modal Deep Distance Metric Learning (MMD-DML): This method was described in Section 4.

Table 3
mAP of different retrieval methods

	Cal10	Cal20	Cal50	Corel5k	Indoor
OMKS	0.28544	0.24715	0.14299	0.42695	0.06846
OM-DML	0.28648	0.21567	0.13078	0.34496	0.06236
MM-DML	0.30247	0.21796	0.11671	0.30356	0.05431
U-MMD-DML	0.24418	0.19869	0.11475	0.27710	0.05630
MMD-DML	0.35760	0.27923	0.17082	0.46844	0.07453

Figure 5.

Evaluation of methods using precision at top-k.

Figure 6.

Evaluation of methods using precision-recall curve.

5.6 Evaluation in retrieval and classification

We evaluate each method on five random splits of the datasets and average the obtained results. First, we use mAP measure to compare the performance of different methods and summarize the average results in Table 3. In Fig. 5, we report the performance of these methods in terms of the precision at top-k. We also compare these methods using 11-point interpolated precision-recall curve on the same dataset in Fig. 6.

It can be seen from the results that MMD-DML significantly outperforms all the other methods. MMD-DML’s ability to learn a nonlinear transform for each modality can serve as a reason for the remarkable difference between performance of our method and that of the MM-DML method. The relatively high performance of shallow MM-DML on Cal10 dataset does not scale well to larger datasets and it becomes suboptimal compared with the other methods. This validates that we can improve the results using deep models such as MMD-DML for large-scale tasks. Moreover, comparing the results of MMD-DML and U-MMD-DML, we find that in MMD-DML supervisory information improves performance with a large margin.

Figure 7.

Evaluation of methods in terms of k-NN classification accuracy.

We also compare the methods in terms of k-nearest neighbor (k-NN) classification accuracy for various values of $k$ . The results are summarized in Fig. 7. Our proposed method achieves the highest classification accuracy on all the datasets. According to Table 3 and Figs 5–7, we can see that our MMD-DML method outperforms the other methods (with a larger margin when the number of classes in the dataset is lower, e.g. larger margin between our method and the second best method is obtained on Cal10 compared to Cal50).

Figure 8.

mAP measure w.r.t. the ratio of pairwise constraints.

5.7 The impact of the ratio of pairwise constraints

As mentioned in Section 5.1, we keep a ratio of pairwise constraints (i.e. supervisory information in the form of similar and dissimilar pairs) to train the supervised methods. We evaluate the methods while changing this ratio and summarize the results in Fig. 8. Several empirical observations can be inferred from these results. First, MMD-DML performs better than the other methods in most cases. Second, the biggest leap in the performance of the three methods results from the first 20% of the constraints. Third, as mentioned in [23], OMKS becomes nearly saturated after receiving the first 20% of the constraints and a similar phenomenon happens for MM-DML and OM-DML methods too. However, MMD-DML keeps taking advantage of supervisory information beyond this level since the MMD-DML model is more flexible and more supervisory information can help it to be trained more properly.

6. Conclusion

In this paper, we proposed the MMD-DML framework for distance metric learning on multi-modal data when supervisory information is available in the form of similar/dissimilar pairs. MMD-DML is capable of learning a complicated nonlinear similarity function on multi-modal data (with heterogeneous modalities). In other words, MMD-DML has the ability of learning intra- and inter-modal high-order statistics from raw features. High degree of freedom in MMD-DML hypothesis space is well controlled using an efficient multi-stage pre-training phase. In fact, we first used the properties of multi-modal data to pre-train the network and then fine-tuned it using the supervisory information. Experimental results show the superiority of the proposed method in the retrieval and classification tasks. Our method improves mAP measure on Cal10, Corel5k, and Indoor datasets respectively 7.2%, 4.1%, and 0.6% compared to the second best method (OMKS).

Footnotes

If we have large-scale data, we can simply ignore SVD steps or calculate SVD over a subset of examples.

Reconstruction loss function used in the first layer of each modality specific SAE can be selected depending on modality distribution. For other layers of every SAE, however, Euclidean reconstruction loss functions loss is common. All reconstruction loss minimization steps in Algorithm 1 are done by batch-mode gradient descent.

The datasets used in our experiments are available in project website of OMKS method :http://www.cais.ntu.edu.sg/∼chhoi/OMKS/.

References

Xing

E.P.

Jordan

M.I.

Russell

S.J.

and Ng

A.Y.

, Distance metric learning with application to clustering with side-information, in: Advances in Neural Information Processing Systems, 2003, pp. 521–528.

Weinberger

K.Q.

Blitzer

and Saul

L.K.

, Distance metric learning for large margin nearest neighbor classification, in: Advances in Neural Information Processing Systems, 2006, pp. 1473–1480.

Davis

J.V.

Kulis

Jain

Sra

and Dhillon

I.S.

, Information-theoretic metric learning, in: Proceedings of the 24th international Conference on Machine Learning, ACM, 2007, pp. 209–216.

McFee

and Lanckriet

, Learning multi-modal similarity, Journal of Machine Learning Research 12 (2011), 491–523.

Baghshah

M.S.

and Shouraki

S.B.

, Metric learning for semi-supervised clustering using pairwise constraints and the geometrical structure of data, Intelligent Data Analysis 13 (2009), 887–899.

and Tan

Y.-P.

, Discriminative deep metric learning for face verification in the wild, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1875–1882.

Oh Song

Xiang

Jegelka

and Savarese

, Deep metric learning via lifted structured feature embedding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4004–4012.

and Tan

Y.-P.

, Deep transfer metric learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 325–333.

Hoffer

and Ailon

, Deep metric learning using triplet network, in: International Workshop on Similarity-Based Pattern Recognition, Springer, 2015, pp. 84–92.

10.

Ngiam

Khosla

Kim

Nam

Lee

and Ng

A.Y.

, Multimodal deep learning, in: Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 689–696.

11.

Wang

Ooi

B.C.

Yang

Zhang

and Zhuang

, Effective multi-modal retrieval based on stacked auto-encoders, Proceedings of the VLDB Endowment 7 (2014), 649–660.

12.

Xie

and Xing

E.P.

, Multi-modal distance metric learning, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI), 2013, pp. 1806–1812.

13.

Feng

and Wang

, Deep correspondence restricted Boltzmann machine for cross-modal retrieval, Neurocomputing 154 (2015), 50–60.

14.

Srivastava

and Salakhutdinov

R.R.

, Multimodal learning with deep boltzmann machines, in: Advances in Neural Information Processing Systems, 2012, pp. 2222–2230.

15.

Chen

Zhu

and Xing

E.P.

, Predictive subspace learning for multi-view data: a large margin approach, in: Advances in Neural Information Processing Systems, 2010, pp. 361–369.

16.

Chen

Zhu

Sun

and Xing

E.P.

, Large-margin predictive latent subspace learning for multiview data analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2012), 2365–2378.

17.

Wang

Nie

Huang

and Ding

, Heterogeneous visual features fusion via sparse multimodal machine, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3097–3102.

18.

Xia

and Hoi

S.C.

, Online multi-modal distance learning for scalable multimedia retrieval, in: Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, ACM, 2013, pp. 455–464.

19.

Schmidhuber

, Deep learning in neural networks: An overview, Neural Networks 61 (2015), 85–117.

20.

Wang

Yang

Ooi

B.C.

Zhang

and Zhuang

, Effective deep learning-based multi-modal retrieval, The VLDB Journal 25 (2016), 79–101.

21.

Hoi

S.C.

Xia

Zhao

Wang

and Miao

, Online multimodal deep similarity learning with application to image retrieval, in: Proceedings of the 21st ACM international conference on Multimedia, ACM, 2013, pp. 153–162.

22.

Xing

E.P.

Yan

, and Hauptmann

A.G.

, Mining associated text and images with dual-wing harmoniums, arXiv preprint arXiv:1207.1423, 2012.

23.

Xia

Hoi

S.C.

Jin

and Zhao

, Online multiple kernel similarity learning for visual search, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (2014), 536–549.

24.

Lanckriet

G.R.

Cristianini

Bartlett

Ghaoui

L.E.

and Jordan

M.I.

, Learning the kernel matrix with semidefinite programming, Journal of Machine Learning Research 5 (2004), 27–72.

25.

Chen

Hoi

S.C.

and Xiao

, SimApp: A framework for detecting similar mobile applications by online kernel learning, in: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, ACM, 2015, pp. 305–314.

26.

Sonnenburg

Rätsch

Schäfer

, and Schölkopf

, Large scale multiple kernel learning, Journal of Machine Learning Research 7 (2006), 1531–1565.

27.

Chen

Hoi

S.C.

and Xiao

, Mobile app tagging, in: Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, ACM, 2016, pp. 63–72.

28.

Lin

Y.-Y.

Liu

T.-L.

and Fuh

C.-S.

, Dimensionality reduction for data in multiple feature representations, in: Advances in Neural Information Processing Systems, 2009, pp. 961–968.

29.

Hoi

S.C.

Zhao

Miao

and Liu

Z.-Y.

, Online multi-modal distance metric learning with application to image retrieval, IEEE Transactions on Knowledge and Data Engineering 28 (2016), 454–467.

30.

Smolensky

, Information processing in dynamical systems: Foundations of harmony theory, in: COLORADO UNIV AT BOULDER DEPT OF COMPUTER SCIENCE, 1986.

31.

Welling

Rosen-Zvi

and Hinton

G.E.

, Exponential family harmoniums with an application to information retrieval, in: Advances in Neural Information Processing Systems, 2005, pp. 1481–1488.

32.

Hotelling

, Relations between two sets of variates, Biometrika 28 (1936), 321–377.

33.

Andrew

Arora

Bilmes

and Livescu

, Deep canonical correlation analysis, in: International Conference on Machine Learning, 2013, pp. 1247–1255.

34.

Kulis

, Metric learning: A survey, Foundations and Trends in Machine Learning 5 (2013), 287–364.

35.

Griffin

Holub

and Perona

, Caltech-256 object category dataset, 2007.

36.

Chechik

Sharma

Shalit

and Bengio

, Large scale online learning of image similarity through ranking, Journal of Machine Learning Research 11 (2010), 1109–1135.

37.

Hoi

S.C.

Liu

Lyu

M.R.

and Ma

W.-Y.

, Learning distance metrics with contextual constraints for image retrieval, in: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, IEEE, 2006, pp. 2072–2078.

38.

Quattoni

and Torralba

, Recognizing indoor scenes, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, IEEE, 2009, pp. 413–420.

39.

LeCun

Y.A.

Bottou

Orr

G.B.

and Müller

K.-R.

, Efficient backprop, in: Neural Networks: Tricks of the Trade, Springer, 2012, pp. 9–48.

Multi-modal deep distance metric learning

Abstract

Keywords

1. Introduction

2.1 Multiple kernel learning

2.2 Probabilistic shallow and deep network models for multi-modal data

3. Preliminaries

3.1 Definitions

4.1 Optimization problem

Table 1 The notation symbols used in our method

5.1 Datasets

5.2 Extracted features

5.3 Choosing distance metric and margins

Table 2 mAP of networks with different output widths

Table 3 mAP of different retrieval methods

6. Conclusion

Footnotes

References

Table 1
The notation symbols used in our method

Table 2
mAP of networks with different output widths

Table 3
mAP of different retrieval methods