Abstract
In many real-world applications, data contain heterogeneous input modalities (e.g., web pages include images, text, etc.). Moreover, data such as images are usually described using different views (i.e. different sets of features). Learning a distance metric or similarity measure that originates from all input modalities or views is essential for many tasks such as content-based retrieval ones. In these cases, similar and dissimilar pairs of data can be used to find a better representation of data in which similarity and dissimilarity constraints are better satisfied. In this paper, we incorporate supervision in the form of pairwise similarity and/or dissimilarity constraints into multi-modal deep networks to combine different modalities into a shared latent space. Using properties of multi-modal data, we design multi-modal deep networks and propose a pre-training algorithm for these networks. In fact, the proposed network has the ability of learning intra- and inter-modal high-order statistics from raw features and we control its high flexibility via an efficient multi-stage pre-training phase corresponding to properties of multi-modal data. Experimental results show that the proposed method outperforms recent methods on image retrieval tasks.
Introduction
A proper distance metric (or similarity measure) plays an important role in many learning and retrieval tasks. Until now, many methods have been proposed for metric learning [1, 2, 3, 4, 5]. In these methods, it is usually assumed that supervisory information in the form of relative distance constraints or similar/dissimilar pairs is available. Some of these methods learn linear [1, 2] or nonlinear [4] transformations on the feature space to find a new representation space in which distance (or similarity) constraints are better satisfied. However, those methods that learn nonlinear transformations (or implicitly kernel matrices) are usually either limited to learning a restricted form of non-linear transformations or very time consuming when they are flexible (i.e. they learn the whole kernel matrix). Moreover, the flexible methods that learn the whole kernel matrix are transductive and cannot be used to find similarities for new data. Recently, some deep metric learning methods [6, 7, 8, 9] have been proposed that can learn a non-linear transformation to achieve a new representation space in which the distance constraints are better satisfied. However, these deep models of metric learning have been designed for input data containing one modality. Therefore, they have not used properties of multi-modal data for designing the architecture and training of the deep networks.
On the other hand, real-world data usually contains different modalities such as text, image, and video. Video and corresponding audio [10] and annotated images [11, 12] are examples of multi-modal data. In recent years, several multi-modal methods have been proposed to incorporate heterogeneous modalities in the classification and retrieval tasks [11, 12, 13, 14]. Besides, there are similar challenges in multi-view descriptions of data. Any view is indeed a description of data obtained by a particular feature extraction method [15, 16]. Because of the somewhat similar nature of multi-modal and multi-view data, they pose similar challenges and even in some studies identical models have been introduced for both of them [17, 18].
(a) Unfolded MMD [10] (b) Unfolded MMD-DML.
Deep networks are flexible and effective models that have been used in a wide range of applications [19]. They can model distributions of different modalities and make a connection between them by a common or shared space that is obtained using a layer above modality specific networks. Multi-modal deep models have recently attracted much attention in many applications such as cross-modal [11, 20] and multi-modal retrieval tasks [14, 21]. A popular deep network architecture for multi-modal data is shown in Fig. 1a. This architecture which includes modality specific networks and a single layer on the top of these networks (to find a shared representation) has been used in many multi-modal deep learning models [10, 14]. However, this architecture has not been already used for metric learning.
In this paper, we propose a deep metric learning method for multi-modal data (to the best of our knowledge, our method is the first deep metric learning method for multi-modal data). The proposed Multi-modal Deep Distance Metric Learning (MMD-DML) framework (see Fig. 1b) can include some layers for non-linear metric learning on the top of the single layer presenting the shared representation of modalities. We design an effective method for unsupervised pre-training of this model using the properties of multi-modal data. Since we intend to use the multi-modal deep network for metric learning, an optimization problem is presented that considers supervisory information in the form of similar/dissimilar pairs. Stochastic gradient descent is employed for training MMD-DML with batches of similar/dissimilar pairs.
In this work, our goal is learning a distance metric that incorporates multiple modalities. Retrieval is one of the tasks in which the distance metric has a critical role and also some data views or modalities usually exist in retrieval applications. For example, in Content-Based Image Retrieval (CBIR), different views of an image, which are obtained through various feature extraction techniques, can act as different modalities. Experimental results show the effectiveness of our pre-training method in such retrieval tasks.
The rest of this paper is organized as follows. Some related works are reviewed in Section 2. We first present some definitions and preliminaries of our proposed model in Section 3 and then the proposed method is described with details in Section 4. Experimental settings and results of our method for CBIR are presented in Section 5. Finally, we conclude our work in Section 6.
The existing methods of representation learning for multi-modal data can be categorized as below:
Multiple Kernel Learning (MKL) Shallow Probabilistic Models Deep Models
These approaches are widely used in unsupervised [10, 11, 20, 22] and supervised [4, 12, 21, 23] multi-modal retrieval.
Multiple kernel learning
MKL methods can be used to learn a kernel that is a combination of a set of fixed basis kernels. Although these methods were first applied to single modal data, they can also be utilized for multi-modal data where different kernels are considered for different modalities [24, 25].
Weighted kernel combination is one of the earliest MKL methods [24, 25, 26, 27] in which the kernel space is equivalent to weighted concatenation of kernel Hilbert spaces. Lanckriet et al. [24] employed weighted kernel combination in the Support Vector Machine (SVM) and learned optimal kernel weights and SVM parameters simultaneously. Even though most of MKL methods have been designed for classification purposes and used labels as supervisory information. However, they can be adapted to use supervisory information in the form of pairwise distance constraints [4] or triplets distance constraints [25, 27]. Recently, Chen et al. [25, 27] proposed methods for learning mobile applications similarity using weighted kernel combination approach.
Lin et al. [28] introduced Weighted Multiple Kernel Embedding (WMKE) method for learning a linear transformation on spaces resulted from weighted kernels combination. Although this method can model correlation between modalities, simple scaling and selecting kernels are the only degrees of freedom considered for integrating modalities. In Multiple Kernel Partial Order Embedding (MKPOE) method [4], distinct linear transformations are learned on kernel spaces simultaneously. Unlike WMKE, this method cannot directly model correlation between diverse modalities. However, it can transform modality spaces efficiently.
The learning stages of MKL approaches usually include optimization of a very big positive semi-definite (PSD) matrix. Therefore, these methods are not scalable to massive data in real-world applications such as multi-media retrieval tasks. Xia et al. [18] extended MKPOE method to an online mode by converting constrained optimization problem to its unconstrained equivalent and then projecting parameters to the constraints space. Xia et al. [23] proposed a similar method called Online Multiple Kernel Similarity Learning (OMKS) for CBIR application. They used different features extracted from image as modalities. To increase performance, several different kernels are considered for each modality and these kernels are combined to find the similarity measure. They proposed an efficient two-stage optimization technique to find kernel space transformations and optimal combination weights. Wu et al. [29] proposed a similar method called OM-DML which simultaneously learns a distinct linear transformation on each modality and also optimal weights for combining modalities. By directly learning a linear transformation instead of learning a Mahalanobis metric, OM-DML eliminates the time-consuming Positive Semi-Definite (PSD) projection step required in the OMKS algorithm. This method is also able to seek low-rank solutions by setting the number of dimensions for new spaces to be less than the number of input dimensions. In this paper, we propose the MMD-DML method which explicitly learns a non-linear transform having the advantage of kernel-based approaches while it does not need the PSD constraints (similar to the methods like OM-DML). As opposed to the MKL methods that fix the base kernels, our method can learn a flexible non-linear transform on each modality. Furthermore, unlike MKPOE, OMKS, and MMDDML methods, our method has the ability to model intermodal correlations using the joint multi-layer network on the top of the modality specific networks.
Probabilistic shallow and deep network models for multi-modal data
Shallow and deep networks are capable of providing a powerful framework using nonlinear activation functions or diverse conditional probability distributions and have been used extensively in various areas including multi-modal tasks [10, 12, 14, 15, 16, 22].
Harmonium [30] is a shallow probabilistic model containing a layer of latent variables as a hidden representation of data. Dual-Wing Harmonium (DWH) [22] is an extension from exponential Harmonium [31] which is applicable to data with two modalities in the visible layer. In this model, image and annotations (along with image) are embedded into a shared latent space. Assumptions about conditional probability distribution can be leveraged as prior information about data. Xie and Xing [12] extended DWH in their Multi-Modal Distance Metric Learning (MM-DML) method for distance metric learning through minimizing the cost function that has been defined according to similar and dissimilar pairs. Chen et al. proposed supervised extensions of DWH for large margin predictive subspace learning [15, 16]. Supervisory information in the form of labeled data is utilized in these methods.
Several models of multi-modal deep networks have been proposed in recent years. Most of them are unsupervised methods that model data distribution [10, 14]. Some of the existing methods try to find a latent space that can be constructed by each modality [11, 13]. These methods are useful in cross-modal tasks. For example, in multi-modal retrieval based on Stacked Auto Encoders (SAEs) [11], an SAE is trained for each of the two modalities of image-tag bimodal data. After that, these methods try to minimize Euclidean distance between the latent representation of the images and that of their associated tags. Feng et al. [13] proposed a similar method based on Restricted Boltzmann Machine (RBM) to map image and text into a low-dimensional common space for cross-modal retrieval task. They used correlation-based loss function to maintain correspondence between distinct deep RBMs of modalities. A deep model using Canonical Correlation Analysis (CCA) [32] to find a shared latent space has also been introduced in [33]. In this model, each modality is transformed through a separate deep network to a space where the inter-modal correlation of the transformed modalities is maximized. Ngiam et al. [10] proposed an effective Multi-Modal Deep Network (MMD) model that learns a shared representation from different modalities in an unsupervised manner. The MMD model is pre-trained in a greedy layer-wise manner and then fine-tuned for multi-modal or cross-modal tasks by back-propagation. Srivastava et al. [14] proposed Multi-Modal Deep Boltzmann Machines (MMDBM) as an unsupervised method that assigns a deep network to each modality and uses a layer on the top of these networks to find a shared latent space. In this method, for each layer, an RBM is used and the model is trained in a layer-wise manner using contrastive divergence. This method is similar to the MMD method [10] but uses DBM instead of SAE.
Preliminaries
In this section, we present some definitions and also some basic ideas about metric learning that have been presented in the previous works.
Definitions
In this part, some definitions are provided for the terms used in the following sections and some basic ideas about metric learning are presented.
A multi-modal vector space is
Given a query object
Similar and dissimilar pair sets are defined as:
For each
In this section, we first present some important and popular optimization problems for metric learning. Then, the most popular multi-modal metric learning method is introduced. Xing et al. proposed a distance metric learning method that minimizes the distance between similar pairs while separating dissimilar pairs by a margin [1]. Hence, the optimization problem does not consider any loss for dissimilar pairs that are far enough from each other:
where
Here,
Xie et al. proposed the MM-DML method [11] with the following optimization problem based on dual-wing harmonium:
where
where
MM-DML method utilizes the stochastic gradient descent to directly optimize the feature transformation instead of learning the Mahalanobis metric (
In general, learning a non-linear transformation has some advantages to learning a Mahalanobis metric or learning a kernel matrix. Deep networks provide a powerful framework to learn flexible non-linear transformations. Nonetheless, all of the existing deep metric learning methods [6, 7, 8, 9, 10] are proper for input data containing only one modality.
In this section, we propose the MMD-DML method that uses the deep learning approach to find a flexible non-linear transformation leading to an effective distance metric for multi-modal data. We use a multi-stage pre-training phase utilizing unlabeled multi-modal data. Then, we impose margin constraints for both similar and dissimilar pairs via an optimization problem inspired by the ITML method [3]. The batch-mode gradient descent technique is utilized to find the solution of the proposed optimization problem that considers similar/dissimilar pairs.
Optimization problem
Figure 1b shows the unfolded structure of the proposed architecture in our MMD-DML method. This model has a separate SAE with an arbitrary number of layers for each modality. Joint SAE (JSAE) takes the concatenation of the latent representations of the modalities as its input layer and provides a shared representation as the output.
The depth of the SAE considered to the
The notation symbols used in our method
The notation symbols used in our method
Finally, we define the optimization problem of MMD-DML as:
where
where
Using hinge losses instead of hard margin constraints in Eq. (4.1), we obtain:
where
In the next subsections, we first introduce a pre-training algorithm to initialize the parameters of our MMD-DML model. Then, a gradient descent optimization technique is utilized to solve the optimization problem in Eq. (4.1) as the fine-tuning step in the proposed model. Since the hinge loss terms in this equation are not differentiable, we simply use sub-gradient technique by considering the gradient of hinge loss equal to zero in non-differentiable points.
Ngiam et al. [10] proposed a pre-training method for MMD in which the network is first initialized in a greedy layer-wise manner by sparse RBMs. After that, the unfolded MMD network is pre-trained by the backpropagation algorithm.
In our method, unsupervised pre-training of the network consists of three major steps. Different stages for pre-training are shown in Algorithm 1. The first step includes pre-training of the SAE of each modality (Fig. 2a). To achieve a proper starting point, every layer is first initialized by Singular Value Decomposition1 (SVD) and then pre-trained by the backpropagation2 algorithm to provide a suitable dimensionality reduction for the next layer (Fig. 2b). The SAE whose layers are found in this greedy manner (one after the other) is then trained as a whole multi-layer network by the backpropagation algorithm (Fig. 2c). Indeed, we train the network allocated to the
where
In the second step, the JSAE is pre-trained in a similar manner using inputs provided by the modality specific SAEs (Fig. 3). Eventually, in Step 3 of Algorithm 1, the whole unfolded network (Fig. 1b) is pre-trained by the backpropagation algorithm to find the shared representation that minimizes sum of the squared reconstruction error over all the modalities (i.e. the first term in Eq. (4.1)).
Pre-training of the modalities SAEs (Step 1 of Algorithm 1). (a) SAE of the m-th modality. (b) Layer-wise SAE pre-training of the m-th modality (for the l-th layer) by firstly using SVD initialization and then update weights of this network (that has one hidden layer) using error backpropagation on the reconstruction error. (c) Backpropagation to minimize the reconstruction error for the unfolded SAE of each modality.
Pre-training of Joint SAE (Step 2 of Algorithm 1). (a) Greedy layer-wise pre-training of Joint SAE by firstly using SVD initialization and then using backpropagation to minimize reconstruction error of each layer (b) Backpropagation to minimize the reconstruction error in the whole unfolded joint SAE.
In this section, we use the gradient descent method to fine-tune the pre-trained MMD-DML network by considering similar/dissimilar distance losses in the second and the third terms of Eq. (4.1). By utilizing distance losses in Eq. (4.1), we optimize MMD-DML parameters (weights and biases of MMD-DML encoders) as:
As mentioned in Section 4.1, hinge losses in the above objective function are not differentiable in zero and we use sub-gradient strategy to train our model. In other words, the sub-gradient of the hinge loss is defined as:
Finally, the gradient of the cost function Eq. (10) is calculated as:
We utilized the batch-mode stochastic gradient descent technique. Therefore, in each step, we calculate Eq. (4.3) for a mini-batch of similar/dissimilar pairs that is a subset of
The first term on the right-hand side of Eq. (4.3) is the gradient of
In this paper, we use various feature types such as SIFT and GIST as different modalities of image data and evaluate MMD-DML in CBIR as a multi-modal retrieval task.
Datasets
To assess the efficacy of our method in CIBR tasks, we evaluate it on three widely-used datasets:3 Caltech-256, Corel5k, and Indoor. These are the most common datasets for CIBR tasks and following [23] as the most related work to ours, we select these datasets. Caltech-256 dataset has 256 image categories plus an extra class named “Clutter” [35]. Similar to the work by Chechik et al. [36], we choose 10, 20, and 50 classes from this dataset. In our experiments, these subsets are referred to as “Cal10”, “Cal20”, and “Cal50” respectively. Corel5k has 50 diverse image categories collected from COREL image CDs [37]. Contrary to Caltech-256, which has varied number of images in each category, each of the classes in Corel5k contains exactly 100 images. Indoor is a dataset previously used for indoor scene recognition [38]. This dataset has 67 categories, each of which contains at least 100 images.
Following the work of Xia et al. [23], in order to avoid the dominating effect of one class with high number of images, we find the number of images in the smallest class and randomly choose samples of this size from each class so that the number of images is the same for all classes. Then, we randomly split the data into four partitions – that is – training set, validation set, query set, and test set. Training set contains 50% of images and is used for pre-training the network and extracting pairwise constraints. Validation set contains 10% of images and is used for tuning the hyper-parameters. Query set and test set contain 10% and 30% of the images respectively that are used for evaluation of the method. For comparison of different methods, query objects are chosen from the query set and the test set is regarded as the target domain. To extract pairwise constraints, we create all possible similar pairs in the training set and for each similar pair
Extracted features
Similar to [23], we use several types of features from each image. These features are Local Binary Pattern, GIST features, Gabor wavelets, color histogram and color moments, edge direction histogram, SIFT features, and SURF features. For SIFT and SURF features, we use 200 and 1000 as codebook size, thus generating four types of features called SIFT200, SIFT1000, SURF200, and SURF1000. Using PCA, we extract 100 features from each feature set whose dimension is more than 100.
Choosing distance metric and margins
We use the measure defined below as the distance metric in Eq. (4.1):
This is the distance metric related to cosine similarity and ranges over
As mentioned by Xing et al. [1], the margin value in Eq. (2) corresponds to only scaling and different margin values may yield equivalent solutions. Suppose a near optimal (w.r.t. visual similarity) pre-trained MMD-DML model (see Section 4.2 for the pre-training stage). Distance between pairs in this model, w.r.t. Euclidian metric, ranges over
Performance of the network with different depths.
In this section, we evaluate networks with various numbers of layers and different numbers of units to find the desired architecture of the network. We start with a network having single-layer modality-specific SAEs and no JSAE. At each step, we increase the number of layers in modality-specific SAEs by one while fixing other hyper-parameters and evaluate the resulted model using mean Average Precision (mAP). This process is continued until adding a new layer decreases the network performance. The number of units in each layer is chosen so that the first layer reduces the dimensionality of each modality to 50 and the subsequent layers do not further reduce the dimensionality. Figure 4 shows performance of the network with respect to the number of layers on the different datasets. According to these results, performance tends to degrade when the number of layers goes beyond three (especially for datasets with the smaller number of samples) since the networks with the higher number of parameters are more prone to overfitting. Indeed, when we increase the number of layers, the flexibility of the model can also be promoted. However, the number of adjustable parameters (i.e. weights and biases) are raised and since in many datasets the number of training samples is not sufficient, the overfitting may occur when new layers are added. For the Indoor dataset, the largest dataset, the network can be five layers deep since we have more samples to train the network.
For each dataset, we then pick the network with the highest performance and replace its last layer with a single-layer JSAE. This network is evaluated using 64, 128, and 256 output neurons. The results are summarized in Table 2.
For the activation functions of the network, as recommended in [39], hyperbolic tangent is used due to its symmetry around the origin which allows faster convergence for all encoders and decoders except decoders of the first layers. Indeed, decoders of the first layers employed linear activation functions. The network is then trained in 300 iterations with a batch size of 250.
mAP of networks with different output widths
mAP of networks with different output widths
We compare our method with three recent methods introduced for multi-modal retrieval. As a baseline, we also report the results of the Unsupervised Multi-Modal Deep SAE (U-MMD-DML) as the unsupervised version of our method that can be considered as an extension of Bimodal Deep Network proposed in [10] with some difference in training (that have been mentioned in Section 4.1). Details of the methods used for comparison are provided below:
mAP of different retrieval methods
mAP of different retrieval methods
Evaluation of methods using precision at top-k.
Evaluation of methods using precision-recall curve.
We evaluate each method on five random splits of the datasets and average the obtained results. First, we use mAP measure to compare the performance of different methods and summarize the average results in Table 3. In Fig. 5, we report the performance of these methods in terms of the precision at top-k. We also compare these methods using 11-point interpolated precision-recall curve on the same dataset in Fig. 6.
It can be seen from the results that MMD-DML significantly outperforms all the other methods. MMD-DML’s ability to learn a nonlinear transform for each modality can serve as a reason for the remarkable difference between performance of our method and that of the MM-DML method. The relatively high performance of shallow MM-DML on Cal10 dataset does not scale well to larger datasets and it becomes suboptimal compared with the other methods. This validates that we can improve the results using deep models such as MMD-DML for large-scale tasks. Moreover, comparing the results of MMD-DML and U-MMD-DML, we find that in MMD-DML supervisory information improves performance with a large margin.
Evaluation of methods in terms of k-NN classification accuracy.
We also compare the methods in terms of k-nearest neighbor (k-NN) classification accuracy for various values of
mAP measure w.r.t. the ratio of pairwise constraints.
As mentioned in Section 5.1, we keep a ratio of pairwise constraints (i.e. supervisory information in the form of similar and dissimilar pairs) to train the supervised methods. We evaluate the methods while changing this ratio and summarize the results in Fig. 8. Several empirical observations can be inferred from these results. First, MMD-DML performs better than the other methods in most cases. Second, the biggest leap in the performance of the three methods results from the first 20% of the constraints. Third, as mentioned in [23], OMKS becomes nearly saturated after receiving the first 20% of the constraints and a similar phenomenon happens for MM-DML and OM-DML methods too. However, MMD-DML keeps taking advantage of supervisory information beyond this level since the MMD-DML model is more flexible and more supervisory information can help it to be trained more properly.
Conclusion
In this paper, we proposed the MMD-DML framework for distance metric learning on multi-modal data when supervisory information is available in the form of similar/dissimilar pairs. MMD-DML is capable of learning a complicated nonlinear similarity function on multi-modal data (with heterogeneous modalities). In other words, MMD-DML has the ability of learning intra- and inter-modal high-order statistics from raw features. High degree of freedom in MMD-DML hypothesis space is well controlled using an efficient multi-stage pre-training phase. In fact, we first used the properties of multi-modal data to pre-train the network and then fine-tuned it using the supervisory information. Experimental results show the superiority of the proposed method in the retrieval and classification tasks. Our method improves mAP measure on Cal10, Corel5k, and Indoor datasets respectively 7.2%, 4.1%, and 0.6% compared to the second best method (OMKS).
Footnotes
If we have large-scale data, we can simply ignore SVD steps or calculate SVD over a subset of examples.
Reconstruction loss function used in the first layer of each modality specific SAE can be selected depending on modality distribution. For other layers of every SAE, however, Euclidean reconstruction loss functions loss is common. All reconstruction loss minimization steps in Algorithm 1 are done by batch-mode gradient descent.
The datasets used in our experiments are available in project website of OMKS method
