Efficient modal-aware feature learning with application in multimodal hashing

Abstract

Many retrieval applications can benefit from multiple modalities, for which how to represent multimodal data is the critical component. Most deep multimodal learning methods typically involve two steps to construct the joint representations: 1) learning of multiple intermediate features, with each intermediate feature corresponding to a modality, using separate and independent deep models; 2) merging the intermediate features into a joint representation using a fusion strategy. However, in the first step, these intermediate features do not have previous knowledge of each other and cannot fully exploit the information contained in the other modalities. In this paper, we present a modal-aware operation as a generic building block to capture the non-linear dependencies among the heterogeneous intermediate features, which can learn the underlying correlation structures in other multimodal data as soon as possible. The modal-aware operation consists of a kernel network and an attention network. The kernel network is utilized to learn the non-linear relationships with other modalities. The attention network finds the informative regions of these modal-aware features that are favorable for retrieval. We verify the proposed modal-aware feature learning in the multimodal hashing task. The experiments conducted on three public benchmark datasets demonstrate significant improvements in the performance of our method relative to state-of-the-art methods.

Keywords

Multimodal hashing feature learning multimodal retrieval nearest neighbour search multimodal fusion

1. Introduction

Multimodal hashing [1, 2], which embeds multimodal data into a single binary code and aims to improve performance by using complementary information provided by the different types of data sources, has drawn much attention. Since good representations are important for multimodal hashing, we only focus on designing a better feature learning approach in this paper.

To learn the representations, multimodal fusion [3, 4] is proposed, which aims to generate a joint representation from two or more modalities in favor of the given task. These fusion methods mainly fall into two categories [3]: model-agnostic approaches [5] and model-based approaches [6]. The model-agnostic methods do not use a specific machine learning method. According to the data processing stage, model-agnostic methods can be mainly split into early and late fusion. Early fusion immediately combines multiple raw/preprocessed data into a joint representation. In contrast, late fusion performs integration after all of the modalities have made decisions. The model-based approaches fuse the heterogeneous data using different machine learning models [7], e.g., the graphical models [8] and the deep networks [9].

Figure 1.

Illustration of two different feature extractions for multimodal data: (A) each modality uses individual neural layers to learn intermediate features; (B) our proposed modal-aware feature learning that can learn the non-linear dependencies among the heterogeneous data.

Recently, deep multimodal fusion has attracted much attention because it is able to extract powerful feature representations from raw data. As shown in Fig. 1 (A), the common practices for deep multimodal fusion are as follows [10, 11, 12]: 1) Each modality start with several individual neural layers to learn intermediate feature. 2) These multiple intermediate features are merged into a joint representation via a fusion strategy. Such a fusion approach is called as intermidiate fusion [13] because the powerful intermediate features obtained by deep neural networks (DNNs) are merged to construct the joint representation. Deep multimodal learning has been shown to achieve remarkable performance for many machine learning tasks. For instance, the deep cross-modal hashing [14] and the deep semantic multimodal hashing [15].

While most existing methods focus on designing better fusion strategies, e.g., gated multimodal units (GMUs) [16] and multimodal compact bilinear pooling (MCB) [17] for data fusion, and only limited attention has been paid to the intermediate features. The multiple intermediate features are separately learned and do no fully utilize the underlying correlation structures in other modalities. For example, we have two pictures, a picture of a white egg and another picture of a white ping-pong ball. Since these two images are very similar, we need to look carefully to distinguish them. Thus, the intermediate features only learned from the image modality are not efficient. If we can utilize the information from the textual modality, e.g., the textual tags “egg” and “ping-pong”, we can obtain more powerful and efficient intermediate features. Thus, a natural question arises: Can we incorporate information from other modalities to learn the intermediate features?

To this end, we propose a deep architecture to learn the modal-aware features for multimodal hashing. Unlike in other deep multimodal approaches, in which each intermediate feature is learned via several individual neural layers, our method learns the dependent and joint intermediate features among the heterogeneous data sources. As shown in Fig. 1 (B), the modal-aware feature learning module is proposed to produce new intermediate features, in which these new intermediate features are learned jointly and dependently. Hence, each intermediate feature consists of information from other modalities, and we can incorporate the information from other modalities to learn more powerful intermediate features.

In the context of multimodal retrieval, two factors are considered in the proposed modal-aware operation. The first consideration is how to learn the non-linear dependencies from other modalities. Inspired by the kernel methods [7], we present a kernel network to learn the underlying correlation structures in other modalities. Given two intermediate features from two modalities, we first calculate the kernel similarities, i.e., dot-product similarities, between the two features. Then, the similarities are used as weights to reweigh the original features. The second consideration is how to learn powerful intermediate features in favor of the binary hash codes. We propose an attention network that focuses on selecting the discriminative parts of the multimodal data. The uninformative parts will be removed and will not be used to encode the binary codes. Thus, our method can learn more efficient binary codes to some extent because the binary codes are generated from the informative parts of multimodal data that are favorable for retrieval. To fully explore all modalities, all intermediate features of the modalities are incorporated to learn the attention maps.

Finally, the main contributions of this paper are listed as follows.

•

We present a modal-aware operation to learn the intermediate features. This operation can learn the information contained in other modalities before fusion, which helps capture data correlations.

•

We propose a kernel network to capture the non-linear dependencies and an attention network to find the informative regions. These two networks learn better intermediate features for generating binary hash codes.

•

The extensive experiments are conducted on three benchmark multimodal databases. Our method achieves better performance compared with several state-of-the-art baselines. The experimental results show the usefulness of the proposed modal-aware operation.

2. Related work

2.1 Multimodal fusion

Multimodal fusion is an important step for multimodal learning. A simple approach for multimodal fusion is to concatenate or sum the features to obtain a joint representation [18]. For instance, Hu et al. [19] concatenated text embeddings and visual features for image segmentation. Reconstruction methods were also proposed to fuse the multimodal data, e.g., the autoencoders [20] and deep Boltzmann machines [21]. They used only one modality as the input to reconstruct both modalities. Subsequently, inspired by the success of bilinear pooling and gated recurrent networks, multimodal compact bilinear pooling [17] is proposed to efficiently combine multimodal features, and John et al. [16] used a gated multimodal unit to determine how much each modality affects unit activation. To capture cross-modal signal correlations, Liu et al. [22] multiplicatively combined a set of mixed source modalities. Although many approaches have been proposed for multimodal fusion, these deep learning methods do not fully explore the dependencies among the modalities prior to the fusion operations. In this paper, we argue that capturing the dependencies among the heterogeneous modalities will benefit multimodal fusion.

2.2 Multimodal retrieval

Another similar work is that on cross-modal hashing [23, 24, 25], which aims to retrieve the relevant data from the different modality. Many algorithms have been proposed, e.g., the cross-view hashing (CVH) [26] and the semantic correlation maximization (SCM) [27]. They used hand-crafted features for representing the modal data. Recently, deep-network-based methods [28, 23] have drawn much attention. The representative work includes deep cross-modal hashing (DCMH) [14], cross-modal hamming hashing [29], pairwise relationship guided deep hashing (PRDH) [30] and so on. Attention-aware deep adversarial hashing [31], self-supervised adversarial hashing (SSAH) [32], multi-label variational hashing [33] and semi-supervised cross-modal hashing (SCH-GAN) [34] apply the adversarial learning to generate better binary codes. Although many approaches have been designed for cross-modal hashing, the multimodal hashing and cross-modal hashing are different. The proposed multimodal hashing aims to learn the joint representations but not coordinated representations, in which the joint approach combines multiple modal data into a representation space while the coordinated approach learns the multiple data separately and enforce similarity-preserving among different modalities [3].

Other similar works include those on multi-view hashing methods that leverage multiple views to learn better binary codes. Some representative works include multiple feature hashing (MFH) [35], multi-view latent hashing (MVLH) [36], composite hashing with multiple information sources (CHMIS) [37], dynamic multi-view Hashing (DMVH) [38] and so on. In this paper, we only consider the multimodal data but not the multiple views, e.g., SIFT and HOG from the same image modality.

There has been limited attention paid to multimodal hashing. In [39], it integrates deep hashing into the secure multimodal biometric systems. Wang et al. [1] proposed deep multimodal hashing with orthogonal regularization to exploit the intra-modality and inter-modality correlations. Cao et al. [40] proposed an extended probabilistic latent semantic analysis (pLSA) to integrate the visual and textual information. There are multiple applications of large-scale retrieval in other domains, e.g., medical image analytics [41], lifelogging [42], etc. Different from these methods, we aim to learn better intermediate features (modal-aware) for multimodal hashing in this paper.

Figure 2.

Overview of deep multimodal hashing. It consists of three sequential parts: (A) feature learning module; (B) fusion module; and (C) hashing module. Please note that the intermediate features are learned separately. In this paper, we focus on learning better intermediate features.

3. Overview of deep multimodal hashing

In this section, we first briefly introduce the existing deep multimodal hashing framework.

Let $S=\{S_{i}\}_{i=1}^{n}$ denotes a set of training samples, where each sample is represented in multiple modalities. For ease of presentation, we only consider the image and text two modalities to explain our main idea. Each sample is denoted as $S_{i}=\{I_{i},T_{i},Y_{i}\}$ , where $I_{i}$ and $T_{i}$ are image and text descriptions of the $i$ -th sample, and $Y_{i}$ is the corresponding ground-truth label. Let $H=\{H_{i}\}_{i=1}^{n}$ denote the binary codes, where $H_{i}\in\{-1,1\}^{l}$ is the $l$ -dimensional binary code for the $i$ -th sample $S_{i}$ . The goal of multimodal hashing is to learn effective hash functions that encode the sample $S_{i}$ into one binary code $H_{i}$ while preserving the similarities between the instances. For example, the Hamming distance between $H_{i}$ and $H_{j}$ should be small if the samples $S_{i}$ and $S_{j}$ are similar. When $S_{i}$ and $S_{j}$ are dissimilar, the Hamming distance should be large.

Different from unimodal data, each instance consists of multiple unimodal signals. Combining these signals into a joint representation becomes a critical step. Currently, the deep multimodal learning (DML) approaches have been shown to achieve remarkable performance because they can learn the powerful features from all of the modalities. Merging these powerful features into a joint representation will lead to better and flexible multimodal fusion.

An illustration of a deep architecture for multimodal hashing is shown in Fig. 2. The architecture has three sequential parts: 1) the feature learning module, which learns the efficient intermediate features from the image and text raw data; 2) the multimodal fusion module, which merges the two intermediate features into a joint representation; and 3) the hashing module, which encodes the joint representations to the binary codes, followed by a similarity-preserving loss.

In the feature learning module, the convolutional layers are used to learn the powerful feature maps for the image modality. These images go through several convolutional layers to obtain the intermediate feature maps. For the text modality, the feed-forward neural network with stacked fully-connected layers is applied to encode the textual data into semantic text features.

In the fusion module, with two intermediate features, a fusion strategy is utilized to obtain a joint representation. Many methods for fusion have been proposed, e.g., concatenation, gate multimodal units (GMUs) [16] and multimodal compact bilinear pooling (MCB) [17].

In the hashing module, the joint representation is mapped into a feature vector with the desired length, e.g., an $l$ -bit approximate binary code. Then, the similarity-preserving loss is constructed to preserve the relative similarities of multimodal data.

However, in the above deep multimodal hashing, these intermediate features are learned separately and had no prior knowledge of other modalities before the fusion. To this end, we propose a modal-aware operation that aims to learn better intermediate feature representations. It contains a kernel network to learn the correlations among different modalities and an attention network that discovers the discriminative regions. The two aspects are described in detail in the next section.

Figure 3.

Illustration of the kernel network. The image feature maps $f^{I}$ with size of $H\times W\times C$ and the text feature vector $f^{T}$ with feature-length $K$ . “ $\otimes$ ” denotes matrix multiplication and “ $\odot$ ” denotes element-wise multiplication. “conv” and “fc” denote the convolutional and fully connected layers, respectively. “GAP” represents the global average pooling layer.

4. Modal-aware operation

In this section, we present a modal-aware operation that has two parts, i.e., the kernel network and attention network, to learn the modal-aware features.

4.1 Kernel network

The kernel network takes two intermediate features as inputs: the feature maps from the image modality and the feature vector from the textual modality. More specifically, suppose that $f^{I}\in\mathbb{R}^{H\times W\times C}$ denotes as the feature maps for the image modality, where $W, H$ and $C$ are the numbers of the weight, height, and channel, respectively. $f^{T}\in\mathbb{R}^{K}$ is the corresponding textual feature, where $K$ is the feature-length.

Inspired by the non-local features [43] and kernel methods, the outputs of the kernel network are defined as

$\displaystyle\hat{f}^{I}=\mathbf{K}^{I}(f^{I},f^{T})f^{I},$ (1) $\displaystyle\hat{f}^{T}=\mathbf{K}^{T}(f^{I},f^{T})f^{T},$

where $\mathbf{K}^{I}(f^{I},f^{T})$ and $\mathbf{K}^{T}(f^{I},f^{T})$ are the kernel functions that measure the similarity between the inputs $f^{I}$ and $f^{T}$ . We use the kernel methods to exploit the correlation structures obtained in other modalities. In Eq. (1), the intermediate features of the image modality are learned from both the textual and image features. First, the kernel similarity between the image feature and textual feature is calculated. Then, this similarity is used to reweight the original feature. Thus, using these operations, the image feature is embedded into textual information. The same approach is used for the text modality. We note that we use different kernel functions because the textual feature is a vector with one-dimension arrays while the image feature maps are three-dimensional tensors.

To easily learn the kernel network in end-to-end manner, the kernel function $\mathbf{K}(f^{I},f^{T})$ is further expressed as the inner product in another space $\mathcal{H}$ , which is reformulated as

$\displaystyle\mathbf{K}(f^{I},f^{T})=\langle\phi(f^{I}),\varphi(f^{T})\rangle_% {\mathcal{H}},$ (2)

where the $\phi(\cdot)$ and the $\varphi(\cdot)$ are mapping functions to project the data into another space. Since we use deep networks to learn the multimodal data, we also design two networks as these two mapping functions. That is, a convolutional layer and a fully connected layer are utilized as the mapping functions: $\phi(\cdot)$ is a convolutional layer, and $\varphi(\cdot)$ is a fully-connected layer.

Figure 3 shows the detailed structure of the proposed kernel network. For the image modality, the feature maps $f^{I}$ and textual vector $f^{T}$ are taken as inputs for the kernel network. The approach consists of three parts: 1) two mapping functions $\phi^{I}(f^{I})$ (a convolutional layer) and $\varphi^{I}(f^{T})$ (a fully connected layer) are first learned; 2) the kernel similarity is calculated using the inner product layer, and 3) the origin features are reweighted using the kernel similarity. In the first part, $\phi^{I}$ is a convolutional layer and its kernel size is $1\times 1$ , and $\varphi^{I}$ is a single-layer neural network with transformation matrix $W\in\mathbb{R}^{K\times C}$ that maps the textual and visual features to the same dimension, which are formulated as

$\displaystyle V^{I}=\phi^{I}(f^{I}),T^{I}=\varphi^{I}(f^{T}).$ (3)

Since $V^{I}$ is a tensor while $T^{I}$ is a vector, we first reshape the feature maps by flattening the height and width of the original features: $V^{I}=[V^{I}_{1},\cdots,V^{I}_{M}]$ , where $V^{I}_{i}\in\mathbb{R}^{C}$ and $M=H\times W$ . The inner products between these $M$ features and the text feature $T^{I}$ can be calculated. The output of $\hat{f}^{I}$ can be defined as

$\displaystyle\hat{f}^{I}_{i}=\langle V^{I}_{i},T^{I}\rangle\ f^{I}_{i},\ \ % \forall i=1,\cdots,M,$ (4)

where $\hat{f}^{I}_{i}$ is the $i$ -th vector corresponding to $V_{i}^{I}$ .

A similar approach is used for the text modality. First, the global average pooling (GAP) layer reduces $f^{I}$ with dimensions $H\times W\times C$ to dimensions $1\times 1\times C$ . Let $\bar{f}^{I}$ denotes as the output of the GAP layer. Since $\bar{f}^{I}$ is a vector, $\phi^{T}$ and $\varphi^{T}$ are two fully connected layers:

$\displaystyle V^{T}=\phi^{T}(\bar{f}^{I}),\ T^{T}=\varphi^{T}(f^{T}),$ (5)

where $\phi^{T}$ is connected with the transformation matrix $W_{\phi^{T}}\in\mathbb{R}^{C\times K}$ and $\varphi^{T}$ is connected with the transformation matrix $W_{\varphi^{T}}\in\mathbb{R}^{K\times K}$ . Finally, the output for the text modality can be formulated as

$\displaystyle\hat{f}^{T}=\langle V^{T},T^{T}\rangle\ f^{T}.$ (6)

Figure 4.

Illustration of the attention network. “GAP” represents the global average pooling layer, and “fc” denotes the fully connected layer. “ $C$ ” denotes the concatenation of two vectors and “ $\odot$ ” denotes element-wise multiplication.

4.2 Attention network

Inspired by how humans process information, we propose an attention network that adaptively focuses on salient parts to learn more powerful multiple intermediate features. To compute the attention efficiently, we aggregate information from all intermediate features. That is, we exploit both features rather than using each independently to locate the informative regions. The detailed operations are described below.

Figure 5.

The proposed modal-aware feature learning for multimodal hashing. Two modal-aware operations were added in the feature learning module.

Figure 4 shows the specific structure of the attention network. First, the visual feature maps $\hat{f}^{I}$ are forwarded to a GAP layer to produce a visual vector $F^{I}$ . Then, we concatenate visual and textual features as $F=[F^{I};\hat{f}^{T}]$ , which contains information from different modalities. The feature $F$ goes through two different networks to separately produce attention maps for the image and textual features. Both the networks consist of a single layer and a softmax function to obtain the attention distributions.

$\displaystyle a^{I}=\textit{softmax}(W_{I}F+b_{I}),$ (7) $\displaystyle a^{T}=\textit{softmax}(W_{T}F+b_{T}),$

where $W_{I}\in\mathbb{R}^{(C+K)\times C}$ and $W_{T}\in\mathbb{R}^{(C+K)\times K}$ are transformation matrices. $b_{I}$ and $b_{T}$ are model biases. Here, $a^{I}$ is also called the channel attention map [44], which exploits the inter-channel relationship of the features. The main difference is that our method uses both the visual and textual features from all modalities to find the salient channels. Then, element-wise multiplication is used to get the final outputs $\widetilde{f}^{I}$ and $\widetilde{f}^{T}$ , which can be defined as

$\displaystyle\widetilde{f}^{I}(:,:,i)=a^{I}_{i}\hat{f}^{I}(:,:,i),\ \forall i=% 1,\cdots,C,$ (8) $\displaystyle\widetilde{f}^{T}_{i}=a^{T}_{i}\hat{f}^{T}_{i},\ \forall i=1,% \cdots,K,$

where $\hat{f}(:,:,i)$ is the $i$ -th channel with size $H\times W$ and $a_{i}$ is the $i$ -th value in vector $a$ .

5. Implementation details

The proposed modal-aware feature learning for multimodal hashing is shown in Fig. 5. We apply modal-aware operations in the earlier layers. Please note that it only has two fully connected layers for text modality. The proposed two operations are applied after each fully connected layer.

5.1 Network architectures

For the image modality, ResNet-18 [45] is used as the basic architecture to learn the powerful feature representations, which has shown great success in many machine learning tasks. In the ResNet-18, the last global average pooling layer and the last 1000-d fully-connected layer are removed. The feature maps in Conv4_2 and Conv5_2 are utilized as the intermediate features $f^{I}$ , respectively. For the text modality, the well-known bag-of-words (BoW) vectors are used as the inputs. Then, the vectors go through a deep neural network (BoW $\to$ 8192 $\to$ 512) to obtain the semantic text features $f^{T}$ .

After the modal-aware operation, we have two features: $\widetilde{f}^{I}$ and $\widetilde{f}^{T}$ . Since $\widetilde{f}^{I}$ is a tensor, the GAP layer is utilized to encode $\widetilde{f}^{I}$ into a vector $\widetilde{F}^{I}$ . Then, a simple approach that concatenates these two features is applied to obtain a joint representation. Let $F=[\widetilde{F}^{I};\widetilde{f}^{T}]$ denotes the joint representation. The joint representation is forwarded to an $l$ -way fully-connected layer to obtain the $l$ -bit binary codes $H$ .

5.2 Training object

The triplet ranking loss [46] is used to train the deep network. We note that other losses, e.g., contrastive loss [47], can also be used in our framework and the loss function is not our focus in this paper. Specifically, given a triplet of instances $(S_{i},S_{j},S_{k})$ , in which the instant $S_{i}$ is more similar to $S_{j}$ than to $S_{k}$ , these three instances go through the deep multimodal network, and the outputs of the network are $H_{i},H_{j}$ , and $H_{k}$ , which are respectively associated with the instances. The triplet ranking loss function is defined as

$\displaystyle\sum_{\langle i,j,k\rangle}\max\{0,\varepsilon+||H_{i}-H_{j}||-||% H_{i}-H_{k}||\}$ (9)

where $\langle i,j,k\rangle$ is a triplet and $\varepsilon$ is the margin.

6. Experiments

In this section, we conduct extensive experiments and compare the proposed method with several state-of-the-art algorithms.

6.1 Datasets

•
NUS-WIDE [48]: This dataset consists of 269,648 images and the associated tags from Flickr. Each image is associated with several textual tags. The text for each point is represented as a 1,000-dimensional bag-of-words vector.
•
MIR-Flickr 25k [49]: This dataset contains 25,000 images collected from Flickr. Each image has associated textual tags. The textual tags are represented as a 1,386-dimensional bag-of-words vector.
•
IAPR TC-12 [50]: This dataset consists of 20,000 still natural images. Each image is associated with a text caption, which is represented as a 2,912-dimensional bag-of-words vector.

For all of the experiments, the experimental protocols of DCMH [14] are followed to construct the query sets, retrieval databases, and training sets. The NUS-WIDE dataset contains 81 ground-truth concepts. To prune the data without sufficient tag information, we select a subset of 195,834 image-text pairs from the 21 most-frequent concepts as suggested by [14]. The randomly sampled 2,100 image-text pairs (100 pairs per concept) are used as the query set, and the rest of the image-text pairs are constructed as the retrieval database. In the retrieval database, 10,000 image-text pairs are randomly selected to train the hash functions. In the MIR-Flickr 25k and IAPR TC-12 databases, the randomly sampled 2,000 image-text pairs are used as the query set. The rest of the pairs are used as the database for retrieval. We randomly select 10,000 pairs from the retrieval database to form the training set. The detailed characteristics of datasets are listed in Table 1.

Table 1
Detailed statistic information of three datasets, where $n_{\textit{train}}$ is the total number of samples in the train set and $n_{\textit{test}}$ and $n_{\textit{db}}$ are the number of samples in the test set and the retrieval set

Datasets $n_{\textit{train}}$ $n_{\textit{test}}$ $n_{\textit{db}}$

NUS-WIDE 10,000 2,100 193,734

MIR-Flickr 10,000 2,000 23,000

IAPR-TC12 10,000 2,000 18,000

6.2 Experimental settings

Datasets	$n_{\textit{train}}$	$n_{\textit{test}}$	$n_{\textit{db}}$
NUS-WIDE	10,000	2,100	193,734
MIR-Flickr	10,000	2,000	23,000
IAPR-TC12	10,000	2,000	18,000

Our codes are based on the open-source deep learning platform PyTorch.2

²
https://pytorch.org/.

For the image modality, ResNet-18 is adapted as the basic architecture. The weights of ResNet-18 are initialized with the pre-trained ImageNet model. For the text modality, the weights of all fully connected layers are randomly initialized following a Gaussian distribution with a standard deviation of 0.01 and a mean of 0. The networks are trained by the stochastic gradient solver, i.e., ADAM (weight_decay

=

0.00001). The batch size is set to be 100. The base learning rate is 0.0001, and after every 20 epochs, it is changed to one-tenth of the current value. For a fair comparison, all deep learning methods use the same network architectures and same experimental settings.

Evaluations: Following the common practice, the mean average precision (MAP), precision w.r.t different numbers of top returned samples, and precision-recall are used as the evaluation metrics. MAP is a widely used metric to measure the accuracy of the whole binary codes based on the Hamming distances. The precision-recall aims to measure the hash lookup protocol and the precision considers only the top returned samples.

Table 2

Comparison with state-of-the-art methods on NUS-WIDE dataset

Methods	NUS-WIDE
	16 bits	32 bits	48 bits	64 bits
DPSH	0.7057	0.7216	0.7252	0.7298
DSH	0.5712	0.5952	0.5998	0.6039
HashNet	0.7115	0.7252	0.7286	0.7317
DTH	0.7096	0.7193	0.7267	0.7362
TextHash	0.6027	0.6037	0.6088	0.6104
CSQ	0.6962	0.7271	0.7294	0.7281
Concat	0.7274	0.7391	0.7432	0.7495
GMU	0.7250	0.7416	0.7458	0.7569
MCB	0.7262	0.7421	0.7481	0.7510
Ours	0.7395	0.7563	0.7627	0.7639

Table 3

Comparison with state-of-the-art methods on MIR-Flickr 25k dataset

Methods	MIR-Flickr 25k
	16 bits	32 bits	48 bits	64 bits
DPSH	0.8262	0.8316	0.8304	0.8301
DSH	0.7234	0.7312	0.7390	0.7403
HashNet	0.8297	0.8333	0.8331	0.8328
DTH	0.8251	0.8332	0.8418	0.8406
TextHash	0.7154	0.7142	0.7121	0.7065
CSQ	0.8045	0.8436	0.8494	0.8488
Concat	0.8352	0.8453	0.8554	0.8508
GMU	0.8398	0.8465	0.8505	0.8552
MCB	0.8379	0.8444	0.8524	0.8528
Ours	0.8564	0.8658	0.8697	0.8723

Table 4

Comparison with state-of-the-art methods on IAPR TC-12 dataset

Methods	IAPR TC-12
	16 bits	32 bits	48 bits	64 bits
DPSH	0.5386	0.5448	0.5383	0.5355
DSH	0.4746	0.4851	0.4892	0.4926
HashNet	0.5391	0.5451	0.5379	0.5386
DTH	0.5662	0.5854	0.5920	0.6032
TextHash	0.5238	0.5487	0.5542	0.5623
CSQ	0.4836	0.5398	0.5728	0.5844
Concat	0.5762	0.5993	0.6213	0.6206
GMU	0.5694	0.6006	0.6207	0.6241
MCB	0.5721	0.5975	0.6149	0.6151
Ours	0.5925	0.6194	0.6330	0.6384

Figure 6.

The comparison results of precision-recall curves with 32 bits.

6.3 Comparison with state-of-the-art methods

In the first set of experiments, we aim to compare with state-of-the-art baselines. We use two different methods as baselines.

The first set of baselines belongs to the unimodal approaches. In this set of baselines, only one modality is utilized to train the deep networks. For the image modality, several state-of-the-art image hashing algorithms are selected: deep pairwise-supervised hashing (DPSH) [51], deep supervised hashing (DSH) [52], HashNet [53], deep triplet hashing (DTH) [46] and central similarity quantization (CSQ) [54]. DPSH and DSH belong to deep pair-wise approaches, and DTH is a triplet-based approach. HashNet minimizes the quantization errors of the binary codes. CSQ is the recently state-of-the-art method for optimizing the central similarity between data points. For a fair comparison, the deep architectures for these four methods are all the same as ours. For the text modality, we utilize the same network for text data, which is referred to as TextHash. TextHash only uses text representations to learn the hash codes.

The second set of baselines is different fusion strategies used to combine multiple modalities. We note that only the fusion module in Figure 2 uses different fusion strategies and the other modules are the same.

•
Concat We concatenate the intermediate features of both the image and text modalities to train the hashing architectures.
•
GMU A gate multimodal unit (GMU) [16] is an internal unit in a neural network for data fusion. GMU uses multiplicative gates to determine how modalities influence the activation of the unit.
•
MCB Multimodal compact bilinear pooling (MCB) [17] uses bilinear pooling [55] to combine visual and text representations.

Figure 7.
The comparison results of precision curve w.r.t. different numbers of top returned samples.

Tables 2–4 show the comparison results of the obtained MAP values for the three multimodal datasets. Figures 6 and 7 show the precision-recall curves and the precision curves on 32 bits, respectively. Our proposed method achieves the highest accuracy and beats all the baselines for most levels. The following observations can be made from the results.

1)
Compared with the unimodal approaches, our method performs significantly better than the baselines. For instance, our method yields a higher accuracy compared to the TextHash that only uses the text modality. For image hashing methods, our method obtains a MAP of 0.7395 on the 16 bits and the value of 0.7115 of the HashNet on NUS-WIDE. On MIR-Flickr 25k, the MAP of DTH is 0.8332, while the proposed method is 0.8658 on 32 bits. Our method shows a relative increase of 4.6% $\sim$ 6.9% on the IAPR TC-12 compared to the DTH algorithm. Note that DTH and our method use the same triplet ranking loss function and DTH achieves excellent performance. Even so, our method performances better than DTH. These results indicate that multi-modal approaches can improve performance.
2)
Compared with other deep fusion strategies, the proposed method also performs best on all databases. Firstly, compared to the Concat approach, the only difference is that using or not using the modal-aware operations, these comparisons can answer us whether the modal-aware features can improve the accuracy or not. The results indicate that our modal-aware features can achieve better performance. For example, the MAP of our proposed method is 0.7395 on 16-bit length, compared to 0.7274 of Concat on NUS-WIDE dataset. Thus it is desirable to learn the powerful features for multi-modal retrieval. Compared to the GMU and MCB two baselines which achieve excellent performances, our proposed method also yields better performance. The main reason is that our method can incorporate the information from other modalities to learn the intermediate features, while the intermediate features of GMU and MCB are learned via individual neural layers.

Table 5
Ablation study on each component on three datasets

Methods NUS-WIDE

16 bits 32 bits 48 bits 64 bits

w/o KN 0.7295 0.7398 0.7467 0.7508

w/o AN 0.7339 0.7420 0.7519 0.7583

Ours 0.7395 0.7563 0.7627 0.7639

MIR-Flickr 25k

w/o KN 0.8349 0.8481 0.8564 0.8557

w/o AN 0.8430 0.8555 0.8625 0.8644

Ours 0.8564 0.8658 0.8697 0.8723

IAPR TC-12

w/o KN 0.5796 0.6037 0.6228 0.6261

w/o AN 0.5839 0.6073 0.6242 0.6326

Ours 0.5925 0.6194 0.6330 0.6384

Figure 8.
The comparison results of precision curves for ablation study.

6.4 Ablation study

Methods	NUS-WIDE
w/o KN	0.7295	0.7398	0.7467	0.7508
w/o AN	0.7339	0.7420	0.7519	0.7583
Ours	0.7395	0.7563	0.7627	0.7639
	MIR-Flickr 25k
w/o KN	0.8349	0.8481	0.8564	0.8557
w/o AN	0.8430	0.8555	0.8625	0.8644
Ours	0.8564	0.8658	0.8697	0.8723
	IAPR TC-12
w/o KN	0.5796	0.6037	0.6228	0.6261
w/o AN	0.5839	0.6073	0.6242	0.6326
Ours	0.5925	0.6194	0.6330	0.6384

In this set of experiments, we do ablation study to elucidate the impact of each part of the proposed method on the final performance.

First, we explore the effect of the kernel network. In this baseline, the attention network is fixed and we do not use the kernel network. That is the features are directly forwarded to the attention network and the only difference is that using or not using the kernel network, which is referred to as w/o KN.

The second baseline explores the effect of the attention network. In this baseline, the kernel network was first performed to obtain two intermediate features. Then, we concatenate the two features to obtain the joint representation. We note that the only difference between the baseline and our method is the use or lack of use of the attention network. We use w/o AN to denote the baseline that is not using the attention network.

The comparison results are shown in Table 5 and Fig. 8. We observe that our method can achieve better performance than the two baselines. For instance, our method obtains a MAP of 0.7627 on 48 bits, compared to 0.7519 of the w/o AN and 0.7467 of the w/o KN. Compared to w/o KN on IAPR TC-12 dataset, our method gains 1.63% to 2.60% on MAP. Figure 8 shows the precision curves on three datasets. Again, our method yields better accuracy for all levels. The results indicate that it is desirable to learn the intermediate features with both the kernel network and the attention network. This also shows that our approach can achieve better feature learning for multimodal hashing.

In this paper, the textual data is represented as a bag-of-word vector. Other text representations, e.g., Sent2Vec or BERT [56], can be used in our framework. For example, on IAPR TC-12 database, each image is associated with a text caption. Thus Sent2Vec, which is computed via the pre-trained model,3

³
https://github.com/epfml/sent2vec.

can be used as text representations. Table 6 shows the comparison results. First, we can observe that Sent2Vec performs better than BoW. This is because Sent2Vec is the better pre-trained model to extract features of the sentences. Second, the two methods had no significant difference from each other. The main reason is that our method can incorporate the information from other modalities, e.g., images, to learn the more efficient modal-aware features. Thus it is desirable to learn the modal-aware features.

Table 6

The comparison results of different texts representation

Method	IAPR TC-12
	16 bits	32 bits	48 bits	64 bits
BoW	0.5925	0.6194	0.6330	0.6384
Sent2Vec	0.5961	0.6232	0.6336	0.6357

7. Conclusion

In this work, we proposed a modal-aware operation for learning good feature representations. The key to success comes from designing a generic building block to capture the underlying correlation structures in heterogeneous multi-modal data before multimodal fusion. First, we proposed a kernel network to learn the non-linear relationships. The kernel similarities between the two modalities were learned to reweight the original features. Then, we proposed an attention network, which aims to select the informative parts of the intermediate features. The experiments were conducted on three benchmark datasets, which demonstrated the appealing performance of the proposed modal-aware operations.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grants (U1811263 and 61772211). This work is also supported by Guangdong Basic and Applied Basic Research Foundation (2021A1515012172) and the Pearl River Nova Program of Guangzhou (201906010080).

References

Wang

Cui

and Zhu

, Deep multimodal hashing with orthogonal regularization, In Proceedings of the International Joint Conference on Artificial Intelligence, 2015.

Qiang

Wan

Liu

Xiang

and Meng

, Discriminative deep asymmetric supervised hashing for cross-modal retrieval, Knowledge Based Systems 204 (2020), 106188.

Baltrušaitis

Ahuja

and Morency

L.-P.

, Multimodal machine learning: A survey and taxonomy, IEEE Transactions on Pattern Analysis and Machine Intelligence 41(2) (2019), 423–443.

Huang

Zhang

Zhao

and Li

, Image-text sentiment analysis via deep multimodal attentive fusion, Knowledge Based Systems 167 (2019), 26–37.

D’mello

S.K.

and Kory

, A review and meta-analysis of multimodal affect detection systems, ACM Computing Surveys 47(3) (2015), 43.

Liu

Zhou

Shen

and Yin

, Multiple kernel learning in the primal for multimodal alzheimer’s disease classification, IEEE Journal of Biomedical and Health Informatics 18(3) (2014), 984–990.

Gönen

and Alpaydın

, Multiple kernel learning algorithms, Journal of Machine Learning Research 12(Jul) (2011), 2211–2268.

Fidler

Sharma

and Urtasun

, A sentence is worth a thousand pixels, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1995–2002.

Rajagopalan

S.S.

Morency

L.-P.

Baltrusaitis

and Goecke

, Extending long short-term memory for multi-view structured learning, In Proceedings of the European Conference on Computer Vision, 2016, pp. 338–353.

10.

Antol

Agrawal

Mitchell

Batra

Lawrence Zitnick

and Parikh

, Vqa: Visual question answering, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2425–2433.

11.

Mroueh

Marcheret

and Goel

, Deep multimodal learning for audio-visual speech recognition, In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2015, pp. 2130–2134.

12.

Ouyang

Chu

and Wang

, Multi-source deep learning for human pose estimation, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2329–2336.

13.

Kim

Koh

Kim

Choi

Hwang

and Choi

J.W.

, Robust deep multi-modal learning based on gated information fusion network, arXiv preprint arXiv:1807.06233, 2018.

14.

Jiang

Q.Y.

and Li

, Deep cross-modal hashing, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.

15.

Jin

Tang

G.-J.

and Xiao

, Deep semantic multimodal hashing network for scalable multimedia retrieval, arXiv preprint arXiv:1901.02662, 2019.

16.

Arevalo

Solorio

Montes-y Gómez

and González

F.A.

, Gated multimodal units for information fusion, arXiv preprint arXiv:1702.01992, 2017.

17.

Fukui

Park

D.H.

Yang

Rohrbach

Darrell

and Rohrbach

, Multimodal compact bilinear pooling for visual question answering and visual grounding, In Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2016, pp. 457–468.

18.

Kiela

and Bottou

, Learning image embeddings using convolutional neural networks for improved multi-modal semantics, In EMNLP, 2014, pp. 36–45.

19.

Rohrbach

and Darrell

, Segmentation from natural language expressions, In Proceedings of the European Conference on Computer Vision, 2016, pp. 108–124..

20.

Ngiam

Khosla

Kim

Nam

Lee

and Ng

A.Y.

, Multimodal deep learning, In Proceedings of The International Conference on Machine Learning, 2011, pp. 689–696.

21.

Srivastava

and Salakhutdinov

R.R.

, Multimodal learning with deep boltzmann machines, In Proceedings of the Neural Information Processing Systems, 2012, pp. 2222–2230.

22.

Liu

and Natarajan

, Learn to combine modalities in multimodal deep learning, arXiv preprint arXiv:1805.11730, 2018.

23.

Meng

Wang

Chen

and Wu

, Asymmetric supervised consistent and specific hashing for cross-modal retrieval, IEEE Transactions on Image Processing 30 (2021), 986–1000.

24.

Wang

Yin

Wang

and Wang

, A comprehensive survey on cross-modal retrieval, arXiv preprint arXiv:1607.06215, 2016.

25.

Shen

H.T.

Liu

Yang

Huang

Shen

and Hong

, Exploiting subspace relation in semantic labels for cross-modal hashing, IEEE Transactions on Knowledge and Data Engineering, 2020, p. 1.

26.

Sun

and Ye

, A least squares formulation for canonical correlation analysis, In Proceedings of The International Conference on Machine Learning, 2008, pp. 1024–1031.

27.

Zhang

and Li

W.-J.

, Large-scale supervised multimodal hashing with semantic correlation maximization, In Proceedings of the AAAI Conference on Artificial Intelligence, volume 1, 2014, p. 7.

28.

Zhang

and Shen

H.T.

, Deep fuzzy hashing network for efficient image retrieval, IEEE Transactions on Fuzzy Systems 29(1) (2021), 166–176.

29.

Cao

Liu

Long

and Wang

, Cross-modal hamming hashing, In Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 207–223.

30.

Yang

Deng

Liu

Tao

and Gao

, Pairwise relationship guided deep hashing for cross-modal retrieval, In Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 1618–1625.

31.

Zhang

Lai

and Feng

, Attention-aware deep adversarial hashing for cross-modal retrieval, In Proceedings of the European Conference on Computer Vision, 2018, pp. 591–606.

32.

Deng

Liu

Gao

and Tao

, Self-supervised adversarial hashing networks for cross-modal retrieval, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4242–4251.

33.

Liong

V.E.

and Tan

Y.-P.

, Adversarial multi-label variational hashing, IEEE Transactions on Image Processing 30 (2021), 332–344.

34.

Zhang

Peng

and Yuan

, Sch-gan: Semi-supervised cross-modal hashing by generative adversarial network, IEEE Transactions on Systems, Man and Cybernetics 50(2) (2020), 489–502.

35.

Kim

Kang

and Choi

, Sequential spectral learning to hash with multiple representations, In Proceedings of the European Conference on Computer Vision, 2012, pp. 538–551.

36.

Shen

Sun

Q.-S.

and Yuan

Y.-H.

, Multi-view latent hashing for efficient multimedia search, In ACM MM, 2015, pp. 831–834.

37.

Zhang

Wang

and Si

, Composite hashing with multiple information sources, In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, 2011, pp. 225–234.

38.

Xie

Shen

Han

Zhu

and Shao

, Dynamic multi-view hashing for online image retrieval, Proceedings of the International Joint Conference on Artificial Intelligence, 2017.

39.

Talreja

Valenti

M.C.

and Nasrabadi

N.M.

, Desep hashing for secure multimodal biometric, IEEE Transactions on Information Forensics and Security 16 (2021), 1306–1321.

40.

Cao

Steffey

Xiao

Tao

Chen

and Müller

, Medical image retrieval: a multimodal approach, Cancer Informatics 13 (2014), CIN–S14053.

41.

Zhang

Müller

and Zhang

, Large-scale retrieval for medical image analytics: A comprehensive review, Medical Image Analysis 43 (2018), 66–84.

42.

Tran

V.-L.

Mai-Nguyen

A.-V.

Phan

T.-D.

A.-K.

Dao

M.-S.

and Zettsu

, An interactive multimodal retrieval system for memory assistant and life organized support, In Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020, pp. 416–420.

43.

Wang

Girshick

Gupta

and He

, Non-local neural networks, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7794–7803.

44.

Woo

Park

Lee

J.-Y.

and So Kweon

, Cbam: Convolutional block attention module, In Proceedings of the European Conference on Computer Vision, 2018, pp. 3–19.

45.

Zhang

Ren

and Sun

, Deep residual learning for image recognition, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.

46.

Lai

Pan

Liu

and Yan

, Simultaneous feature learning and hash coding with deep neural networks, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3270–3278.

47.

Hadsell

Chopra

and LeCun

, Dimensionality reduction by learning an invariant mapping, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, 2006, pp. 1735–1742.

48.

Chua

T.S.

Tang

Hong

Luo

and Zheng

, Nus-wide: a real-world web image database from national university of singapore, In Proceedings of the ACM international conference on image and video retrieval, 2009, p. 48.

49.

Huiskes

M.J.

and Lew

M.S.

, The mir flickr retrieval evaluation, In Proceedings of the 1st ACM international conference on Multimedia information retrieval, 2008, pp. 39–43.

50.

Escalante

H.J.

Hernández

C.A.

Gonzalez

J.A.

López-López

Montes

Morales

E.F.

Sucar

L.E.

Villaseñor

and Grubinger

, The segmented and annotated iapr tc-12 benchmark, Computer Vision and Image Understanding 114(4) (2010), 419–428.

51.

, Feature learning based deep supervised hashing with pairwise labels, In Proceedings of the International Joint Conference on Artificial Intelligence, 2016, pp. 3485–3492.

52.

Liu

Wang

Shan

and Chen

, Deep supervised hashing for fast image retrieval, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2064–2072.

53.

Cao

Long

Wang

and Yu

P.S.

, Hashnet: Deep learning to hash by continuation, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.

54.

Yuan

Wang

Zhang

Tay

F.E.

Jie

Liu

and Feng

, Central similarity quantization for efficient image and video retrieval, In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 3083–3092.

55.

Lin

T.-Y.

RoyChowdhury

and Maji

, Bilinear cnn models for fine-grained visual recognition, In ICCV, 2015, pp. 1449–1457.

56.

Devlin

Chang

M.-W.

Lee

and Toutanova

, Bert: Pre-training of deep bidirectional transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

Efficient modal-aware feature learning with application in multimodal hashing

Abstract

Keywords

1. Introduction

2.1 Multimodal fusion

2.2 Multimodal retrieval

4.1 Kernel network

5.1 Network architectures

5.2 Training object

6.1 Datasets

2 https://pytorch.org/.

3 https://github.com/epfml/sent2vec.

Footnotes

Acknowledgments

References

²
https://pytorch.org/.

³
https://github.com/epfml/sent2vec.