Abstract
Due to the emergence of the era of big data, cross-modal learning have been applied to many research fields. As an efficient retrieval method, hash learning is widely used frequently in many cross-modal retrieval scenarios. However, most of existing hashing methods use fixed-length hash codes, which increase the computational costs for large-size datasets. Furthermore, learning hash functions is an NP hard problem. To address these problems, we initially propose a novel method named Cross-modal Variable-length Hashing Based on Hierarchy (CVHH), which can learn the hash functions more accurately to improve retrieval performance, and also reduce the computational costs and training time. The main contributions of CVHH are: (1) We propose a variable-length hashing algorithm to improve the algorithm performance; (2) We apply the hierarchical architecture to effectively reduce the computational costs and training time. To validate the effectiveness of CVHH, our extensive experimental results show the superior performance compared with recent state-of-the-art cross-modal methods on three benchmark datasets, WIKI, NUS-WIDE and MIRFlickr.
Introduction
Due to the explosive growth of big data, cross-modal retrieval has occupied an increasingly important status in information retrieval. However, due to the heterogeneous between different modalities, it is impossible to directly represent other modalities data in its original feature space. Therefore, how to find the correlations between different modalities, and to achieve accurately retrieval has become increasingly vital. At present, classic mainstream cross-modal retrieval algorithms can be divided into three categories: subspace learning-based methods, deep learning-based methods and hashing learning-based methods.
A kind of methods based on subspace learning [9, 10, 12, 11, 13], which first use the paired symbiosis information of different modalities to learn projection matrix. Then project the different features into a latent common subspace by projection matrix, and obtain the similarity of different modalities in this subspace. Finally, sort similarity and implement cross-modal retrieval process. Although existing cross-modal retrieval based on subspace learning have made some progress, there still suffered from problems of differences in spatial structure and matching problems between different modalities. On the other hand, subspace learning can only learn the linear mapping, which cannot effectively model the higher-order correlations of different modalities.
Another kinds of retrieval methods are based on deep learning [14, 15, 16], which mainly use deep learning to extract the underlying features of heteromodal data, and then use high-level network to represent the semantic correlation between the modalities. Although deep learning-based methods have better retrieval performance than subspace methods, it does not match the spatial structure within the modal and the semantic connection between different modalities, also because of its complex network structure and high training costs, we only use several classic traditional methods in this paper.
To address these problems above, another kind of methods are based on hashing learning [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35] has been increasingly used in machine learning, computer vision and information retrieval due to its efficient retrieval performance and low computational costs. Cross-modal hashing methods generate compact binary codes for fast and efficient retrieval by mapping the original high-dimensional data into a low-dimensional feature space. However, most hashing-based methods map original data to fixed-length hash codes, which may not represent different modalities data well. Therefore, the accuracy of cross-modal retrieval cannot be guaranteed. In addition, solving the binary hash codes is an NP-hard problem [17, 18, 25]. Most of methods try to relax the binary constraints of hash codes, which resulting the learned binary codes are not accurate enough.
In order to solve these issues, we initially propose a novel cross-modal retrieval method called Cross-modal Variable-length Hashing Based on Hierarchy (CVHH). CVHH uses variable-length hashing codes to represent different modalities data to learn more accurate representation of different modalities data. Furthermore, in order to take into account the correlation between different modalities data on the global manifold structure, we construct a hierarchy structure based on algebraic multigrid (AMG) model [31, 33]. We assume that high-dimensional neighbor data is also similar in projected low-dimensional space. In fact, the AMG model is currently only used for unimodal retrieval. Therefore, in order to further improve the retrieval performance of cross-modal data and reduce the computational costs of training process, we apply AMG model into our cross-modal variable-length hashing method. The main contributions of CVHH can be summarized as follows:
We utilize variable-length hash codes to represent different modalities data, which reflects the original structure in their respective feature spaces. We construct a cross-modal hierarchical structure based on AMG model, and select the core instances to represent the original entire training set, which can greatly reduce the computational costs and training time. Experimental results on three benchmark cross-modal datasets show that our proposed CVHH is superior to other comparison methods in cross-modal retrieval.
The rest of this paper is organized as follows. In Section 2, we introduce some related work in the field of cross-modal retrieval. Section 3 presents the theoretical analysis and algorithm optimization of our CVHH in detail. Section 4 analyses and compares the experimental results of CVHH with other comparison methods. Finally, the conclusion is summarized in section 5.
In this section, we introduce two kinds of cross-modal methods, called latent subspace learning [9, 10, 12, 11, 13] and binary representation learning [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35], namely hashing learning.
Latent subspace learning
Latent subspace learning methods attempt to eliminate heterogeneity between different modalities, and match different features by the same subspace. The most typical one is named Canonical Correlation Analysis (CCA) [11], which finds the mapping vector of each modal to the common subspace by maximizing the correlation coefficient between two modalities. However, since the features of original data are greatly different, how to express the correlation between different original data has become the most important problem. In order to more accurately obtain the correlation between different modalities and implement better retrieval performance, hash learning methods are widely used in cross-modal retrieval tasks in recent years.
Hashing learning
Hashing learning can effectively improve the efficiency by mapping the original high-dimensional data into low-dimensional binary strings, which can significantly reduce the computational costs. Hashing learning is to learn the binary hash codes representation of original data, and to maintain the neighbor relationship in the original space, i.e., to maintain the similarity of the original data. Specifically, each sample is encoded by a compact binary string, so that two similar samples in the original feature space should remain similarity in compact binary space. Currently, hashing learning-based methods can be divided into unimodal hashing learning [18, 19, 20, 21, 22, 23, 24, 25, 26, 27], and cross-modal hashing learning [28, 29, 30, 31, 32, 33, 34, 35].
Unimodal hashing learning
Unimodal hashing learning can be generally divided into data-independent hashing methods and data-dependent hashing methods. As the most famous data-independent method, the main idea of Locality Sensitive Hashing (LSH) [18] is that the neighbor data in high-dimensional space can still maintain the neighboring relationship by mapping into hashing space. However, LSH uses random projection matrix to generate hash functions, which cannot represent the original data well. In contrast, data-dependent methods learn hash codes through the relationship between original data, which can generate hash functions more accurately compared with data-independent methods.
According to whether use label information in training process, data-dependent hashing methods can be roughly divided into unsupervised hashing, semi-supervised hashing and supervised hashing. Spectral Hashing (SH) [19] leverages Principal Component Analysis (PCA) [20] to calculate and order the eigenvalues of each principal component direction of image data. On the basis of SH, Anchor Graph Hashing (AGH) [21] uses the neighbor graph between the data clustering center and each instance to replace the original neighbor graph, and uses neighbor graph to replace the original neighbor matrix. Discrete Graph Hashing (DGH) [22] solves the discrete constrains by directly establishing anchor graph and the relaxation constrains. These methods are all unsupervised hashing algorithms by using feature vectors or anchor graphs to solve hash codes. The advantage of unsupervised hashing algorithms is that it reduces the training time, but the retrieval results are not satisfactory enough because the label information is not been used. Semi-supervised hashing methods use both original data information and a small amount of completed label information to generate hash codes. Semi-supervised hashing methods reduce the training time because they do not use all label information, while maintaining the relationship of the original data. Semi-Supervised Hashing (SSH) [23] combines metric similarity and semantic similarity, uses singular value decomposition to establish semi-supervised hashing function, and obtains non-orthogonal and orthogonal solutions of the coefficient matrix. Another one called Semi-Supervised Discriminant Hashing (SSDH) [24] uses unlabeled data for regularization, and uses labeled data to distinguish different categories associated with binary codes to avoid overfitting. Although semi-supervised methods have achieved better results than unsupervised methods, since the original data features cannot fully represent the semantic correlation between different data, complete label information appears to be significant. Under these circumstances, many label-based supervised hashing methods are more used in retrieval. Kernel-based Supervised Hashing (KSH) [25] leverages inner product of vectors to generate the kernel-base hashing function. FastHash [26] solves hashing codes by the classic graph cuts algorithm. Supervised Discrete Hashing (SDH) [27] maps the hash codes to the label information and solves hash codes bit-by-bit under discrete constraints using discrete cyclic coordinate descent (DCC) algorithm proposed in [27].
Cross-modal hashing learning
Cross-modal hashing learning methods are based on unimodal hashing learning to find the correlation between different modalities and implement the retrieval process. Specifically, cross-modal hashing retrieval methods first extract effective features from different modalities by using their own feature extraction methods. Each original high-dimensional feature is then projected into a low-dimensional Hamming space. Finally query data by ordering the “Top n” nearest neighbors in this low-dimensional Hamming space.
Cross-modal hashing methods can be divided into unsupervised cross-modal hashing and supervised cross-modal hashing according to whether use the label information. Unsupervised cross-modal hashing methods mainly refer to different modalities data having similarities in feature representation, and they are considered to have the same semantic information. Typical unsupervised cross-modal hashing methods are Cross-View Hashing (CVH) [28], Inter-Media Hashing (IMH) [29], and Collective Matrix Factorization Hashing (CMFH) [31], etc. CVH [28] extends from unimodal method SH [19], which is essentially to construct the objective function by minimizing the Euclidean distance with weights, and obtain approximate hash codes by mapping the nearest neighbor instances. IMH [29] maintains the consistency of hash function between and within modalities, i.e., the similar data in their original different modalities are have similar hash codes in cross-modal retrieval, and neighbor data in same modal are still similar in Hamming space. CMFH [31] assumes that hash codes of different modalities in common latent Hamming subspace are consistent, and uses cooperative matrix factorization method to find the potential representation of each modal. Another supervised cross-modal hashing models learn common subspaces by making full use of label information. On account of the category information is more distinguished in cross-modal data and can better represent original information, the supervised methods usually have higher retrieval performance. The classic supervised cross-modal hashing algorithms are Semantic Correlation Maximization (SCM) [33], Semantics-Preserving Hashing (SePH) [34], Double Alignment Based Hashing (DASH) [30], and Generalized Semantic Preserving Hashing (GSPH) [35]. SCM [33] is a linear cross-modal hashing method which using learned hash codes to reconstruct the similarity matrix, and using label information to improve the retrieval performance. SePH [34] first transforms the semantic matrix into a probability distribution and makes it as close as possible by minimizing the Kullback-Leibler divergence. Then uses a logistic regression with a kernel to learn the non-linear hash function for each modal. Because the first step of the training process is so complicated, it takes much longer training time than other comparison methods [34]. DASH [30] first uses unimodal hash methods to obtain the hash codes of different modalities, then directly uses linear regression model to learn the hash function of each modal. GSPH [35] first learns the optimal hash codes of each modal, while the hash codes maintain the semantic similarity between different modalities. Then learns the hash function that maps original data to binary space.
Although cross-modal hashing methods have made some progress, there still have some unsolved issues. (1) How to effectively find the correlations between heterogenous data; (2) How to efficiently learn hash functions and hash codes; (3) How to reduce the training time of cross-modal hashing retrieval. In this paper, we propose CVHH to obtain hash functions more efficiently and also reduce the computational costs. The specific implementation and optimization process is in Section 3.
The proposed CVHH method
In this section, we describe the algorithm analysis and optimization process of CVHH in detail. First we construct hierarchical structure to obtain the representative image-text pairs. Then we use these pairs to learn the hash functions by variable-length hashing model. Finally, we implement the cross-modal retrieval by ordering the similarity of hash codes in common Hamming space.
Notations
Suppose we have a set of
CVHH retrieval algorithm
This section we focus on the implementation process of CVHH. First, we combine the manifold learning hypothesis and algebraic multigrid (AMG) method [36, 37] to construct the hierarchical structure of entire training set, which can reduce the size of instances that actually participate in hash training process. Then we construct cross-modal variable-length hashing model to accommodate the different modalities in different length of hash codes. Finally we optimize the objective function and obtain the hash functions of each modal. Figure 1 shows the framework of our proposed CVHH.
The framework of proposed CVHH. First construct hierarchical structure and select the top-layer representative image-text pairs, then project top-layer pairs to the common Hamming space, finally generate the entire hash codes by similarity transfer matrix.
Manifold learning assumes that the data we observe is actually mapped from a low-dimensional manifold to a higher-dimensional space. By means of manifold learning, we can get the manifold distribution of data in the original space more accurately. In the process of constructing the hierarchical model, we first assume that the image-text pairs in training set are distributed on a manifold structure, and their local neighbor pairs have similar features. In each local structure of current layer, we select representative pairs as original data for the next layer. According to the same way, we can construct a bottom-up pyramid hierarchical structure for training set. The top layer is the most representative pairs with great differences between each pair. The bottom layer are all instances of training set. Figure 2 shows the overall flow of CVHH algorithm.
The overall flow chart of CVHH, example of using a new image to retrieve similar images or texts.
The challenge of constructing hierarchy structure is how to choose representative pairs strongly connected to the unselected pairs, which can ensure that the representative pairs can substitute all unselected pairs. The specific process is as follows. First, the bottom layer is the entire training set. We first construct a k-nearest neighbor graph based on the bottom layer, as layer 0, and calculate the weight matrix of layer 0. Then build a strongly-connected graph to select the data from bottom layer which are strongly connected to unselected data. Thus, each representative data can represent entire data in this layer. We store the selected data into layer 1, then we construct a new weight matrix based on layer 1, and new representative data are selected as layer 2 according to the strongly-connected graph by layer 1, and so on. Finally, we select the top layer data as the representative data for the entire training set.
First construct a k-nearest neighbor graph
where
where
where
Similarity transfer matrix can transfer similarities between any two adjacent layers. Therefore, the affinity matrix
We now have a hierarchical structure for the entire training set, and also obtain representative pairs for each layer. So after debugging and continuing the above, we can select the representative pairs of data from appropriate layers. The complete process can be summarized as Algorithm 1.
Construction of Hierarchical StructureImage-text pairs
Let
Calculate similarity transfer matrix
Update affinity matrix
Through Algorithm 1, we obtain representative pairs and introduce them into cross-modal variable-length hashing retrieval process.
We assume that there has a common latent semantic space
The similarity between images and texts in
Let
The first two terms project original images and texts data into their own low-dimensional binary spaces, respectively. The last one represents the semantic similarity of original data in cross-modal latent space. Then we can optimize the objective function and obtain the projection matrices
Since the objective function is non-convex, we first solve for one variable by fixing others, and then iteratively solve all variables until the objective function converges.
(1) Fix other variables and solve
,
, separately
The objective function can be simplified as follows:
Therefore,
(2) Fix other variables and solve
The objective function can be simplified as follows:
Then correlation matrix
(3) Fix other variables and solve
The objective function can be simplified as follows:
Due to the binary constraint problem, we learn the binary codes
where
We can see from Eq. (16) that, every bit in
(4) Fix other variables and solve
The solution of
In summary, the overall algorithm can be summarized as Algorithm 2.
Cross-modal Variable-length Hashing Based on HierarchyThe training set
In the query phase, assume there is a new image instance
The complexity analysis of CVHH is as follows. The hierarchy process can be processed offline, so the main time cost is cross-modal hashing retrieval process, i.e., calculate and update projection matrices
Experiments
In this section, we evaluate the effectiveness of CVHH, and compare CVHH with several state-of-the-art cross-modal retrieval algorithms. The evaluation metrics are precision-recall curves and MAP values, indicating that the performance of CVHH is superior to other comparison methods.
Datasets and features
In order to verify the performance of CVHH, we adopt three benchmark large-scale datasets, Wiki [38], MIRFlickr [39] and NUS-WIDE [40], which commonly used for cross-modal retrieval methods [31, 32, 33, 34, 35]. These datasets are all provide bimodal data, images and texts. The statistics of the three datasets are summarized in Table 1.
Wiki [38] is a common bimodal dataset contains 2866 instances, which collected from Wikipedia website. All instances are separated by 10 classes and each instance has its own semantic label. For each instance, its image is represented as a 128-dimensional SIFT feature vector, and its text is represented by a 10-dimensional feature vector generated by Latent Dirichlet Allocation (LDA). Wiki was initially split into 2173 instances as training set and 693 instances as test set.
MIRFlickr [39] is a public large-scale dataset collected from Flickr website which includes 25000 instances. each instance has several correlative labels from 24 classes. According to [31], we select 16738 image-text pairs to be used in our experiments. We randomly select 5% instances as the testing set and others as the training set. Each image is represented as a 150-dimensional edge histogram and each text is represented as a 500-dimensional feature vector generated from PCA on its index tagging vector.
NUS-WIDE [40] is a real-world large-scale dataset which contains 269648 instances, and every instance is annotated by at least one label of 81 ground truth semantic categories. According to the settings in [30], we choose the most common 10 categories and totally 186577 instances for our experiments. Each image is represented as a 500-dimensional Bag-of-Words feature vector and each text is represented as a 1000-dimensional binary tagging vector of the most frequent labels. We randomly select 1% instances as the testing set and others as the training set.
Statistics of three benchmark datasets
Statistics of three benchmark datasets
This section mainly describe the evaluation matric and comparison methods of our experiments. We focus on two typical tasks: 1) Img2Txt use images to query texts; 2) Txt2Img use texts to query images. During the query phase, we sort the distances of Hamming distance matrix in ascending order, and return the top 10 instances as the results. Furthermore, we adopt two common-used performance measurements, mean average precision (MAP) value and precision-recall curve to evaluate our CVHH with other comparison methods. For these two measurements, higher values indicate better performance. The MAP value is defined as:
where
We choose several typical cross-modal retrieval algorithms as our comparison methods, namely CCA [32], SCM [33], SePH [34] and GSPH [35]. Our experiments use the codes published by the authors with the same parameter settings as the original papers. In original articles [34, 35], both have two methods to learn hash functions in SePH and GSPH models: (1) SePH
The precision-recall curves of CVHH and other comparison methods in Img2Txt and Txt2Img in 64-bit hashing length. Subgraphs (a), (b) are the results on Wiki, (c), (d) are the results on MIRFlickr, and (e), (f) are the results on NUS-WIDE.
In this section, we compare CVHH with four typical cross-modal retrieval methods on three benchmark datasets. For comparison, the experimental results show the retrieval performance of CVHH and all comparison methods under 64-bit length of hash coding.
The overall precision-recall curves are shown in Fig. 3. Subgraph (a), (b) show the experimental results on Wiki dataset, (c), (d) show the results on MIRFlickr, and (e), (f) show the results on NUS-WIDE. It can be seen that the performance of CVHH is generally better than other comparison methods, for both Img2Txt and Txt2Img. Figure 3 also shows that CVHH improving significantly over comparison methods on MIRFlickr and NUS-WIDE, probably because CVHH is more suitable for multi-label datasets than other comparison methods.
Tables 2 and 3 show the MAP values of Img2Txt and Txt2Img, respectively. We mark the maximum MAP values for each column as black. It can be seen that our proposed CVHH has the highest retrieval performance among all comparison methods in both Img2Txt and Txt2Img tasks, which proves our proposed CVHH has higher retrieval accuracy. And we can also observe that with the increase of hash codes length, the MAP values of all methods have generally increased, which represent the retrieval performance will vary with the length of hash codes.
MAP values of CVHH and comparison methods in Img2Txt with different hashing length on three datasets
MAP values of CVHH and comparison methods in Img2Txt with different hashing length on three datasets
MAP values of CVHH and comparison methods in Txt2Img with different hashing length on three datasets
To visually compare the training time of CVHH with other comparison methods, Table 4 shows the training time of several methods on the three large data sets with 16-bit, 32-bit and 64-bit hash coding lengths. In Table 4, the training time of all methods increases with the increase of the hash code length, and even the training time of many algorithms increases exponentially. It can also be seen from Table 4 that the training time of two comparison methods SePH and GSPH is costs too much, which because these two methods need to construct pairwise similarity matrices. Compared with other algorithms, the computational costs are quite high. At the same time, it can be seen that compared with other comparison algorithms, CVHH has an absolute advantage in training time and is in an advantageous position in large-scale cross-modal data set.
The training time (seconds) of CVHH with all comparison methods in 16-bit, 32-bit and 64-bit hashing lengths on three datasets
Comparison of MAP values on every layer of CVHH for text query image and image query text in three datasets with 64-bit hashing length
The precision-recall curves of CVHH in Img2Txt and Txt2Img with different hashing length. Subgraphs (a), (b) are the results on Wiki, (c), (d) are the results on MIRFlickr, and (e), (f) are the results on NUS-WIDE.
Figure 4 shows the experimental results of our proposed CVHH in different length of hash codes (image hashing length * text hashing length). Subgraphs (a), (b) show the retrieval performance of different length of hash codes on Wiki dataset, (c), (d) show the performance on MIRFlickr, and (e), (f) show the performance on NUS-WIDE. In order to show the variation trend of experimental results in different length of hash codes, the colors of curves from 16*16 to 128*128 gradually change from dark blue, light blue, light red to dark red. In general, with the increasing of hash codes length, the query results become more effective, especially in the subgraphs (b), (d) and (f) in Fig. 4. The experimental results also show that when the codes length is from 16*16 to 128*128, the query performances is gradually improving.
To demonstrate the validity of the hierarchical structure, Table 4 shows the experimental results of every layer on the above three datasets. The evaluation indicators are the number of hierarchical layers, the size of training sets, training time and MAP values. In order to reduce the computational costs, we take 10000 instances as layer 0 on NUS-WIDE.
The results in Table 4 fully demonstrate that with the increase of the number of layers, the training time is greatly reduced while the retrieval performance is guaranteed. As an example, we can see that the MAP values on layer 1 show a competitive advantage in both Img2Txt and Txt2Img on Wiki dataset. However, the MAP values on layer 2 is significantly reduced compared to layer 1. So we select layer 1 data to participate in subsequent cross-modal variable-length hashing. Similarly, we all select the layer 1 as the training set in MIRFlickr and NUS-WIDE. In the query phase, we can obtain the hash codes of entire training set, thus ensuring the authenticity of the experimental results.
Conclusion
In this paper, we propose a novel cross-modal retrieval method, called Cross-modal Variable-length Hashing based on Hierarchy (CVHH). First we construct the hierarchical structure based on training set, and select the representative image-text pairs from optimal layer. Then we build the variable-length hashing structure to project high-dimensional original data into low-dimensional hashing space. Finally we compare the distances between queried instances with the entire training set by similarity transfer matrix, and return the top 10 instances as final query results. The experiments show that our proposed CVHH can greatly reduce the training time, and also improve the retrieval efficiency. In addition, the performance of CVHH is superior to other typical comparison methods. Furthermore, we will try to apply this method to multi-modal domain in future researches.
Footnotes
Acknowledgments
The authors would like to thank the anonymous reviewers for their help. This work was supported by the National Natural Science Foundation of China (Grant No. 62076044), Chongqing Research Program of Basic Research and Frontier Technology (Grant No. cstc2019jcyj-zdxm0011), and The Postgraduate Scientific Research and Innovation Project of Chongqing (Grant No. CYS18248).
