Deep hash networks with cross-scale feature fusion for optimal binary encoding

Abstract

Deep Hashing is a technique used for retrieving images on a large-scale, encoding the latent code of images into binary codes, which significantly reduces computational and storage costs in image retrieval. This enables fast similarity comparison and search. However, this technique encounters two significant challenges: the extraction of discriminating category-specific image features and the conflict between metric learning and quantization learning. The latter challenge often results in the binary representation of latent codes being considerably ambiguous. To tackle these challenges, this paper proposes a novel Cross-Scale Fusion Deep Hash Network. The model is built upon a dual-branch framework, aiming to capture the most representative retrieval features. One branch employs Spatial Pyramid Pooling layers and a self-attention mechanism for local information extraction, whereas the other branch uses a sliding window methodology for capturing global information. Upon obtaining the local and global information, the Cross Feature Synergy Module proposed in this paper integrates these data points to form a comprehensive feature vector, ultimately generating a complete representation of the image. In order to address the conflict between metric learning and quantization learning, as well as improve the binary codes further, this paper introduces a meticulously designed, threshold-dependent Hash-Guided Metric Loss (HGM-Loss). The novel network proposed in this paper demonstrates superior retrieval performance in standard benchmark tests on multiple datasets, including CIFAR-10, CIFAR-100, ImageNet, and MS-COCO, outperforming the existing hash methods.

Keywords

Deep hashing binary encoding image retrieval

1 Introduction

In recent years, large-scale image retrieval has become a highly researched area in the fields of computer vision and multimedia computing [1 –4]. With the widespread use of social media platforms and search engines, billions of images are constantly uploaded to the internet. Therefore, retrieving images similar to the query image from these massive libraries quickly and accurately has become a research topic of great practical value. Image descriptors [5, 6] play a critical role in encoding and measuring the similarity of image content in image retrieval. Due to the subtle and potentially overlapping feature differences among various categories of images, precise classification becomes challenging. Furthermore, the internal changes within the same category, such as variations in lighting, angle, or scale, also increase the difficulty of distinguishing features between different categories of images. Therefore, extracting discriminative features from images to achieve highly detailed visual understanding is a significant challenge. Given the excellent performance of deep neural networks (DNNs) in feature extraction, deep hashing [7, 8] has become an important research branch in recent years. It aims to map the high-dimensional feature vectors extracted by DNNs to low-dimensional vectors through specific hash functions. The encoded binary hash codes can minimize storage costs while preserving distinguishable features of images [7 , 9–11].

Most existing deep hashing algorithms first extract continuous floating-point features from the last fully connected layer of a deep network before the classification layer, and then use a simple sign function as a post-processing step to calculate binary codes. Consequently, the network can be optimized through pair-based or triplet-based loss, either as a classification task, or using cosine similarity loss as a metric learning task. However, these methods ignore the issue of semantic information loss during the binarization process of continuous floating-point features. Therefore, a challenge that needs to be addressed is how to preserve the most important semantic information in the floating-point features in order to ensure retrieval quality. Previous works [12 –16] have made significant efforts to address the issue of inevitable information loss during the binarization process. However, they have not resolved the incompatible conflict between metric loss and quantization loss. Metric loss is usually based on continuous space metric methods, such as euclidean distance or cosine similarity, to measure the similarity between images. It requires preserving as much semantic information as possible to accurately express image similarity in low-dimensional space. On the other hand, quantization loss maps continuous semantic information to binary codes, which inevitably introduces information loss and distortion, leading to severe distortion of image similarity in low-dimensional space. As shown in Fig. 1(a), the gradient direction of metric loss for the red cluster is opposite to that of quantization loss, and this common conflict causes clustering to deviate from the ideal hashing position during convergence, leading to the problem of erroneous hashing and a decrease in retrieval quality. To tackle these challenges, we propose a Hash-guided Metric Loss (HGM-Loss) that addresses the conflicting objectives of preserving semantic information and generating compact binary codes. HGM-Loss uses a threshold ν, which is determined by the number of hash bits and the number of classes, to prevent the distance between different classes from becoming infinitely distant. As shown in Fig. 1(b), this threshold ensures that negative samples are only penalized when their distance is less than ν, thereby preventing clustering from deviating from the ideal hashing position during convergence and maintaining the stability of the hashing position.

Fig. 1

(a) The opposite gradient direction from the metric loss and the quantization loss balances the red cluster, leading to a higher likelihood of error hashing during binarization. (b) The red clusters are only affected by the quantization loss until they reach the hash position with the help of HGM-loss.

To capture the most critical semantic information from an image, it is essential for the network to extract both global and local features to generate better image descriptors. Global features, which can be designed to be invariant or robust to certain viewpoints and lighting conditions, are used as advanced semantic image signatures. Local features, on the other hand, provide geometric discriminative information about specific regions of the image and are more sensitive to local geometry and texture [17 –19]. Therefore, fusing local and global features can effectively enhance the representation ability of the image descriptor. Our Cross-Scale Fusion Deep Hashing Network incorporates a global branch and a local branch, enabling the seamless integration of high-level semantics and fine-grained details. In addition, the local branch of our proposed Cross-Scale Fusion Deep Hashing Network is equipped with spatial pyramid pooling and self-attention mechanisms inspired by previous research [29], which are focused on extracting representative local features. A Cross-Feature Synergy (CFS) module is introduced to enhance the mutual influence between global and local features. Inspired by the orthogonalization idea, the module aims to orthogonalize global and local features to extract the critical local information and eliminate redundant components of global information. This enables the local and global components to mutually reinforce each other, thus generating final representative descriptors. By doing so, the discriminability and retrieval quality of the hash code can be improved. Our contributions can be summarized as follows:

Based on the expressive power of binary coding, we propose a novel network architecture called CSF-Net to achieve efficient image retrieval

The Cross-Feature Synergy module (CFS) aims to utilize global semantics to extract discriminative local features, and fuse both global and local features to enhance the representation ability of the descriptors.

The proposed Hash-Guided Metric loss(HGM-Loss) aims to mitigate the conflict between metric learning and quantization learning by restricting the learning scope of the metric term through meticulously designed thresholds.

A large number of experiments demonstrate that our proposed CSF-Net outperforms existing hashing methods and achieves significant performance improvements in four public benchmark tests.

The remaining sections of this paper are structured as follows: first, Section II briefly reviews previous work related to deep hashing; second, Section III presents in detail our proposed Cross-Scale Fusion Deep Hashing Network, which fully utilizes both global and local features to enhance the representation ability of image descriptors, and employs the HGM-Loss to alleviate the conflict between metric learning and quantization learning; third, Section IV validates the effectiveness and flexibility of our proposed method through four benchmark experiments; finally, Section V concludes our work.

2 Related work

With the rapid expansion of image datasets in recent years, efficient image retrieval has become a hot topic in the field of image processing. Among numerous retrieval methods, due to its high computational efficiency and low storage cost advantages, the hash method [1–3 , 9] has become one of the most popular methods for retrieval in large-scale image datasets. However, before the advent of deep learning, the accuracy of traditional hash methods in generating binary codes from handcrafted features was limited,since the inability of these features to optimally represent the content of images while maintaining semantic similarity. With the emergence of Deep Neural Networks (DNN), the method of image hashing has greatly benefited from the excellent performance of DNN in feature extraction from two-dimensional images. Deep learning methods have achieved significant success in this field, making deep hashing the mainstream of image retrieval. Examples of such methods include Convolutional Neural Network Hashing (CNNH) [9], Network in Network Hashing (NINH) [20], Bit-scalable Deep Hashing (BS-DRSCH) [21], and Hashnet [13]. The workflow of CNNH consisted of two phases: hash code learning and hash function learning. In the hash code learning phase, it decomposed the similarity matrix of samples into binary codes, and in the hash function learning phase, it fitted these binary codes using a CNN. Unlike CNNH, which follows a two-stage approach, both NINH and BS-DRSCH adopt a single-stage deep hashing framework. NINH utilizes a triplet ranking loss to maintain relative similarity and integrates image representation learning and hash code learning into a single framework. This framework consists of a shared subnetwork and multiple stacked convolutional layers for image feature extraction, and a segmentation coding module that utilizes a sigmoid activation function and a piecewise threshold function to output binary hash codes. BS-DRSCH constructed an end-to-end architecture to learn the hash function and could flexibly control the length of the hash code by weighting each bit of the hash code as needed. In addition, BS-DRSCH also defined weighted Hamming similarity to measure the dissimilarity between two hash codes. HashNet solved the gradient difficulty in optimization by optimizing a non-smooth weighted pairwise cross-entropy loss function with a sign activation function sign (.). This method could generate binary hash codes and cluster similar samples in feature space. Unlike traditional methods, HashNet avoided the problem of Tanh gradient approaching 0 in the domain, which led to small penalties for potential encoding between similar samples with relatively large relativedistances.

However, existing deep hashing learning frameworks based on contrastive loss and triplet loss often suffer from slow convergence, partly because they only use one negative sample per update and do not interact with other negative classes. To address this issue, [22] proposed a novel metric learning objective function called Multi-class N-pair Loss, based on triplet loss, which interacts with multiple negative samples per update. The objective function first generalizes triplet loss by allowing joint comparisons of N - 1 negative samples. Then, it reduces the computational burden of evaluating deep embedding vectors by using an efficient batch construction strategy that only contains N pairs ofexamples.

Although the idea of the above methods is to learn high-level semantic features of images through deep networks and then generate compact binary codes using hash networks, the global representations obtained by deep learning alone may not be sufficient to accurately retrieve similar samples with a large amount of redundant background. Thus, recognizing the need for more refined local features, our proposed method adopts a dual-branch feature extraction architecture with different scales to jointly consider both local and global features. Furthermore, while the existing methods aim to generate compact binary codes using hash networks, they fail to tackle the problem of losing a significant amount of continuous semantic information during the quantization process of floating-point features. DHN [7] combines quantization loss with metric loss based on pairwise loss to restrict potential codes and control quantization error to improve hash quality. Recent studies [23 –27] have added quantization loss to reduce information loss in the binarization process and have demonstrated potential performance improvements. Although recent works have shown the potential of adding quantization loss to improve the performance of deep hashing techniques, the combination of metric loss with quantization loss can lead to information loss during binarization. To address this issue, we propose HGM-loss, a carefully designed approach that reconciles these two losses, resulting in significant performance improvements across various deep hashing techniques.

3 Method

In this paper, we propose a deep hash network for cross-scale feature fusion to optimize binary coding. The network consists of a dual-branch module and a cross-feature synergy module.

3.1 Problem Definition

As mentioned above, deep hash networks can learn a compact high-dimensional feature representation and transform it into binary hash codes through a hash function, enabling fast image retrieval. Given N images $χ = {x_{i}}_{i = 1}^{N}$ and corresponding one-hot semantic labels $y = {y_{i}}_{i = 1}^{N} \in {0, 1}^{C \times N}$ , where C is the category numbers of the whole dataset. For each training sample (x_i, y_i), Deep Hashing learns a feature extractor ψ parameterized by ð to extract representative latent code representations $h_{i} = {h_{1} h_{2} . . . h_{k} ∣ h_{i} \in R}$ from the sample. Subsequently,the latent code h_i is quantized into a compact K-bit binary code b_i ={ b₁b₂ . . . b_k ∣ b_i ∈ { ± 1 }} by the hashing function $H : b_{i} \leftarrow H (h_{i})$ . Our objective in this task is to learn a reliable feature extractor ψ and a robust hash function $H$ , to cluster images of different categories into appropriately and distinctly hash positions. Given a query image x_q, we sort the hash codes of all samples in the database based on their Hamming distance and return the top N images as query results. It is reasonable to assume that the hash code b in Hamming space has similar semantic information to the representation h_i in feature space, enabling similarity matching between images. Moreover, to improve retrieval performance, we need to choose an appropriate hash code length K based on the size of the dataset and the complexity of the featurerepresentation.

3.2 Overview of CSF-Net

The CSF-Net framework proposed by us is shown in Fig. 2. Building upon the state-of-the-art image recognition model ResNet50, as in [1 , 28], we introduce a dual-branch architecture, CSF-Net, to address the issue of insufficient representation of ambiguous objects in image encoding by deep learning networks. As is well-known, shallow layers of deep neural networks tend to capture fine-grained details such as edges and corners, while deeper layers are more sensitive to global and semantic information. Inspired by this, the dual-branch design of CSF-Net aims to leverage semantic context in Res4 to align and filter better local features, thus suppressing noise interference. The approach showcases a sophisticated fusion of deep learning and image retrieval techniques, resulting in a coherent and elegant solution for improving the representation of non-obvious targets in image encoding.

Fig. 2

Overall structure of proposed CSF-Net, built upon ResNet50. The multi-scale features f_global are obtained by the global branch which takes the output of Res4 block as input. The local branch, which models more local details f_local after Res3 block, uses spatial pyramid pooling (SPP) and self-attention. The CFS module takes both f_global and f_local as inputs to generate the final representation. The metric loss( $L_{HGM}$ ) and quantization loss( $L_{Q}$ )are used for optimizing the network.

As shown in Fig. 2, our global branch remains the same as the original ResNet50, but all pooling and fully connected layers after the Res4 block are removed to preserve spatial information in the feature maps. In addition, a multi-scale sliding window is introduced to capture multi-scale spatial global information by applying sliding windows of different sizes to the feature map. This allows the network to capture contextual information at multiple scales and helps to improve the overall performance of the model. The global features, abbreviated as f_g, can be expressed as: $\begin{matrix} f_{g} = cat (F . unfold (f_{4_{1}}, s_{w_{1}}), \\ F . unfold (f_{4_{2}}, s_{w_{2}})) \end{matrix}$ (1)

In Equation (1), the feature maps f_4₁ and f_4₂ are obtained by processing the output of the Res4 layer with Conv2d, but they have different sizes. We use 3 × 3 and 5 × 5 as the window size (denoted as s_{w
₁} and s_{w
₂}, respectively) to apply multi-scale sliding windows on them. To enhance the ability of neural networks to extract detailed information from images and jointly extract local descriptors, we also introduced a local branch after the Res3 block, which consists of spatial pyramid pooling layers and a self-attention module. Specifically, we perform max pooling on the output feature f3 of the Res3 block at different spatial scales to obtain multiple feature maps with different spatial receptive fields. Each feature map is then flattened into a one-dimensional vector, and these vectors are concatenated along the channel dimension. By performing pooling operations at different scales on the image, a fixed-size feature representation is generated, allowing the model to capture multi-scale information of the target. This enables the model to extract relevant information from features at different scales, even when the position and size of the target vary in the image, thus maintaining effective recognition capabilities. Therefore, the spatial pyramid pooling layer contributes to enhancing the performance and robustness of the model, enabling it to adapt to targets of different scales and positions [29]. Finally, the concatenated feature map is passed to the self-attention module to further model the importance of each local feature point. The self-attention mechanism computes the similarity between input features and allocates weights accordingly. This allows the model to focus more on the areas that have the greatest impact on the results, reducing the influence of other distracting information. Since it takes into account and emphasizes the importance of meaningful information areas, it can better capture representative, distinguishable local features. This mechanism enhances the model’s ability to extract local features and contributes to improving the accuracy of the model’s prediction [16].

The global feature component f_g and the local feature tensor f_l, as two important feature sources, undergo feature fusion and elaborately designed processing through the Cross-Feature Synergy module, resulting in a compact descriptor that effectively preserves spatial information from both branches and achieves adaptive context alignment, which will be discussed in detail later. The design of the Cross-Feature Synergy module aims to fully leverage the strengths of global and local features, achieve complementarity and synergy, and better capture semantic information from images.

3.3 Cross Feature Synergy Module, CFS

Our Cross-Feature Synergy (CFS) module operates as illustrated in Fig. 3. The input to CFS includes two components: the multi-scale global feature f_g based on sliding windows and the local feature f_l based on self-attention. To start, we align the local feature tensor and global feature tensor by calculating the contextual repetition of each local feature point $f_{l}^{(i, j)}$ on the global feature f_g using matrix multiplication (torch.mm). Mathematically, this can be expressed as: $f_{t 1}^{(i, j)} = \frac{f_{l}^{(i, j)} \cdot f_{g}}{{| f_{l}^{(i, j)} |}^{2}} f_{l}^{(i, j)}$ (2)

Fig. 3

Framework of our proposed Cross-Feature Synergy Module(CFS).

Due to the fact that global feature tensor contains more semantic contextual information, while local feature tensor contains more detailed local contextual information, redundant context in complex backgrounds may contain unnecessary semantic information. Therefore, to better extract fine-grained local features with multi-scale spatial context, we adopted a strategy of eliminating redundant features. Specifically, when extracting local information from global features, we use subtraction operation to cancel out the duplicated information with local features, thus retaining information that only appears in the global scope. This effectively separates global and local features, and extracts more valuable features. The whole process can be expressed as follows: $f_{t 2}^{(i, j)} = f_{l}^{(i, j)} - f_{t 1}^{(i, j)}$ (3)

By following this procedure, we obtain a tensor of size C × H × W, where each element is orthogonal to the global feature f_g. We then concatenate the C×1 vector f_gc to each element of this tensor, resulting in a new tensor of size C₀ × H × W. Finally, we aggregate this tensor into a C₀ × 1 vector. Finally, a K×1 descriptor is generated using a fully connected layer. Here, we choose to use pooling to aggregate the cascaded tensors. Through our Cross-Feature Synergistic (CFS) module processing, the extracted features align detailed local features with global spatial context, thus focusing on the key instance regions.

3.4 Loss Functions for Deep Hashing Image

Metric Loss. In general, for any K-dimensional binary codes b₁ and b₂, the Hamming distance measures their similarity by counting the number of different bits between them using XOR operation. However, in practical applications, XOR operation is not differentiable and cannot be used for training neural networks with backpropagation. It is also not suitable for measuring distances between continuous floating-point features used for training. Given that Hamming distance is related to cosine similarity, we use cosine similarity to approximate the distance between continuous floating-point hash codes h₁ and h₂, and enable backpropagation. Therefore, the relationship between cosine similarity and Hamming distance can be expressed as Equation 4, where K represents the dimensionality of the hashcodes. $\begin{matrix} D_{Ham} (b_{1}, b_{2}) & = \sum_{i = 1}^{K} | b_{1 i} - b_{2 i} | \\ = \frac{1}{2} K - D_{\cos} (h_{1}, h_{2}) \cdot \frac{K}{2} \end{matrix}$ (4) In Equation 4,D_cos (h₁, h₂) represents the cosine distance between h₁ and h₂, which can be expressed as: $D_{\cos} (h_{1}, h_{2}) = 1 - \frac{h_{1} \cdot h_{2}}{| h_{1} | | h_{2} |}$ (5)

As is well known, the goal of metric loss is clustering, which is a distance-based learning method designed to learn a mapping function that maps input data from a high-dimensional feature space to a low-dimensional embedding space, and makes samples of the same class more compact in the embedding space, while making the distance between different class samples more distinct. This technique is widely used in the field of image retrieval and is used to optimize the performance of models. Existing techniques [30 –33] typically use cosine similarity estimated in mini-batches to design the metric loss term, which can be represented as: $\begin{matrix} L_{Metric} = \\ \sum_{h_{1} \in 𝕊} \sum_{h_{2} \in 𝕊^{'}} 1 * (y_{1} = y_{2}) F_{+} (- cos (h_{1}, h_{2})) \\ + \sum_{h_{1} \in 𝕊} \sum_{h_{2} \in 𝕊^{'}} 1 * (y_{1} \neq y_{2}) F_{-} (cos (h_{1}, h_{2})) \end{matrix}$ (6) In Equation 6, F₊ and F_- represent the loss functions designed for positive and negative samples, respectively, which can be estimated using contrastive-based or class-proxy-based methods to measure similarity. $𝕊$ represents the latent codes of the given batch of images, and $𝕊^{'}$ represents the latent code set of other samples or class proxies. y_i represents the class label.

Quantization loss. The metric loss only considers the intra-class compactness and inter-class separability of continuous features in the original space, while neglecting the requirements for efficient query and easy storage. Therefore, the metric loss itself cannot guarantee that the binary hash codes are located in the ideal hash positions. Directly binarizing the latent code h_i will lead to a large amount of feature information loss, such as some bits close to 0 being binarized as ±1, which is not ideal for image clustering with similar hash codes. In order to alleviate the problem of information loss, the quantization loss was proposed in reference [7]. It constrains the latent code by calculating the L2 regularization loss between the latent code h_i and its corresponding binary vector b_i. Subsequent works [3 , 25] have demonstrated that adding quantization loss can significantly improve the performance of image retrieval, bringing new opportunities for the development of image retrieval technology.

$L_{Q} = \frac{1}{N} \sum_{i = 1}^{N} ∥ h_{i} - b_{i}) ∥_{2}^{2}$ (7)

In image retrieval, deep learning models are commonly used to learn feature representations, which are then subjected to distance measurement to enable matching of similar images. The purpose of metric loss is to optimize this feature representation so that similar images are closer to each other in feature space, while dissimilar images are further apart. On the other hand, quantization loss aims to quantize continuous feature vectors into discrete binary codes for more efficient storage and retrieval. However, the learning objectives of metric loss and quantization loss are contradictory, since metric loss requires the feature representation to be more compact and continuous in feature space, while quantization loss requires the feature representation to be more discrete in the binary code space.

To illustrate this contradiction more clearly, we use an example of a two-dimensional feature space shown in Fig. 4, which contains three different categories. To better demonstrate their distribution in the two-dimensional feature space, all samples are scaled to a feature circle with a radius of r = $\sqrt{2}$ , where each point on the feature circle corresponds to a feature vector representing a sample’s feature representation. As there are only four quadrants in the two-dimensional space, the binarization operation directly pulls all samples to four binary points (i.e., (±1, ±1)) corresponding to the four quadrants. The farther away the sample is from the binary point, the more information is lost after the hashingfunction.

Fig. 4

(a) The objective of metric loss is to maximize the inter-class distance while minimizing the intra-class distance in the feature space. (b) Once the quantization loss is incorporated, the class centroids tend to shift towards the binary points. However, the metric loss that repels the classes from one another prevents the centroids from aligning with the optimal hashing positions. (c) Upon integrating hash-guided metric loss, the solution obtained is the global optimum for this problem.

Figure 4(a) illustrates the goal of metric learning by showing samples of three different classes in a 2D feature space, where the centers of these classes are pushed as far apart from each other as possible to form an equilateral triangle, representing the best solution where each cluster is separated as much as possible from each other. In contrast, in Fig. 4 (b), the goal of quantization learning is to pull the class centers to one of the four binary points as closely as possible to minimize information loss during the binarization process, resulting in the optimal quantized solution (i.e., minimizing information loss as much as possible). It is visually clear that in metric learning, we want the center points of different classes to be separated as much as possible, while in quantization learning, we want the center points of different classes to be pulled towards one of the four binary points, to minimize information loss in the hashing process. Therefore, there is no solution that simultaneously satisfies the best metric and the best quantization goals. In fact, the ideal solution is to have the three class centers strictly located at the three of the four binary points, which is the optimal global solution with the least information loss during the binarization process (as shown in Fig. 4 (c)). When both metric loss and quantization loss are integrated into a method, an intermediate state is obtained (as shown in Fig. 4 (b)), which lies between the optimal metric solution and the optimal global solution. This integration leads to more information loss during the binarization process, thus reducing retrieval accuracy. Specifically, samples that are further away from the cluster center are more likely to be misclassified into other binary points.

3.5 Hash-Guided Metric loss

The Proxy-Anchor [37] was previously the state-of-the-art method, which used the cosine similarity estimated in small batches to design the metric loss term. Our proposed HGM-loss modifies the linear computation part of the cosine similarity in the Proxy-Anchor metric loss to restrict the learning scope of the metric term. The metric loss of Proxy-Anchor is calculated as Equation 8: $\begin{matrix} M_{loss} & = \frac{1}{| ℙ |} \sum_{p \in ℙ} log (1 + \sum_{h \in h_{p}^{-}} e^{α (cos (h, p) + γ)}) \\ + \frac{1}{| ℙ^{+} |} \sum_{p \in ℙ^{+}} log (1 + \sum_{h \in h_{p}^{+}} e^{- α (cos (h, p) - γ)}) \end{matrix}$ (8) In Equation 8, alpha (α) is a scaling factor, while gamma (γ) is a margin. Let $ℙ$ denote the set of proxies and $ℙ^{+}$ represent the positive proxies corresponding to samples in a mini-batch. In the Proxy-Anchor method, each class has a proxy vector, which is continually updated during training to maximize the distance between positive and negative classes while minimizing the distance between the same class. The latent code set $ℍ$ for a proxy is divided into two categories: $h_{p}^{+}$ represents all samples of the same category, while the remaining set $h_{p}^{-} = ℍ - h_{p}^{+}$ represents samples of different categories to the proxy.

In Equation 8, the exponential term is linearly related to the cosine similarity of all samples, which contradicts the principles of quantization learning discussed earlier. Therefore, our proposed HGM-loss uses a specific threshold value ν to correct this part of the equation. The HGM-loss is shown in Equation 9: $\begin{matrix} L_{HGM} = \\ \frac{1}{| ℙ |} \sum_{p \in ℙ} log (1 + \sum_{h \in h_{p}^{-}} (e^{α F (cos (h, p), - ν - δ)} - 1)) \\ + \frac{1}{| ℙ^{+} |} \sum_{p \in ℙ^{+}} log (1 + \sum_{h \in h_{p}^{+}} (e^{- α F (cos (h, p), - δ)})) \end{matrix}$ (9)

where F(a,b) = max(0,a + b), δ is a margin hyperparameter to prevent the model from fitting too closely to the training data. The proposed HGM-loss modifies the error component calculation part in the Proxy-Anchor metric loss with a specific threshold value to alleviate the conflict between metric loss and quantization loss. The value of threshold ν depends on the binary code length and the number of categories, which indicates the minimum distance d_min of the optimal global solution. The goal of HGM-loss is not to have an infinite metric distance between different classes, but to push them apart until they reach a specific distance ν derived from the minimum distance d_min. When the distance between different classes satisfies ν, the optimal metric solution is considered to have been achieved. Overall, the HGM-loss aims to balance the separation of different class clusters and the minimization of information loss in the quantization process.

The calculation method of threshold ν is equivalent to minimizing the distance between given binary codes, however, previous work [34] has proven that computing the ideal minimum distance requires exponential time complexity, which is an NP-hard problem. However, researchers [35] proposed a method to estimate the minimum distance of binary linear codes and provided a table of lower bounds for the minimum distance of binary linear codes, denoted as $𝕋$ . Specifically, they limit the minimum distance through strict inequality constraints and propose some upper and lower bound estimation methods to solve the NP-hard problem. Among them, the most important is the Gilbert-Varshamov bound [36], which can be used to calculate the upper and lower bounds of the minimum distance of binary linear codes and can be computed in polynomial time. Inspired by [35], we set the minimum distance d_min according to the following rules: if the value obtained from $𝕋$ is a constant, we use that value as d_min; if the value obtained is a range of values, we choose the median of that range as d_min. Finally, we normalize d_min according to the length K of the hash code to obtain the optimal threshold ν, ensuring a certain degree of separation between differentcategories. $ν = 1 - \frac{d_{\min}}{2 \sqrt{K}}$ (10) Following [27, 28], our approach combines the metric loss with the quantization loss to reduce information loss during the binarization process. Our learning objective can be expressed as: $\begin{matrix} \min (L_{HGM} + β L_{Q}) \end{matrix}$ , where we use β=1 to train CSF-Net in an end-to-end manner.

4 Experiments

4.1 Dataset

We conduct extensive experiments on four representative benchmarks (CIFAR-10 [38], CIFAR-100 [38], MS-COCO [39], and ImageNet [40]) to validate the performance of our proposed method and to demonstrate the effectiveness of the proposed CSF-Net.

ImageNet is a widely used large-scale image dataset containing over 14 million labeled images, covering more than 20,000 categories. We adopt the same dataset and setting as in [13], where the main challenge is the large number of categories and limited number of training samples.

CIFAR-10 is a public dataset commonly used for image recognition, but it is also suitable for image retrieval tasks. It contains 60,000 32x32 color images belonging to 10 classes, with 6,000 images per class. We selected 50,000 images as the training set and the database, and the remaining 10,000 images as the query set.

CIFAR-100 dataset is an image classification dataset that consists of 100 classes, with each class containing 600 32x32 pixel color images. There are 50,000 images for training and 10,000 images for testing. The 100 classes in the dataset are divided into 20 superclasses, with each superclass containing 5 subclasses. CIFAR-100 is an extension of the CIFAR-10 dataset, with more classes and a more diverse set of images.

MS-COCO dataset is a commonly used multi-label image recognition dataset, which contains over 120,000 images, each labeled with several of 80 categories. To remove images without any labels, according to [28], the dataset is first preprocessed, and then 5,000 images that contain at least one label are randomly selected from the remaining 122,218 images as the query set, 10,000 images as the training set, and the rest as the database. The main challenges of this dataset are its multi-label nature and its large scale of images.

4.2 Evaluation protocols and implementation details

The CSF-Net is implemented in the PyTorch framework. We use pre-trained ResNet-50 on ImageNet as the backbone to show the superiority of our solution. The convolutional layers with the proposed CFS module and HGM-loss are optimized using backpropagation. We use stochastic gradient descent (SGD) as the optimizer with a momentum of 0.9 and weight decay of 1e-5. The initial learning rate is set to 0.01. As for hyperparameters, we set δ to 0.2 based on empirical experience. All experiments use batch size of 128 trained on a NVIDIA TITAN X (Pascal) GPU for 150 epochs. We employ the widely used metric of mean average precision (mAP) to evaluate the retrieval performance, which is a metric ranging between 0 and 1, used to measure the accuracy and reliability of a retrieval system. $mAP @ N = \frac{1}{| Q |} \sum_{i = 1}^{| Q |} \frac{1}{N} \sum_{j = 1}^{N} P (j) \cdot rel (j)$ (11) Where |Q| represents the size of the query set, N represents the maximum number of results returned for each query, P (j) represents the precision of the top j retrieved results, i.e., the proportion of relevant samples among the top j retrieved results. rel (j) is usually set to 1 or 0 to indicate whether the jth retrieved result is relevant or not.

4.3 Experiment results

Table 1 details the comparisons of mean average precision (mAP) across various hash bit lengths, allowing us to juxtapose the performance of our proposed CSF-Net method with that of alternative leading-edge methods. Intriguingly, CSF-Net exhibits superior performance when compared to other techniques, particularly when we consider the 32-bit and 48-bit hash codes within the CIFAR-10 and CIFAR-100 datasets. While the breadth of categories present in the CIFAR-100 dataset makes image identification a more demanding task, and results in a more notable performance decline for all methods when compared to CIFAR-10, CSF-Net remains robust. It’s worth noting the stark performance drop exhibited by the DPSH and DSDH methods when transitioning from CIFAR-10 to CIFAR-100, yet our CSF-Net shows resilience amid these challenges. This highlights our solution’s capability to mitigate observed conflict between metric learning and quantization learning, thereby demonstrating the robustness of our approach.

Table 1
The map performance by Hamming Ranking for various hash bits configurations was evaluated on CIFAR-10, CIFAR-100, ImageNet, and MS-COCO datasets

Dataset CIFAR-10 CIFAR-100 ImageNet (mAP@1000) MS-COCO (mAP@5000)

Hash bits 12bits 32bits 48bits 12bits 32bits 48bits 16bits 32bits 64bits 16bits 32bits 64bits

DCWH [41] 0.818 0.848 0.854 0.723 0.757 0.766 0.782 0.799 0.849 0.742 0.776 0.779

DPSH [42] 0.797 0.82 0.802 0.06 0.12 0.159 - - - 0.349 0.355 0.367

IDCWH [27] 0.828 0.868 0.849 0.764 0.824 0.835 - - - 0.732 0.76 0.77

CSQ [28] - - - - - - 0.851 0.865 0.873 0.796 0.838 0.861

HashNet [13] - - - - - - 0.622 0.701 0.739 0.745 0.773 0.788

DSDH [43] 0.8 0.804 0.808 0.078 0.187 0.227 - - - 0.347 0.359 0.37

CSCE-Net [44] - - - - - - 0.869 0.887 0.897 0.807 0.852 0.888

CSF-Net 0.895 0.889 0.907 0.855 0.863 0.886 0.874 0.906 0.912 0.843 0.848 0.891

Dataset	CIFAR-10	CIFAR-100	ImageNet (mAP@1000)	MS-COCO (mAP@5000)
DCWH [41]	0.818	0.848	0.854	0.723	0.757	0.766	0.782	0.799	0.849	0.742	0.776	0.779
DPSH [42]	0.797	0.82	0.802	0.06	0.12	0.159	-	-	-	0.349	0.355	0.367
IDCWH [27]	0.828	0.868	0.849	0.764	0.824	0.835	-	-	-	0.732	0.76	0.77
CSQ [28]	-	-	-	-	-	-	0.851	0.865	0.873	0.796	0.838	0.861
HashNet [13]	-	-	-	-	-	-	0.622	0.701	0.739	0.745	0.773	0.788
DSDH [43]	0.8	0.804	0.808	0.078	0.187	0.227	-	-	-	0.347	0.359	0.37
CSCE-Net [44]	-	-	-	-	-	-	0.869	0.887	0.897	0.807	0.852	0.888
CSF-Net	0.895	0.889	0.907	0.855	0.863	0.886	0.874	0.906	0.912	0.843	0.848	0.891

When discussing the large-scale ImageNet and MS-COCO datasets, CSF-Net consistently outshines. With a 64-bit hash code on ImageNet, an uptick of 0.5%, 0.6%, and 1.5% is observed over the next best performing method—CSCE-Net for 16-bit, 32-bit, and 64-bit hash bits respectively. Our CSF-Net implementation on the MS-COCO dataset also outperforms CSCE-Net, barring a minor exception where the performance dips with 32-bit hash bits. Amid varying datasets and variable hash bit lengths, CSF-Net remains steady and impressive. It confirms the algorithm’s inherent strength for extracting and isolating crucial granular fine-grained information. We also recognize the role of our newly proposed HGM-Loss, which plays a significant part in minimizing informational loss during the binarization process to the farthest extent.

In Fig 5, we demonstrate the comparative results between our method and others in terms of precision when employing 64-bit binary codes. Our method stands out with a significant advantage in precision among all compared. This advantage is attributed to the effective use of 64-bit binary codes in our method to acquire and retain information, as well as the exceptional stability and accuracy of our algorithm while handling high-dimensional data and complex scenarios.

Fig. 5

Precision@top-500 curves on CIFAR-10 and MS-COCO datasets with binary codes @ 64-bits.

We also observed that although our algorithm performs well under the settings of 64-bit binary coding, there might be minimal precision drop under certain circumstances, such as complexity in the image patterns or in cases of high noise. This indicates that there is room for improvements in our method to handle these challenges.

On the other hand, in Fig 6, we present the retrieval results of our CSF-Net on the CIFAR-10 dataset, using only 12-bit hash codes. The graph suggests that our method can execute accurate image retrievals even when only applying 12-bit hash codes. This further emphasizes the strong capability of our method in dealing with the issue of information loss during the binarization process.

Fig. 6

Shown here are the top 7 images retrieved with our proposed method using 12 bits on the CIFAR-10 dataset

4.4 Ablation study

Investigating the Impact of Network Components on Key Retrieval Information Extraction. We conduct a series of ablation experiments to investigate which parts of the network are most beneficial for extracting key retrieval information. In these experiments, we use the same loss function( $L_{HGM} + L_{Q}$ ) and build a baseline model based on the fourth layer of ResNet50, which represents the global context information of the image. In Table 2, we show the performance of different models, where “baseline+local” represents the concatenated output of global and local features. We find that this model performs better than the baseline results on both datasets, possibly because the local features contain geometric discriminative information about specific image regions. However, performing a single sliding window operation or concatenating sliding window features of different scales on global features does not achieve significant performance improvement, as verified in rows 3-5 of Table 2. In addition, when considering the fusion of global and local features, we introduce a CFS module for feature processing. Table 2 shows the performance of features processed by CFS, where “Baseline+CFS” represents the final feature after global and local features are processed by the CFS module. We find that compared to all other conditions, the CFS module performs best on both the CIFAR-10 and ImageNet datasets, indicating that the CFS module has strong fusion ability, can filter duplicate information, and retain important representative region information. It is particularly noteworthy that even when only a small amount of information can be retained, our CFS module still has strong representative ability, which has been demonstrated in the case of 12-bit hash codes.

Table 2
Outcomes obtained from conducting experiments with diverse architectures of the CSF-Net

Struct CIFAR-10 ImageNet(mAP@1000)

12bits 32bits 48bits 16bits 32bits 64bits

Baseline 0.879 0.873 0.887 0.867 0.882 0.901

Baseline+local 0.881 0.876 0.896 0.869 0.885 0.906

Baseline+global-sw3 0.892 0.885 0.885 0.872 0.887 0.905

Baseline+global-sw5 0.891 0.88 0.883 0.871 0.891 0.903

Baseline+global-sw3+global-sw5 0.887 0.882 0.901 0.873 0.889 0.908

Baseline+CFS 0.895 0.889 0.907 0.874 0.906 0.912

Struct	CIFAR-10	ImageNet(mAP@1000)
Baseline	0.879	0.873	0.887	0.867	0.882	0.901
Baseline+local	0.881	0.876	0.896	0.869	0.885	0.906
Baseline+global-sw3	0.892	0.885	0.885	0.872	0.887	0.905
Baseline+global-sw5	0.891	0.88	0.883	0.871	0.891	0.903
Baseline+global-sw3+global-sw5	0.887	0.882	0.901	0.873	0.889	0.908
Baseline+CFS	0.895	0.889	0.907	0.874	0.906	0.912

Table 3

Comparison of mAP using different loss variants

Loss	Imagenet100(mAP@1000)
	16bits	32bits	64bits
Proxy-NCA [8]+Qua	0.868	0.89	0.901
Proxy-Anchor [37]+Qua	0.871	0.891	0.908
CE	0.848	0.879	0.885
CE+Qua	0.862	0.886	0.893
CF	0.861	0.875	0.896
CF+Qua	0.863	0.887	0.894
HGM+Qua	0.874	0.906	0.912

Analysis on the Effects of loss functions. In our previous discussions, we elaborately explained the incompatibility issues between the metric learning and the quantization learning. To tackle these issues, we proposed the HGM-Loss, which explicitly restricts the learning range of the metric term, thereby preventing infinite separation of class centers. Furthermore, drawing upon the minimum distance theory of binary linear codes, our HGM-Loss pays more attention to quantization learning while ensuring distinguishable clustering, thereby achieving a balance between metric learning and quantization learning. To verify the effectiveness of our method, we conducted several tests on the large-scale ImageNet dataset and made comparisons with other methods.

We adopted various loss settings in the experiment represented in Table 4. “Proxy-NCA” introduced a set of class centers that can be learnt using proxy-based loss. “Proxy-Anchor” is also a proxy-based loss, which amalgamates the advantages of proxy-based methods and pair-based methods. “Qua” stands for Quantization loss. “CE” refers to the Cross Entropy loss utilized, and “CF” represents the CosFace loss implemented.

Table 4

mAP performance with different margin hyperparameters δ for a 32-bit hash code on three datasets using the same model and loss function

Margin Params δ	0	0.1	0.2	0.25	0.3	0.4
CIFAR-10	0.868	0.873	0.889	0.886	0.881	0.62
CIFAR-100	0.851	0.859	0.863	0.861	0.858	0.511
ImageNet	0.882	0.888	0.906	0.889	0.882	0.532

From Table 4, it is evident that our proposed HGM-Loss+Qua algorithm displays remarkable performance on the Imagenet100 dataset, with average mAP highest performances of 0.874, 0.906, and 0.912 respectively. This substantiates the robustness and effectiveness of our HGM-Loss, even when the increase in the number of categories within the dataset exacerbates the conflict issues between metric learning and quantization learning. Our method effectively ameliorates these conflicts and manifests its resilience.

“CE+Qua” and “CF+Qua” represent the combined results of Cross-Entropy loss and CosFace loss. Their respective mAP performances for 16bits, 32bits, and 64bits stand at 0.862, 0.886, 0.893 and 0.863, 0.887, 0.894. Although these scores are slightly lower than when we utilize our proposed HGM-Loss, they still verify that implementing Quantization loss can help improve the precision of image retrieval.

In conclusion, these experimental outcomes validate the superiority of our proposed HGM-Loss in achieving an effective balance between metric learning and quantization learning and thereby facilitating efficient and precise image retrieval.

Sensitivity Analysis of Hyperparameters. We conduct a sensitivity analysis of hyperparameters to explore the influence of δ on the model. As shown in Table 3, we perform experiments on three standard datasets with a hash code length of 32 bits and set the range of the margin hyperparameter δ from 0 to 0.4. The results indicate that the optimal performance occurs around δ of approximately 0.2. When using the HGM-Loss, the main role of δ is to prevent similar samples from being simply mapped to the same latent codes, thus ensuring that the sample distribution is in the ideal hypersphere. It should be noted that without adding δ, the mAP of each dataset will generally decrease due to the problem of overfitting, which will harm the retrieval performance. Additionally, if δ is increased uncontrollably, that is, when δ is far from 0.2, we observe that the performance will rapidly decline, and this cannot guarantee consistent performance improvements across all datasets. Therefore, experimental results demonstrate that choosing an appropriate δ is advantageous for improving hash performance.

In HGM-Loss, the alpha value has a direct impact on the model’s performance. The alpha value controls the weight of the two parts of the loss in HGM-Loss, namely, the metric loss and the quantization loss. Proper adjustment of the alpha value can affect the model’s performance in two aspects: 1. Metric learning of the data, that is, how to separate the distance between different categories in the feature space; 2. Quantization learning of the hash code, that is, how to map the continuous feature vectors to the discrete binary code. By adjusting the alpha value, the impact of these two aspects can be balanced, thereby optimizing the overall model performance. Therefore, choosing the appropriate alpha value is a key factor in optimizing the performance of HGM-Loss. As shown in Table 6, we conducted experiments on the model on three standard datasets (CIFAR-10, CIFAR-100, and ImageNet), with the hash code length set to 32 bits, and the scaling factor α ranging from 0.15 to 0.6. The experimental results show that the optimal performance for each dataset occurs at a specific value of α. Specifically, for CIFAR-10, the optimal α value is 0.15; for CIFAR-100, the optimal α value is 0.25; for ImageNet, the optimal α value is 0.45.

Table 5

mAP performance with different feature fusion methods on ImageNet datasets

Loss	Imagenet100(mAP@1000)
	16bits	32bits	64bits
FFA-Net [45]	0.856	0.872	0.9
CFNet [46]	0.866	0.899	0.906
CFP [47]	0.872	0.885	0.889
Concat	0.809	0.846	0.858
Add	0.776	0.806	0.843
CFS	0.874	0.906	0.912

Table 6

mAP performance with different scaling factor α for a 32-bit hash code on three datasets using the same model and loss function

scaling factor α	0.15	0.25	0.3	0.4	0.45	0.6
CIFAR-10	0.889	0.853	0.826	0.831	0.821	0.811
CIFAR-100	0.859	0.863	0.861	0.842	0.838	0.711
ImageNet	0.879	0.881	0.885	0.89	0.906	0.887

In the CIFAR-10 dataset, with the increase in the value of α, the model’s mAP performance exhibited a significant decrease. When α is set at 0.15, the mAP reaches the maximum value of 0.889. This suggests that for this dataset, a smaller α value can help the model achieve higher performance. This result may be due to the greater tolerance of error as a result of a smaller α, which allows the model to better learn the underlying patterns of the data and to avoid over-concentrating on noise and outliers. However, in the CIFAR-100 dataset, the model’s performance reaches its highest value when α is 0.25. This may be due to the fact that for datasets with more categories, a slightly higher α value can be beneficial in distinguishing more classifications, thereby improving the model’s performance. On the ImageNet dataset, with the increase in α, the model’s performance initially shows an upward trend and achieves its peak when α is 0.45. Afterward, as α continues to increase, the model’s performance starts to decline. This may be related to ImageNet’s higher image complexity and the number of categories, which allows the model to improve performance within a certain range of increased α, but an overly large α may cause the model to overfit, leading to performance degradation.

In summary, this ablation study reveals the impact of the α value on the model’s classification performance, and indicates that the selection of α needs to take into account the characteristics and complexity of the dataset. A specific α value cannot adapt to all datasets, thus it needs to be chosen and adjusted according to the specific dataset and task at hand.

Effect analysis of different feature fusion methods. In the ablation study, we compared different feature fusion methods on the ImageNet dataset. As shown in Table 5, we evaluated the performance of FFA-Net, CFNet, CFP, Concat, Add, and our proposed CFS under different hash bit lengths.

Firstly, it can be observed that our CFS method achieves competitive performance across all hash bit lengths. Particularly, at 64-bit hash bit length, our CFS method achieves the highest performance with an mAP@1000 of 0.912, surpassing all other methods. This indicates a significant advantage of our CFS module in cross-scale feature fusion, enabling better capture of both global and local feature information in images at 64-bit hash length. Secondly, although in some cases, such as at 32-bit hash length, our CFS method slightly trails behind CFNet, it still demonstrates high mAP@1000 (0.906). This suggests that while CFNet may excel in capturing global feature information at 32-bit hash length, our CFS method remains competitive in local feature fusion.

Compared to other feature fusion methods, our CFS method exhibits stable performance across different hash bit lengths. In particular, our CFS method outperforms CFP in terms of mAP@1000, further confirming the superiority of our CFS module in extracting key fine-grained information. Overall, the results of the ablation study validate the effectiveness and competitive performance of our CFS module across different hash bit lengths. While in some cases, other feature fusion methods may demonstrate better performance, our CFS module’s advantage in cross-scale feature fusion is more pronounced, providing a more flexible and robust feature representation for image retrieval tasks.

5 Conclusion

In this paper,we consider the representational power of feature fusion and proposes a cross feature synergy module for deep hashing. The proposed CSF-Net can extract fine-grained local information by utilizing global contextual information, thereby improving the recall rate for difficult samples. In addition, to alleviate the challenging conflict between metric learning and quantization learning in deep hashing, we proposes a new loss function called Hash-Guided Metric loss (HGM-Loss), which achieves a balance between metric and quantization terms by limiting the learning range of metric terms and alleviating the problem of semantic information loss. Extensive experiments on four standard datasets verify the superiority of our method, and the results show that our method achieves significant performance improvements in mAP evaluation metric. Visualization results demonstrate that CSF-Net can achieve better retrieval performance and visual consistency.

Looking forward, there are several promising avenues to further evolve our work. Firstly, we believe there is room to expand our study of feature fusion, potentially integrating varying image feature types to provide a more comprehensive image representation. Secondly, our future research paths will include a deeper exploration into different hash coding techniques, such as investigating the impact of hash length on retrieval accuracy and speed, as well as accommodating a wider array of data types and scenarios through multi-hash coding techniques. We assert that these exploratory directions may not only improve the method proposed herein but can also broaden its application potential.

Our ongoing commitment to this line of enquiry, prompted by our findings in this paper, will continue to drive the improvement and innovation in deep hashing methods.

Footnotes

Acknowledgment

This work was supported in part by the National Natural Science Foundation of China (No. 62173285 and 62103345), the Fujian Provincial Natural Science Foundation of China (No. 2021J011181, 2020J02160 and 2022J011234) and Xiamen Youth Innovation Fund Project (No. 3502Z20206072 and 3502Z20206076).

References

Noh

, Araujo

, Sim

, et al., Large-scale image retrieval with attentive deep local features, Proceedings of the IEEE International Conference on Computer Vision (2017), 3456–3465.

Datta

, Joshi

, Li

, et al., Image retrieval: Ideas, influences, and trends of the new age, ACM Computing Surveys (Csur) 40(2) (2008), 1–60.

Yan

, Gong

, Wei

, et al., Deep multi-view enhancement hashing for image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 43(4) (2020), 1445–1451.

Zhang

and Rui

, Image search—from thousands to billions in 20 years, ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 9(1s) (2013), 1–20.

Kim

, Parra

, Yue

, et al., Robust local and global shape context for tattoo image matching, 2015 IEEE International Conference on Image Processing (ICIP) (2015), 2194–2198.

Fan

, Wu

and Hu

, Aggregating gradient distributions into intensity orders: A novel local image descriptor, CVPR 2011. IEEE (2011), 2377–2384.

, et al., Deep hash learning for efficient image retrieval, 2017 IEEE International Conference on Multimedia & Expo Workshops (ICMEW) (2017), 579–584.

Movshovitz-Attias

, Toshev

, Leung

T.K.

, et al., No fuss distance metric learning using proxies, Proceedings of the IEEE International Conference on Computer Vision (2017), 360–368.

a R, Y. Pan, H. Lai, et al., Supervised hashing for image retrieval via image representation learning, Proceedings of the AAAI Conference on Artificial Intelligence 28(1) (2014).

10.

Erin Liong

, Lu

, Wang

, et al., Deep hashing for compact binary codes learning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 2475–2483.

11.

Luo

, Wang

, Wu

, et al., A survey on deep hashing methods, ACM Transactions on Knowledge Discovery from Data 17(1) (2023), 1–50.

12.

Kulis

, Metric learning: A survey, Foundations and Trends^® in Machine Learning 5(4) (2013), 287–364.

13.

Cao

, Long

, Wang

, et al., Hashnet: Deep learning to hash by continuation, Proceedings of the IEEE International Conference on Computer Vision (2017), 5608–5617.

14.

Qin

, Gong

, Liu

, et al., Forward and backward information retention for accurate binary neural networks, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 2250–2259.

15.

Jose

, Horstmann

and Ohm

J.R.

, Optimized binary hashing codes generated by siamese neural networks for image retrieval, 2018 26th European Signal Processing Conference (EUSIPCO) (2018), 1487–1491.

16.

Long

, Wei

, Qi

, et al., A deep hashing method based on attention module for image retrieval, 2020 13th International Conference on Intelligent Computation Technology and Automation (ICICTA) (2020), 284–288.

17.

, Ji

and Huang

, Joint event extraction via structured prediction with global features, Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (2013), 73–82.

18.

Cao

, Araujo

and Sim

, Unifying deep local and global features for image search, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16. Springer International Publishing (2020), 726–743.

19.

Kabbai

, Abdellaoui

and Douik

, Image classification by combining local and global features, The Visual Computer 35 (2019), 679–693.

20.

Lai

, Pan

, Liu

, et al., Simultaneous feature learning and hash coding with deep neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), 3270–3278.

21.

Zhang

, Lin

, Zhang

, et al., Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification, IEEE Transactions on Image Processing 24(12) (2015), 4766–4779.

22.

Sohn

, Improved deep metric learning with multi-class n-pair loss objective, Advances in Neural Information Processing Systems (2016), 29.

23.

Hoe

J.T.

, Ng

K.W.

, Zhang

, et al., One loss for all: Deep hashing with a single cosine similarity based learning objective, Advances in Neural Information Processing Systems 34 (2021), 24286–24298.

24.

Sun

, Ye

, Li

, et al., Unsupervised deep hashing through learning soft pseudo label for remote sensing image retrieval, Knowledge-Based Systems 239 (2022), 107807.

25.

Roy

, Sangineto

, Demir

, et al., Metric-learning-based deep hashing network for content-based retrieval of remote sensing images, IEEE Geoscience and Remote Sensing Letters 18(2) (2020), 226–230.

26.

, Li

, Meng

, et al., Discriminative deep metric learning for asymmetric discrete hashing, Neurocomputing 380 (2020), 115–124.

27.

Zhang

and Yan

, Improved deep classwise hashing with centers similarity learning for image retrieval, 2020 25th International Conference on Pattern Recognition (ICPR) (2021), 10516–10523.

28.

Yuan

, Wang

, Zhang

, et al., Central similarity quantization for efficient image and video retrieval, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 3083–3092.

29.

, Zhang

, Ren

, et al., Spatial pyramid pooling in deep convolutional networks for visual recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence 37(9) (2015), 1904–1916.

30.

, Wang

, et al., Triplet Deep Hashing with Joint Supervised Loss for Fast Image Retrieval, 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI) (2019), 606–613.

31.

Xuan

, Shim

and Lee

S.G.

, Deep Semantic Hashing Using Pairwise Labels, IEEE Access 9 (2021), 91934–91949.

32.

Wang

, Chen

, Zhang

, et al., Weakly supervised deep hyperspherical quantization for image retrieval, Proceedings of the AAAI Conference on Artificial Intelligence 35(4) (2021), 2755–2763.

33.

Zhou

, Yang

, Wang

, et al., Scalable feature matching by dual cascaded scalar quantization for image retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 38(1) (2015), 159–171.

34.

Vardy

, The intractability of computing the minimum distance of a code, IEEE Transactions on Information Theory 43(6) (1997), 1757–1766.

35.

Brouwer

A.E.

and Verhoeff

, An updated table of minimum-distance bounds for binary linear codes, IEEE Transactions on Information Theory 39(2) (1993), 662–677.

36.

Griesmer

J.H.

, A bound for error-correcting codes, IBM Journal of Research and Development 4(5) (1960), 532–542.

37.

Kim

, Kim

, Cho

, et al., Proxy anchor loss for deep metric learning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020), 3238–3247.

38.

Krizhevsky

and Hinton

, Learning multiple layers of features from tiny images, (2009), 7.

39.

Lin

T.Y.

, Maire

, Belongie

, et al., Microsoft coco: Common objects in context, Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 (2014), 740–755.

40.

Russakovsky

, Deng

, Su

, et al., Imagenet large scale visual recognition challenge, International Journal of Computer Vision 115 (2015), 211–252.

41.

Zhe

, Chen

and Yan

, Deep class-wise hashing: Semantics-preserving hashing via class-wise loss, IEEE Transactions on Neural Networks and Learning Systems 31(5) (2019), 1681–1695.

42.

W.J.

, Wang

and Kang

W.C.

, Feature learning based deep supervised hashing with pairwise labels, arxiv preprint arxiv:1511.03855 (2015).

43.

, Sun

, He

, et al., Deep supervised discrete hashing, Advances in Neural Information Processing Systems (2017), 30.

44.

Xue

, Shi

, He

, et al., Cross-Scale Context Extracted Hashing for Fine-Grained Image Binary Encoding, arxiv preprint arxiv:2210.07572 (2022).

45.

Qin

, Wang

, Bai

, et al., FFA-Net: Feature fusion attention network for single image dehazing, Proceedings of the AAAI Conference on Artificial Intelligence 34(7) (2020), 11908–11915.

46.

Zhang

, Li

, et al., Cfnet: Cascade fusion network for dense prediction, arXiv preprint arXiv:2302.06052. (2023).

47.

Quan

, Zhang

, et al., Centralized feature pyramid for object detection, IEEE Transactions on Image Processing (2023).