Abstract
Due to the fast growth of multimedia archives, the semantic gap is becoming a vital problem between machine learning based semantic concepts and local features of the image to retrieve images accurately. To address this issue, the proposed method of this article introduces two novel methods for effective image retrieval known as visual words integration after clustering (VWIaC) and feature integration before clustering (FIbC). These methods use complementary features of histograms of oriented gradients (HOG) and oriented FAST and rotated BRIEF (ORB) descriptors founded on the bag-of-words (BoW) model for salient objects within the images to build smaller and larger sizes of codebooks. To achieve higher efficiency in terms of specificity of the image retrieval system, the codebook of larger sizes are preferred, while larger sizes codebook produces low sensitivity and vice versa. The proposed method of VWIaC produces two smaller sizes codebooks to achieve higher sensitivity. After that visual words of both smaller size codebooks are integrated to produce larger size codebook, which improves the specificity of the proposed method. The performance of the proposed method is tested on three standard image benchmarks, which verifies its vigorous performance as compared to an FIbC method and recent CBIR methods.
Keywords
Introduction
The volatile evolution and extensive ease of access of public, multimedia documents contributed on the Internet have steered to a flow of research activity in multimedia search. Contributions for multimedia search employed to text search methods have achieved insufficient success due to ignorance to visual contents of the images as a ranking indicator. A method of retrieving images from image benchmarks using content-based attributes (i.e. texture, shape, and color) of the images is called content-based image retrieval (CBIR) [1]. Due to the multimedia search, image and video re-ranking methods have attained increased attention to enhance initial text-only searches by re-ordering visual documents or images on basis of multimodal hints. Mostly preliminary image retrieval results exhibit an excessive deal of noise arising as a challenging issue of CBIR. To discover visual pattern from the noisy ranked list to conduct and guide the re-ranking process is a bit demanding task. For re-ranking of content-based image search, a number of methods have been built in different domains. Inclusive survey of research on multimedia search re-ranking contribute some structural classifications of methods to improve CBIR performance such as self-re-ranking, example-based-re-ranking, crowd-re-ranking, and interactive-re-ranking. Although much progress has been made in recent years to boost the accuracy of CBIR, in spite of this, there are various attention seeking issues in CBIR. The common issues in CBIR are a semantic gap, scale and rotation-invariant features, automatic annotation of salient objects within images, the time complexity of the CBIR system, lack of versatile feature extraction methods to represent visual contents of the images [2]. Despite such challenging issues in CBIR, images contain rich information in the form of visual contents, which eliminate language barriers during information sharing at international level. The common applications of CBIR in different domains include education, medical science, and military affairs.
In CBIR, due to the similarity with human perception, the pictorial characteristics of the images are mostly computed by smearing local feature extraction methods to the images, which are based on texture, shape, color, and spatial arrangement. The texture features for CBIR represent salient objects of the images in the form of orientation, granularity, divergence consistency seediness, demonstrating spatial distinctions among the salient objects. The limitations of representing pictorial characteristics of the image via texture features are that the segmentation of texture is a difficult task and it requires an extensive computational cost [3]. The shape-based features extract salient objects (shapes) from the images but limitations of shape-based features are that they can be easily get corrupted due to occlusion, distortion, defects, and noise [4]. The color is an important visual attribute to represent salient objects within the image [5]. The color-based feature abstraction methods of the image are also robust to the rotation as well as scale-invariant possessions of the salient objects of the images. An example of the commonly used color-based feature extraction method for CBIR is the color histogram. The benefits of the feature extraction from images by applying color histogram method are that computation of the resultant feature vector is simple and computationally efficient as well as it is invariant to rotation, scale, and translation. However, limitations of the color histogram method are lost of spatial information regarding color distribution within the image due to the formulation of the single histogram to signify the pictorial characteristics of entire image. It also reduces the discriminative capability among visual contents, which ultimately affect CBIR performance [6].
The human brain can easily classify the images based on visual contents, but it is difficult for a machine to classify images that are visually and semantically similar. The Fig. 1(a-b) images are selected from two semantic classes namely “Sky” and “Sea”. When a machine learning classifier tries to classify such types of images, it may wrongly classify them due to similar visual contents between these two images, which may affect the performance of the CBIR system.

Semantic gap issue, (a-b) images taken from two different semantic classes of “Sky” and “Sea”, respectively with similar visual contents, (c-d) Inter-class variation, objects with similar visual appearances but belongs to two different semantic classes namely “Camel” and “Horse”, respectively.
An inter-class variation occurs between objects belonging to the different classes as shown in Fig. 1(a-b). The salient objects in these images have a similar visual appearance, but the context helps to distinguish the class they belong to. The images containing “Airplanes” and “Birds” shown in Fig. 1(a-b) and images of “Camel” and “Horse” shown in Fig. 1(c-d) have similar visual appearances with some distinctive features. There are many object-classes in which there is little difference in appearance, due to which it becomes difficult to classify by the machine which salient object belongs to which class. This problem can be resolved during CBIR by taking context information into account. The images shown in Fig. 2 (a-b) are an example of an intra-class variation, in which salient objects within the images belong to the same class. There are many categories in which due to large intra-class variations, the automatic categorization of salient objects of the same class becomes difficult, which also affects the performance of the CBIR.

Intra-class variation, each image belonging to the same object-class of “animals” having different appearances.
The HOG feature descriptor is scale-variant and rotation-variant feature descriptor, while ORB descriptor is rotation and scale-invariant feature descriptor. The ORB feature descriptor is less sensitive to noise as matched to HOG feature descriptor. The computational complexity of the ORB feature descriptor is twice less as matched to the SIFT descriptor, while it gives an improved performance as compared to SIFT and SURF descriptors. The HOG descriptor interprets visual contents of the image as a whole due to its global nature and it also gives better performance for large-scale image retrieval, object-detection, and activity-recognition or scene-recognition-based applications. The ORB descriptor interpret pictorial characteristics of the image in the form of the patches due to its local nature and local features are preferred for object-recognition-based applications. In this article, the performance of the proposed CBIR method is improved by integrating global and local features to achieve complementary features of the HOG-ORB descriptors [7, 8].
To improve CBIR performance and reduce the semantic gap issue, the proposed method of this article extract the features from each image by applying HOG and ORB descriptors. Afterward kernel principal component analysis (KPCA) [9] method is applied to each resultant feature vector for reducing the length of the feature vector to decrease the computational cost of the proposed method. After that two smaller size codebooks are formed from each reduced size feature vector by applying k-means++ clustering method [10] on the images of each reported image benchmark. Each codebook contains features of the HOG and ORB descriptors in the form of visual words. Each visual word is the name given to the centroid, which is obtained by applying k-means++ clustering method on the extracted reduced size feature space. The visual words of these two smaller size codebooks are integrated, which generate a larger size codebook that comprises complementary features of the HOG-ORB descriptors. This proposed CBIR method is known as VWIaC. The pictorial characteristics of each image are represented in the form of a histogram, which is made using visual words of the larger size codebook. The histograms of the training images are used for the learning of the classifier. After that query image is taken from the test image group and similarity is calculated by applying Euclidean distance between query image and database images to retrieve most similar images. The performance comparisons of the proposed method of VWIaC, which produce robust results as equaled with latest CBIR approaches and FIbC method, in which features of both descriptors are computed from the training images and integrated to formulate a codebook, which also comprises complementary features of HOG-ORB descriptors. This proposed CBIR method is known as FIbC.
The rest of the study of this article is systematized as follows: Section 2 presents relevant recent methods of CBIR. Section 3 presents the detailed methodology of the proposed method. In Section 4, performance assessment metrics, experimental results, and arguments on reported image benchmarks are presented. Section 5 concludes the proposed method followed by future directions.
Yuan et al. [11] intend a novel image retrieval method by assimilating SIFT and local binary pattern (LBP) descriptors. The bag-of-features (BoF) model is used in this method. The experimental details show that the selection of SIFT-LBP feature combination achieve better results, even when the background is noisy and there are ambiguous objects in the images. In variants of the BoW model [12 –16], multiple features are fused into a single feature space and a codebook is formed, which comprises a variety of features. Due to different forms of feature space, a single feature space resulting from simple joint suffers from submergence of feature salience. Moreover, fused features are not enough to achieve significant CBIR performance, if they are not fused properly. Thus, a hierarchal BoW model is proposed by Jiang et al. [17], which characterized multiple features codebook to generate a separate codebook for each feature. Thus, respective partition and ranks are provided by each feature for its candidate images. The target images are chosen from each feature by aggregating the ranks. A query rank model is proposed for rank aggregation known as ordinary-least-squared (OLS) regression to enhance the feature salience criteria. Each feature is weighted based on its performance and the respective images are retrieved. The hierarchical BoW model achieves significant performance by reducing the interactions between different features. The multi-trend structure descriptor (MTSD) is introduce by Zhao et al. [18] for efficient CBIR, which uses robust feature representation method that utilizes local structures and multi-trend structures. The basic unit for image contents analysis is local structures, which are defined using 3×3 blocks. The multi-trend structures find the internal correlations between pixels in local structures and illustrate the variations in information. The visual contents of each image are quantized and multi-trend is used as a bridge with local structures for color detection, edge orientation, and intensity map are built to extract features. By providing color, shape, and texture features, spatial information can also be characterized by MTSD. The frequencies of multi-trend structure are counted to obtain the resultant feature vector.
Singh et al. [19] propose a novel image retrieval method, which is based on the color and texture features for efficient CBIR. The color histograms represent the color features, whereas texture features are characterized by a block variation of local correlation coefficients (BVLC) and block difference of inverse probabilities (BDIP). The color features combined with textures features resulting from the brightness component deliver similar results as by employing all of the three components of color with additional benefits of dimension reduction and least processing time. It also employs square-chord distance as a similarity measure method to achieve robust CBIR performance. The efficient image representation and ranking are critical stages of CBIR to achieve improved image retrieval performance but there is a compatibility issue between feature descriptors and ranking algorithms. A novel image representation method for efficient CBIR is introduced by Wu et al. [20], which is known as the texton uniform descriptor (TUD) motivated by human visual perception and manifold learning. The manifold structure is conserved by visualizing the distribution of image representation on two-dimensional manifold providing a baseline for manifold-based ranking. On the basis of manifold-based ranking, the similarity is propagated with the query to the neighborhood. The preservation of neighborhood structure brought compatibility with descriptor and relationship among images is propagated. Additionally, to enhance image retrieval performance, the improved manifold ranking is introduced, which select small-scale images randomly as landmarks to propagate end-to-end resemblance among images repeatedly.
Alzubi et al. [21] introduce a novel CBIR method which uses convolutional neural networks (CNNs) with deep learning to achieve improved performance. In CNNs, image features are extracted at the last layer with an order less quantization approach limiting the use of an intermediate convolution layer to identify local patterns. In this CBIR method, firstly CNN-based architecture along with two parallel feature extractors is introduced.
The convolution layers extract features of images at different image locations and scales. The deep CNNs method first initialized network architecture on the generalized dataset and then tuned for CBIR task. Moreover, for the compact dimension of feature descriptor with high discriminative image descriptor, bilinear root pooling method is presented. Lastly, back-propagation is performed and its parameters are learned for image retrieval. Without existing semantic knowledge about image datasets, remarkable performance is achieved on standard datasets using this method. Moreover, during feature extraction phase, noteworthy dimension reduction of feature vector is attained. Sajjad et al. [22] introduce a novel CBIR method, which construct a discriminative representation by integrating two properties of each image. Initially, the query image is characterized by HSV color space and then quantized to limit the number of characterized colors. After that texture features are extracted on the basis of uniform patterns of the rotated LBP (RLBP). The color histogram features are employed from the quantized images and texture features are matched to evaluate the representation of the image. Uniform patterns histogram and quantized color histogram of each image are integrated to formulate resultant feature vector. This method gives robust performance in case of illumination variations in colors by utilizing components of colors invariant. Due to uniform patterns in RLBP, performance is also enhanced.
Proposed methodology
The proposed CBIR method uses complementary features of HOG-ORB descriptors based on the BoW model to generate smaller and larger sizes of codebooks, which improves the performance of CBIR. Each image denoted by F is selected from the reported image benchmark. The detail of each step of the proposed CBIR methods based on the VWIaC and FIbC is as follows:
Extraction of the HOG features
For extraction of HOG features [7], the keypoints are extracted by applying FAST detector [8, 23]. Each image is decomposed into small size patch of 32 × 32 pixels and histogram of the oriented gradient is computed from each small size patch. The block-wise pattern is used to normalize the histogram of each patch, which returns a HOG feature descriptor from each patch. After that feature descriptor of each patch is concatenated to formulate a resultant HOG feature vector. Let the non-normalized histograms, which are calculated from all the patches of each image are denoted by a feature vector X, then each histogram is normalized by applying the following equations on each patch of the image:
The ORB feature vector [8] is formulated by applying FAST detector on each image to detect keypoints and at those keypoints, ORB features are computed by applying steered BRIEF descriptor. The ORB descriptor is an alternative of SIFT and SURF descriptors and performs well in classification based problems due to its binary nature as well as it required less computational cost as compared with SIFT and SURF descriptors. The ORB descriptor utilizes an improved version of the BRIEF descriptor. It overcomes the problem of rotation invariance of the BRIEF descriptor. It also assures invariance to in-plane rotations and less sensitive to noise as compared to BRIEF, SIFT, and SURF descriptors. The ORB feature vector of dimension 2 × m is computed at any keypoint (a
j
, b
j
) of the image patch i
p
by applying the following equations:
In this step, KPCA method [9] is applied on the extracted features of HOG and ORB descriptors to select best features and reduce the length of each feature vector as well as computational cost. The KPCA is a non-linear version of PCA. In this method, the non-linear map is used to map the input space into feature space and principal components are computed in that feature space. The KPCA uses a nonlinear function ω to map the data into feature space F. Then, linear PCA is applied on the mapped data. Mercer kernels is used by KPCA instead of performing mapping of ω explicitly, as the feature space can be high dimensional. A mercer kernel m (a, b) is a function that compute the positive matrix M
ij
= m (a
i
, b
j
) for all datasets {d
i
}. The use of function m instead of the dot product is equal to mapping the data with some ω to a feature space F, i.e., m (a, b) = (ω (a). ω (b)). The KPCA is applied on feature space F by finding Eigen values λ > 0 and Eigen vectors v ∈ F {0} that must satisfy with covariance matrix, which is designated as follows:
It is noted that all the solutions of v must be in the range of ω-images of the sample data, when is substituted into the Eigen vector equation. This implies the following equivalent equation:
and there exist coefficients β
1, …, β
n
such that
The non-linear principal components for the ω-image of a test point a are extracted by computing the projection onto the mth component, which is designated as follows:
In this step, k-means++ clustering method [10] is applied on each reduced feature vector of HOG and ORB descriptors, which generate two codebooks that are designated by C
BHOG
and C
BORB
, respectively. Each codebook is a combination of clusters or visual words. In k-means++ clustering method, given n data points are divided into k clusters and each data point is assigned to the nearest centroid. The goal of k-means++ clustering method is to minimize the objective function, which is designated as follows:
The drawback of k-means clustering method is the specification of a number of clusters in advance. The initial centroids are selected randomly. The final results are dependent on initialization of initial centroids. The bad selection of initial centroids may cause the algorithm to terminate at a local optimum. Generally, large k decreases the error but may results in over-fitting problem. It also fails to handle noisy data and outliers. Therefore, we have chosen k-means++ clustering method for clustering that uses a weighted method for selecting initial centroids. The points that are further apart are selected as initial centroids. However, initialization is complex and takes longer time, but the clusters are more accurate and the computational cost reduces due to fewer iterations. In k-means++ clustering method, let X be a dataset having P points (p
1, … p
N
) and c
1 be an initial cluster selected by a weighted method from X. The square distances between all the points in the dataset and c
1 are computed as D
i
2
= ||p
i
- c
1||2. A second cluster c
2 from X is selected again by applying probability distribution of
The generated codebooks are vertically concatenated, which is designated as follows:
In case of the proposed method of FIbC, HOG and ORB features are extracted from the images. The dimension of each feature vector is reduced by applying KPCA on the extracted features. The resultantly reduced feature vectors are concatenated or integrated to achieve benefits of complementary features of HOG-ORB descriptors. The k-means++ clustering method is applied on the integrated features of HOG-ORG descriptors, which generate a single codebook.
In this step, each image visuals contents are represented in the form of a histogram by utilizing its visual words from the resultant codebook, which is designated by C BR according to the detail mentioned in [24, 25].
Training of the classifier
In this step, the histograms of training images are used to train the proposed CBIR model. We trained the proposed CBIR model using SVM along with polynomial kernel [26]. The SVM algorithm utilizes mathematical function for the training of model, which is referred to kernel. The kernel function takes data as input and directs into essential form. The inner product between two data points is returned by kernel function in feasible feature space and define similarity with less computation even in high dimension. We have used the polynomial kernel of SVM in which similarity vectors are represented for training images in a feature space over original polynomial variables to train non-linear models. The polynomial kernel of SVM is designated as follows:
In this step, we have taken a query image from the test group of the image database and measure the similarity between the histogram of the query image and histograms of the database images by applying Euclidean distance [27 –30].

Block diagram of the proposed CBIR method which employs VWIaC.
Performance evaluation metrics
The standard performance evaluation metrics are used to measure the performance of the proposed CBIR methods, which are “Specificity or Precision (S
p
)”, “Sensitivity or Recall (S
n
)”, “average specificity or average precision (A
s
)”, “mean average precision (mAP)”, and “precision-recall (P
r
) curve”. These performance evaluation metrics are defined by the following equations:
The performance of the proposed CBIR methods based on VWIaC and FIbC are evaluated on three standard image benchmarks, which are COREL-1K [31], COREL-1.5K [31], and Holidays [32]. The COREL-1K image benchmark consist of a total number of 1000 images, which are grouped into 10 semantic groups and each semantic group includes 100 images. The COREL-1.5K image benchmark contains a total number of 1500 images, which are grouped into 15 semantic groups and each semantic group also contains 100 images. The resolution of each image in the COREL-1K or COREL-1.5K image benchmarks is either 384 × 256 pixels or 256 × 384 pixels. Both of these image benchmarks contains images belonging to the different semantic groups like “Flowers”, “Beach”, “Mountains”, “Dinosaurs”, “Tigers” etc. The Holidays image benchmark contains total of 1491 images that are organized into 500 semantic groups. This image benchmark comprises images with numerous artifacts like variations in rotation, scale, blurring, illumination, and resolution of each image is 2448 × 3204 pixels. The first image of each semantic group of the Holidays image benchmark is taken as a test image and remaining images in each semantic group are taken for the training purposes.
Experimental outcomes and discussions
This section presents the experimental outcomes and discussions of the proposed CBIR method based on VWIaC and its performance comparison with the proposed CBIR method based on FIbC as well as with standard CBIR methods on the COREL-1K, COREL-1.5K, and Holidays image benchmarks. The proposed CBIR method based on the VWIaC outperforms its competitor CBIR methods due to the following reasons: it uses the complementary features of HOG and ORB descriptors for effective image representation, it uses an efficient feature reduction method of KPCA, it uses codebook of twice size to represent visual contents of the images, it uses smaller sizes of the codebook (i.e. individual codebook of HOG and ORB descriptors), which improves its “Recall”, and its combine smaller sizes codebooks to generate larger size codebook, which improves its “Precision”. In order to prove the robust performance of the proposed CBIR method based on VWIaC, its performance is evaluated on the COREL-1K, COREL-1.5K, and Holidays image benchmarks and compared with standard CBIR methods, whose experimental details are shown in Tables 1–3.
Performance estimation of the proposed CBIR method of VWIaC with competitor CBIR methods on the COREL-1K image benchmark (best performances are highlighted in bold face)
Performance estimation of the proposed CBIR method of VWIaC with competitor CBIR methods on the COREL-1K image benchmark (best performances are highlighted in bold face)
Performance estimation of the proposed CBIR method of VWIaC with competitor CBIR methods on the COREL-1.5K image benchmark (best performances are highlighted in bold face)
Performance estimation of the proposed CBIR method of VWIaC with competitor CBIR methods on the Holidays image benchmark (best performances are highlighted in bold face)
The experimental details of the single feature of HOG and ORB descriptors based CBIR methods, proposed CBIR method based on the VWIaC and its competitor CBIR method of FIbC on different sizes of the codebook are shown in Figs. 4–6 using the COREL-1K, COREL-1.5K, and Holidays image benchmarks, respectively. The best MAP performance of each CBIR method on a particular size of the codebook is highlighted for clarity purposes. On the COREL-1K image benchmark, best MAP performance of 81.94% is attained on a codebook size of 600 visual words using the proposed method based on the VWIaC.

Performance estimation of the proposed CBIR method of VWIaC with competitor CBIR methods built on the BoW mechanism on diverse sizes of the codebook on the COREL-1K image benchmark.

Performance estimation of the proposed CBIR method of VWIaC with competitor CBIR methods built on the BoW mechanism on diverse sizes of the codebook on the COREL-1.5K image benchmark.

Performance estimation of the proposed CBIR method of VWIaC with competitor CBIR methods built on the BoW mechanism on diverse sizes of the codebook on the Holidays image benchmark.
Similarly, the experimental details presented in Figs. 5 and 6 also prove the robust mAP performance of the proposed CBIR method based on the VWIaC on a codebook of all reported sizes as compared with its competitor CBIR methods. The best mAP performances of 77.66% and 68.76% are achieved on codebook sizes of 800 and 600 visual words, respectively using COREL-1.5K and Holidays image benchmarks.
The result of the image retrieval as a response to the given query image, which is chosen from the semantic group “Beach” of the COREL-1K image benchmark is shown in Fig. 7. Similarly, the image retrieval result of the query image chosen from the semantic group “Sunset” of the COREL-1.5K image benchmark is shown in Fig. 8. The image shown at the top of each Figure is the query image while remaining images in each Figure are the retrieved images. The numeric value shown at the top of each image is the score of the classifier, which is used as a similarity extent between the visual contents of the query image and retrieved images.

Recovered images as a response to the given query image (shown at top), which is picked from the group “Beach” of the COREL-1K image benchmark.

Recovered images as a response to the given query image (shown at top), which is picked from the group “Sunset” of the COREL-1.5K image benchmark.
The proposed research work introduces two novel methods for effective content-based image retrieval, which are known as VWIaC and FIbC to analyze the effect of complementary features of the HOG-ORB descriptors to generate a codebook. The proposed method based on the VWIaC performs better than FIbC method because it uses complementary features of the HOG-ORB descriptors and represents visual contents of the images by formulating the codebook of twice size. Moreover, in case of proposed method based on the VWIaC, the codebook of larger size (twice the size) is formed by integrating two codebooks of smaller sizes (i.e. one using HOG features and other one using ORB features). These smaller sizes codebooks improve the sensitivity, while larger size codebook progresses the specificity of the proposed method based on the VWIaC. The feature dimension of the proposed method is reduced by applying the KPCA method to the extracted HOG-ORB features to reduce computational cost. In future, due to the robust performance of texture and color features along with relevance feedback are considered to evaluate the performance of the proposed method using convolutional neural networks (CNNs).
Funding
This research work was supported by the Machine Learning Research Group [RG-CCIS-2017-06-02] of Prince Sultan University, Riyadh, Saudi Arabia. The authors are grateful for this financial support.
