Joint index and cache technique for improving the effectiveness of a similar image search in big data framework

Abstract

Nowadays, the exponentially increasing amount of digital images available imposes a great challenge to a content-based image retrieval (CBIR) system due to the requirement of extensive-computing. Considering this challenge, this paper presents an approach to achieve effectiveness and scalability of a CBIR system in a large-scale dataset. To do that, we propose a cache mechanism to spare the distance computation efforts of a retrieval task in the CBIR system. Additionally, a MapReduce technique is presented to exploit the cached data in a parallel facility, thereby not only improving the performance of a CBIR system but also ensuring scalability for the system. Additionally, a collaborative caching service has been introduced for enhancing the data availability, thus decreasing the network traffic load due to fetching data remotely in the distributed environment. Moreover, by clustering the dataset before a search, this system can be efficient at responding to a user query since only a portion of the dataset is actually operated at a time. Through experiments, our approach obtains significant efficiency gains compared to other methods in terms of response time and achieves an acceptable accuracy ratio, which is applicable in the practical environment.

Keywords

CBIR cache index scheme MapReduce cloud

1 Introduction

The widespread presence of digital cameras and the emerging trends of sharing multimedia data on social networks triggered an exponential increase of online image data. An exhaustive search is infeasible for these large-scale datasets due to its extensive time requirement. To address this obstacle, it is necessary to develop tools for efficiently indexing and retrieving these data. Generally, traditional approaches retrieved images based on image metadata and keywords might be unfeasible due to either the ambiguity in the image or semantics of words in the context surrounding an image. In this scenario, content-based image retrieval (CBIR) systems [1, 2] have been developed by researchers over the past decades as an alternative to overcome those limitations by considering the content of images for supporting retrieved images.

Basically, CBIR adopts an image as a query and retrieves images according to the match between two images. The critical points of similarity estimation depend on three important issues: 1) an effective feature description (the content of an image), 2) a proper indexing scheme, and 3) an appropriate distance metric. For image representation, researchers have spent many years attempting to identify and design effective feature descriptors for CBIR [3, 4], including global features [5] and local features (e.g., scale-invariant feature transform [6] and bag-of-visual-words representations [7]). In early CBIR, images were popularly represented by a single holistic feature, such as color correlogram [8], shape [9], etc. Later, with the introduction of the SIFT [10] and SURF feature [6], local visual features have been widely explored for CBIR. This feature demonstrates the excellent property of invariance to rotation and resizing transformations, and robustness to illumination changes. Generally, thousands of local features can be extracted from a single image. Meanwhile, the evaluation database has been increased to be on the scale of millions [11]. To address large-scale visual data, it becomes crucial to design an efficient indexing scheme.

Inspired by the success of information retrieval, the unsupervised index scheme is leveraged by grouping the similar image descriptor vectors into clusters. With local features quantized to a fixed-size vector, the training images are efficiently indexed by comparison with similar images using the Euclidean distance. Through the index structure, it is efficient to identify these target images sharing common representations with the query image and compute their similarity scores for ranking [12]. Note that unlike hashing techniques [13], the unsupervised index scheme still works well in high dimensional features, and therefore can still guarantee an acceptable ratio on the retrieval accuracy. With the above advantages, the unsupervised index structure has been exploited in many image retrieval systems [14 , 18].

With the aid of the indexing scheme, CBIR can avoid an exhaustive search over the whole database. However, the CBIR performance still degrades because of the exponentially increasing multimedia content. Recent research studies another approach, which leverages the power of the distributed processing big data framework Hadoop to enhance CBIR performance [16 –18]. The master-slave architecture controls the query requests from users in a parallel manner. A master node receives a query request, then based on data organization information, it divides a big task into sub-tasks and delivers them to slave nodes, where the required data to process the sub-tasks are located. The philosophy of pushing analysis code close to the data it is intended to analyze, rather than requiring code to read data across a network, has made Hadoop popular in the big data processing field. Hence, for a CBIR system integrated with Hadoop, each slave node will maintain its local index scheme for its local data [18]. The disadvantage of previous studies is that when a slave node is overloaded, the task is performed in another slave node, which requires remote access of the data from this node to the overloaded node. This degrades the performance of the CBIR system due to the I/O bottleneck access over the network. To overcome this issue, we propose a collaborative distance-based cache to alleviate the I/O access cost. Here, every slave node maintains a local distance-based cache when using an index scheme. Additionally, a global distance-based cache is maintained to collaborate the local caches, aiming at reducing the I/O cost during remote access.

For the above observations, we propose an efficient framework for retrieving similar images in the distributed environment, summarized as follows:

We employ a fast and robust feature extraction technique, namely SURF (speeded-up robust features), which extracted features based on the appearance of the object at particular points of interests and are invariant to image scale and rotation, then represented image descriptors for every point of interest identified previously. The goal is to provide a unique and robust description of an image feature, enabling image search for different image features.

We introduce an index structure based on the unsupervised clustering algorithm VP-Tree to improve the performance in processing a similarity query. This technique aims to accelerate data retrieval by grouping relevant objects based on distance criteria. Moreover, this approach narrows the search space, avoiding the computation of distances among all images of a given dataset.

We propose an in-memory distance-based cache for a CBIR system. This proposed method significantly increases the performance in terms of memory access rate while calculating a distance between pairs of images. It also avoids duplicate calculations during the search for similar images. In addition, a global collaborative cache service is implemented to enhance the data availability, thus avoiding the remote access data pattern, which causes performance degradation of a distributed application.

The three above techniques are intertwined inside a MapReduce framework for accomplishing the similarity query process.

The remainder of this paper is organized as follows. Section 2 provides related studies relevant to the topic. In Section 3 Methodology, our methodology is discussed in detail. Section 4 presents an architectural overview of the proposed approach while the performance evaluation is discussed in Section 4. Conclusions and future work are summarized in Section 6.

2 Related work

A large percentage of the increase in data management requirements consists of media data such as images and videos. For instance, there are 250 billion images stored in a Facebook datacenter, while YouTube claims that more than 300 hours of video content are uploaded by users every minute. Therefore, the scalability of a CBIR system has been significantly studied in previous work [11 , 19].

Some studies have focused on a centralized approach in CBIR systems. Jégou et al. [20] proposed an indexing scheme based on the concept of product quantization combined with an inverted file system for nearest neighbors search. They evaluated the performance by indexing two billion feature vectors. Lejsek et al. [21] introduced a version of the NV-tree for the nearest neighbors search that indexes 2.5 billion local feature vectors. While in [22], the authors describe results using the NV-tree to index a collection of 28.5 billion local features. In 2015, the authors in [23] employed the inverted multi-index to study the performance of the BIGANN dataset, containing one billion features. In contrast to these approaches, our indexing scheme is designed for parallel processing. Furthermore, N.Kumar et al. [24] evaluated VP-Tree performance in terms of the construction cost and search speed. Compared with other index methods, they claim that VP-Tree achieves better performance. It motivates us to employ VP-Tree as an index scheme in our work.

The earliest study of designing distributed indexing is presented in [25]. The authors report results on indexing approximately 1.5 billion feature vectors using 2,000 workstations. More recently, in [26], the authors utilize a multi-probe Kmedoids-LSH method, indexing two billion feature vectors to support a sketch-based retrieval system, but using only 10 servers. The first study of implementing the system of image retrieval on Hadoop is investigated by Zhang et al. [27]. Since then, variants of similar systems have been proposed, but most experiment on relatively small collections (e.g., see [17 , 28–30]). One study used the largest collection of these, called ImageTerrier [31], indexing 10.9 million images using bag-of-words features from approximately 10 billion SIFT feature vectors. Some studies done on medical image retrieval domains were also inspired by such Hadoop frameworks, again with relatively small collections (i.e., [29, 32]). Moise et al. [18] built a CBIR system on top of an extended cluster-pruning (eCP) algorithm and introduced an index search scheme for large batches of queries. The experiments on a dataset contain approximately 100 million images extracted approximately 30 billion SIFT feature vectors using a cluster of more than 100 machines. Additionally, with recent explosive research on deep neuron networks, there are some studies that have demonstrated the outstanding performance in various vision fields. A recent work investigating collaborative index embedding combined with a convolutional neuron network (CNN) is presented in [33]. While the authors in [34] combined Support Vector Machine (SVM) and CNN to enhance for enhancing image retrieval process, a study in [35] employed deep belief network (DBN) method to extract the image features and improve similar measurements. However, it is beyond the scope of our work.

Many recent approaches have proposed to enhance the performance of CBIR systems based on Hadoop framework. Herein, [27, 28] are two first studies that employed Hadoop framework for improving a CBIR performance. However, the scalability of these approaches are limited due to the small datasets. Moise et al. in [18] proposed a Haddop-based CBIR framework using eCP indexing technique for a large dataset. They evaluated the performance on very large cluster infrastructure, including 192 computing nodes. Otherwise, the work in [29, 32] are applied for medical images system. The datasets are only used in a specific area, while our approach is performed on large datasets, including many subjects such as building, castle, computers, etc. Lastly, the study in [30] designed a CBIR system based on Hadoop for quickly retrieval image. However, the similarity measure between images is performed by colors, which limit the accuracy of CBIR system. Overall, a study of Yin et al. [17] is one of the most relevant studies to ours.

While all the above works focused on using MapReduce parallelization facilities for improving the response time of a CBIR system, none utilize the cache mechanism to enhance the overall performance. Our work proposes an elegant collaborative cache using the Redis (Remote Dictionary Server) framework employed in the Hadoop distributed system. It not only reduces the I/O cost but also increases the data availability for the Hadoop distributed system. Therefore, the proposed CBIR system can gain better performance. To the best of our knowledge, this is the first study to propose a collaborative cache mechanism for a CBIR system in the distributed environment. The differences between our study and other works are summarized in the Table 1.

Table 1
Existing approaches and the approach presented here

Works Distributed framework Index scheme Local cache Collaborative cache Dataset

Zhang et al. [27] Hadoop No No No Small

C. Gu et al. [28] Hadoop No Yes No Small

DongSheng Yin et al. [17] Hadoop LSH No No Small

Moise [18] Hadoop eCP No No Large

Swapnil P. Dravyakar [30]) Hadoop No No No Small

R. K. Grace et al. [29] Hadoop No No No Small

S. Jai-Andaloussi et al. [32]) Hadoop No No No Small

Ours Hadoop VP-Tree Yes Yes Large

Works	Distributed framework	Index scheme	Local cache	Collaborative cache	Dataset
Zhang et al. [27]	Hadoop	No	No	No	Small
C. Gu et al. [28]	Hadoop	No	Yes	No	Small
DongSheng Yin et al. [17]	Hadoop	LSH	No	No	Small
Moise [18]	Hadoop	eCP	No	No	Large
Swapnil P. Dravyakar [30])	Hadoop	No	No	No	Small
R. K. Grace et al. [29]	Hadoop	No	No	No	Small
S. Jai-Andaloussi et al. [32])	Hadoop	No	No	No	Small
Ours	Hadoop	VP-Tree	Yes	Yes	Large

3 Methodology

3.1 Cache methodology

It is clear that access data from the cache is much faster as compared to disk access. With this motivation, a distance-based cache strategy, which was inspired by [36], is applied to our system in order to reduce the overhead of distance computations during a query. The main functionality of the distance-based cache is twofold. First, given a query object o_q and a database object o_i, the cache should quickly determine the desirable value in case it already existed in the cache. However, as the actual value could only be found in the condition, it has been computed in the previous sessions. Thus, this functionality is limited in some special cases, such as repeated queries or querying by a database object [36]. The second functionality is that given a query object o_q and a database object o_i, the cache must have the ability to quickly determine the tightest possible lower and upper bounds of σ (o_q, o_i), which denote the distance from o_q to o_i. This faster approximation of lower/upper bound distances is useful to filter out some nonrelevant objects or even a whole part of the index.

It is assumed that the track of distances is stored in the cache as a tuple form: $(id (o_{q}), id (o_{i}), σ (o_{q}, o_{i}))$ (1) where id (o_q) and id (o_i) denote unique identifiers of objects q and o, respectively, and σ (o_q, o_i), is a distance function which has to satisfy some conditions in a metric space, such as $σ (x, y) = 0 \Leftrightarrow x = y$ (2) $σ (x, y) > 0 \Leftrightarrow x \neq y$ (3) $σ (x, y) = σ (y, x)$ (4) $| σ (x, y) - σ (y, z) | \leq σ (x, z) \leq σ (x, y) + σ (y, z)$ (5) Otherwise, let us symbolize a lower-bound distance of σ (o_q, o_i) as σ_LB (o_q, o_i) ≤ σ (o_q, o_i) and an upper-bound distance as σ_UB (o_q, o_i) ≥ σ (o_q, o_i). In order to enable the second functionality, we need to know the information σ (o_{p
_k}, o_i) and σ (o_{p
_k}, o_q) in advance, where o_{p
_k} denotes some past queried objects. This means that the cache has to be filled with data from past sessions. In particular, the objects o_{p
_k}, along with their distances to database objects σ (o_{p
_k}, o_i), need to be maintained in the cache before a current query object actually starts. Moreover, the number of past queries should be considered carefully due to the limitation of the main memory and the extra cost when computing the distance σ (o_{p
_k}, o_q). Supposing that o_{p
_k} ∈ AP ⊂ PQ, where PQ is the set of all past queried objects within the current session and AP is an actual set of selected past queried objects, a fixed size |AP| = k < |PQ| is chosen to determine their distances to o_q. In addition, the distances of past queries to the current database objects σ (o_{p
_k}, o_i) already exist in the cache, where they were inserted during earlier sessions. Consequently, with respect to such available values σ (o_{p
_k}, o_i) and σ (o_{p
_k}, o_q), the tightest lower/upper bounds of a distance σ (o_q, o_i) are formulated as: $σ_{LB} (o_{q}, o_{i}) = \max_{o_{p_{i}}} (| σ (o_{p_{k}}, o_{i}) - σ (o_{p_{k}}, o_{q}) |)$ (6) $σ_{UB} (o_{q}, o_{i}) = \min_{o_{p_{i}}} (| σ (o_{p_{k}}, o_{i}) + σ (o_{p_{k}}, o_{q}) |)$ (7) with all objects o_{p
_k} belonging to a set of past queried objects AP. From Equations (6) and (7), it is worth noticing that the cache needs to keep o_{p
_k} objects which are very close to or very far from the current query object o_q, as the bigger or smaller σ (o_{p
_k}, o_q), the closer σ_LB (o_q, o_i) and σ_UB (o_q, o_i) to the actual distance σ (o_q, o_i). Furthermore, it is reasonable that the k number of past queries are selected as the most recent ones because there is a high possibility that recent query objects have more distances stored in the cache than the older ones.

After that, a filter can be applied by using lower/upper bound distances derived from the cache, instead of calculating the exact σ again. As shown in Fig. 1, suppose we have a query ball (q, r_q) with respect to a range query within the radius r_q, and a data ball (o_i, r_{o
_i}), which partitions the metric space into two half-spaces: one has distances to o_i less than a radius r_{o
_i} and other has distances to o_i greater than a radius r_{o
_i}. Noticeably, the data region can be eliminated from the search if there is no overlapping between the two balls (Fig. 1a) by using the tightest lower bound equation: $σ_{LB} (o_{q}, o_{i}) > r_{q} + r_{o_{i}}$ (8) Or, in the case of determining an overlap with the query ball (Fig. 1b), an upper bound equation is used: $σ_{UB} (o_{q}, o_{i}) < | r_{q} - r_{o_{i}} |$ (9) It is implied that a search process is not necessarily to cover the outside of a data region, thus saving the search cost for a query. With respect to the real implementation, a hash table is utilized as a primarily in-memory data structure, which can store and maintain data efficiently. The greatest advantage of a hash table is that it guarantees data is accessible from the cache as fast as possible.

Fig.1

An example of pruning index regions during a search employing the distance-based cache approximation.

3.2 Collaborative caching service

In many parallel and distributed systems equipped with the MapReduce engine, data availability is considered a crucial factor in the performance of MapReduce. In particular, if a map task is placed on or close to machines that store the input data chunks, then a problem is called data availability. Obviously, when the data chunk associated with a task is stored locally, a system can reduce both the processing time of map tasks and network traffic load due to fetching data remotely. In our proposed in-memory distanced-based cache model in the distributed environment, unless the system includes a very large number of computing nodes, remote data is unavoidable. Consequently, the distance cache does not take full advantages. For example, in Fig. 2a, suppose that two distinct data types can be stored in a computing node; M₁, M₂, M₃,..., M_n refer to in-memory data blocks, which encapsulated the distance computations by the cache algorithm, and A₁, A₂, A₃,..., A_n refer to data blocks handled by HDFS. When map0 is assigned to a node 1, it accesses M₁, A₀ and A₁ as inputs. While M₁ and A₀ are the memory block and data block stored in node 1, respectively, data block A₁ has been fetched remotely from node 2. In this scenario, the performance of the MapReduce job is reduced by the additional time needed to transfer the data from node 2 tonode 1.

Fig.2

a) An example of fetching data remotely in a Hadoop map task. b) An example of pre-fetching data using the collaborative cache service.

Motivated by this observation, a collaborative caching mechanism is introduced, which coordinates all of the dedicated local caches to form a single global cache. It aims to synchronize data in every cache, thus the data availability increases in further map jobs for better efficiency. Overall, the main idea is that even when the requested data does not exist in the arbitrary local cache, it can still be accessed from a global cache without the need of explicit disk access. For instance, in Fig. 2b, it is assumed that block M₁ in node 2 obtains the distance computations between a query and the data in block A₁. In the collaborative caching model, block M₁ will be collected by a global cache and then synchronized to the other node in the cluster. Consequently, node 1 will contain block M₁ in its memory, even when the data inside is not actually computed by a map task in node 1. Later on, when a map task is generated in node 1 and data block A₁ is required as its input, it can directly process data stored in memory block M₁ without waiting for the fetching data time as compared to the scenario in Fig. 2a.

From the perspective on implementation, Redis is a suitable solution for our purpose. Intuitively, Redis is designed for adapting for various application purposes. Specifically, the Redis replication model supports master-slave architecture, in which data from one master can be replicated to any number of slaves. In the case of a master node failing, one of the remaining slaves is chosen to become a new master node, which enables fault-tolerance and data accessibility for this architecture. Moreover, the Redis replication is non-blocking in both the master and slave [45], which means that the master can continue serving queries when one or more slaves are synchronizing and slaves can answer queries using the old version of the data during the synchronization. Regarding this observation, Redis is a suitable solution for our collaborative cache service. From the deployment perspective, Redis clients can be set up in the same machines with Hadoop slaves, while the Redis master is launched in a separate machine. In particular, Redis slaves serve as a local in-memory cache to retain all distance computations from MapReduce tasks. Alternatively, the Redis master ensures a parallel job has a memory locality cache to improve the accomplished time by incorporating all local cache data from slaves to serve as a global cache, as explained in the above scenarios.

4 The proposed architecture

Briefly, the proposed architecture includes two main components: the offline component and the online component.

4.1 Offline component

As described in Fig. 3, the offline component consists of three main modules: build sequence file, create index tree, and group data.

Fig.3

The offline architecture.

4.1.1 Build sequence file

This first module takes into account preparing a suitable image format which is input for all involved Map/Reduce processes in the system. It contains two main parts, as follows.

Feature Extractor: The SURF relies on the determinant of the Hessian matrix for both scale invariant and location. Given an image I, the Hessian matrix is defined as: $H (x, σ) = [\begin{matrix} L_{xx} (x, σ) & L_{xy} (x, σ) \\ L_{xy} (x, σ) & L_{yy} (x, σ) \end{matrix}]$ (10) at each image point x with scale σ. L_xx (x, σ) is the convolution of the Gaussian second order partial derivative $\frac{\partial^{2}}{\partial x^{2}} g (σ)$ with an image at point x, with a similar definition for L_xy (x, σ) and L_yy (x, σ).

Feature Descriptor: Every keypoint in the previous step is assigned to a reproducible orientation by applying the Haar wavelet responses in the x and y directions. Then, a square region aligned with the selected orientation is constructed and extracts the SURF descriptor from it. The Haar wavelets can be quickly computed by integral images. The window can be split into 4×4 sub-regions when the dominant orientation is estimated and included in the points of interests. Each sub-region contributes four values to the descriptor vector, leading to an overall vector length of 4×4×4 = 64. The resulting SURF descriptor is invariant to rotation, scaling, and brightness. According to a work in [46], we summarize the performance of three popular image feature extraction techniques in Table 2. This table shows that SURF and SIFT are two dominant techniques. Moreover, according to [47], SURF enhances the speed of feature matching as compared to SIFT.

Table 2

A comparison of some dominant feature extraction algorithms [46]

Features detector	Transformation
	Intensity	Rotation	Scale
SURF	++	++	++
SIFT	+++	+++	+
ORB	+	+	+++

(+) = efficient level

Eventually, the Hadoop SequenceFile is constructed based on image descriptors obtained from the previous steps. Basically, SequenceFile is a binary key-value pairs format in which the key is a unique identifier of an image and the value is the image descriptor vector. SequenceFile is usually used to solve the large number of small files problem in Hadoop, which is the advantage of processing a few large files rather than a large number of small files.

4.1.2 Create index tree

In this module, we proposed an indexing technique for organizing data efficiently for retrieving rather than the traditional sequence matching method. Given a data point, it is desirable to find the items in a dataset that are closest to this point. Theoretically, this task can be done by comparing a given point to the rest of the points in a database. However, it does not perform well on a large dataset. Hence, for accomplishing this task, a hierarchical data structure has been used in this module.

Typically, an index structure based on [24] is utilized, namely VP-Tree, which relies on distance computations among points in metric spaces. First, the algorithm selects a vantage point v in a dataset. The distance from the remaining points to v (d (p_i, v)) is computed. Second, the median distance among those is selected. The database points are sorted according to their distance d (p_i, v) compared with the median distance. This means that all points having a distance d (p_i, v) smaller than the median are assigned to the left subspace of v, while the remaining ones are assigned to the right subspace. This procedure is recursively applied to the left and right subspaces of v. Choosing an appropriate vantage point to separate a dataset plays an important role in the performance of the VP-Tree algorithm. However, for simplicity, we pick a random point to be the vantage point v for the index algorithm. Otherwise, VP-Tree is a binary search tree structure so the searching complexity is O (log₂N).

Apart from that, the distance function constructor and VP-Tree level assigner (Fig. 3) are two precursors used to define the VP-Tree properties. For instance, an Euclidean distance is employed to measure the distance between a vantage point and candidate points inserted into a tree, while a height factor is used to identify a terminate condition when building an index tree. For calculating the level of this index tree, the formula log₂N/S is used, where N refers to the number of images in the dataset and S is the number of machines in a cluster environment.

4.1.3 Grouping data

This module (refer to Algorithm 1) performs a process that distributes data points to their corresponding groups using the MapReduce framework. In the beginning, Hadoop SequenceFile is split into a set of InputSplits data. Then, each InputSplit will be assigned to individual mappers, which are generated by the framework for processing. Particularly, the mapper starts by loading the index tree from the HDFS as well as retrieving a cache object in the memory of a working machine. Then, for each data point, the mapper traverses through the index tree to find the closest group to this point. Afterward, the mapper generates a set of key-value pairs in the form of <GroupID, point> . Herein, pairs of values having a similar key are sent to the same Reducer through the intermediate Hadoop shuffling process. The Reducer aggregates all points with the same GroupID and writes them into HDFS.

Moreover, while using the traversed tree to find the closest group for a specific point, the algorithm also tracks all distances between an inserted point with nodes in its path. In this way, the function not only provides a GroupID but also brings back a distance from an inserted node to its parent node. As a result, these distances are fed to a local cache for estimating the lower and upper bounds of an unknown distance between two objects during querying.

4.2 Online component

Following that described in Fig. 4, the online component consists of two main modules, the pre-processing query image and the MapReduce retrieving module.

Fig.4

The online architecture.

4.2.1 Query image pre-processing

This task receives a query image from a user and process it in the same way with an offline phase in order to generate an image descriptor vector. Then, the index tree is loaded from HDFS, which is used to find the kNN nearest neighbors for this query. Additionally, the paths to its neighbor’s group are fed as inputs to further the mapper processes.

4.2.2 Mapper-reducer retrieval

This module (refer to Algorithm 2) aims at exploiting a MapReduce framework for finding the similarity between a query image and images stored in the dataset. The basic task is to calculate the similarity between two images, then use a pre-defined threshold to filter values which satisfy this user-defined condition, whereby the system might return a list of suitable images to the user.

Mapper: After the job is launched, the framework generates a set of map tasks depending on the number of data chunks that were split. Then, it delivers those map tasks to machines that have available slots for processing. Herein, a cache object is loaded from local memory before each map task actually executes its job. Then, each record <ImageId, ImageDescriptor> is read interactively to compare with a query descriptor using an Euclidean distance metric. However, instead of computing this distance directly, an in-memory cache object is a priority required for checking data availability first. As mentioned earlier, a cache object is used to store all distances computed during previous querying sessions. Thus, if a cache already contained the distance between two images, it is returned to a user without needing calculation again. Note that if a cache object does not contain a requested computation, it can still be used to approximate the lower/upper bound values based on information from the previous queries (Section 3.1). In case this estimated value is greater than a pre-defined threshold, the distance computation can be ignored. Intuitively, map tasks can save the computation cost during querying; thus, the performance of the entire processing is improved. Otherwise, if a cache object returns an estimated value that satisfies a pre-defined threshold, the distance between pairs of images is computed directly using the Euclidean metric. Consequently, if it meets the requirement, a key-value pair <Sim, ImageId> is emitted as the output for the Reducer.

Reducer: Herein, the reducer combines all keys and their corresponding values. This job is carried out by an aggregator compone nt to produce an output format like <Sim, List of ImageIDs> . Eventually, the process obtains original images based on its key ID stored in a database and dispatches the relevant dataset images to the user.

5 Experiments

5.1 Environment setup

The proposed framework has been implemented in Java with JDK version 1.8 and deployed on top of Hadoop 2.6.0. Furthermore, OpenCV library 2.4.9, which is wrapped under the Java component has been used as an auxiliary tool for developing and maintaining images involved in the pre-processing and searching phase. Additionally, a stable released package Redis 2.8 is set up under the master-slave replication model. It supported data synchronization from the global cache to each local cache for enhancing data availability. The Hadoop cluster includes one master node and seven data nodes, in which a master node is also configured as a data node. The details of the configuration used for the experiment can be found in Table 3.

Table 3
System configuration

Configuration

Platform 64bit

CPU Intel (R) Core (R) Xeon E31220 3.1GHz 1 cores

Storage 31GB for Name Node, 63GB for DataNode

Memory 2GB for Name Node, 4GB for DataNode

OS Ubuntu 14.04

Software Apache Hadoop version 2.6.0

	Configuration
Platform	64bit
CPU	Intel (R) Core (R) Xeon E31220 3.1GHz 1 cores
Storage	31GB for Name Node, 63GB for DataNode
Memory	2GB for Name Node, 4GB for DataNode
OS	Ubuntu 14.04
Software	Apache Hadoop version 2.6.0

5.2 Datasets

Since our work focuses on how to find similar images that were captured at different angles of the same object for a given query, some well-known large datasets, such as CIFAR-10, SIFT-1M, and Eighty Million Tiny Images, are not suitable for this target. Instead, we employ multiple different datasets (Table 4) in the experiment in order to ensure that the proposed framework can operate efficiently under a variety of image characteristics.

Table 4
Image dataset description

Image dataset Number of images Image resolution Average size

Holidays1-2 1491 24483264 2.7GB

UCID2 1338 512384 753MB

NUS-WIDE 269.648 240180 1.2GB

MIRFLICKR25000 25000 500500 2.9GB

Synthetic 1039 1080*720 531MB

Image dataset	Number of images	Image resolution	Average size
Holidays1-2	1491	2448*3264	2.7GB
UCID2	1338	512*384	753MB
NUS-WIDE	269.648	240*180	1.2GB
MIRFLICKR25000	25000	500*500	2.9GB
Synthetic	1039	1080*720	531MB

Holidays dataset: The Holidays dataset was introduced in [37]. It mainly contains photos that were taken in various conditions: rotation, view point, and illumination changes, blurring, etc., with the purpose of testing the robustness of the similarity matching image algorithm. Concretely, the dataset contains 1,491 images of 500 scenes and objects provided together with a ground truth. The first image of each scene is used as a query.

UCID dataset: The UCID dataset [38] collects over 1300 uncompressed images of which 262 are query images along with defined matching ground truth images. Its goal is to evaluate image retrieval techniques in CBIR applications.

NUS-WIDE dataset [39]: This dataset contains 269,648 images downloaded from the social photography site Flickr that have been manually annotated, with several tags (2-5 on average) per image. The tag dictionary is a collection of 81 different tags. This dataset is used to conduct the experiments of the proposed framework on a large dataset.

MIRFLICK-2500 Image Collection [40]: As its name suggests, this collection consists of 25,000 downloaded images from Flickr through its API, which are provided as distractors. It offers users the ability to search and share their pictures based on image tags.

Our synthetic dataset: This dataset accumulates from nearly 1,000 images, which were taken from different places around stadiums, buildings, etc. and also guarantees similarity characteristics for comparison.

Various dataset sizes are needed to evaluate the effectiveness on scalability for our proposed framework. Thus, three datasets are constructed as shown in Table 5.

Table 5

Image dataset of different scales

Data size	Number of images	Average size	Average processing time(s) on a Single Node
Small	8000	2.9GB	142.2 seconds
Medium	16000	4.29GB	265.5 seconds
Large	28000	8GB	301.1 seconds

5.3 Evaluation metrics

To evaluate the accuracy of the CBIR system, two commonly used metrics are precision and recall, which are calculated as follows: $Precision = \frac{Number of relevant images retrieved}{Total number of images retrieved}$ (11) $Recall = \frac{Number of relevant images retrieved}{Total number of relevant images}$ (12) The two metrics are often combined into their harmonic mean, namely F-measure, which can be formulated as follows: $F = \frac{(1 + β^{2}) * recall * precision}{(β^{2} * precision) + recall}$ (13) where β is the ratio which allows either precision or recall to be weighted more heavily. However, in most experiments, the precision and recall are considered equal in terms of the reliability and correctness of the CBIR system. Hence, the ratio is often set to β = 1. In contrast, the average precision is another widely used metric to estimate the accuracy of a system that returns the top k nearest neighbors given a query q. It is calculated as follows: $AvgPrecision @ K (q) = \frac{\sum_{i = 1}^{K} I (T_{i} \leq K)}{K}$ (14) where T_i is the rank of a true neighbor and I (T_i ≤ K) equals 1 if a true neighbor is within the retrieved K and equals 0 otherwise. The mean average precision is the average of all Precision @ K across all queries in Q.

$\begin{matrix} MeanAvgPrecision @ K \\ = \frac{\sum_{i = 1}^{| Q |} AvgPrecision @ K (i)}{| Q |} \end{matrix}$ (15) We use MeanAvgPrecision @ K (MAP) with different values of K in our experiment.

In addition, some other factors, such as total response time and cache hit ratio, are also considered. The total response time is determined by the average time that the system actually spends to process one query image, with the condition that the content caching has been filled by some previous queries. The cache hit ratio is the number of times the objects found in a cache divided by the number of times the systems found objects outside the cache. The higher the ratio, the more effective the cache is at improving the overall performance.

5.4 Results

5.4.1 Experiment I: Evaluate the accuracy ofthe proposed framework

In this experiment, the accuracy factors of the proposed framework in terms of recall and F-measure are validated against two methods SR-Tree and LSH [41] on the same Holidays dataset. We estimated the MAP@K metric against others methods, such as SR-Tree, PCAH [42], SH [43], and the approached used by Hong et al. [44]. This experiment investigates the influence of dataset features on the accuracy factor of the CBIR systems when a large number of distractor images are involved in the dataset. First, the featured category used for similarity matching is extracted from the 1,491 images in the Holidays dataset and 269,648 images in NUS-WIDE dataset. To improve the computation speed of the SURF method, the number of features detected in each image, known as the HessianThreshold, is limited to a value 500. Then, a set of descriptors is built based on the features extracted in the previous step. On average, the size of descriptors for each image is 64x250, which means that each image has approximately 250 descriptors for representing its visual content, modeled by a 64-dimension vector. Thus, the total number of data points that are stored in the HDFS file system in the pre-processing step is approximately 372,750 for the Holidays dataset and 67.5 million for the NUS-WIDE dataset. Afterwards, the descriptors are partitioned into blocks using the VP-Tree clustering algorithm. To ensure the balance of the index tree, the depth level is calculated by using the formula O (log ₂N/S) (Section 4.1.2). It is important to note that a balanced treestructure not only ensures the fairness of used resources among computing nodes but also avoids the bottleneck problem when a small portion of resources retain a large demand data placed on it. Lastly, threshold and sample query data are generated from each dataset. Concretely, the threshold is obtained by calculating the average distance for all possible candidate pairs of images.

Figure 6 represents the results according to the F-measure metric. The proposed framework achieves a higher accuracy and recall against the SR-Tree approach. Specifically, an F-measure attains 3.76% enhancement for our framework over SR-Tree and likely 2.8% enhancement for the recall factor. Against LSH, the proposed framework is approximately 2.53% and 2.29% higher for F-measure and the recall metric, respectively. Overall, the above results indicate that although our proposed method does not achieve remarkable performance compared to others, the outcome is still able to attain the requirement for a specific dataset in a CBIR system.

Fig.5

Several example images belong to the same group of the NUS-WIDE dataset with respect to “Building”, “Castle”, and “Computers”, but it is hard to determine the relevant images using a visual-content approach.

Fig.6

F-measure and recall for our proposed framework versus SR-Tree and LSH for the Holidays dataset.

Meanwhile, the mean average precision (MAP) of the proposed framework is shown in Fig. 7a, 7b, and 7c when it is performed on a large dataset like NUS-WIDE. As mentioned above, the NUS-WIDE dataset consists of a thousand images in which categories are manually tagged by the user. Many of the unique tags arise from spelling errors [39], thus decreasing the semantic correlation and increasing noise in the dataset. For example, some irrelevant images were tagged in the same categories by the user as shown in Fig. 5. The objects in each category are not totally related at all, but they are arranged in the same group. Therefore, the average precision archives generally below the expected. In particular, we performed experiments on varying parameters K, which indicate the number of neighbor results returned for a given query.

Fig.7

Evaluation of the MAP for the proposed framework.

According to results, our proposed framework outperforms PCAH, with 0.34% versus 0.2% at K = 1, and by 0.11% versus 0.015% at K = 100, respectively. While against SH, our approach is also better for most K values, for instances, by 0.34% versus 0.3% at K = 1, and by 0.11% versus 0.1% at K = 100. In contrast, comparing with the approach used by Hong et al., our proposed framework is not as good as for differences in K. The main reason is that the approach used by Hong et al. extracts different concept relationship features from three levels, including textual expression, visual images, and the occurrence frequency of concepts labeled on the image, whereas our feature-matching method only takes into account the visual content of images.

5.4.2 Experiment II: Evaluation of the time factor for the proposed framework for varying dataset sizes

In this second experiment, our proposed method is validated regarding the time factor. To conduct the experiment, all the datasets shown in Table 4 are accumulated into a large dataset with a total size approximately 8 GB. The cache size of each slave nodes is set to 50 MB. The descriptors built from this dataset can be up to nearly 75 million data points. For the classifier, we also utilize the VP-Tree clustering algorithm for grouping similar data into clusters.

Meanwhile, the Redis replication model and the distance-based cache method are used to store distance calculations when clustering data. In particular, the Redis master merges distance caches from local nodes. Then, it performs the synchronization. As a consequence, the slave computing nodes not only contain the cached data, which was performed on its own machine, but also retains cached data on other nodes in the cluster. Hence, the performance in computational cost is crucially improved. In other words, the more distance calculations between images filled the cache, the more significant the computation cost of the further matching are deduced. Otherwise, dataset of various sizes are also constructed to validate the impact of dataset size on the proposed framework performance. As shown in Table 5, along with the large dataset, we also split it into two different sizes. The small dataset contains approximately 8 * 10⁴ images in 2.9 GB and the medium dataset includes nearly 16 * 10⁴ images in 4.29 GB.

For the results, we show the response time of our proposed framework in a single node and Hadoop cluster environments. In a single node environment, Fig. 8a indicates that the performance increased in a logarithmic manner with an increased number of images. Therefore, it proves that our proposed framework is capable of stable operation in the variant scale of dataset sizes. In the Hadoop cluster environment, we construct a cluster of eight computing nodes to study the effectiveness of using the distributed environment to enhance the performance in term of a time factor. Concretely, our experiment result compares with Yin et al. [17] works in the same cluster environment. The performance results are presented in Fig. 8b for the response time, which assumes that sufficient training queries have been accomplished before the current query is processed. The purpose of this work is to guarantee the cache data is carried out enough to utilize the advantages of the distance cache model. According to the experimental outcomes, the proposed framework outperforms that of Yin et al. in most testing activities. In particular, our work has achieved a significant improvement by saving approximately 94% time consumption for running a Hadoop job. The benefit of the distance cache model is also considered by comparing the performance of the system with and without the cache. The result is shown in Fig. 8c, which emphasizes the usefulness of the caching model by reducing the time consumption of a job ranging from 5% in the small dataset up to 10% in the large dataset. Eventually, Fig. 8d shows the ratio time saving of our proposed framework when deployed in different environments (1 node vs. 8 nodes). Specifically, the system performance has gained approximately 76% speedup for the corresponding large dataset, approximately 75% for the medium size, and up to 74% for the small dataset.

Fig.8

Evaluation the response time of the proposed framework for varying dataset sizes.

5.4.3 Experiment III: Evaluation of the cache hit ratio of the proposed distance cache method

Figure 9 shows the exact, approximate, and total hit ratios as a function of the cache size. The approximate hit ratio is defined by the number of cache hits by using lower/upper bound distances. The first observation is that the distance cache has a remarkably large hit ratio, larger than 85% for the largest cache size tested. However, even with the lowest cache size of 10 MB, the distance cache model still obtains a higher hit ratio with 18 %. It is worth noticing that the approximate hit ratio is dramatically higher than the exact hit ratio, as shown in Fig. 9. The main reason is that when the cache is filled, the replacement policy retains only data which will give the best performance for the further approximate lower/upper bound calculation. Hence, instead of keeping the least recently used data (LRU) in the cache, it often replaces the old data with new data that ensures the policy above. Lastly, a trade-off between the time and accuracy factors arises when using the distance-based cache. In case of a miss, the distance cache will give an approximately distance instead of calculating directly. Obviously, this improves the overall speed of the system by reducing a part of the time-consuming computations. However, the accuracy metric may decrease due to the approximate calculation.

Fig.9

Hit ratio on varying cache size.

Combining the results between Figs. 9 and 8c, we can infer that the retrieval speed has inversely proportional to the hit ratio of distance-based cache service. When the total hits are increased, the performance of our proposed approach improves since the less distance computations are processed by the computing nodes. In contrast, if a cache miss occurs, the more distance computations are executed by the computing nodes. This leads to an increase in the response time of a job.

6 Conclusion

In this paper, we presented a scalable solution that was equipped with the most renowned parallel framework for improving the effectiveness of the CBIR system. Our contributions are considered as the following points. An indexing scheme based on an unsupervised learning method, namely VP-Tree, has been employed to classify the dataset, which gains the benefits of CBIR searching tasks, instead of using distance computation from the entire dataset. In addition, we proposed an in-memory distance-based cache to quickly determine the tightness lower/upper bounds between a query image and images in the database, thus improving the performance by avoiding the direct distance computations during the search process. Furthermore, a collaborative cache service is built together with the distance-based cache to increase the data availability for each computing node, which significantly accelerates the similar matching process due to the avoidance of remotely fetching data in a distributed application.

Through rigorous analysis on some well-known and public datasets, experimental results have demonstrated that the proposed method can achieve significant effectiveness compared to other studies with respect to response time and gain an acceptable accuracy ratio on varying datasets. Due to some limitations of the Hadoop MapReduce framework and the recent emerging growth of other technologies, we would like to investigate the effects of using other parallel computing frameworks for accelerating the proposed system.

Footnotes

Acknowledgment

This research was supported by the MSIT(Ministry of Science and ICT), Korea, under the ICT Consilience Creative program(IITP-2019-2015-0-00742) supervised by the IITP(Institute for Information & communications Technology Planning & Evaluation) and Institute for Information & communications Technology Promotion(IITP) grant funded by the Korea government(MSIT) (No. 2017-0-00294, Service mobility support distributed cloud technology

References

Veltkamp ,

Burkhardt and

H.P.

Kriegel , State-of-the-Art in Content-Based Image and Video Retrieval. Springer Science & Business Media, Dordrecht, 2013.

Datta ,

Joshi ,

Li and

Ze Wang , Image retrieval: Ideas, influences, and trends of the new age, ACM Comput Surv 40 (2008).

A.W.M.

Smeulders ,

Worring ,

Santini ,

Gupta and

Jain , Content-based image retrieval at the end of the early years, in IEEE Transactions on Pattern Analysis and Machine Intelligence 22(12) (2000), 1349–1380.

Rahmani ,

S.A.

Goldman ,

Zhang ,

S.R.

Cholleti and

J.E.

Fritts , Localized content-based image retrieval, IEEE Trans Pattern Anal Mach Intell 30(11) (2008), 1902–1912.

Wu ,

S.C.H.

Hoi ,

Jin ,

Zhu and

Yu , Distance metric learning from uncertain side information with application to automated photo tagging, in Proc ACM Int Conf Multimedia, Scottsdale, AZ, USA, 2009, pp. 135–144.

Bay ,

Tuytelaars and

Van Gool , SURF: Speeded Up Robust Features, In:

Leonardis ,

Bischof and

Pinz , (eds.) Computer Vision – ECCV 2006 ECCV 2006. Lecture Notes in Computer Science, vol 3951. Springer, Berlin, Heidelberg, 2006.

Wu ,

S.C.H.

Hoi and

Yu , Semantics-preserving bag-ofwords models and applications, IEEE Trans Image Process 19(7) (2010), 1908–1920.

Huang ,

S.R.

Kumar ,

Mitra ,

W.-J.

Zhu and

Zabih , Image indexing using color correlograms, in Proc IEEE Conf Comput Vis Pattern Recognit, 1997, pp. 762–768.

Belongie ,

Malik and

Puzicha , Shape matching and object recognition using shape contexts, IEEE Trans Pattern Anal Mach Intell 24(4) (2002), 509–522.

10.

D.G.

Lowe , Distinctive image features from scale-invariant keypoints, International Journal on Computer Vision 60(2) (2004).

11.

Jegou ,

Douze and

Schmid , Improving bag-offeatures for large scale image search, Int J Comput Vis 87(3) (2010), 316–336.

12.

Zheng ,

Wang ,

Liu and

Tian , Lp-norm IDF for large scale image search, in Proc IEEE Conf Comput Vis Pattern Recognit, 2013, pp. 1626–1633.

13.

He , et al., Mobile product search with bag of hash bits and boundary reranking, in Proc IEEE Conf Comput Vis Pattern Recognit, 2012, pp. 3005–3012.

14.

Feng ,

Yang and

Liu , An efficient indexing method for content-based image retrieval, Neurocomputing 106 (2013), 103–114. ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2012.10.021.

15.

T.D.T.

Nguyen and

E.-N.

Huh , An efficient similar image search framework for large-scale data on cloud, In Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication (IMCOM ’17) ACM, New York, NY, USA, 2017, 8. Article 65.

16.

J.S.

Hare ,

Samangooei and

P.H.

Lewis , Practical scalable image analysis and indexing using Hadoop, Multimedia Tools Appl 71(3) (2014), 1215–1248.

17.

D.S.

Yin and

D.B.

Liu , Content-based image retrial based on hadoop, Mathematical Problems in Engineering 2013 (2013), 7. Article ID 684615.

18.

Moise ,

Shestakov ,

Gudmundsson and

Amsaleg , Indexing and searching 100M images with map-reduce, In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval (ICMR ’13), ACM, New York, NY, USA, 2013, pp. 17–24.

19.

Sun ,

Wang ,

Xu and

Zhang , Indexing billions of images for sketch-based retrieval, In Proceedings of the 21st ACM International Conference on Multimedia (MM ’13), ACM, New York, NY, USA, 2013, pp. 233–242.

20.

Jegou ,

Douze and

Schmid , Product Quantization for Nearest Neighbor Search, in IEEE Transactions on Pattern Analysis and Machine Intelligence 33(1) (2011), 117–128.

21.

Herwig

Lejsek ,

Björn Þór

Jónsson and

Amsaleg , NV-Tree: Nearest neighbors at the billion scale, In Proceedings of the 1st ACM International Conference on Multimedia Retrieval (ICMR ’11), ACM, New York, NY, USA, Article 54, 2011, p. 8.

22.

Amsaleg , A database perspective on large scale highdimensional indexing, Habilitation à diriger des recherches, Université de Rennes 1, 2014.

23.

Babenko and

Lempitsky , The inverted multi-index, in IEEE Transactions on Pattern Analysis and Machine Intelligence 37(6) (2015), 1247–1260.

24.

Kumar ,

Zhang and

Nayar , What Is a Good Nearest Neighbors Algorithm for Finding Similar Patches in Images? In Proceedings of the 10th European Conference on Computer Vision: Part II (ECCV ’08),

David

Forsyth ,

Philip

Torr , and

Andrew

Zisserman (Eds.), Springer-Verlag, Berlin, Heidelberg, 2008, pp. 364–378.

25.

Liu ,

Rosenberg and

H.A.

Rowley , Clustering Billions of Images with Large Scale Nearest Neighbor Search, 2007 IEEE Workshop on Applications of Computer Vision (WACV ’07), Austin, TX, 2007, pp. 28–28.

26.

Jégou ,

Perronnin ,

Douze ,

Sánchez ,

Pérez and

Schmid , Aggregating local image descriptors into compact codes, in IEEE Transactions on Pattern Analysis and Machine Intelligence 34(9) (2012), 1704–1716.

27.

Zhang ,

Liu ,

Luo and

Lang , DIRS: Distributed image retrieval system based on MapReduce, 5th International Conference on Pervasive Computing and Applications, Maribor, 2010, pp. 93–98.

28.

Gu and

Gao , A Content-Based Image Retrieval System Based on Hadoop and Lucene, 2012 Second International Conference on Cloud and Green Computing, Xiangtan, 2012, pp. 684–687.

29.

R.K.

Grace ,

Manimegalai and

S.S.

Kumar , Medical Image Retrieval System in Grid Using Hadoop Framework, 2014 International Conference on Computational Science and Computational Intelligence, Las Vegas, NV, 2014, pp. 144–148.

30.

Dravyakar ,

Swapnil and

S.B.

Mane , Private Content Based Image Retrieval Using Hadoop, 2014.

31.

Hare ,

Samangooei ,

P.D.

Dupplaw and

H.P.

Lewis , ImageTerrier: An extensible platform for scalable high-performance image retrieval, Proceedings of the 2nd ACM International Conference on Multimedia Retrieval, ICMR, 2012. 10.1145/2324796.2324844.

32.

Jai-Andaloussi ,

Elabdouli ,

Chaffai ,

Madrane and

Sekkaki , Medical content based image retrieval by using the Hadoop framework, ICT 2013, Casablanca, 2013, pp. 1–5.

33.

Zhou ,

Li ,

Sun and

Tian , Collaborative index embedding for image retrieval, in IEEE Transactions on Pattern Analysis and Machine Intelligence 40(5) (2018), 1154–1166.

34.

Mohamed ,

E.A.

Khalid ,

Mohammed and

Brahim , Content-Based Image Retrieval Using Convolutional Neural Networks, In:

Mizera-Pietraszko ,

Pichappan and

Mohamed , (eds.), Lecture Notes in Real-Time Intelligent Systems RTIS 2017 Advances in Intelligent Systems and Computing, vol 756. Springer, Cham.

35.

R.R.

Saritha ,

Paul and

P.G.

Kumar , Cluster Comput, (2018). https://doi.org/10.1007/s10586-018-1731-0

36.

Skopal ,

Lokoc and

Bustos , D-Cache: Universal Distance Cache for Metric Access Methods, in IEEE Transactions on Knowledge and Data Engineering 24(5) (2012), 868–881.

37.

Jegou ,

Douze and

Schmid , Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search. In:

Forsyth ,

Torr and

Zisserman , (eds.), Computer Vision –ECCV 2008 ECCV 2008 Lecture Notes in Computer Science, vol 5302, Springer, Berlin, Heidelberg.

38.

Schaefer and

Stich , UCID: An uncompressed color image database, Storage and Retrieval Methods and Applications for Multimedia 5307 (2004), 472–480.

39.

T.-S.

Chua ,

Tang ,

Hong ,

Li ,

Luo and

Zheng , NUS-WIDE: A real-world web image database from National University of Singapore, In Proceedings of the ACM International Conference on Image and Video Retrieval (CIVR ’09) ACM, New York, NY, USA, 2009, 9. Article 48.

40.

M.J.

Huiskes and

M.S.

Lew , The MIR flickr retrieval evaluation, In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval (MIR ’08), ACM, New York, NY, USA, 2008, pp. 39–43.

41.

Mukherjee ,

Mukhopadhyay and

Mitra , A survey on image retrieval performance of different bag of visual words indexing techniques, Proceedings of the 2014 IEEE Students’ Technology Symposium, Kharagpur, 2014, pp. 99–104.

42.

Gong ,

Lazebnik ,

Gordo and

Perronnin , Iterative quantization: A procrustean approach to learning binary codes for large-scale image retrieval, in IEEE Transactions on Pattern Analysis and Machine Intelligence 35(12) (2013), 2916–2929.

43.

Weiss ,

Torralba and

Fergus , Spectral hashing, NIPS, 2008.

44.

Hong ,

Yang ,

Wang and

Hua , Learning visual semantic relationships for efficient visual retrieval, in IEEE Transactions on Big Data 1(4) (2015), 152–161.

45.

Redis Replication https://redis.io/topics/replication/

46.

Karami ,

Prasad and

M.S.

Shehata , Image Matching Using SIFT, SURF, BRIEF and ORB: Performance Comparison for Distorted Images, CoRR, abs/1710.02726, (2017).

47.

OpenCV Tutorials. (Accessed on 2019/20/02) https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_feature2d/pysurf_intro/py_surf_intro.html#surf

Joint index and cache technique for improving the effectiveness of a similar image search in big data framework

Abstract

Keywords

1 Introduction

2 Related work

3.1 Cache methodology

4.1 Offline component

4.1.3 Grouping data

4.2 Online component

4.2.2 Mapper-reducer retrieval

5 Experiments

5.1 Environment setup

Table 3 System configuration Configuration Platform 64bit CPU Intel (R) Core (R) Xeon E31220 3.1GHz 1 cores Storage 31GB for Name Node, 63GB for DataNode Memory 2GB for Name Node, 4GB for DataNode OS Ubuntu 14.04 Software Apache Hadoop version 2.6.0

Table 4 Image dataset description Image dataset Number of images Image resolution Average size Holidays1-2 1491 2448*3264 2.7GB UCID2 1338 512*384 753MB NUS-WIDE 269.648 240*180 1.2GB MIRFLICKR25000 25000 500*500 2.9GB Synthetic 1039 1080*720 531MB

5.4.1 Experiment I: Evaluate the accuracy ofthe proposed framework

Footnotes

Acknowledgment

References

Table 3
System configuration

Configuration

Platform 64bit

CPU Intel (R) Core (R) Xeon E31220 3.1GHz 1 cores

Storage 31GB for Name Node, 63GB for DataNode

Memory 2GB for Name Node, 4GB for DataNode

OS Ubuntu 14.04

Software Apache Hadoop version 2.6.0

Table 4
Image dataset description

Image dataset Number of images Image resolution Average size

Holidays1-2 1491 24483264 2.7GB

UCID2 1338 512384 753MB

NUS-WIDE 269.648 240180 1.2GB

MIRFLICKR25000 25000 500500 2.9GB

Synthetic 1039 1080*720 531MB