Abstract
Similarity search for content-based retrieval - a sustained problem; many applications endures. Most of the similarity measures intend focusing the least possible set of elements to find an answer. In the literature, most work is based on splitting the target data set into subsets using balls. However, in the era of big data, where efficient indexing is of vital importance, the subspace volumes grow exponentially, which could degenerate the index. This problem arises due to inherent insufficiency of space partitioning interlaced with the overlap factor among the regions. This affects the search algorithms thereby rendering these methods ineffective as it gets hard to store, manage and analyze the aforementioned quantities. A good topology should avoid biased allocation of objects for separable sets and should not influence the structure of the index. We put-forward a novel technique for indexing; IMB-tree, which limits the volume space, excludes the empty sets; the separable partitions, does not contain objects and creates eXtended regions that will be inserted into a new index named eXtended index, implemented in a P2P environment. These can reunite all objects in one of the subsets-partitions; either in a separable set or in the exclusion set, keeping the others empty. We also discussed the efficiency of construction and search algorithms, as well as the quality of the index. The experimental results show interesting performances.
Introduction
Efficient similarity search is an increasingly important area in computer science. The goal in similarity search is to find objects similar to a specified query. Progresses to improve the Indexing techniques are continued in order to cater the searches on large collections of data, where the processes gets more difficult. At the same time, it is hard to compare these techniques [9, 25] as various factors; data types, computing machine quality etc., can influence their effectiveness. We intend to reduce the distance calculations for answering queries by finding a technical and effective indexing similarity. Concurrently, the objects needed to be indexed; are generally more complex than mere vectors like graphs and sets or cannot be concatenated simply and meaningfully to form larger vector like keywords and color histograms [3–6, 24]. Therefore, it results in the partly shift of the focus of indexing from multidimensional spaces to metric spaces. Formally, a metric space is defined for elements’ family; comparable through a given distance. The dissimilarity between two elements of a given database can be measured through distance function in such a manner that smaller distances may correspond to more similar elements. Let
Let
Then
We have previously [28] focused on indexing through tree structure. It relies on the successive division of space with spheres. Furthermore, as with drastic growth in data, these balls’ outer regions also grow swiftly as compared to the rests. Likewise, the subspaces’ volumes grow briskly that could result in index degeneration. The inherent insufficiencies of space partitioning interlacing with the intra-regions overlap factor cause this problem. The research is prone to degenerate into a complete analysis of the data set. In compact clusters, an efficient structure relies on better grouping of similar objects. The main aim is to reduce the balls’ outer surface, to limit the intra-subspaces concavities and inserts them into eXtended index; a new index in the form of smaller separable partitions implemented in P2P environment. The biased object’s allocation for separable sets should be avoided by a good topology and not be influenced in the index’s structure. In this paper, we have presented our novel indexing structure that mixes two principles into a single, pivot-filtering and partitioning system. We intend to limit the sub-spaces’ concavities to reduce the accessed data. We have proposed a control over these regions to provide an improved partitioning. This can be achieved through a multiple index’s creation that groups the separable partitions; eXtended index, in a P2P environment. The parallelism may and must be part of a solution however cannot be the full solution. To best of our knowledge, none of indexing techniques could achieve a logarithmic search time and the logarithm response time can be achieved with parallel implementation.Sequentially, a general sort is O (n . Log (n)); where n is the number of objects. Parallelly, a general sort is implemented in time O (log (n)) and O (n) on the surface, namely the number of processors. It is quite impossible to use O (n) processors along with large databases having multimillions of objects (theoretical solution clearly marked on the surface).
The paper is organized as follows: Section 2 presents briefly the taxonomy of indexing techniques. Section 3 discusses the pros and cons of the IMB-tree. Further in Sections 3.1 and 3.2 we put-forward the newly proposed technique, presents its main characteristics and the corresponding algorithms. Then, Section 4 analyzes the experimental evaluations. Finally, Section 5 concludes the paper and discuses research extendable future directions.
Background
The first class based on the two partitioning techniques, does not impose space-partitioning. We consider basically the M-tree family that induces a balanced index allowing incremental updates. But unluckily, it undergoes the overlapping problem increasing the distance calculations to answer a query. For objects’ reorganization in compact clusters, Almeida [16] has put-forwarded a novel structure that worked only for Divisive-Agglomerative Hierarchical Clustering or DAHC-tree; an approximate search. It is built by grouping and dividing the data set into compact clusters. In [21], the authors have presented an extension of Slim-Tree named Slim*-tree, exploiting the best properties from ball and the BST as a hash function to search inside a bucket file. The problem is still unresolved and objects’ reinsertion proves to be costly at large. In Tree [25], the authors have proposed a novel dynamic indexing and retrieval technique based on clustering. The technique has updated structure with constant data insertion. The Gaussian Mixture Models are used to fit the nodes in CD-Tree. To the best of our knowledge, this problem has not yet been solved completely due to slow updating of construction phase and its high costs at large.
The second class relies on the partitioning of the space. A reasonable amount of longitudinal studies [7, 19] exist for this. Two sub-approaches are generally discussed. The first one utilizes ball partitioning, like VP-tree [11, 26]. In this sub-approach, the pivots’ selection plays a vital role in index structure. That was exactly the reason behind Yianilos proposal of VP-tree [26] that was about finding the median element inside a set of objects. The VP-tree is generalized to mVP-tree, the nodes are divided into quantiles. This principle of partitioning removes the overlapping problem among shapes. Besides all the benefits, there arises a problem in the cases; when the demand point gets closer to border between two regions as the need to visit all the neighboring regions results in index less efficiency. In the near past, some techniques have been proposed for the second class. In [12] the authors have presented a combination of two trees; kd-tree and ball-tree to improve searching time; to take advantage of both information. But, it proves to be less-efficient when dimension exceeds 10. In NOBH-trees [22], the data in this technique is fully indexed only in the main memory. It combines Voronoi shapes with balls to partition the space. Other techniques [2] were put-forwarded in recent years that focused on indexing large-scale data, that respond to the approximate queries, and not the exact ones. Other, who tries to compress the index [20]. Besides all the benefits, there exists a problem in this technique in complexity of enclosing forms that results in an increase in insertion and search operation costs. A technique based on users’ queries called I-Clusters [15] is extended from the List of Clusters. When a user executes a query, the result is a subset containing close objects. The method creates new clusters and inserts them into the index structure, based on all the results. This method has two main advantages. The first one is the cheapness of the construction and of the objects’ insertion in dynamic datasets. The second advantage is the improvement of searching cost after the queries execution. Although, the effectiveness of this method is demonstrated via data sets with low dimensions. Given that the search by the nearest neighbor of large databases in large spaces is inherently difficult because of the curse of dimensionality. Approximate nearest neighbor search is needed to practically solve the problem [1, 23] have designed an efficient indexing scheme, HD-Index, based on a new RDB-tree index structure.
The proposed IMB-tree
In this paper we propose a new access method, called IMB-tree (the Intersection Metric Bucketed trees) implemented on P2P system. IMB-tree is a new indexing structure that combines two principles into a single system, pivot-filtering and partitioning, in order to minimize the amount of accessed data. The Fig. 1 illustrates the development of the tree. At each stage of the recursive process of constructing the tree, two pivots are chosen from a subset of elements cmax, they are chosen as the two objects furthest apart from each other. Regions I, II and III collapse into level 2, the node eXt collapse into the same level. The elements are distributed according to the partitions that they belong to.

IMB-tree on peer to peer system.
In compact clusters, an efficient structure relies on better grouping of similar objects. The main aim is to reduce the balls’ outer surface, restrict the sub-spaces concavities rather insert as eXtended index; smaller separable partitions, into new index implemented in a P2P environment. The biased object’s allocation for separable sets must be avoided in any good topology and it should not affect the index structure also. This can result in reuniting all the objects into one of the partitions. It keeps the others empty as either separable set or exclusion set. This structure has main advantage that it is simply identifiable for a pattern that if ay element would belong to it and ensures no nodes overlapping in same tree levels. Furthermore, the drastically increasing large spaces’ volumes advocates in favor of these techniques. Otherwise, it would minimize or restrict volumes or control their occupancy at-least.
At first, a leaf node L may contain an indexed objects’ subsets:
(p1, p2) are two distinct objects, i.e., with d (p1, p2) >0, called pivots; r = d (p1, p2) helps to defines two balls, B1 (p1, r) and B2 (p2, r), centered on p1 and p2 respectively and having a common radius value, large enough for the two balls to have a non-empty intersection; (r1, r2) are the distances to the farthest object in the sub-tree rooted at that node N with respect to p1 and p2 respectively, i.e., r
i
= max {d (p
i
, o) , ∀ o ∈ N} for i = 1, 2 where the set notation o ∈ N is abusively used for the union of the leaf extensions that are rooted at N; (N1, N2, N3) are three sub-trees, such that: N1 = {o ∈ N : d (p1, o) < r ∧ d (p2, o) < r} for the intersection; N2 = {o ∈ N : d (p1, o) < r ∧ d (p2, o) ≥ r} for the partial ball centred on p1; N3 = {o ∈ N : d (p1, o) ≤ r ∧ d (p2, o) < r} for the partial ball centred on p2; N
eXt
= {o ∈ N : d (p1, o) ≥ r ∧ d (p2, o) ≥ r}; eXt
i
set of intervals where,
Accordingly, we can limit extended regions’ size. The brighter region and the union of the exclusion sets represent The exclusion set.
The IMB-tree is an hierarchical directory. It is based on the partitioning of the space using spheres. Creating index extensions in a P2P system, limits the volume of the outer regions of the spheres, by inserting them into another peer. Building an IMB-tree is realized incrementally. The insertion is done in a top-down way. Algorithm 1 describes formally the process. When the cardinal limit is reached, an inner node and three new leaves N1, N2, N3, and a pointer to another peer containing all objects in the outer regions of the spheres according to the conditions replace a leaf. In order to split the objects set, two distinct pivots have to be chosen. The selection of these pivots plays an important role. The goal is to balance, as much as possible, the tree. In this work, we decided to choose two objects as far as possible from each other, several heuristics (such as the choice of the pivots that are far) have been proposed to improve the pruning power.
Recalling that the role of the extended index is to create compact regions in the outer parts of the spheres, in order to eliminate the maximum of the objects during the search algorithm. The elimination is feasible if we have some information about the distances from a pivot to other objects and from the query object to the pivot. We keep in each case the lower and upper limits of on the distance.
We have considered putting in place strategies to try to balance the tree, such as choosing two elements furthest apart from each other. However, we are careful not to use a function of more than linear complexity, otherwise the algorithm will exceed a complexity in O (n . log n) which is the one it has in this version. Besides, the tree tends to be rather balanced, hence inserting a new object is a logarithmic operation, in amortized cost. This algorithm 1 is implemented in the order of balancing network peer loads.
Insertion in IMB-tree
Insertion in IMB-tree
Search kNN in IMB-tree
The algorithm 2, which formally describes the search kNN in a IMB-tree, is also quite complex. The searches are made from balls while the space has been partitioned. The search is done by calculating the distance between the query point and the two pivots, while descending into the tree. Not counting the case of the empty tree, we can meet four cases when passing through a tree node.
Note that this algorithm is the same on all the stations of the network. It is on this logical network that the query q is broadcast. Once a peer machine receives a query q with the radius of the query r q initialized to infinity, the query peer establish direct connections with the desired peers, and a local search algorithm starts. The query radius r q plays the key role in the search optimization (the minimum possible is a maximum pruning). Every time the indexes are browsed, the value of the query radius r q decreases, which actually corresponds to the distance to the k e object in the ordered list A. The leaf nodes contain a subset of the indexed data with a maximum cardinal equal to cmax. At the leaf level the procedure is quite simple. In order to find the k closest neighbors of a leaf, just sort them according to their increasing distances to the q request object. Then we return at most the first k objects already sorted. Note, again, that this step does not really require sorting, but only a sequence of mergers. The complexity is "constant", that is to say in: O (2 . k) rather than: O (4 . k . log 4k). Relying on the decreases expressed by the computations of r q is quite insufficient. It is necessary to estimate an upper-bound to the forthcoming k th distance. In this way, the first call on the root node could be initialized with a much more satisfactory value than +∞, the best estimation leading to a perfect search.
In the worst case, a global calculation occurs on all nodes; this is a complete parallel search. Let us consider the case
Experiments and Comparison
In this section we provide experimental results on the performance of IMB-tree on real data sets, in order to test and compare its effectiveness. We used three datasets. We started with the cities of France, which have a low dimensional. After, we turned to the complex objects, a good example is the multimedia descriptors, we used a subset of the the MPEG-7 Dominant Color Descriptor (DCD) of 64 dimension from the COPhIR dataset. We also used KDDCup (2008), which is descriptor of colour image histograms of 32 dimensions. We run the experiments on a 32 workstations computers with the configuration Intel(R) Xeon(R) CPUs, and 16 GB of main memory. All index files were stored on a network partition. We arranged the size of each tree node to be equal to the size of a disk page. We compared ourselves to the NOBH-tree [22], Onion-tree [8], Slim-tree [21], IM-tree [28] and GH*-tree [27]. In order to obtain a fair comparison, and in the absence of parallel versions for these techniques, we have implemented under our framework. This allows a fair comparison, all the versions being implemented, somewhat optimized, and instrumented with a largely common code. Hence, their respective merits depend mainly on the way they partition the space. We recall that the aim of this work is to adapt and avoid the degeneration of the index with large amount of data. The quality of the structure is very important. Unlike the Slim and Onion-trees, our proposal is a bucketed-tree. Therefore, the semi-balancing algorithm is replaced by our leaf splitting strategy.
In general, the structure ensures a balanced index, so the problem of degeneration does not arise in this case. We notice that we have an IMB structure with a minimum number of levels, by creating an eXtended regions, which will be inserted into a new index named eXtended index. The cost of the construction algorithm lies on the choice of pivots; comparison with the pivots for each internal node and the insertion of the objects in the corresponding region. Figure 2-a, show the resource requirements for building the indexes. We see that our proposal is the most effective, with a difference between the three collections which explains the curse of dimensionality. As a comparison, our proposal is 30% more efficient than NOBH-tree in the three collections, and more than 40% compared to the IM-tree. The difference from Onion and slim-trees is easily explainable. The reason is the absence of the respective semi-balancing algorithm and keep-small that require a number of additional operations. In slim-tree, slim-down algorithm also has a significant cost which was noted by its own authors. Our approach, IMB-tree is simpler in the insertion of new objects, (which was one of the initial objectives with respect to the complexity O (n log(n)) reasonable) and provides a competitive incremental index.

Performance statistics.
the average number of calculated distances the average number of comparisons
and on the other hand, we calculated the complexity of the search algorithms.
Figure 2-b shows a variety of parallel algorithm performance to respond to kNN queries. We compared IMB-tree with NOBH*-tree, IM*-tree, Onion*-tree and Slim*-tree, using different values of the parameter k. The results are similar, we observed that if we increase the value of k, the performances decreases but with a gap between less than 1% and to less than 2% when k = 50. So the value of k has no major influence on the performance of the search algorithm, for the three collections. We noted that Slim* and Onion* are the most expensive. And there is an important advantage between NOHB*-tree and the IMB-tree, which reflects the interests of the Extended-index, that has allowed us to keep the aberrant objects, and will permit the quick access to these objects, in order to consolidates the concept: proper construction of the index improves the performance of the search algorithms.
After, we evaluated our system performance by resolving a query. Figure 3-a presents a variety of parallel algorithm performance that responds to kNN queries. For this task we varied the number of machines (2, 4, 6, 8, 12, 15 and 32). We calculated the average time to respond to a query (2NN, 10NN, 50NN and 100NN). We observed a logical breakdown of CPU time beside the number of machine. We also noticed a logical increase compared to the complexity of the query while increasing the parameter k as well the intrinsic dimension, we found that this new approach is able to index up to twenty million objects distributed over 32 clusters. We recall that the choice of destination clusters between machines during the construction of the index was done in a way that the distribution of objects was almost balanced on all machines.

Performance statistics.
Next, the performance gain measured in disk the number of page accesses and the number of messages received by the peers. Clearly, Fig. 3-b shows that the IMB-tree has much lower research cost. And it indicates that IMB-tree is scalable and it is not sensitive to the data distribution.
In large collections -twenty million objects Dominant colors-, the gap increases between our structure and NOHB-tree, our version is accompanied by an algorithm of the estimation of the radius of the query r q , that plays a role in reducing the number of visited nodes and keeps the logarithmic complexity.
Finally, The cmax parameter was chosen either as a the logarithm in base 2, or the square root of the size of the collection. We note that performances tend to degrade as the size of the leaves decreases, but this degradation is less important for our proposal than for Slim*-tree, NOBH*-tree, IM*-tree and M*-tree. Also, a square root size of the leaves seems to be the better compromise for the candidates.
In summary, our index is better index structure than the IM-tree and NOBH because it saves many block accesses and distance computations when executing queries. An interesting property is the very fast response time to execute the exact match queries.
The P2P paradigm has become very popular for storing and sharing data. IMB-tree is a new access method able to index dynamic data sets. At first, we have clarified some methods of indexing in both the spaces; the multidimensional and the metric. The access methods of Metric space generally divides the data-space into overlapping sub-spaces. In this regard, the data structures should consider both node overlapping and height-balancing. Due to exponential growth of current data, the outer regions get very large as compared to the rest of the regions that could degenerate the index. This subsequently reflects on the search algorithm. Furthermore, we proposed a novel approach to perform searching for approximate similarity in metric spaces that is based on the space-partitioning with the spheres. A IMB-tree reduces the spheres’ external volume by making the eXtended regions. The eXtended regions permit empty spaces’ exclusions that do not have objects.The eXtended regions restrict sub-spaces’ concavities and remove some objects in this regard without needing any computation of their relative distances to a query object and to boost the similarity execution in the nearest neighbor queries. IMB-tree is a peer-to-peer system that advocates the metric spaces based similarity search, to index data in large amount. Concludingly, we have elaborated the search algorithm’s efficiency and our index’s quality by comparing it with the latest techniques on a real data in this regard. The extensive results of these experiments have acknowledged that the IMB-tree has outperformed for kNN queries in terms of best performance. Our work is extendable for introduction of newer and better alternatives considering and focusing the key ideas. In near future we aim to introduce a useful approach to deal the complex data in large qualities. This will help to combine multiple indexes by using various indexes rather than new index creation. For instance, we can say, one could have independent index on one hand, extract the descriptors from media content such as color histograms, and, on the other hand, descriptions semantic minimum of simple keywords.
