Abstract
M-Tree, Slim-Tree, DF-Tree, and Omni-Tree are some of the popular dynamic structures which can grow incrementally by splitting overflowed nodes, and adding new levels to the tree very much like the B-tree variants. Unfortunately, they have been shown to perform very poorly compared to flat structures such as AESA, LAESA, Spaghettis, and Kvp that use a fixed set of global pivots. HKvp index structure is an extension of Kvp allowing the elimination of pivots as well as the database objects. The number of pivots can be easily increased to provide more selectivity and query performance. However, there is an optimum number of pivots for a given query radius, and using too many pivots increases the costs of queries and index initialization. In this paper, a new set of pivot elimination mechanisms is proposed to determine the right number of pivots for different query radii. The suggested pivot elimination schemes perform significant cost reduction in terms of number of distance computations, and they estimate the drop rate value for HKvp on query time.
Introduction
Similarity search is an important process to manage unstructured data. Handling objects in audio or image databases [1], fingerprint detection [2], retrieving words that are similar to a given inquiry [3], search engines [4, 5], pattern recognition algorithms [6], retrieval of DNA sequences in biology applications [7] are example usage areas of similarity search methods to deal with high computational costs of data processing. In large dimensions of data, it becomes an important operation to calculate distance between two objects. Most of the time, a user expects more than one objects to match with different intentions. Problems like duplicate data and non-homogenous structure are to be handled. Similarity search has wide range of applications for spatial data. It is especially used in recommender systems [8], online shopping [9], and web search [10]. Suppose that the traffic load around the streets of a region of the city is analyzed and showed at the same time in a number of screens. Since the number of screens are finite and less than a number, a similarity search should be employed to display the loads of most critical streets.
The Approximating and Eliminating Search Algorithm (AESA) [11, 12] holds an n × n matrix storing the distances of n database objects to each other. Every object is a pivot in AESA, and n × (n - 1)/2 distances are computed at construction time. At query time, a subset of the objects are processed to eliminate the rest of the objects. AESA shows near optimal performance at the expense of high space and construction complexity. Because of the quadratic space and construction costs, it is not very feasible for large datasets. LAESA [13] solves the high space requirement problem by only using m of the n database objects as pivots. Thus the distance matrix is of size n × m rather than n × (n - 1)/2. LAESA attempts to appoint pivots as far from each other as possible. The rest of the LAESA algorithm is similar to AESA. The algorithm continues until a sufficient number of non-eliminated objects are obtained. Spaghettis [14] improves LAESA by using sorted arrays to store distances of objects to each pivot respectively. A pivot is more effective for objects that are close to or distant from it. Kvp [15] keeps only the distances to most promising pivots. Therefore, it stores a subset of pivot distances unlike other pivot based methods. It uses less space than other vantage point methods while offering similar query results with a reduction in CPU overhead and space.
Similarity search in metric spaces has focused on main memory structures and disk structures. Among the main memory structures vp-tree [16] requires a small memory and can be constructed efficiently. However, it is poor in query performance. mvp-tree [17] and GNAT [18] improve the performance at an expense of space and construction time. Among the disk structures M-Tree [19] and Slim-Tree [20] support dynamically selection of pivots while maintaining insertion and deletion operations. They keep less precise data than GNAT [18] resulting in a poor performance. Tree structures allow as many pivots as the height of the tree causing them to work inefficiently in difficult structures. They are not so satisfactory when greater effort for pruning is needed. Omni [21] selects a set of objects as foci, and it is adaptable to other methods. DF-Tree [22] is also a tree structure exploiting Omni-HF Algorithm [21] which allows insertion and deletion operations while maintaining a dynamic structure. Moreover, it defines a new measure of prunability to decrease the number of distance computations. Currently, it seems to give the best performance among the centralized disk based index structures.
Global pivot based methods perform very well in terms of number of distance computations. They overperform tree-based structures because the number of total pivots are not limited by unrelated parameters like branching factor. However, in cases where there are a lot of pivots, these structures may spend too much time in computing distances to all pivots. The user of such structures have the option of calibrating a parameter which is called drop rate. The problem is that each query radius has a different optimal value of this parameter.
In this paper, a detailed comparison of the pivot elimination schemes to automatically adjust the drop rate parameter is proposed. The organization of the rest of the paper is as follows: Section 2 gives details on the basic principles behind the HKvp. In Section 3, a detailed overview of the pivot elimination strategies and their mathematical foundations are presented. The experimental results and detailed analysis of the algorithm parameters are discussed in Section 4. Finally, Section 5 provides some concluding remarks and further discussion regarding this research.
High performance Kvp and the importance of pivot elimination
The algorithms used for similarity search stand on an assumption that there is a large number of objects in the database, and the number of pivots is limited when compared to the number of database objects. Global pivot based approaches are employed on these databases while exploiting a very costly distancefunction.
High performance Kvp (HKvp) [23] is an extension of the Kvp structure for the unusual cases of applications using smaller database sizes, or running on larger database sizes with higher number of pivots. In general, it can be stated that the number of distance computations may be dominated by the number of pivots when the ratio of the pivot number to database size is higher.
The cost of a query can be represented by Equation (1) for a generic pivot based structure. Pivot based methods first calculate the distances between the pivots and the query object. The probability of an object not to be discarded by one of the k pivots is represented by fc. The distance function should be calculated if the query object is not eliminated by any of the pivots. Therefore, the probability of an object not to be eliminated by any of the pivots is fc k . In other words, n × fc k distance computations are included in the cost function.
HKvp needs to calibrate the drop rate value to reach an optimal number of pivots k for a given query object and radius. The bottleneck for the traditional pivot based approaches including Kvp is that they fail to find the optimum query result, whenever the solution can be reached in fewer distance calculations than pivot number. As a result, the cost calculation n × fc k is considered to be negligible when compared to the value k in the first part of the Equation (1).
Easier queries involving lower dimensions or lower query radii require fewer pivots than more difficult queries. A pivot based method may perform worse even though more effort is spent in construction because it has more pivots to process than generally needed. HKvp performs significantly better in these kinds of queries. The major drawback of current HKvp is that it works with a drop rate parameter from user.
HKvp has the flexibility of using more pivots unlike Kvp since it can eliminate pivots like objects using a parameter called drop rate [23]. However, current algorithm takes the drop rate as a parameter which causes the algorithm to perform less effectively. The drop rate optimization should be performed to select the right number of pivots at query time.
Rather than requiring an extra parameter, HKvp may be improved by using cost formula of Equation (1). Assume that there are p pivots used in the first step, while there are k pivots overall. Then, there are k - p pivots left to be considered which are eliminated in the first step. If the drop rate is represented by α, the number of pivot distances calculated is expressed by p + (1 - α) × (k - p).
Using the necessary parameters discussed in cost Equation (1), the cost formula takes the form as Equation (2).
In this section, several drop rate optimization methods that differ in the way they estimate the value of fc are proposed. These methods work dynamically for each query determining the correct drop rate value while the query is being processed.
Pivot cut approximation based on pivot performance (PCAPP) [24] has an initial phase to calibrate the drop rate value. The algorithm detects the drop rate by estimating a failure to cut probability. The database consists of P objects, and the first phase pivots are represented by PP, whereas the rest of the pivots are represented by PR. Assume that there are k objects selected from the database of size n for the calibration of the first phase, and the number of eliminated objects is k’, where PP ∈ P and PR ∈ P. If the cardinality of the first phase pivots is p, then the failure to cut probability fc can be calculated as . Equation (3) shows the general formulation of the probability calculation.
Pivot cut approximation based on general distribution (PCAGD) [24] tries to model the distance computations using an approximation to the general distribution of the data. It selects a ratio of the number of distance computations to model the database. The pivots of the first phase PP can also be used for distribution modeling. An array of m distribution intervals is maintained, and the existance of the query object in the center of an interval is represented by precision*i + precision/2 for each i ∈ m. The value of the failure to cut probability is expressed in Equation (4), where fc represents the failure to cut probability, and weighted sum of the value F (d
i
+ r) - F (d
i
- r) is computed using the failure to cut value of a pivot p
i
at a distance d
i
to the query object q. The value f (d
i
) represents the probability of an object to be at a location as far as d
i
to the query object q.
Pivot cut approximation based on inter pivot distances (PCAIPD) has two different implementations PCAIPD-Avg and PCAIPD-Root. It is a heuristic algorithm, and it only uses the cost formula of Equation (2)with the set of pivots to process after first step of the HKvp algorithm. PCAIPD-Avg and PCAIPD-Root optimize the drop rate by looking at the elimination power of pivots chosen in the first phase. PCAIPD-Avg averages the sum of probabilities of a pivot failing to eliminate another pivot from the rest of the pivot set. Assume that there are k pivots. First pivot eliminates x1 pivots from these k pivots, then there are k - x1 pivots left. The second pivot eliminates x2 pivots from pivot set, and k - x1 - x2 is left after the second pivot is processed. The process continues until the number of pivots to process is 0. The failure to cut probability of the first pivot is where the failure to cut probability of the second pivot is . For the ith pivot this fraction is equal to Assuming that p pivots are selected in the first phase, then the failure to cut probability for PCAIPD-Avg is expressed by Equation (5) by taking the average failure to cut values.
The same failure to cut probabilities apply for PCAIPD-Root. However, this time the geometric mean is calculated rather than the arithmetic mean of the fc values of each pivot. In other words, the p
th
root of multiplication of the failure to cut probabilities is taken which is equal to × ×... × . Equation (6) is found using this multiplication for the failure to cut probability.
Pivot cut approximation based on query pivot distances (PCAQPD) operates similar to PCAGD. PCAGD assumes that pivots come from the general distribution, and weighs the cut probabilities over the whole range of distances. PCAPQPD uses the actual set of pivots PP ∈ P that are processed in the first phase to determine the average fc that will be in effect during the query processing. Rather than being a static calculation as PCAGD, PCAPQPD estimates fc for each query separately. Equation (7) shows the failure to cut probability for PCAQPD.
The number of pivots PP ⊆ P evaluated in first phase of PCAQPD HKvp is adequate to represent the whole behavior of data set.
Pivot elimination methods have been tested with several data collections in different situations. Datasets of random points were used to understand which drop rate optimization technique is better. Among these algorithms PCAPP, PCAQPD, and PCAGD give the best results.
Experiments were run for a data set of 5K objects under varying pivot limits and dimensions to see how much the drop rate detection algorithm effects the costs and pivot selection. Clustered distributions in experiments use 500 clusters and cluster radius of 0.1.
In order to evaluate the correctness of the methods compared, optimal drop rate was found using the steepest descent method. The drop rate was optimized based on the performance of the query after it has been finished. This means that the same query will be executed multiple times to find the optimal drop rate for a particular query. This optimal drop rate is called DR_OPT in the figures. The symbol DR=0 is used to show the fact that all the pivots are used without dropping any.
Some of the drop rate detection methods perform very close to the optimal value. Among these algorithms PCAPP, PCAQPD and PCAGD give the best results. In Figures 1–3, the radius is increased for dimensions of 10, 30, and 50, while fixing the other parameters pivot limit as 200 and number of uniform vectors as 5000. The need for optimizing the drop rate is obvious. PCAGD and PCAPP are the most competitive methods to deal with increasing dimensions, and PCAQPD performs close to these approaches.
For the clustered vectors of the same settings asshown in Figures 4–6, PCAIPD variants and PCAQPD perform poorly for increasing dimension size.PCAIPD generally gives good results except for clustered distributions. In higher dimensions, it is obviously seen that PCAIPD gets worse in performance when compared to current drop rate detection scheme which is called the optimal drop rate. Even though PCAIPD does not give the best results among the drop rate detection algorithms, it is observed to be especially successful for uniform data. PCAIPD does not use any extra information than the algorithm itself. Figures 7–10 show the effect of pivot limit on the performance of the drop rate detection algorithms on uniform vectors of size 5000. Four different pivot limit values are used as 50, 100, 200, and 500 in the experiments while the dimension is 30, and the radius is varying in the range [0.1–0.4]. There is an increase in the performance of the algorithm for larger pivot limit values regardless of the pivot elimination algorithm. However, for smaller pivot limit values PCAGD performs better than other algorithms. Figures 11–14 show the performance of the algorithms for different pivot limits where all the parameters are the same but the data is clustered. PCAGD and PCAPP perform the best for smaller pivot limit values.
PCAPP is very successful for determining the drop rate value. It detects the drop rate while HKvp is running. Therefore, one may expect that it would work better than PCAGD. However, it is not as efficient as PCAGD in some situations. Distribution based behavior of PCAGD may be a result of it. The performance of the PCAGD algorithm improves for increasing number of database samples. It is very successful in detecting a drop rate ratio to minimize the number of distance computations. It performs generally better than the PCAIPD variants. This is because it uses more information than PCAIPD variants. PCAQPD shows similar results as PCAGD since it exploits the actual set of pivots used in the first phase of the algorithm.
Unfortunately, PCAIPD variants are not successful to detect the optimal drop rate. A reason of this result may be that the pivots selected in first phase of the HKvp Algorithm does not represent the overall elimination behavior of all pivots.
Conclusions
In this paper, a number of improvements to the drop rate selection mechanism of the prevailing pivot based HKvp structure was discussed. Several algorithms were handled to detect the drop rate value on the fly. A detailed performance comparison of the strategies PCAIPD-Avg, PCAIPD-Root, PCAGD, PCAPP and PCAQPD was proposed for automatic detection of the drop rate. PCAIPD-Avg uses the arithmetic mean of the failure to cut probabilities, and PCAIPD-Root uses the geometric mean of the failure to cut probabilities. PCAGD and PCAQPD are based on the general distribution of the data. PCAPP is different from these algorithms because it has a calibration phase. It determines the failure to cut probability after this calibration phase. Best results have been obtained with PCAPP, PCAGD and PCAQPD.
Overall, the techniques and concepts presented in this paper provide significant improvements to the understanding of dynamic similarity search algorithms. For future research projects, the possibility of promoting query objects should also be investigated. Once a query is executed, there is already some information about its distances to the existing pivots and some of the database objects. This information can be used to evaluate it as a pivot candidate against the existing pivots. If selected as a pivot, the object distances can be used to significantly decrease the cost of pivot creation.
