Locality-sensitive hashing of permutations for proximity searching

Abstract

Similarity searching is the core of many applications in artificial intelligence since it solves problems like nearest neighbor searching. A common approach to similarity searching consists in mapping the database to a metric space in order to build an index that allows for fast searching. One of the most powerful searching algorithms for high dimensional data is known as the permutation based algorithm (PBA). However, PBA has to collect the most similar permutations to a given query’s permutation. In this paper, how to speed up this process by proposing several novel hash functions for Locality Sensitive Hashing (LSH) with PBA is shown. As a matter of fact, at searching our technique allows discarding up to 50% of the database to answer the query with a candidate list obtained in constant time.

Keywords

Nearest neighbor similarity searching metric spaces

1 Introduction

Similarity or proximity searching consists in retrieving the most similar objects to a query from the database. This kind of searching is the core of many applications in artificial intelligence, specially nowadays most databases include multimedia data. In these databases, the main problem is to solve queries where it is not possible to apply exact searching, however, there is a way to estimate how similar objects are. Furthermore, almost all of these databases have a huge amount of elements, making it unthinkable to solve similarity searching via a sequential scan. Therefore, a solution is to preprocess the database to build an index which then allows answering queries efficiently Chávez and Navarro [2], Samet [12], Zezula et al. [14].

The similarity searching problem can be approached using a metric space, which is a pair $𝕄 = (𝕏, d)$ , where $𝕏$ is the universe of valid objects with a distance function $d : 𝕏 \times 𝕏 \to ℝ^{+}$ . Distance function d must satisfy the following properties, let $x, y, z \in 𝕏$ : symmetry d (x, y) = d (y, x), reflexivity d (x, x) =0, strict positiveness d (x, y) >0 ↔ x ≠ y, and triangle inequality d (x, y) ≤ d (x, z) + d (z, y). The database is a finite subset of valid objects $𝕌 \subseteq 𝕏$ , $n = | 𝕌 |$ .

The distance is assumed to be expensive to compute. Hence, it is customary to define the complexity of the search as the number of distance evaluations performed, disregarding other components such as CPU time for side computations.

Basically, there are two kind of queries: range queries and k-nearest neighbors queries. The first one consists in retrieving elements from $𝕌$ within a radius given to the query, that is $R (q, r) = {d (u, q) \leq r, \forall u \in 𝕌}$ ; the second one consists in retrieving the k elements of $𝕌$ that are closest to q, |N N_k (q) | = k, and $\forall v \in N N_{k} (q), w \in {𝕌 - N N_{k} (q)}$ , d (v, q) ≤ d (w, q). In case of ties any k-element set that satisfies the query is choosen.

Several algorithms have been proposed to solve range and k-nearest neighbors queries. Many of them are very effective only in low dimensional spaces, but their performance worsens as the intrinsic dimension increases. This problem is known as the curse of dimensionality Chávez and Navarro [2] (which has been extended from vector spaces Chávez et al. [3]). It is not easy to measure the difficulty of searching in a metric space Navarro et al. [9], the search costs depend on the shape of the histogram of distances of the dataset. In the case of high-dimensional, the performance of many of these algorithms grows to such extent that their search costs end up being similar to compare the query with any database element (sequential search) Chávez et al. [3]. There are only a few algorithms that are capable of successfully facing searches on high-dimensional metric spaces.

An effective way to deal with the problem of similarity searches on very large or high-dimensional databases is obtaining an approximate and very efficient answer to the query. That is, we could achieve faster query answers if we admit that the returned set does not contain all relevant elements or contains some that are irrelevant.

In this context, there are two techniques whose performance can be unbeatable: the first is based on Locality Sensitive Hashing (LSH) Gionis et al. [8], the second one is based on the permutation based algorithm (PBA) Chávez et al. [4]. LSH is an effective strategy that has not been used with the permutations of a PBA.

In this article, we propose to combine LSH and PBA by using a new family of hash functions to avoid sequential scanning in PBA. The rest of this paper is organized as follows. In Section 2 we introduce some basic concepts: locality-sensitive hashing and permutation-based algorithm. In Section 3 we describe our proposal and finally, we present some experimental results in Section 4 and conclusions in Section 5.

2 Previous Work

As we mentioned previously, we consider a metric space $𝕄 = (𝕏, d)$ , where d is a distance function which satisfies the metric properties. In order to minimize the number of distance evaluations needed to solve similarity queries, we generally build an index, by preprocessing the database $𝕌 \subseteq 𝕏$ ( $| 𝕌 | = n$ ). Then, when a query is posed, we traverse the index determining a set of candidates objects, presumably with the relevant objects.

There exists several indexes Chávez et al. [3], Samet [12], Zezula et al. [14] that usually are clasiffied according to the approach they to partition the dataset. The usual classification considers pivot-based algorithms, compact partition algorithms, and permutation-based algorithms. As pivot-based and compact partition algorithms have not a very good performance in high-dimensional spaces, in this work we only focus on permutation-based algorithms.

Since our proposal involves locality-sensitive functions and permutation-based algorithms, we describe their main characteristics below.

2.1 Locality-Sensitive Hashing

Locality Sensitive Hashing was introduced in Gionis et al. [8]. This strategy hashes each element and similar elements on the same buckets with high probability. The hash functions of LSH maximize collision for similar objects.

The main idea consists of using a set of hash tables ( $𝕋$ ) each with its own hash function. Each hash table t along with its hash function is known as an instance ( $t \in 𝕋$ ). In order to improve the probability to find similar elements, we increase the number of instances. In Gionis et al. [8], the authors showed how to map one vector in a set of bits that represent a location of a hash table. Formally, a LSH is a family of functions, let $f : 𝕌 \to 𝔹$ , that is, f which maps elements from the metric space to a bucket $b \in 𝔹$ . LSH is called (r₁, r₂, s₁, s₂)-sensitive for d if for any $u, v \in 𝕌$ .

if u ∈ R (u, r₁) then $P_{𝕋} [f (u) = f (v)] \geq s_{1}$

if u ∉ R (u, r₂) then $P_{𝕋} [f (u) = f (v)] \leq s_{2}$

2.2 Permutation-Based Algorithms

The other approach that we use is the Permutation-Based Algorithms (PBA) Chávez et al. [4]. A permutation for an object is a representation of it that considers in which order this object “sees” the elements of a subset of distinguished objects called permutants. Formally, we select a set of objects of the database $ℙ = {p_{1}, \dots, p_{m}}$ , as a set of permutants, where $ℙ \subseteq 𝕌$ and $| ℙ | = m$ . For each database element $u \in 𝕌$ we compute the distance to $ℙ$ ; that is, d (p₁, u) , d (p₂, u) , …, d (p_m, u). Then, we sort these distances in increasing order. PBA propose to keep this order and call it the permutationΠ_u for object u. This means that, the closest element is in the first position, the second closest at the second position, and so on. For example, if we consider the set of permutants $ℙ = {p_{1}, p_{2}, p_{3}, p_{4}, p_{5}, p_{6}}$ and the permutation of u is Π_u = {2, 4, 6, 3, 5, 1}, p₂ is the closest permutant to u. Let be $Π_{u}^{- 1} (i)$ , where 1 ≤ i ≤ m, the position of the permutant p_i in the permutation, while Π_u (j) is the permutant at position j.

When a query q is given, we compute the distance between q and all the elements in $ℙ$ , and build its permutation (Π_q). The premise of this strategy is: if u and q are exactly equal, then they have exactly the same permutation, however, if they are similar, they must have similar permutations. Therefore, in order to find the most similar elements, we must find the most similar permutations.

There are basically two measures to evaluate similarity between permutations. As we describe: Spearman Footrule $S_{F} (Π_{u}, Π_{q}) = \sum_{i = 1}^{m} | Π_{u}^{- 1} (i) - Π_{q}^{- 1} (i) |$ (1)

Spearman Rho $S_{ρ} (Π_{u}, Π_{q}) = \sqrt{\sum_{i = 1}^{m} (Π_{u}^{- 1} (i) - Π_{q}^{- 1} (i))^{2}}$ (2)

In Chávez et al. [4] authors proposed, at query time, to compute all permutations against the query using Spearman Footrule (equation 1), then, to sort them and compute the most similar permutations by the distance function d. Authors proposed to compute just a few fraction of the whole database.

In the literature of PBA, there are some proposals that use auxiliary data structures to avoid the examination of the whole dataset. For example, in Amato and Savino [1] it was proposed to use an inverted index (MI-File) in order to find the list of candidates of nearest neighbor. In Esuli [5], the authors proposed to use a suffix tree to find the most similar permutations (list of candidates). However, some of these techniques need a lot of permutants in order to get a high recall (i.e. 512, 1024, etc).

Authors in Tellez and Chavez [13] introduced an strategy to map permutations to bits and to use well-known LSH. However, their strategy is based on Spearman Footrule metric (equation 1). Unfortunately, they needed a lot of permutants to get an acceptable recall, i.e. 256 permutants.

The LSH concept was presented in Novak et al. [11] for general metric spaces. They used the well-know M-Index Novak and Batko [10] in order to extend the applicability of the LSH approach. All strategies presented were adaptable to the original LSH.

3 Our Proposal

In this article, we propose a new set of hash functions, the core of the Locality Sensitive Hashing algorithm. This set of functions is specifically designed for permutation-based technique. In order to introduce our proposal, we firstly define the notation used.

Let us consider a permutation split in δ small subpermutations, for example, if we split Π_q in δ = 3 permutations we can divide the permutations as:

$Π_{q} = {\underset{Π_{q}^{1}}{\underset{︸}{2, 4}}, \underset{Π_{q}^{2}}{\underset{︸}{6, 3}}, \underset{Π_{q}^{3}}{\underset{︸}{5, 1}}}$

Let the function $β (Π_{u}, i) : Π_{u} \to Π_{u}^{i}$ which splits the permutation between positions at a_i =⌊ ((m/δ) × (i - 1)) +1 ⌋ to b_i =⌊ (m/δ) × i ⌋, where 1 ≤ i ≤ δ. For example, $β (Π_{q}, 1) = Π_{q}^{1} = {2, 4}$ , and so on.

3.1 Locality Sensitive Hashing of Permutations (LSHP)

Let $𝕋$ be a set of hash tables, and t an element of $𝕋$ , $τ = | 𝕋 |$ . Let $g : 𝕌 \to 𝔹^{τ}$ , such that g (Π_u) = (f¹ (Π_u) , …, f^τ (Π_u)), where $𝔽 = {f : 𝕌 \to 𝔹}$ , since we are working with hash tables, γ is the size of each hash table t.

One of the main concepts of hashing is that different objects may go to the same bucket which is known as collisions. Our proposal makes use of this concept, the functions defined below generate equivalence classes each of them allocate similar objects.

3.1.1 Definitions of Hash Functions

In this section, we present the hash functions proposed for permutations. For each function we are just considering the closer and far permutants, because these permutants are more significant for the distance between permutations. In other words, the evaluated differences of the i-th permutant in two permutations (i.e, $| Π_{u}^{- 1} (i) - Π_{q}^{- 1} (i) |$ ), can be greater for the closer and far permutants. For example, if δ = 3 means subpermutations $Π_{q}^{1}$ and $Π_{q}^{3}$ will be consider.

{Let δ = 3, $Π_{u}^{1} = β (Π_{u}, 1)$ and $Π_{u}^{3} = β (Π_{u}, 3)$ : $\begin{matrix} f^{1} & = & ((c \sum_{i = a_{1}}^{b_{1}} Π_{u}^{1} (i)) + \sum_{i = a_{3}}^{b_{3}} Π_{u}^{3} (i)) \\ f_{sum}^{1} (Π_{u}) & = & f^{1} \mod γ \end{matrix}$

Considering ψ = 3 elements at the beginning and at the end of permutation: $\begin{matrix} f^{2} & = & ((c \sum_{i = 1}^{ψ} Π_{u} (i)) + \sum_{i = m - ψ}^{m} Π_{u} (i)) \\ f_{ψ_ψ}^{2} (Π_{u}) & = & f^{2} \mod γ \end{matrix}$

Let δ = 4, $Π_{u}^{1} = β (Π_{u}, 1)$ ; that is the part of the permutation between positions a₁ and b₁: $\begin{matrix} f^{3} & = & (\sum_{i = a_{1}}^{b_{1}} Π_{u}^{1} (i)) \\ f_{ini}^{3} (Π_{u}) & = & f^{3} \mod γ \end{matrix}$

Considering ψ = 3 elements at the beginning and ω = 4 at the end of permutation: $\begin{matrix} f^{4} & = & ((c \sum_{i = 1}^{ψ} Π_{u} (i)) + \sum_{i = m - ω}^{m} Π_{u} (i)) \\ f_{ψ_ω}^{4} (Π_{u}) & = & f^{4} \mod γ \end{matrix}$

Considering ω = 4 elements at the beginning and ψ = 3 at the end of permutation: $\begin{matrix} f^{5} & = & ((c \sum_{i = 1}^{ω} Π_{u} (i)) + \sum_{i = m - ψ}^{m} Π_{u} (i)) \\ f_{ω_ψ}^{5} (Π_{u}) & = & f^{5} \mod γ \end{matrix}$

Let be δ = 4, $Π_{u}^{4} = β (Π_{u}, 4)$ ; that is the part of the permutation between positions a₄ and b₄: $\begin{matrix} f_{end}^{6} (Π_{u}) = (\sum_{i = a_{4}}^{b_{4}} Π_{u}^{4} (i)) \mod γ \end{matrix}$

Considering ω = 4 elements at the beginning and at the end of permutation: $\begin{matrix} f^{7} & = & ((c \sum_{i = 1}^{ω} Π_{u} (i)) + \sum_{i = m - ω}^{m} Π_{u} (i)) \\ f_{ω_ω}^{7} (Π_{u}) & = & f^{7} \mod γ \end{matrix}$

We use the sum operation because it is robust under inversion of relatively close permutants which is good for our purpose Chávez et al. [4]. We use ψ = 3, ω = 4 and c = 100, because the first and last elements of the permutations are the most important according to Figueroa and Paredes [6].

During indexing time, each hash table tⁱ, for 1 ≤ i ≤ τ is computed with the whole database and organized in buckets according to the its own hash function. This process is described in algorithms 1 and 2; add function puts the element in the same bucket. Finally, we store for each hash table the elements of the database as they were distributed, all the permutations computed during indexing time are discarded, that means, we are using O (τ × n) space.

Algorithm 1 build-LSHperm

1: fori = 1 to τdo

2: tⁱ← build-HT(i)

3: end for

Algorithm 2 build-HT(i)

1: for all $u \in 𝕌$ do

2: v ← fⁱ (Π_u)

3: tⁱ [v] ← add (u)

4: end for

3.1.2 List of Candidates

When the index is built, we can answer the queries. Hence, when a query q and a radius r are given (i.e. we want to solve R (q, r)), we compute every hash table and we get the list of candidates. For lack of space, we only consider range queries.

Formally, let $ℂ$ be the set of candidates: $ℂ = \cup_{i = 1}^{τ} t^{i} [f^{i} (Π_{q})]$ . Each $u \in ℂ$ computes its distance towards to the query q; that is, we compute d (u, q) and report those elements whose distance d are less or equal to r.

3.1.3 Example

Let be Π_u = {2, 4, 6, 3, 5, 1} a permutation, and γ = 11, $f_{sum}^{1} (Π_{u}) = (2 + 4) * 100 + (5 + 1) = 606 % 11 = 1$

If we have another element x whose permutation is Π_x = {4, 2, 6, 3, 1, 5}, it can be noticed that also the $f_{sum}^{1} (Π_{x}) = 1$ . Therefore, as both Π_u and Π_x permutations produce the same value of $f_{sum}^{1}$ , u and x belong to the same bucket. Therefore, if a query q has a permutation Π_q and $f_{sum}^{1} (Π_{q}) = 1$ we need to search on the same bucket of u and x. Again, our proposal takes advantage of the collisions to answer similarity queries.

4 Experimental Results

In order to evaluate the performance of our proposal, we selected some real-life databases available from SISAP Metric Library benchmark set Figueroa et al. [7]. For each dataset we randomly chose 500 objects as queries and we index all the other database elements.

We have tested several alternatives for the parameters of our proposal. We show here the results obtained considering τ = 7 different functions and using γ = 57 with permutations size m = 8, 12, 16.

It can be noticed that our technique is just using O (τ × n) integers to store the buckets of hash tables. We need m evaluations of the distance d to obtain the query permutation Π_q, but the list of candidates can be obtained without performing any additional distance evaluation of d.

On the other hand, the original technique of PBA uses O (m × n) memory space, but m ⪢ τ. Another advantage is that we can get the candidate list without making a sequential scan. Actually, we can obtain this list in constant time. Therefore, we do not need to use an index to accelerate this process. For this reason, it is not necessary to compare our technique with MI-File, P-Index or PPP-Index; that is, these indexes need more than constant time to get the candidate list.

4.1 COLORS database

This database consists of 112,682 color histograms of images, represented as 112-dimensional vectors. We use the Euclidean distance for this metric space.

In Figure 1 we show the search performance of our proposal. Figure 1(a) depicts the experimental results when we retrieve the NN₁ (q), comparing different values of τ; i.e. different numbers of hash tables used. As it can be seen, using only 1 hash table we can get the 75% of recall with the 30% of the database in the list of candidates (m = 8). Moreover, when we use 2 hash tables we get up to 93% of the recall with 55% of the elements in list of candidates. Figure 1(b) illustrates the performance obtained when we answer NN₂ (q) queries. In this case, we obtain again a similar performance. In all plots, Distances represents the number of evaluations of distance d performed to respond the query; that is, when we compare directly q with all the elements in the list of candidates.

Fig.1

Performance of our proposal using COLORS database.

It is remarkable that our proposal improves as the permutation size decreases and the permutations are used just during indexing time.

4.2 NASA database

This dataset consists of 40,150 feature vectors in $ℝ^{20}$ . These 20-dimensional vectors were generated from images downloaded from NASA (available at http://www.dimacs.rutgers.edu/Challenges/Sixth/software.html ), there is not duplicate vectors. Also for this metric space we use Euclidean distance to compare the vectors.

The performance of our technique at searches in NASA database is shown in Figure 2. In this case, we show the results when we answer range queries with low selectivity. Hence, we illustrate the costs of range searches that retrieve 600 elements (Figure 2(a)) and 700 elements (Figure 2(b)). For this metric space, we can get 100% of the recall using only 6 hash tables and the list of candidates with 50% of the database. As it can be noticed, the recall of our proposal increases as the number of tables grows. The better recall is obtained when the number of permutants used is the lowest value considered (m = 8).

Fig.2

Performance of our proposal using NASA database.

4.3 Histogram of Hash Tables

One of the main issues in hashing is the one related to collisions. On Figures 3(a), and 3(b) the performance of the hash functions regarding how well they distribute the objects in the hash tables is shown for t¹ to t⁶ using COLORS database and m = 16. Notice that the hash functions from f¹ to f⁶ have uniform distribution across the buckets.

Fig.3

Database COLORS.

In the case of the NASA Database, the performance of hash functions from t¹ to t⁶ is shown on Figures 4(a), and 4(b).

Fig.4

Database NASA.

5 Conclusions and Future Work

Approximate similarity search is a very useful operation for several real applications because it allows obtaining a good tradeoff between answer quality and the needed time to respond the query. In this work, we combine the good aspects of two highlighted techniques in the proximity search field: LSH and PBA. Our proposal introduces a new family of hash functions to use in a LSH index. This family can be used in LSH for finding the most similar permutations defined with a permutation-based algorithm. As it is aforementioned, we surpass other techniques (i.e. MI-File, P-Index, and PPP-Index) because we obtain the candidate list in constant time. Moreover, we show empirically that our new alternative achieves a very good search performance. It takes advantage of using the information provided by permutations in LSH and avoiding the traditional sequential scan on all the permutations (as original PBA must perform). By this way, we also prevent the need of using an index like the inverted index or the PP-Index.

In this article, we tested several values for ψ, ω, τ and γ. The values shown are those that maximized the performance. Some values for γ were not considered because their use had the effect of clustering the computed bucket numbers on the first locations of the hash tables. In the case of ψ and ω, the most important and significant permutants are the first and sometimes the last ones.

Even more, it is still possible to improve the search performance of our technique by storing also the permutations of the elements into the buckets, this way we will need more memory space for the index, but we could compare the list of candidates by using any permutation distance and then keeping only the candidates whose are deemed promising. We are also interested in designing other hash functions for LSH based on bits.

References

Amato

and Savino

, Approximate similarity search in metric spaces using inverted files, In 3rd International ICST Conference on Scalable Information Systems, INFOSCALE 2008, Vico Equense, Italy, 2008, p. 28. doi: 10.4108/ICST.INFOSCALE2008.3486. URL https://doi.org/10.4108/ICST.INFOSCALE2008.3486.

Chávez

and Navarro

, A probabilistic spell for the curse of dimensionality, In 3rd Workshop on Algorithm Engineering and Experiments (ALENEX’01), volume 2153 of Lecture Notes in Computer Science, 2001, pp. 147–160.

Chávez

, Navarro

, Baeza-Yates

and Marroquín

, Proximity searching in metric spaces, ACM Computing Surveys33(3) (2001), 273–321.

Chávez

, Figueroa

and Navarro

, Effective proximity retrieval by ordering permutations, IEEE Trans on Pattern Analysis and Machine Intelligence (TPAMI)30(9) (2009), 1647–1658.

Esuli

, Use of permutation prefixes for efficient and scalable approximate similarity search, Inf Process Manage48(5) (2012), 889–902. ISSN 0306-4573. doi: 10.1016/j.ipm.2010.11.011. URL https://dx-doi-org.web.bisu.edu.cn/10.1016/j.ipm.2010.11.011.

Figueroa

and Paredes.

, An effective permutant selection heuristic for proximity searching in metric spaces, In Proc 6th Mexican Conf on Pattern Recognition (MCPR’14), LNCS 8495, 2014, pp. 102–111.

Figueroa

, Navarro

, Chávez

, Metric spaces library, 2007. Available at http://www.sisap.org/Metric_Space_Library.html.

Gionis

, Indyk

and Motwani

, Similarity search in high dimensions via hashing, VLDB’99 Proceedings of the 25th International Converence on Very Large Data Bases, 1999, pp. 518–529.

Navarro

, Paredes

, Reyes

and Bustos

, An empirical evaluation of intrinsic dimension estimators, Inf Syst64(C) (2017), 206–218. ISSN 0306-4379. doi: 10.1016/j.is.2016.06.004. URL https://doi.org/10.1016/j.is.2016.06.004.

10.

Novak

and Batko

, Metric index: An efficient and scalable solution for similarity search, In Proceedings of the 2009 Second International Workshop on Similarity Search and Applications, SISAP ’09, Washington, DC, USA, 2009, pp. 65–73. IEEE Computer Society. ISBN 978-0-7695-3765-8. doi: 10.1109/SISAP.2009.26. URL https://dx-doi-org.web.bisu.edu.cn/10.1109/SISAP.2009.26.

11.

Novak

, Kyselak

and Zezula

, On localitysensitive indexing in generic metric spaces, In Proceedings of the Third International Conference on SImilarity Search and APplications, SISAP ’10, New York, NY, USA, 2010, pp. 59–66. ACM. ISBN 978-1-4503-0420-7. doi: 10.1145/1862344.1862354. URL http://doi.acm.org/10.1145/1862344.1862354.

12.

Samet

, Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2005. ISBN 0123694469.

13.

Tellez

E.S.

and Chavez

, On locality sensitive hashing in metric spaces, In Proceedings of the Third International Conference on SImilarity Search and APplications, SISAP ’10, New York, NY, USA, 2010, pp. 67–74. ACM. ISBN 978-1-4503-0420-7. doi: 10.1145/1862344.1862355. URL http://doi.acm.org/10.1145/1862344.1862355.

14.

Zezula

, Amato

, Dohnal

and Batko

, Similarity Search: The Metric Space Approach, volume 32 of Advances in Database Systems, Springer, 2006.