Query-specific signature selection for efficient k -nearest neighbour approximation

Abstract

Finding k-nearest neighbours (k-NN) is one of the most important primitives of many applications such as search engines and recommendation systems. However, its computational cost is extremely high when searching for k-NN points in a huge collection of high-dimensional points. Locality-sensitive hashing (LSH) has been introduced for an efficient k-NN approximation, but none of the existing LSH approaches clearly outperforms others. We propose a novel LSH approach, Signature Selection LSH (S2LSH), which finds approximate k-NN points very efficiently in various datasets. It first constructs a large pool of highly diversified signature regions with various sizes. Given a query point, it dynamically generates a query-specific signature region by merging highly effective signature regions selected from the signature pool. We also suggest S2LSH-M, a variant of S2LSH, which processes multiple queries more efficiently by using query-specific features and optimization techniques. Extensive experiments show the performance superiority of our approaches in diverse settings.

Keywords

locality-sensitive hashing

1. Introduction

Finding k-nearest neighbours is one of the most important tasks performed by various applications such as search engines and recommendation systems. The performance of k-nearest neighbour (k-NN) computation becomes increasingly important when finding k-NN points in a large collection of high-dimensional points representing images (e.g. Google image search) or videos (e.g. YouTube video recommender). Furthermore, there are situations in which k-NN lists need to be updated frequently and promptly in response to diverse user behaviour such as clicks. Therefore, an efficient approach for finding k-NN points is highly demanded.

Locality-sensitive hashing (LSH) is an efficient method for approximate k-NN search [1]. Given a collection of data points, LSH functions assign close points to the same hash bucket with high probability. Therefore, points sharing a bucket with a query point can be considered as good candidates for k-NN search. However, the performance of the state-of-the-art LSH approaches such as Collision Counting LSH [2], LSB-tree and forest [3], and Exact Euclidean LSH [1, 4] is heavily affected by the characteristics of a given dataset such as the number of dimensions, the number of points and the feature extraction method used to generate the dataset.

In this paper, we propose a novel LSH approach for a fast k-NN approximation, called Signature Selection LSH (S2LSH), which performs consistently well in various types of datasets. To explain our main idea clearly, we introduce three concepts, a signature, a signature region and a k-NN region of a given point. When an LSH function g projects a point p onto a lower-dimensional space, the projected coordinate of p, g(p), is called the signature of p by g. A signature region of p by g, R_sig(g, p), is defined as the smallest region containing all the points whose signature is g(p). The k-NN region of p, R_knn(p), is defined as the smallest region containing all the exact k-NN points of p.

Notice that the sizes and the shapes of the k-NN regions of queries can be highly diverse. Therefore, even if there exist multiple signature regions, there is no guarantee that a signature region covers the entire k-NN region of a given query q. Some LSH approaches select candidates from the union or the intersection of all the signature regions. However, their performance can be seriously degraded by signature regions containing many non k-NN points of q. Instead, we suggest generating a query-specific signature region by using only highly effective signature regions.

Figure 1 shows a motivating example where some signature regions are more effective in finding k-NN points of q than others. Let us assume that there exist three signature regions of q (the star point), R_sig,1, R_sig,2 and R_sig,3. Now, assuming that k is 3, we need to select five points as k-NN candidates of q. Instead of the union or the intersection of R_sig,1, R_sig,2 and R_sig,3, we generate a query-specific signature region by using the most effective signature regions. R_sig,1 is a large signature region containing five neighbours of q, but four points are relatively far from q. In contrast, R_sig,2 and R_sig,3 are smaller, so the points in them are closer to q. Therefore, the points in R_sig,2 and R_sig,3 are better k-NN candidates of q than those in R_sig,1.

Figure 1.

Finding k-NN candidates (k=3) of a query point (the star point) using three signature regions (spheres), R_sig,1, R_sig,2 and R_sig,3.

Based on this observation, we assume that small signature regions are more effective for k-NN approximation. However, we need large signature regions as well, since the actual k-NN points of q can be far from q if q is in a sparse region or k is large. Therefore, we construct a large pool of highly diversified signature regions with various sizes and shapes, not just small ones, so that we can dynamically generate a query-specific signature region for any given query.

Even though the size of a signature region is a good indicator of the effectiveness of a signature region, it can be estimated more accurately by using query-specific features of a signature region. Let us consider a signature region containing two query points, q_b and q_c. If q_c is near its centre while q_b is near its border just as in R_sig,1, its effectiveness for q_b may not be as high as that for q_c, especially when it partially covers the points very close to q_b. For example, the maximum distance from a query q to the points in the signature region is a query-specific feature that can be used to estimate the effectiveness of a signature region more accurately; it usually becomes larger as q moves from the centre of a signature region to the border. However, given a large set of data points and a large pool of signature regions, the computational cost to compute the exact values of such features is very expensive, slowing down the overall execution time of k-NN approximation significantly. Later, we present a variant of LSH, S2LSH-M, which exploits query-specific features of a signature region, together with additional optimization techniques, to further improve the efficiency of k-NN approximation, given a large set of queries. When many queries are given, we can approximate the query-specific distance values efficiently using intermediate computation results of k-NN approximation of other queries.

The contributions of this paper are summarized as follows:

We propose a novel k-NN approximation approach, S2LSH, which constructs a large pool of signature regions (Section 3) and finds k-NN candidates by using highly effective signature regions (Section 4).

We suggest S2LSH-M that uses query-specific features of a signature region and optimization techniques in order to further improve the efficiency of k-NN approximation, given multiple query points (Section 5).

Extensive experiments with various types of datasets show that our approaches, S2LSH and S2LSH-M, consistently outperform the state-of-the-art algorithms for k-NN approximation (Section 6).

2. Related works

Three types of k-NN computation tasks are mainly performed in various applications: k-NN search, k-NN graph construction, and partial k-NN graph construction. For a given dataset D and a given query, k-NN search finds k-NN points of the query among the points in D. It is an essential part of Google image search and k-NN classification. Then, let Q denote a set of queries. k-NN graph construction considers all the points in D as queries, that is, Q=D, and constructs a k-NN graph of D in which each point is connected to its k-NN points. It is important in the YouTube video recommendation system and agglomerative clustering. Partial k-NN graph construction is the same as k-NN graph construction except that Q is a subset of D. It can be used for the incremental update of a large k-NN graph.

2.1. k-NN search

Let D be a set of data points in a d-dimensional space. Given a query point q, we define an approximate k-NN search as a task of finding approximate k-nearest neighbours in D. It is challenging especially when the data points in D are high dimensional. For instance, documents and logs are usually represented by a huge number of words or items, and images and videos by a large number of extracted features. Tree-based space partitioning approaches such as kd-tree, quadtree, and R-tree have been proposed to speed up this process, but they suffer from the curse of dimensionality [4].

2.1.1. Locality-sensitive hashing

Locality-sensitive hashing is introduced as one of the most effective techniques for finding approximate k-nearest neighbours in a high-dimensional space [1]. LSH reduces the dimensionality of high-dimensional points by converting them into low-dimensional signatures while preserving their relative distances. Formally, a signature of a point p consists of an ordered set of hash values, each calculated by the corresponding LSH function $h (p) : ℝ^{d} \to ℕ$ .

The main difference between LSH and conventional hash functions can be explained as follows: let us consider two points, p₁ and p₂, that are good k-NN candidates of each other because of their close proximity to each other. While conventional hash functions map same points to the same buckets, different points such as p₁ and p₂ are usually mapped to different buckets. Even if conventional hash functions map different points to the same bucket, we cannot regard them as neighbours because the distances between the points are not considered when determining bucket numbers. In contrast, LSH hashes points to buckets so that close points are mapped to the same buckets with high probability. Therefore, it is highly likely that p₁ and p₂ are located in the same LSH bucket if they are close enough, and thus we can consider points located in the same bucket as k-NN candidates.

Various types of LSH functions have been proposed, and random projection [1, 5] is one of the most popular ones. It projects a high-dimensional point onto a small line segment as follows: $h_{\vec{a}, b} (p) = (\vec{a} \cdot p + b) / r$ . Here, p denotes a query point or a data point; $\vec{a}$ is a vector where each component is drawn independently from a Gaussian distribution; r is a constant; b is a constant chosen uniformly from the range [0, r]. By randomly selecting $\vec{a}$ and b for H number of times, we can obtain H hash functions h₁, h₂, …, h_H and the resulting signatures of length H. There are also many popular variants of random projection, such as random hyperplanes for cosine similarity [6] or random-permutations based MinHash for Jaccard similarity [7]. However, the goal of this paper is not to suggest a better LSH function, but to propose an efficient approach of finding approximate k-NN points based on a given LSH scheme.

2.1.2. k-NN approximation using LSH

Once high-dimensional data points in D are converted into signatures, we need a way to find k-NN candidates for a given query q. The simplest way is to calculate the distances between the signature of q and the signatures of data points in D and select k data points closest to q. This approach, however, is not adequate for many applications for the following reasons: (1) if signatures are short (e.g. 100), even the state-of-the-art data-dependent LSH algorithms such as spherical hashing [8] do not achieve the level of accuracy (e.g. MAP@100) higher than 0.15 for large datasets; (2) if signatures are long, it takes much time to calculate the similarities between signatures.

To cope with this problem, Exact Euclidean LSH (E2LSH) is proposed [1, 4]. It constructs a set of compound LSH functions, g₁, g₂, …, g_L, each consisting of an equal number of binary LSH functions. Given a query point q, if q and a data point $p \in D$ have the same compound hash value, that is, g_i(q)=g_i(p), for some i, p is considered as a candidate for the k-nearest neighbours of q. LSB-tree and LSB-forest [3] further compress every low-dimensional signature g(p) into one-dimensional value using a Z-order curve. Given a query point q, they select a point that has a Z-order value with the greatest LLCP (length of longest common prefix) as a candidate for q. Collision Counting LSH (C2LSH) [2] counts the number of collisions between a query point q and a data point $p \in D$ using a set of hash functions, h₁, h₂, …, h_H, and then selects p as a k-NN candidate for q if the collision count is equal to or greater than a given threshold l.

C2LSH outperformed other LSH approaches such as E2LSH, LSB-tree and LSB-forest [2]. However, we consider E2LSH as one of the baselines since we observed that E2LSH significantly outperforms C2LSH on some datasets. In fact, both of C2LSH and E2LSH use multiple signature regions. However, since the size and the shape of the actual R_knn(q) of each query q may differ significantly across queries, the union of all the corresponding signature regions (E2LSH) or their intersection with a threshold value (C2LSH) does not cover R_knn(q) effectively for some queries.

2.2. k-NN graph construction

Given a set of data points D and a set of query points Q, k-NN graph construction finds k-nearest neighbours in D for each query point in Q. In fact, we set Q to be D, that is, every point in D is considered as a query. It is one of the primitive operations performed in data mining, information retrieval, recommender systems and machine learning [9 –23]. There are three groups of algorithms for k-NN graph construction: LSH-based algorithms, clustering-based algorithms and heuristic or data-distribution based algorithms.

A naive way to construct a k-NN graph is to execute an LSH-based k-NN search algorithm for every point in D. However, k-NN graph construction algorithms [9, 11] are usually more efficient than multiple executions of LSH-based k-NN search algorithms, since they reuse the computation results obtained during k-NN search of other queries. In order to cope with this problem, the two-hop neighbours of each query are exploited and random projection is applied to compound LSH functions, which is a variant of LSB-tree and LSB-forest [18].

The key intuition behind clustering-based algorithms is that the points in a cluster are highly likely to be k-nearest neighbours of each other. K-means and canopy clustering are known to be slightly faster than brute-force search while maintaining a high level of accuracy [24]. Also, iterative execution of two-means clustering is shown to be effective in constructing a k-NN graph [17]. Recursive Lanczos bisection [10] uses a hyperplane to make clusters; first, it draws a hyperplane that splits a set of points in D such that it maximizes the sum of squared distances between $p \in D$ to the hyperplane that passes through the centroid. Next, it recursively divides the points in D into two overlapping clusters using the hyperplane. Similarly, random hyperplanes are used to divide the points in D [6]. Note that the clustering algorithms exploit the aforementioned two-hop neighbours heuristic to improve the approximation accuracy.

NN-Descent [9] also exploits the two-hop neighbours heuristic but in a more efficient way: it first randomly selects the k-nearest neighbours for every point in D. Then, it repeatedly improves the k-NN lists by comparing each point against its neighbours’ neighbours. It also uses two techniques, one for avoiding many redundant distance computations without consuming much memory, and the other for slowing down the convergence of the algorithm. Greedy filtering [11] assumes that each point is sparse and thus represented as an array of elements, each consisting of a dimension and a value, and that cosine similarity and the TF-IDF (or BM25) weighting scheme are used. Based on the observation that a dataset often follows a certain data distribution, it prunes out a significant amount of search space.

3. Signature pool generation

Given a set of data points, D, we construct a highly diversified signature pool of size L. We first define a set of L compound LSH functions, $G$ ={g₁, g₂, …, g_L}. The ith compound LSH function, g_i, maps each point p in D onto the ith signature region of p, R_{sig, i}(p). Given a query point q, the signature pool provides L signature regions of q defined by L compound LSH functions in $G$ , from which we select the most effective signature regions of q. Note that two points q and p are in the same ith signature region if the ith signatures of p and q are the same, that is, g_i(p)=g_i(q).

We target constructing a large signature pool containing signature regions with various sizes and shapes, since we want to improve the chance to find a set of effective signature regions of any given query. However, when L and |D| are large, it turns out that the preprocessing time to generate values of g_i separately and precompute the signatures of points in D using each g_i is extremely long. Instead, we generate a large collection of binary LSH functions, $H$ , from which we randomly select a varying number of underlying binary LSH functions to compose each compound LSH function in $G$ .

3.1. Generating a set of underlying binary LSH functions

We first generate a set of binary LSH functions, $H$ , of size H. S2LSH can use any type of binary LSH functions to construct $H$ without significant performance degradation, while existing approaches are heavily affected by it. We mainly use spherical hashing [8] for ease of explanation, but we compared our approaches with baselines using different LSH schemes such as random hyperplanes [6]. Random hyperplanes [6] is one of the most popular data-independent LSH schemes. It generates binary LSH codes with length equal to the number of hyperplanes, and each LSH code is determined by the cosine distance from a point to the corresponding hyperplane. Data distribution is not considered when drawing hyperplanes. Recently, there have been proposed data-dependent LSH schemes such as spectral hashing [25] and anchor graph hashing [26]. Spherical hashing [8] is a very effective data-dependent LSH scheme. It considers the data distribution in D when drawing H hyperspheres with training samples chosen from D so that the hyperspheres are not too close or too far apart from each other, and each hypersphere contains about a half of the training samples. If a point p is in the ith hypersphere, the corresponding hash function h_i maps p to +1 (and −1 otherwise). Figure 4(a) shows an example of spherical hashing: the signatures of A and D are (+1, +1, −1, −1) and (+1, +1, +1, +1) respectively.

3.2. Generating a set of compound LSH functions

Given the size of a signature pool, L, we generate L compound LSH functions in $G$ using $H$ . Let m₁ and m₂ (1≤m₁≤m₂≤H) denote the minimum and maximum number of binary LSH functions in each compound LSH function. To construct the ith compound LSH function g_i, we generate a random integer n_i (m₁≤n_i≤m₂), and then n_i random integers, $r_{1}, \dots, r_{n_{i}} (1 \leq r_{j (1 \leq j \leq n_{i})} \leq H)$ . g_i is constructed by combining n_i binary LSH functions, $h_{r_{1}}, \dots, h_{r_{n_{i}}}$ in $H$ .

For instance, let us assume that H=300, m₁=5, and m₂=15, and spherical hashing is used. After generating 300 hyperspheres, we construct each g_i by combining hyperspheres randomly chosen among them. The number of hyperspheres is between m₁ and m₂, and corresponds to the length of signatures defined by g_i. Notice that we execute spherical hashing only once to generate 300 hyperspheres and generate values of g_i, instead of performing spherical hashing L times to generate values of g_i separately. This significantly reduces the preprocessing time to construct a signature pool without affecting k-NN approximation accuracy. In addition, generating values of g_i in $G$ by combining binary LSH functions in $H$ can improve the diversity of a signature pool, especially when a data-dependent LSH scheme is used; for instance, spherical hashing tends to generate similar values of g_i if it uses a similar number of hyperplanes. This is not surprising since the set of most effective hyperspheres of D does not change significantly if the distribution of points in D remains the same.

3.3. Signature diagrams and signature regions

The ith signature diagram is defined by the ith compound LSH function g_i in $G$ , as shown in Figure 2. In the ith signature diagram, each point q in D is located in a signature region corresponding to its ith signature, g_i(q), denoted by R_sig(g_i, q). It is also referred as the ith signature region of q, R_{sig, i}(q) for short. For instance, in the signature diagram 3 of Figure 2, three points, E, F and H, are in a signature region, R_sig,3(E). If E is a query point, four signature regions are defined in the four signature diagrams, containing 0, 1, 2 and 4 neighbours, respectively. Later, we select k-NN candidates of q from the most effective signature regions among L signature regions of q, R_sig,1(q), R_sig,2(q), …, R_{sig, L}(q), in a signature pool.

Figure 2.

A signature pool of size 4 (L=4). The signature diagram i depicts a given dataset partitioned by the ith compound LSH function, g_i. Points in a signature region have the same signature; g₂(E)=g₂(D) and g₃(E)=g₃(F)=g₃(H).

3.4. Signature pool for fast retrieval of points in a signature region

S2LSH finds k-NN candidates by retrieving points in each selected signature region. If a signature pool provides only the compound LSH functions in $G$ , we need to calculate the signatures of all the points in D to select the points with a specific signature, which is extremely expensive. For efficiency, we define an additional component of a signature pool, $T$ , as a set of L hash tables constructed using bucket hashing technique that is also applied to the E2LSH package.¹

Algorithm 1.

The pseudocode of constructSignaturePool ( $D, ℋ, ℒ, m_{1}, m_{2}$ ) that constructs a signature pool ( $G$ , $T$ ) of a dataset D using four parameters, $H, L, m_{1}, m_{2}$

	Input: $D, H, L, m_{1}, m_{2}$ Output: a set of compound LSH functions, $G$ , and a set of hash tables, $T$
1	begin
2	$H \leftarrow genBinaryLSHFunctions (H);$
3	$G \leftarrow genCompoundLSHFunctions (L, H, m_{1}, m_{2});$
4	initialize hash tables $t_{1}, \dots, t_{L}$ in $T$ ;
5	for $g_{j} \in G$ do
6	for $p_{i} \in D$ do
7	$performBucketHashing (p_{i}, g_{j}, t_{j})$

Using the example in Figure 2, we explain how we generate $T$ using bucket hashing. We first build four hash tables (L=4 in this example). Then we define four hash functions, {b₁₁, …, b₁₄}, such that the ith hash function b_1i maps the ith signatures of points into specific positions of the ith hash table. For instance, b₁₃ maps the signatures {g₃(A), …, g₃(H)} onto the third hash table. However, since different signatures can be mapped to the same position of a hash table, we need to distinguish different signatures by putting them into different buckets. To do so, we define another set of four hash functions, {b₂₁, …, b₂₄}, such that the ith hash function b_2i maps the ith signatures into the short hash codes that correspond to the bucket IDs in the ith table. For example, assume that the signatures of five points in the signature diagram 3, {g₃(B), g₃(C)} and {g₃(E), g₃(F), g₃(H)}, are mapped to the same position of the third hash table by b₁₃, even though they belong to different signature regions. Using b₂₃, we assign different bucket IDs to the two different sets of points.

Algorithm 1 is the pseudocode of the signature pool generation of S2LSH. The inputs are (1) D (a set of data points), (2) H (the number of binary LSH functions), (3) L (the size of a signature pool), (4) m₁ (the minimum size of g_i in $G$ ) and (5) m₂ (the maximum size of g_i in $G$ ). The output is a signature pool of D represented as $G$ and $T$ . $G$ is a set of L compound LSH functions and $T$ is a set of L bucket hash tables. $genBinaryLSHFunctions (H)$ generates a set of H binary LSH functions, $H$ , using a binary LSH scheme such as spherical hashing or random hyperplanes. Then, $genCompoundLSHFunctions (L, H, m_{1}, m_{2})$ generates a set of compound LSH functions, $G$ , by combining underlying binary LSH functions selected from $H$ . The size of g_i in $G$ is between m₁ and m₂. Once $G$ is generated, we also construct bucket hash tables t_i in $T$ , one for each g_i in $G$ , to improve the performance of the signature selection of S2LSH.

3.5. Parameter settings

S2LSH uses four parameters to control the size and the diversity of a signature pool: H, L, m₁, and m₂. To achieve high k-NN approximation accuracy within a fixed time budget, these parameters need to be set carefully, especially when D is a large-scale high-dimensional dataset. For instance, we set H=1000, L=H/2, m₁=5, and m₂=15 for some experiments.

As H and L grow, the size and the diversity of a signature pool increase, which consequently improves the k-NN approximation accuracy. However, this requires more preprocessing time and space. For our experiments, we tune H mainly by considering the type of the LSH scheme and the characteristics of a given dataset. L is set to H/2, since the k-NN approximation accuracy improves as L grows but the improvement becomes marginal after L exceeds H/2.

We determine m₁ and m₂ based on the cardinality of D. Assuming that each binary LSH function h_i in $H$ is balanced, each h_i partitions D into two equal-sized groups of points. Assume that |D|=4,000 (<2¹²). The value of m₁ should not be too large; if m₁ is 15, each g_i in $G$ should have at least 15 binary LSH functions, resulting in less than one point in each signature region on average. At the same time, m₁ and m₂ should not be too small; if m₂ is 1, each signature region can contain about 50% of the points in D. Also, when we set m₁ and m₂, we need to consider the expected execution time of each query and k so that we do not check too many k-NN candidates. Also, we prefer to set m₁ and m₂ such that the difference between them is as large as possible, since we want to have signature regions of various sizes in the signature pool.

4. k-NN approximation using S2LSH

Once a signature pool of size L is constructed, we need to select k-NN candidates of q from the signature regions in the signature pool. To compare the effectiveness of the signature regions of q, we define an effectiveness measure, E′.

We first discuss on two observations on the effectiveness of a signature region of q. First, the ith signature region of q, R_{sig, i}(q), is effective for the efficient k-NN computation of q if it contains many exact k-NN points of q, that is, |R_sig,i(q) ∩ R_knn(q)| is large. Here, we assume that a signature region can be also viewed as a set of points in it. If |R_sig,i(q)∩R_knn(q)| is k, this means that we can find all the exact k-NN points among the points in R_sig,i(q). Next, when two signature regions contain the same number of exact k-NN points of q, we prefer a smaller one containing fewer false positives. Therefore, the effectiveness of the ith signature region of q can be represented as follows:

E (R_{sig, i} (q), q) = \frac{| R_{sig, i} (q) \cap R_{knn} (q) |}{| R_{sig, i} (q) ∖ R_{knn} (q) |}

(1)

However, notice that we can calculate the above effectiveness score only if we know the exact k-NN points of q in R_knn(q) in advance. In fact, even if R_knn(q) is known, it is NP-hard to find the optimal set of signature regions that cover the k-NN points of q with minimum cost. This problem is equivalent to the weighted set cover problem that finds the set of signature regions covering all the k-NN points of q while introducing the minimum number of non k-NN neighbours of q. Therefore, we identify a set of features that are highly correlated with the effectiveness of a signature region in terms of the k-NN approximation accuracy and the computation time, and use them to define a new effectiveness measure E′.

4.1. Effectiveness measure for S2LSH

Since the distance between q and a point in a signature region is bounded by the size of the signature regions, a point in a smaller signature region of q tends to be closer to q. A small signature region may not cover some of the actual k-NN points of q, but they are highly likely to be covered by other small signature regions in our large pool of signature regions. If two signature regions are of similar size, we consider the proximity of q to the centres of the signature regions. If R_sig,i(q) contains q near its border while q is near the centre of R_sig,j(q), R_sig,i(q) is considered less effective since there can be points that are very close to q but located just outside of its border.

In this way, we measure the effectiveness of each signature region to generate a query-specific signature region. We selected six features of a signature region that seem to be highly correlated to the effectiveness of a signature region and the proximity of a query to the centre of the signature region, while able to be calculated with low computational cost.

Signature length (F_SigLen) – as the length of a binary signature of q grows, the corresponding signature region tends to shrink to a smaller one. The average size of a signature region decreases as more hyperspheres are used as in Figure 3(a). Therefore, a signature region corresponding to a longer signature tends to be smaller, and thus more effective for k-NN approximation. It is also shown that popular LSH schemes achieve a higher mean average precision by using longer binary LSH codes [8].

Cardinality (F_Card) – when two regions are of similar size (e.g. regions with the same signature length), a signature region with fewer points tends to introduce less false positives as depicted in Figure 3(b), and is thus more effective for k-NN approximation.

Average radius of hyperspheres that contain (or do not contain) a given query point q (F_AvgRadIn, F_AvgRadOut) – a signature region of q is located in the intersection of the hyperspheres containing q and out of the union of the hyperspheres that do not contain q. Therefore, if the average radius of hyperspheres containing q is small, it is likely that the corresponding signature region of q is also small. Likewise, if the average radius of a hypersphere not containing q is large, it is likely that the corresponding signature region of q is small.

Average distance to the centres of the hypersphere that contain (or do not contain) a given query point q (F_AvgDistIn, F_AvgDistOut) – a hypersphere captures R_knn(q) effectively if q is located near its centre. In contrast, if a hypersphere contains q, but q is near its border, it is likely that the hypersphere covers R_knn(q) partially. Therefore, if the average distance to the centres is small, q is near the centres of the hyperspheres containing q, and thus the corresponding signature region is likely to cover R_knn(q) effectively. In contrast, if the average distance to the centres of the hyperspheres that do not contain q is small, it is likely that R_knn(q) is just partially covered by the signature region. Since the distances from points in D to the centres of hyperspheres have been calculated to determine the signatures of points during the preprocessing stage of S2LSH, these two features can be computed efficiently.

Figure 3.

Four features highly relevant to the effectiveness of a signature region for the k-NN approximation. (a) and (b) are used to define E′, the effectiveness measure of S2LSH. (c) and (d) are used to define E ″ for S2LSH-M. Black points represent the actual k-NN of the query point, the star point, and grey ones are non k-NN points.

Even if we replace spherical hashing with a different LSH scheme, we can still define these six features by slightly modifying them. For instance, if we use random hyperplanes, we can define F_AvgRadIn by replacing the average radius of hyperspheres with the average angle of hyperplanes. The effectiveness of a signature region was roughly proportional to its signature length (F_SigLen) and the inverse of its cardinality (1/F_Card) in every dataset in Table 1. However, the correlation between the effectiveness of a signature region and the other four features was not clear in some datasets. Therefore, we define a new effectiveness measure for S2LSH, E′, by combining F_SigLen and 1/F_Card, as shown in equation (2). The weights of the two features, F_SigLen and 1/F_Card, leading to the best performance were significantly different across datasets, so S2LSH uses a simple effectiveness measure E′ that considers both features equally important in order to avoid overfitting to a particular dataset. If we need to improve the performance of S2LSH on a specific dataset, we can use a machine-learned effectiveness measure. Let F_SigLen(R_sig,i(q)) and F_Card(R_sig,i(q)) denote the signature length and the number of points of R_sig,i(q), respectively. Then, the effectiveness measure E′ for S2LSH is defined as follows:

E' (R_{sig, i} (q), q) = {\begin{matrix} 0 & , if F_{Card} (R_{sig, i} (q)) = 1 \\ \frac{F_{SigLen} (R_{sig, i} (q))}{F_{Card} (R_{sig, i} (q)) - 1} & , otherwise . \end{matrix}

(2)

Table 1.

The summary of various datasets

Dataset	Number of dimensions	Number of points	Feature description
Corel	14	300,000	Image dataset [27]
NUS-WIDE-CH	64	200,000	Colour histogram from Flickr images [28]
Audio	192	50,000	Marsyas music dataset [29]
NUS-WIDE-BoW	500	100,000	Bag of words based on SIFT [30]
Shape	544	25,000	Shape benchmark [31]
MNIST	784	60,000	Handwritten digits dataset [32]
GIST1M	960	100,000	Image dataset [33]

4.2. Finding k-NN candidates using signature selection

Given a signature pool of D and a query point q, we estimate the effectiveness of each signature region of q using the effectiveness measure E′ to select highly ranked signature regions. For simplicity, we assume that all the points except for q in the selected signature regions are included as k-NN candidates of q. Let $μ$ denote the minimum number of candidate points per query. Among unselected signature regions of q, we select a signature region with the highest effectiveness score until the total number of points in the selected signature regions exceeds $μ$ . Therefore, as $μ$ increases, the k-NN approximation accuracy of S2LSH tends to improve. However, at the same time, more distance calculations between q and candidate points are required. Since an optimal value of $μ$ is query specific and runtime calculation of $μ$ increases the overall execution time of S2LSH significantly, we suggest determining $μ$ in advance by estimating the average numbers of candidates required to achieve the target approximation accuracy based on a set of sample queries randomly selected from D.

Algorithm 2.

The pseudocode of findKNN-S2LSH ( $q, G, T, μ, k$ ) that finds k-NN points of $q$ using S2LSH, given a signature pool ( $G$ , $T$ ) and two parameters, $μ, k$

	Input: query point $q$ , signature pool ( $G, T)$ , $μ, k$ Output: k-NN list of q, $N_{q} [1, . . ., k]$
1	begin
2	initialize $E_{q}^{'} [1, \dots, L];$
3	initialize $N_{q} [1, \dots, k];$
4	$numCandidates \leftarrow 0$ ;
5	for $g_{j} \in G$ do
6	calculate $E_{q}^{'} [j]$ , the effectiveness score of the jth signature region of $q$ ;
7	repeat
8	$l \in getNextMostEffectiveSigRegion (E_{q}^{'})$ ;
9	$C = findCandidateSet (q, g_{l}, t_{l})$ ;
10	$numCandidates \leftarrow numCandidates + \| C \|$ ;
11	for $c_{j} \in C$ do
12	$calculateDistance (q, c_{j});$
13	Update the $k$ -NN list of $q$ , $N_{q};$
14	until $μ \leq numCandidates$

Algorithm 2 describes how S2LSH finds k-NN points of q by selecting highly effective signature regions from a given signature pool, ( $G, T)$ , generated by Algorithm 1.

Initialize $numCandidates$ to 0 (line 4).

Compute E′(R_sig,j(q), q), the effectiveness score of the jth signature region of q, for every $j_{(1 \leq j \leq L)}$ (lines 5–6).

Choose the most effective signature region of q_i, say the lth signature region, among the unselected ones, and retrieve a set of k-NN candidate points of q_i using the corresponding bucket hash table (lines 8–10).

Calculate the distance between q and each candidate point c_j and update the k-NN list if necessary (lines 11–13).

If the total number of candidates of q exceeds $μ$ , the signature selection stage terminates. Otherwise, repeat from (3) (line 14).

4.2.1. An example of k-NN approximation using S2LSH

Assume that a signature pool of size 4 is given as in Figure 2, and we want to find 2-NN of point E, namely D and F. The value of $μ$ is set to 3. Recall that the ith signature region of E is the region containing E in the ith signature diagram, and its signature length corresponds to the number of hyperspheres. We first calculate the effectiveness of four signature regions of the query point E. The effectiveness score of R_sig,i(E) is as follows: E′(R_sig,1(E), E) is 0 since R_sig,1(E) contains no point other than E. E’(R_sig,2(E), E)=2/1 since D is in the region of E. Similarly, E′(R_sig,3(E), E)=3/2 and E′(R_sig,4(E), E)=2/4.

Now, we rank the four signature regions of E based on their effectiveness scores; R_sig,2(E)>R_sig,3(E)> R_sig,4(E)>R_sig,1(E). Since the most effective signature region of E is R_sig,2(E), we select D in R_sig,2(E) as a candidate. Since $μ$ =3 and D is the only one selected so far, we choose more signature regions. Next, R_sig,3(E) is chosen, and F and H are also selected. Since the set of candidates, {D, F, H}, contains $μ$ points, the signature selection stage terminates. Then, after calculating distances between E and three candidates, we find the 2-NN points of E, D and F. As depicted in Figure 4(b) and (c), E2LSH and C2LSH need to select five candidates to find the 2-NN points of E. Since the pairwise distance calculations between a given query point and candidates almost dominate the overall execution time of LSH-based k-NN methods, an approach requiring fewer candidates to achieve the target k-NN approximation accuracy is preferred.

Figure 4.

k-NN candidate selection by various approaches to find the 2-NN of a query point E (the red circle).

5. Multi-query k-NN approximation using S2LSH-M

Let us consider that a large set of queries, Q $(\subseteq D)$ , is given. We suggest S2LSH-M, a variant of S2LSH that processes multiple queries in Q more efficiently than executing S2LSH for |Q| number of times. We define a new effectiveness measure, E″, for S2LSH-M and use it with E′ to further improve the performance of S2LSH-M. The signature pool generation of S2LSH-M is the same as that of S2LSH.

5.1. Effectiveness measure for S2LSH-M

To measure the effectiveness of a signature region more accurately, we use query-specific features of a signature region. Since calculating exact values of such features during runtime is computationally very expensive, we identify a set of query-specific features of a signature region that we can approximate efficiently using distance values computed during the k-NN approximation of other queries in Q. The following are three query-specific features we consider:

Maximum distance between q and the points in a signature region (F_MaxDist) – if the maximum distance between q and the points in R_sig,i(q), denoted by F_MaxDist(R_sig,i(q), q), is shorter than F_MaxDist(R_sig,j(q), q), it is likely that either the size of R_sig,i(q) is smaller than that of R_sig,j(q) or q is closer to the centre of R_sig,i(q) than to the centre of R_sig,j(q) if the two signature regions are of similar sizes. The distances from q to the points in a signature region R_sig,i(q) is bounded by F_MaxDist(R_sig,i(q), q), so it is certain that there is no point in R_sig,i(q) whose distance to q is farther than F_MaxDist(R_sig,i(q), q), as depicted in Figure 3(c).

Average distance between q and the points in a signature region (F_AvgDist) – if the average distance between q and the points in R_sig,i(q), denoted by F_AvgDist(R_sig,i(q), q), is shorter than F_AvgDist(R_sig,j(q), q), it is likely that either R_sig,i(q) is smaller than R_sig,j(q) or R_sig,j(q) contains more points that are far from q than R_sig,j(q) does, as illustrated in Figure 3(d). Figure 3(d) depicts a situation where, even if maximum distances of two signature regions are the same, we can choose a more effective signature region using the average distances to q.

Minimum distance between q and the points in a signature region (F_MinDist) – if the minimum distance between q and the points in R_sig,i(q), denoted by F_MinDist(R_sig,i(q), q), is shorter than F_MinDist(R_sig,j(q), q), this means that R_sig,i(q) contains a point that is closer to q than all the points in R_sig,j(q).

Notice that F_MaxDist and F_AvgDist provide information on the distances between q and all the points in the signature region. However, F_MinDist gives information only on the distance between q and the point closest to q, thus the size of the corresponding signature region can be arbitrarily large. In fact, we observed through experiments that the correlation between F_MinDist and the actual effectiveness of a signature region is not as strong as expected. Therefore, we use only F_MaxDist and F_AvgDist to define a new effectiveness measure E″ for S2LSH-M.

Calculating the exact values of F_MaxDist and F_AvgDist of a query for each signature region in a signature pool is computationally very expensive. Recall that the features such as F_SigLen and F_Card used in E′ are not query-specific, so the values of such features are the same for all the points in a signature region, and available as soon as a signature pool is constructed. In contrast, query-specific features such as F_MaxDist and F_AvgDist can be computed after calculating all the pairwise distances from a given q to the points in each signature region. This massive number of distance calculations per query requires a significant amount of computation time because query-specific features need to be computed for every signature region in a signature pool. We can sample a subset of points in a signature region, and estimate such feature values. Although this solution is effective with a small signature pool, it can still slow down the overall execution time significantly when a large signature pool is used.

In S2LSH-M, we approximate values of query-specific features of the current query $q_{c} \in Q$ by reusing distance values between some pairs of points in R_sig,i(q_c). The opportunity is that, before processing q_c, a large number of pairwise distances have already been calculated while processing previous query points, $q_{1}, \dots, q_{c - 1} \in Q$ , to select k-NN points among candidates. Let F′_MaxDist(R_sig,i(q_c), q_c) and F′_AvgDist(R_sig,i(q_c), q_c) denote the approximate values of F_MaxDist(R_sig,i(q_c), q_c) and F_AvgDist(R_sig,i(q_c), q_c) computed by using such distance values available after the k-NN approximation of previous queries. In our implementation, we do not store all the distance values because of memory limitation. Instead, we keep only the minimum amount of values to calculate the maximum and the average distances from each point in a signature region. Then, we define a new effectiveness measure E″ for S2LSH-M as follows:

E^{''} (R_{sig, i} (q), q) = \frac{1}{F'_{MaxDist} (R_{sig, i} (q), q) + F'_{AvgDist} (R_{sig, i} (q), q)}

(3)

5.2. Finding k-NN candidates using signature selection of S2LSH-M

Let us assume that Q={q₁, q₂, …, q_N} and we want to find the k-NN list of q_c after processing (c− 1) previous queries, {q₁, q₂, …, q_c−₁}. To select good k-NN candidates of q_c, S2LSH-M compares the effectiveness of signature regions of q_c using both E′ and E″. It first compares the effectiveness score E′ of each signature region of q_c. Then, it calculates the new effectiveness score E″. The approximate values of F_MaxDist and F_AvgDist of q_c in a signature region can be computed only if distance values between q_c and some points in the signature region are available.

Let Computed(R_sig,i(q_c), q_c) be a set of points in R_sig,i(q_c) whose distance to q_c, dist(q_c, p), has already been computed and thus available before finding k-NN candidates of q_c. This means that p is one of the previous queries, that is, p ∈{q₁, q₂, …, q_c-₁}, and q_c was one of the k-NN candidates of p. When Computed(R_sig,i(q_c), q_c) is nonempty, we define F′_MaxDist(R_sig,i(q_c), q_c) and F′_AvgDist(R_sig,i(q_c), q_c) as max(dist(q_c, p)) and avg(dist(q_c, p)) for all p ∈ Computed(R_sig,i(q_c), q_c), respectively. S2LSH-M first calculates E′ of each signature region of q_c in a signature pool, and rank them by E′. Then, for each signature region R_sig,i(q_c) with |Computed(R_sig,i(q_c), q_c)|>0, it computes E″, reorders them using E″, and selects highly ranked signature regions. Notice that, as the size of points in |Computed(R_sig,i(q_c), q_c)| grows, F′_MaxDist(R_sig,i(q_c), q_c) and F′_AvgDist(R_sig,i(q_c), q_c) become more accurate.

5.3. Optimization techniques

In order to improve the performance of S2LSH-M, we use two optimization techniques: one is for eliminating duplicate distance calculations for better efficiency, and the other is for refining an approximate k-NN graph for better approximation accuracy.

Because of memory limitation, we cannot use the duplicate elimination technique used for S2LSH. For instance, if there are 1M points in D and each distance value occupies 4 bytes, 2 TB of memory is required to store all the pairwise distance values in a |D|*|D| triangular matrix. To reduce memory consumption, recursive Lanczos bisection (RLB) [10] used a hash table to store the distance values in which a distance value from the ith point is allocated to the ith position. However, it still requires about 200 GB of memory even if only 10% of all the distances are computed. In practice, RLB fails in constructing a k-NN graph if |D| is large. For this reason, a duplicate elimination technique has not been applied to k-NN graph construction algorithms such as Zhang’s approach [18] and greedy filtering [11].

We suggest a simple but efficient technique that removes all the duplicate distance calculations while requiring only O(|D|) memory. Like S2LSH, S2LSH-M also uses bucket hashing to construct a signature pool for an efficient retrieval of points in a selected signature region. Let us assume that the points in the lth signature region of q reside in the same bucket in the corresponding bucket hashing table, t_l. For each query point q, we calculate the distance between q and each of its neighbours in the lth signature region of q only if needed. Once distance calculations are finished, we remove q from t_l, to avoid any distance calculation involving q. Also, while calculating the distances between q and each point in the signature region, we update the k-NN lists of query points that might have q as a k-NN candidate.

Algorithm 3.

The pseudocode of findKNN-S2LSH-M( $Q, G, T, μ, k$ ) that finds lists of k-NN points for a given set of queries, $Q$ , using S2LSH-M, given a signature pool ( $G$ , $T$ ) and two parameters, $μ, k$

	Input: a set of query points $Q$ , signature pool $(G, T), μ, k, maxHops$ Output: a set of k-NN lists containing N_qi’s
1	begin
2	for $q_{i} \in Q$ do
3	initialize $E_{q_{i}}^{'} [1, \dots, L]$ and $E_{q_{i}}^{''} [1, \dots, L]$ ;
4	initialize $N_{q_{i}} [1, \dots, k]$ ;
5	for $q_{i} \in Q$ do
6	$numCandidates \leftarrow 0;$
7	initialize $A [1, \dots, \| D \|]$ as $FALSE$ ;
8	for $g_{j} \in G$ do
9	calculate $E_{q_{i}}^{'} [j]$ , the effectiveness score of the jth signature region of $q_{i}$ ;
10	repeat
11	$l \leftarrow getNextMostEffectiveSigRegion (E_{q_{i}}^{'}, E_{q_{i}}^{''});$
12	$C = findCandidateSet (q_{i}, g_{l}, t_{l});$
13	$numCandidates \leftarrow numCandidates + \| C \|$
14	for $c_{j} \in C$ do
15	if $A [c_{j}]$ is $FALSE$ then
16	$calculateDistance (q_{i}, c_{j});$
17	update the $k$ -NN lists of $q_{i}, N_{q_{i}};$
18	if $c_{j} \in Q$ then
19	update $E_{c_{j}}^{''} [l];$
20	update the $k$ -NN lists of $c_{j}, N_{c_{j}};$
21	$A [c_{j}] \leftarrow TRUE$ ;
22	remove $q_{i}$ from $t_{l};$
23	until $μ \leq numCandidates$
24	$performNeighborhoodPropagation (maxHops)$ ;

Our second optimization technique is for improving the accuracy of an approximate k-NN graph using multi-hop neighbourhood propagation. Existing approaches [9, 10, 16, 18] applied a widely used neighbourhood propagation technique based on two-hop neighbours. In S2LSH-M, we modify the neighbourhood propagation such that all the neighbours within maxHops hops are considered, where maxHops≥2. If maxHops is 3, for each point q, we calculate the similarities between the k-NN list of q and the k-NN lists of its two- and three-hop neighbours, and update the k- NN list of q accordingly. Through experiments, we observed that, even though we include three-hop neighbours, relatively small number of additional distance calculations is required, while generating a more accurate k-NN graph. Let us define the scan rate of an algorithm as the number of distance calculations of the algorithm divided by the number of distance calculations performed by the brute-force approach. Then, for instance, when the number of points is 100,000 and we are refining an approximate 10-NN graph (k=10), the scan rate of S2LSH-M increases by at most 0.022 including duplicate calculations. In our experiments, we set maxHops parameter as 3.

Algorithm 3 is the pseudocode of the k-NN approximation performed by S2LSH-M. Given a set of queries Q and data points D, S2LSH-M first constructs a signature pool of D as described in Algorithm 1, and then finds k-NN lists of Q as in Algorithm 3. The inputs are the following: (1) Q ( $Q \subseteq D$ ); (2) $G$ and $T$ (a signature pool of D); (3) $μ$ (the minimum number of k-NN candidates per query); (4) k; and (5) maxHops (the maximum number of hops used in the neighbourhood propagation). Algorithm 3 also explains how S2LSH-M applies both of the above optimization techniques to improve the performance of S2LSH-M.

Initialize $numCandidates$ to 0 and prepare for a false-initialized array A of size |D| (line 7).

Compute E′(R_sig,j(q_i), q_i), the effectiveness score of the jth signature region of q_i, for every $j_{(1 \leq j \leq L)}$ (lines 8–9).

Calculate the distance between q_i and each candidate point c_j, dist(q_i, c_j), if A[c_j]=FALSE, and then set A[c_j] to TRUE (lines 15, 16, 21).

Update the k-NN list of q_i, $N_{q_{i}}$ , if dist(q_i, c_j) is shorter than dist(q_i, $N_{q_{i}} [k]$ ) (line 17).

If c_j is a query in Q, update E″(R_sig,l(c_j), c_j) if needed. Also, update the k-NN list of c_j, $N_{c_{j}}$ , if dist(q_i, c_j) is shorter than dist(c_j, $N_{c_{j}}$ [k]) (lines 18–20).

Remove q_i from the lth bucket hash table, t_l (line 22).

If the total number of candidates of q_i exceeds $μ$ , the signature selection for q_i terminates and continues from (1) with a new query in Q. Otherwise, repeat from (3) (line 23).

Perform the neighbourhood propagation technique using the specified $maxHops$ to refine k-NN lists (line 24).

6. Experiments

6.1. Experimental setup

6.1.1. Datasets

In order to show the performance superiority of our approach, we use various types of datasets. Table 1 shows the number of dimensions, the number of points and the description of features of the datasets used for our experiments. The datasets are constructed from various sources such as images, music and texts, using various feature extraction methods. For detailed comparison of algorithms, we mainly use two datasets, NUS-WIDE-CH and NUS-WIDE-BoW.

6.1.2. Three types of k-NN computation tasks

We compare the performance of our approach with the state-of-the-art k-NN computation approaches by executing three different types of k-NN computation tasks: (1) k-NN search (k-NNS); (2) k-NN graph construction (k-NNG); and (3) partial k-NN graph construction (Pk-NNG). We perform k-NN search task to compare S2LSH with the state-of-the-art approximate k-NN search approaches. Then, we construct a k-NN graph or a partial k-NN graph to compare the performance of S2LSH-M with existing approximate k-NN search approaches and k-NN graph construction methods.

6.1.3. Performance evaluation measure

Since there is a tradeoff between the k-NN approximation accuracy and the execution time, we compare the performance of our approaches and the baseline algorithms by measuring the execution time required to generate k-NN results of the target k-NN approximation accuracy. As we choose more candidate points to improve the approximation accuracy, the overall execution time grows since the number of pairwise distance calculations that need to be performed to select k-NN points among the candidates also increases. We performed the experiments on one core of an Intel Xeon CPU X5472 @ 3.00GHz with 16 GB RAM, running Linux Ubuntu 12.02.

For our experiments, we set the target k-NN approximation accuracy to 90%. Application-based evaluations using applications such as agglomerative clustering, dimensionality reduction and recommendation [10, 34] showed that the k-NN approximation accuracy that leads to the best application-specific quality is heavily dependent on various factors such as the characteristics of a dataset, the type of a quality measure and the type of an application. Therefore, we decided to fix our target k-NN accuracy to 90%, assuming that the performance of applications such as search engines and recommendation systems do not degrade significantly by replacing exact k-NN lists with 90% accurate ones.

The k-NN search time reported in Figures 5 and 6 is the average execution time of 1000 randomly selected queries. For the k-NN graph construction task and the partial k-NN graph construction task, we measure the total elapsed time to process all the query points in Q. For k-NN graph construction, Q is set to D. For the partial k-NN graph construction, we use a set of randomly selected points from D as Q, and |Q| is set to 20% of |D|. k is set to 10. The preprocessing time is measured separately since the signature pool generation is executed just once for all queries. Also, the preprocessing times of S2LSH and that of S2LSH-M are the same, since they use the same signature pool. The preprocessing time of S2LSH is not significantly longer than those of existing algorithms. For example, S2LSH spends 310 s constructing a signature pool for 50K data points of NUS-WIDE-BoW while it takes 302 s for E2LSH+ to preprocess the same dataset. The k-NN approximation accuracy is defined as the average precision of k-NN lists.

Figure 5.

The average k-NN search time and the size of D (spherical hashing is used).

Figure 6.

The average k-NN search time and the size of D (random hyperplanes are used).

6.1.4. Baseline algorithms

There are two main reasons why our approaches, S2LSH and S2LSH-M, show superior performance: one is the quality of candidates in a query-specific signature region that is dynamically generated using a signature pool, and the other is the optimization techniques we applied to our approaches such as bucket hashing, neighbourhood propagation, eliminating duplicate distance calculations and incremental k-NN updates. In order to precisely analyse the impact of the query-specific signature region of S2LSH and S2LSH-M, we applied the same set of optimization techniques to the baselines if applicable. C2LSH+ and E2LSH+ are the implementations of C2LSH and E2LSH with these optimization techniques. In this paper, we do not compare S2LSH with algorithms such as LSB-tree and LSB-forest, since the performance of C2LSH is better [2]. In Section 6.2.1, we compare S2LSH, the brute-force approach, C2LSH+ and E2LSH+, by measuring the performance of k-NN search. We can observe a larger performance gap between our approaches and the baselines if we do not apply the optimization techniques aforementioned to the baselines.

To evaluate the performance of S2LSH-M more rigorously, we include two more algorithms to our baselines, RLB and NN-Descent (NND). Both are the state-of-the-art k-NN graph construction algorithms, but they are different in that RLB is hyperplane-based and NND is heuristic-based. Since NND terminates even if the target approximation accuracy is not reached, it sometimes shows very poor approximation accuracy when its heuristics failed on some types of datasets such as the NUS-WIDE-BoW dataset. Therefore, we enhanced NND to NND+ by applying the optimization techniques we applied to S2LSH-M and, more importantly, executing NND iteratively until the minimum approximation accuracy t is reached. However, since NND+ can be executed only if the exact k-NN lists are known in advance, NND+ is not feasible in the real world scenario. We do not include clustering-based k-NN graph construction algorithms as baselines since their performance is worse than that of NND, or changes significantly depending on input parameters. In Sections 6.2.2 and 6.2.3, we compare S2LSH-M with baselines by measuring the performance of k-NN graph construction and partial k-NN graph construction.

6.2. Performance comparison using three k-NN computation tasks

6.2.1. k-NN search

Figures 5 and 6 show the impact of the cardinality of D (10–50K), source datasets (NUS-WIDE-CH and NUS-WIDE-BoW) and the LSH scheme used to generate underlying binary LSH functions (spherical hashing and random hyperplanes) on the performance of k-NN search. Three approximate k-NN search algorithms, S2LSH, E2LSH+ and C2LSH, and a brute-force approach are compared.

In Figure 5, we compare the execution time of k-NN search algorithms, assuming spherical hashing is used. First, notice that S2LSH consistently outperforms the other k-NN search algorithms. Also, the performance superiority of S2LSH is observed in both datasets, NUS-WIDE-CH and NUS-WIDE-BoW, although these datasets are significantly different in terms of the number of dimensions and feature extraction method. In contrast, while E2LSH+ outperforms C2LSH+ in NUS-WIDE-CH, they show a similar performance in NUS-WIDE-BoW. In Figure 6, we perform the same set of experiments after replacing spherical hashing with random hyperplanes. Even though underlying binary LSH functions are generated using a completely different LSH scheme, S2LSH still performs the best on both datasets. C2LSH+ is better than E2LSH+ in NUS-WIDE-CH, which is opposite when they use spherical hashing, as shown in Figure 5. Also, E2LSH+ performs much better than C2LSH+ in NUS-WIDE-BoW, while they show a similar performance on the same dataset when using spherical hashing in Figure 5.

In summary, even though we change the cardinality of D, source datasets and the LSH scheme used in generating underlying binary LSH functions, S2LSH consistently shows a superior performance, while the performance of other state-of-the-art k-NN search algorithms is highly sensitive to such factors.

6.2.2. k-NN graph construction

Figure 7 shows the performance of k-NN construction in two different datasets (NUS-WIDE-CH and NUS-WIDE-BoW). The k-NN graph construction time is the total elapsed time of executing k-NN search for each query in D. The algorithms compared are S2LSH-M, two approximate k-NN search algorithms (E2LSH+ and C2LSH+), two state-of-the-art approximate k-NN construction algorithms (RLB and NND+), and the brute-force approach.

Figure 7.

The average k-NN graph construction time and the size of D (spherical hashing is used).

In Figure 7, we measured the k-NN graph construction time of six k-NN computation approaches, assuming that spherical hashing generates the underlying binary LSH functions. First, the k-NN graph construction time of S2LSH-M is much less than that of approximate k-NN search algorithms, E2LSH+ and C2LSH+, which is consistent with the comparison results in Section 6.2.1. In fact, the performance gap between S2LSH-M and other k-NN search algorithms is more evident than in Figure 5, which indicates that the performance of S2LSH-M is improved by using E″ and applying optimization techniques. Second, let us compare S2LSH-M with the state-of-the-art k-NN graph construction algorithms, RLB and NND+. S2LSH-M is comparable with or slightly faster than them even though they are specifically designed for k-NN graph construction, while S2LSH-M is a variant of S2LSH, a k-NN search algorithm. Lastly, we can see that RLB can be extremely inefficient in some datasets such as NUS-WIDE-BoW, sometimes slower than the brute-force exact k-NN approach.

6.2.3. Partial k-NN graph construction

Figure 8 shows the performance of partial k-NN construction in two different datasets (NUS-WIDE-CH and NUS- WIDE-BoW). The partial k-NN graph construction time is the total elapsed time of executing k-NN search for every query in Q. The points in Q used for the partial k-NN graph construction in Figure 8 are randomly selected from the specified source dataset, and the size of Q is set to 20% of |D|. Partial k-NN graph construction is useful for incremental update or distributed construction of a large k-NN graph. We compare S2LSH-M with two approximate k-NN search algorithms (E2LSH+ and C2LSH+), two state-of-the-art approximate k-NN construction algorithms (RLB and NND+), and the brute-force approach.

Figure 8.

The partial k-NN graph construction time and the size of D. |Q| is 20% of |D| (spherical hashing is used).

In Figure 8, as |D| increases, S2LSH-M outperforms other approaches including approximate k-NN search algorithms (E2LSH+ and C2LSH+) and k-NN graph construction algorithms (NND+ and RLB). Notice that NND+ performs better than E2LSH+ and C2LSH+ in the NUS-WIDE-CH dataset, while it performs the worst in the NUS-WIDE-BoW dataset. This is because the k-NN graph construction algorithms such as RLB and NND+ can be executed only when Q=D. Partial k-NN graph construction is conceptually considered as a task located between k-NN search (|Q|=1) and k-NN graph construction (|Q|=|D|). Therefore, k-NN graph construction algorithms perform worse than k-NN search algorithms when |Q| is small, but their performance improve as |Q| approaches |D|. In the following section, we analyse the impact of the size of Q on the performance of S2LSH-M and the baseline approaches.

6.2.4. Impact of the number of queries in Q

We examine the impact of the size of Q on the average k-NN computation time per query. The set of data points, D, contains 50K data points that are randomly selected from NUS-WIDE-CH or NUS-WIDE-BoW. The query set, Q, contains points randomly selected from D, and the number of queries in Q, |Q|, changes from 10K to 50K, that is, 20% of |D| to 100% of |D|. The average k-NN computation time of C2LSH+ in NUS-WIDE-CH is not included in Figure 9, since C2LSH+ is extremely slow compared with other three approaches.

Figure 9.

The k-NN computation time per query and the ratio of |Q| to |D|. |D| is 50K (spherical hashing is used).

In Figure 9, we can see that S2LSH-M outperforms the baselines, given Q of any size. The performance of NND+ becomes similar to that of S2LSH-M as |Q|/|D| approaches 100%. However, NND+ is less efficient than E2LSH+ or C2LSH+ when |Q| is small, such as 20% of |D|. Second, the performance of approximate k-NN search algorithms, E2LSH+ and C2LSH+, does not change significantly even though |Q|/|D| changes from 20 to 100%. Recall that they can find k-NN points of each query efficiently, but even when Q contains many query points, they perform k-NN search separately for each query in Q without improving the quality of k-NN candidates by reusing distance values computed to find k-NN points of other queries. Notice that the average k-NN computation time of S2LSH-M monotonically decreases as the size of Q grows, because S2LSH-M further improves the quality of k-NN candidates using such distance values and thus requires fewer candidates in achieving the target k-NN approximation accuracy.

6.3. Performance comparison using various datasets

Table 2 presents the average execution time of k-NN computation algorithms when they perform three types of k-NN computation tasks in various datasets. The k-NN computation algorithms are brute-force exact k-NN search algorithm (Brute-force), two state-of-the-art approximate k-NN search algorithms (E2LSH+ and C2LSH+), a k-NN graph construction algorithm (NND+) and our approach (S2LSH and S2LSH-M).

Table 2.

The overall performance comparison using various datasets. Each cell contains the average execution time followed by the corresponding k-NN approximation accuracy in the parentheses. The shortest execution time is in bold.

Dataset	k-NN search				k-NN graph construction			Partial k-NN graph construction
	Brute-force	E2LSH+	C2LSH+	S2LSH	Brute-force	NND+	S2LSH-M	Brute-force	NND+	S2LSH-M
Corel	74ms (1.00)	6ms (0.96)	66ms (0.85)	2ms (0.97)	14215s (1.00)	151s (0.94)	199s (0.91)	5066s (1.00)	145s (0.94)	136s (0.94)
NUS-WIDE -CH	120ms (1.00)	9ms (0.91)	47ms (0.82)	4ms (0.92)	14116s (1.00)	314s (0.95)	204s (0.95)	5007s (1.00)	315s (0.95)	94s (0.90)
Audio	75ms (1.00)	9ms (0.89)	16ms (0.87)	8ms (0.92)	2044s (1.00)	175s (0.93)	82s (0.94)	734s (1.00)	116s (0.90)	54s (0.95)
NUS-WIDE -BoW	384ms (1.00)	169ms (0.88)	187ms (0.90)	103ms (0.91)	19774s (1.00)	7185s (0.90)	6262s (0.92)	7017s (1.00)	6844s (0.90)	2006s (0.90)
Shape	97ms (1.00)	11ms (0.90)	11ms (0.90)	10ms (0.93)	1314s (1.00)	51s (0.93)	48s (0.94)	475s (1.00)	51s (0.93)	40s (0.96)
MNIST	346ms (1.00)	61ms (0.92)	32ms (0.81)	21ms (0.94)	10993s (1.00)	202s (0.91)	189s (0.94)	3878s (1.00)	198s (0.94)	67s (0.97)
GIST1M	726ms (1.00)	162ms (0.92)	134ms (0.82)	75ms (0.92)	36610s (1.00)	3830s (0.91)	3411s (0.91)	13068s (1.00)	3805s (0.91)	1862s (0.92)

First, when performing approximate k-NN search, the performance superiority of S2LSH is clear in every dataset; S2LSH finds the most accurate k-NN lists using the shortest k-NN search time. Next, we compare the performance of k-NN graph construction in various datasets. S2LSH-M outperforms NND+ in almost every dataset since it generates a more accurate k-NN graph faster than NND+ does, except for the Corel dataset. When performing partial k-NN graph construction with Q such that |Q|/|D|=20%, S2LSH-M performs better than NND+ in almost every dataset, except for NUS-WIDE-CH. In the NUS-WIDE-CH dataset, the performance of NND+ and S2LSH-M is not directly comparable, since NND+ constructs a more accurate partial k-NN graph (0.95 vs 0.90) but it requires a much longer execution time (315 vs 94 s).

7. Conclusions

We presented a novel LSH approach, S2LSH, which efficiently performs an approximate k-NN search on various types of datasets. It constructs a large pool of highly diversified signature regions of various sizes and shapes, and selects a subset of highly effective signature regions to generate a query-specific signature region. Then, we proposed S2LSH-M, a variant of S2LSH, which constructs a k-NN graph more efficiently than multiple executions of S2LSH. It exploits query-specific features and uses additional optimization techniques. Extensive experiments show that our approaches, S2LSH and S2LSH-M, consistently outperform the state-of-the-art k-NN computation algorithms in diverse settings.

As future work, we plan to perform a theoretical analysis on the joint relationship between features of a signature region and its effectiveness for efficient k-NN computation. Also, we want to suggest a method to determine important runtime parameters such as the number of k-NN candidates dynamically for a given query by considering pre-specified information such as k, the target approximation accuracy, the execution time budget and the distance measure with the corresponding LSH scheme. Lastly, we want to develop a parameter-free self-organizing k-NN approximation framework that suggests good combinations of parameter values by analysing characteristics of datasets.

Footnotes

Acknowledgements

This paper is an extended version of the work of Park et al. [].

Funding

This work was supported by the 2013 Research Fund of the University of Seoul.

Notes

References

Indyk

Motwani

. Approximate nearest neighbors: Towards removing the curse of dimensionality. In: Proceedings of the ACM annual symposium on theory of computing, 1998, pp. 604–613.

Gan

Feng

Fang

. Locality-sensitive hashing scheme based on dynamic collision counting. In: Proceedings of the ACM SIGMOD international conference on management of data, 2012, pp. 541–552.

Tao

Sheng

. Quality and efficiency in high dimensional nearest neighbor search. In: Proceedings of the ACM SIGMOD international conference on management of data, 2009, pp. 563–576.

Datar

Immorlica

Indyk

. Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of the annual symposium on computational geometry, 2004, pp. 253–262.

Gionis

Indyk

Motwani

. Similarity search in high dimensions via hashing. In: Proceedings of the international conference on very large data bases, 1999, pp. 518–529.

Charikar

. Similarity estimation techniques from rounding algorithms. In: Proceedings of the annual ACM symposium on theory of computing, 2002, pp. 380–388.

Broader

. On the resemblance and containment of documents. In: Proceedings of the compression and complexity of sequences, 1997, pp. 21–29.

Heo

Lee

. Spherical hashing: Binary code embedding with hyperspheres. IEEE Transactions on Pattern Analysis and Machine Intelligence 2015; 37(11): 2304–2316.

Dong

Charikar

. Efficient k-nearest neighbor graph construction for generic similarity measures. In: Proceedings of the international conference on World Wide Web, 2011, pp. 577–586.

10.

Chen

Fang

Saad

. Fast approximate k-NN graph construction for high dimensional data via recursive lanczos bisection. Journal of Machine Learning Research 2009; 10: 1989–2012.

11.

Park

Hwang

Lee

. A Novel algorithm for scalable k-nearest neighbour graph construction. Journal of Information Science 2016; 42(2): 274–288.

12.

Davidson

Liebald

Liu

. The Youtube video recommendation system. In: Proceedings of the ACM conference on recommender systems, 2010, pp. 293–296.

13.

Kim

. A group recommendation system for online communities. International Journal of Information Management 2010; 30(3): 212–219.

14.

Franti

Virmajoki

Hautamaki

. Fast agglomerative clustering using a k-nearest neighbor graph. IEEE Transactions on Pattern Analysis and Machine Intelligence 2006; 28(11): 1875–1881.

15.

Bobadilla

Ortega

Hernando

. Recommender systems survey. Knowledge-Based Systems 2013; 46: 109–132.

16.

Wang

Zeng

. Scalable k-NN graph construction for visual descriptors. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2012, pp. 1106–1113.

17.

Chaudhuri

Dasgupta

Kpotufe

. Consistent procedures for cluster tree estimation and pruning. IEEE Transactions on Information Theory 2014; 60(12): 7900–7912.

18.

Zhang

Huang

Geng

. Fast kNN graph construction with locality sensitive hashing. Machine Learning and Knowledge Discovery in Databases 2013; 1: 660–674.

19.

Altingovde

Subakan

ÖN

Ulusoy

. Cluster searching strategies for collaborative recommendation systems. Journal of Information Processing and Management 2013; 49(3): 688–697.

20.

Rafeh

Bahrehmand

. An adaptive approach to dealing with unstable behaviour of users in collaborative filtering systems. Journal of Information Science 2012; 38(3): 205–221.

21.

Hashemi

Hamzeh

. SPCF: A stepwise partitioning for collaborative filtering to alleviate sparsity problems. Journal of Information Science 2012; 38(6): 578–592.

22.

Onan

. Classifier and feature set ensembles for web page classification. Journal of Information Science 2016; 42(2): 150–165.

23.

Pong

Kwok

Lau

. A comparative study of two automatic document classification methods in a library setting. Journal of Information Science 2008; 34(2): 213–230.

24.

McCallum

Nigam

Ungar

. Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the ACM International Conference on Knowledge Discovery and Data Mining, 2000, pp. 169–178.

25.

Weiss

Torralba

Fergus

. Spectral hashing. Advances in Neural Information Processing Systems 2009; 1: 1753–1760.

26.

Liu

Wang

Kumar

. Hashing with graphs. In: Proceedings of the International Conference on Machine Learning, 2011, pp. 1–8.

27.

Charikar

. Image similarity search with compact data structures. In: Proceedings of the ACM international conference on information and knowledge management, 2004, pp. 208–217.

28.

Shapiro

Stockman

. Computer Vision. Prentice Hall: Englewood Cliffs, NJ, 2001.

29.

Tzanetakis

Cook

. Marsyas: A framework for audio analysis. Journal of Organised Sound 2000; 4(3): 169–175.

30.

Lowe

. Object recognition from local scale-invariant features. In: Proceedings of the IEEE international conference on computer vision, 1999, pp. 1150–1157.

31.

Kazhdan

Funkhouser

Rusinkiewicz

. Rotation invariance spherical harmonic representation of 3-d shape descriptors. In: Proceedings of the symposium on geometry processing, 2003, pp. 156–165.

32.

LeCun

Bottou

Bengio

. Gradient-based learning applied to document recognition. Proceedings of the IEEE 1998; 86(11): 2278–2324.

33.

Jegou

Douze

Schmid

. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence 2011; 33: 117–128.

34.

Park

Jung

. Reversed CF: A fast collaborative filtering algorithm using a k-nearest neighbor graph. Expert Systems with Applications 2015; 42: 4022–4028.

35.

Park

Hwang

Lee

S. A

Fast k-nearest neighbor search using query-specific signature selection. In: Proceedings of the ACM international conference on information and knowledge management, 2015, pp. 1883–1886.