Subspace k-anonymity algorithm for location-privacy preservation based on locality-sensitive hashing

Abstract

Existing location-privacy-preserving methods primarily focus on solving the problem of location-privacy preservation in the global space. This not only increases the response time of the location service, it also degrades the data quality. In this paper, a k-anonymity algorithm based on locality-sensitive hashing is proposed to solve the problem of location-privacy preservation in the subspace. In the proposed algorithm, higher efficiency and higher quality of service are achieved by applying a bottom-up grid-search method. Further, reasonable division is obtained based on locality-sensitive hashing by retaining position characteristics. The results of experiments conducted to evaluate the proposed algorithm indicate that the proposed algorithm provides a smaller anonymous spatial region, higher data quality, and lower time cost than methods with no subspace.

Keywords

Location-privacy preservation locality-sensitive hashing bottom-up grid search subspace k-anonymity

1. Introduction

Location-based services (LBSs) [1, 2, 34], such as finding the nearest convenience services, coupon-release services, and real-time location-information services for locating friends and family, have rapidly grown in mobile social network applications in recent years [5]. To enable and obtain their desired location-based services, mobile users must send their exact location information to an LBS provider [6].

However, location-privacy leaks can occur when the servers of these LBS providers are compromised. The location information embedded in the LBS query can be collected by an adversary connected to the LBS server. Sensitive information of service recipients, such as home locations, lifestyles, health conditions, and political and religious associations, can be inferred from the information on the server [7]. Thus, as serious privacy threats can result from use of LBSs, users need to consider the risk of privacy disclosure. Further, as use of LBS applications has become more widespread, location privacy protection is becoming increasingly more critical and a primary concern for information security.

An effective LBS privacy protection method is k-anonymity [8, 9, 10, 11], which has gained considerable attention in recent years. It requires at least $k$ users to participate in an anonymity set so that no user in the set can be distinguished from the other k–1 users [12]. It obscures the user’s precise coordinates and replaces them with a well-shaped cloaked region [6]. In the past decade, new algorithms based on efficient data structures have been proposed to address location-privacy preservation using k-anonymity algorithms [13]. They include the spatial-temporal k-anonymity [14], non-exposure accurate location k-anonymity [6], and location k-anonymity in indoor spaces [15] algorithms.

Existing location-privacy-protection methods primarily solve the privacy problem only in the global view, which results in difficulties solving complex problems. To rectify this issue, a k-anonymity algorithm based on the subspace, which is built by a bottom-up grid-cloaking method that helps to reduce the problem complexity, is proposed in this paper.

The main contributions of this work are summarized as follows:

1.
Instead of solving the problem in the global space, we propose a subspace concept in which anonymous areas are created from the adjacent areas of the query point. The proposed method more closely approximates a human’s approach to solving the problem in the real world. Furthermore, we provide a common solution for k-anonymous privacy protection research.
2.
The subspace is created using a bottom-up grid-cloaking method to narrow the search range and reduce the time complexity. Moreover, it can improve the service quality and reduce the response time.
3.
As the locality-sensitive hash function has a position-sensitive characteristic, based on locality sensitive hashing (LSH), the space position can maintain similarity with high probability. Therefore, the proposed algorithm makes the partition reasonable and improves the data quality.

The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 presents the background to this study and introduces related studies, along with related concepts and problems. Section 4 describes and analyzes the proposed LSH-based subspace k-anonymity privacy protection algorithm. Section 5 discusses the experiments conducted to evaluate the proposed method and the results obtained. Section 6 concludes this paper and outlines future research.
2. Related work

K-anonymity algorithms are widely used in privacy preservation of location data mining [16] and ubiquitous vehicular ad hoc networks (VANETs) [17], as well as preventing neighborhood attacks in social networks [18], anonymizing data and protecting sensitive labels [19], anonymizing collections of tree-structured data [20], and anonymizing transactions with sensitive items [21]. In addition, they are used for privacy preservation in high-dimensional data sharing [22], supervisory control, data acquisition publishing [23], transaction data anonymization [24], and publication of sensitive transactional data [25].

Furthermore, LSH has been proposed as an efficient technique for similarity joining for high-dimensional and large-scale data [26]. It is widely used in various applications [27]. Personalized locality sensitive hashing (PLSH), for example, has been implemented in a parallel framework to address similarity joins on large-scale data in the approximate nearest-neighbor problem [26]. Tag assignment stream clustering (TASC), an incremental scalable community detection method, has been proposed based on locality-sensitive hashing for social tagging systems [28]. In addition, randomness-based locality-sensitive hashing (RLSH), based on p-stable LSH, has been introduced for the approximate nearest-neighbor problem [27]. Moreover, use of the LSH-based adaptive mean-shift algorithm with bandwidth estimation (LSH-AMS-BE) in an approximate-neighborhood query method has been proposed for computation of high-dimensional data. In this approach, LSH is used to reduce the computational complexity of the adaptive mean-shift algorithm [29].

Interpreting LSH as a means for data clustering can solve the difficult problem of clustering transactional data with very high dimensions [30]. The index-aware nearest-neighbor search method based on LSH can be used to handle large compound databases with several million entries for accelerated similarity searching and clustering of very large compound sets [31]. Meanwhile, the single-linkage method reduces the time complexity by rapidly finding nearby clusters to be connected using LSH to search for the nearest approximate neighbor [32]. MinHash and SimHash use two respective LSH strategies to group high-dimensional data for the design of intelligent systems. This approach has been incorporated in numerous real-world applications [33].

The LSH function can maintain a proximate position in the mapping process [34, 35, 36, 37]. It has therefore been used in the spatial k-anonymity algorithm. Datar et al. [38] proved that the LSH function has a position-sensitive characteristic. Based on hashing, the space position can maintain similarity with high probability, which can enable reasonable space partitioning. Vu et al. [39] designed a spatial k-anonymity privacy protection scheme with an LSH mechanism that exhibited good performance and moderate computational complexity.

Numerous algorithms based on LSH have an obvious influence on computational complexity. However, these LSH-based algorithms are directly performed in the global data space. In this scenario, they have to search for the optimal value of all location points, which requires considerable service time for responding with results. In general, finding k–1 points in k-anonymity does not involve a large distance in the real world. The search must only be confined to the adjacent regions of the query point.

3. Related concepts and problems

3.1 LBS privacy protection model

In this section, we review LBS selection techniques and consider several recently proposed LBS privacy protection methods. A typical LBS system architecture is shown in Fig. 1. The LBS system is mainly composed of four parts: LBS servers, mobile terminal users, positioning system, and communication network. Most basic daily-use information, such as traffic maps, vehicle tracking, and entertainment information, is related to the user’s location. Services can locate the user through the positioning system, and appropriate information and services can be provided through the communication network.

Figure 1.

LBS system architecture.

In this paper, a centralized architecture is adopted that comprises three parts: user, anonymizer, and LBS server. The LBS privacy-preserving query system is shown in Fig. 2. In the model, a trusted third-party anonymous server is established between the user and the location service provider. Firstly, the query information is sent to the anonymous server from the user through the network. The anonymous server then sends the user’s location together with other k–1 locations to the LBS server. Next, the LBS server returns all results to the anonymous server. Finally, the anonymous server returns the relevant information to the user. In this way, the service provider cannot know the user’s information and confidentiality is maintained.

Figure 2.

LBS privacy-preserving query system.

In this model, the identification (ID) of the user who submitted the query is hidden by the anonymous server. The anonymizing spatial region (ASR) is constructed using at least k different users. ASR can be surrounded by a circle, rectangle, etc. The minimum bounding rectangle (MBR) encapsulation method is adopted in this paper.

3.2 Locality-sensitive hashing

LSH is one of the most effective search techniques for high-dimensional data with probabilistic guarantees. Its key concept is to hash data points into one or more hash tables after random projection, and then search within the hash table buckets [30]. LSH is used in the projection of a point in the space to a vector L. It maintains the distance in both the space and projection to vector L. After being hashed, the given points can be guaranteed as being near each other. However, ensuring that distant points remain far apart is difficult. The following example illustrates the problem.

Figure 3.

Example of LSH partition.

As shown in Fig. 3, suppose that the distance between C and D is very far, while the projected distance of point C1, D1 in the projection on L1 is near. Then, it is necessary to project multiple vectors to form a hash cluster. Accordingly, the original nearby points in all vectors are still very close to each other in the projection. The point that is far from the others may be close in the vector projection; however, most of the points projected on the vector remain far.

3.3 Subspace creation

In this paper, the bottom-up grid cloaking method is used to create the subspace. The whole data space is divided into many grids, with the number in each grid representing the number of location points included in the region. The subspace contains only the area around the query point, with its size determined by the threshold. To reduce the number of iterations, the extension principle of the subspace is used to find the grid of the maximum number of locations. The principle is illustrated by an example as follows.

Assume that SN is a two-dimensional array for storing the number of data points in each small grid. The number of points in the grid of the query point is denoted as SN[u][v]. We select the maximum value of the set (SN [v], [u $+$ 1] SN [u] [v $+$ 1], SN [v], [u – 1] SN [u] [v – 1]) each time. When the sum of the numbers in the expanded grid satisfies the threshold, SK, of the search space, the extension is stopped.

4. Problem description

When the data scale is large, the efficiency of the spatial k-anonymity algorithm based on the LSH partition (SKA_LSHP) remains very large because it solves the problem only in the global space. Therefore, to improve the availability and quality of the data, and to reduce the problem complexity, the proposed subspace k-anonymity algorithm through a bottom-up grid searching method based on the LSH partition (SKA_BGSM_LSHP) is presented.

4.1 Algorithm description

The basic idea underlying the SKA_BGSM_LSHP algorithm is described as follows.

Step 1.
Initialize the structure array. Read n two-dimensional location points from the dataset and store these points in the structure array.
Step 2.
Divide the location region into M $\times$ N grids. Calculate the number of points in each small grid. M and N are parameters that indicate that the original space is divided into M $\times$ N small grids. M and N determine the size of the grid.
Step 3.
Locate the grid of the query point, which is extended along the grid. The extension principle is presented as follows. Firstly, the algorithm searches from four different directions to identify the grid with the largest number of points. Then, it connects these grids to a rectangular area. If this extension is in the horizontal direction, then the next expansion is in the vertical direction, and vice versa, until the given threshold conditions are met. In this way, the subspace is constructed.
Step 4.
Read all the points in the subspace that were obtained in Step 3 and store them in a new structure array.
Step 5.
Project these points onto one random vector, sort the projection points on each vector as a sorted table, and divide the table into many k-size buckets.
Step 6.
Remove the first point in the first bucket, list the other points in the same bucket from each sorted list, and screen out k–1 nearest points. Next, delete these selected points in the sorted tables. Repeat the operation until the number of points in the sorted table is less than 2 $\times$ k.

The SKA_BGSM_LSHP algorithm is composed of two parts, as shown below in respective Algorithms 1 and 2 pseudo-code. Algorithm 1 is a subspace-search algorithm based on the bottom-up grid method. It is abbreviated as CSABGM (n, SK) and is initially executed to form a subspace. Next, the sub-spatial k-anonymity algorithm based on LSH partition SKA_LSHP (Q, u, k) is executed.

First, a subspace of the problem is created by the CSABGM algorithm presented as Algorithm 1.

Algorithm 1 CSABGM ( $n$ , SK)

Creating a subspace algorithm by the bottom-up grid method

Input: Number $n$ , parameters M and N of grid size, threshold value SK

Output: Sub-spatial set $Q$

1: Read $n$ points $P_{1},P_{2},\ldots,P_{n}$ from the dataset;

2: Divide the region containing these points into $M\times N$ small grids;

3: Count the number of points in each grid and store them in a two-dimensional array, $\textit{SN}[M][N]$ ;

4: while (selectednum $<=$ SK) do

5: if (it is the first selected)

6: selectednum $=$ Max( $\textit{SN}[u+1][v]$ , $\textit{SN}[u][v+1]$ ,

$\textit{SN}[u-1][v]$ , $\textit{SN}[u][v-1)+\textit{SN}[u][v])$ ;

7: else

8: if (time_merger is in the horizontal)

9: {

10: selectednum $=$ Max ( $\textit{SN}[u+1][v]$ , $\textit{SN}[u-1][v])+\textit{SN}[u][v]$ ;

11: Extract all of the selected points to set $Q\{Q1,Q2,\ldots,Qq\}$

12: }

13: else

14: if (time_merger is in the vertical)

15: {

16: selectednum $=$ Max ( $\textit{SN}[u][v+1]$ , $\textit{SN}[u][v-1])+\textit{SN}[u][v]$ ;

17: Extract all of the selected points to set $Q\{Q1,Q2,\ldots,Qq\}$

18: }

19: end if;

20: end if;

21: end if;

22: if ( $|q|$ < 2 $k)$

23: return ( $Q$ );

24: end while;

The CSABGM algorithm first reads all points from the dataset (Line 1). Then, it divides the region of the points into $M\times N$ small grids (Line 2). It then calculates the number of points in each grid and stores them in a two-dimensional array, SN (Line 3). Finally, using the bottom-up selection method, the subspace is created (Lines 4 to 22). The grid search method can narrow the range, which can reduce the cycle execution times and improve the efficiency. Thus, it only searches within a certain range around the query point, which is reasonable in the real world.

Figure 4.
Calculation steps of the bottom-up grid search algorithm.

An example of Algorithm 1 is illustrated as follows. Assume that threshold SK $=$ 15. The expansion process and steps are as shown in Fig. 4. A shaded grid represents the grid of a query point in Fig. 4a. The first determination of the maximum of the four directions is shown in Fig. 4b, SN $=$ 1 $+$ 5. The first maximum value is in the vertical direction.

Thus, the second determination is made in the horizontal direction, as shown in Fig. 4c; SN $=$ 1 $+$ 5 $+$ 2 $+$ 4 $<$ 15. The third determination expands in the vertical direction, as shown in Fig. 4d; SN $=$ 1 $+$ 5 $+$ 2 $+$ 4 $+$ 2 $+$ 3 $>$ 15. At this point, the threshold requirements have been reached to stop searching and expanding.

The subspace k-anonymity algorithm based on LSH is outlined in Algorithm 2.

Algorithm 2 SKA_LSHP ( $Q, u, k$ )

Sub-spatial $k$ -anonymity algorithm based on LSH

Input: Sub-spatial set $Q,$ query point $u,$ anonymity degree $k$ ;

Output: $k$ -anonymous set;

1: Map all points in set $Q$ onto a random vector $L_{i}(i=1,2,\ldots,m)$ to obtain the sort table $L\{L_{1},\ldots,L_{m}\}$

2: Divide $L_{i}$ into a bucket table, $H_{i}$ , with $k$ -capacity.

$L=H=H\{H_{1},\ldots,H_{m}\}$ , $H_{i}=\{H_{i1},H_{i2},\ldots,H_{it},t=\frac{n}{k};\}$ ;

3: While (tableelement_numbers $>=$ 2 $k$ and query point $u$ is not in $R)$

4: $R=$ Ø;

5: $P=$ the first element of $H_{1}$ ;

6: $B={\O}$ ;

7: For ( $i=$ 1 to $m$ )

8: Determine bucket $H_{ix},u\in H_{ix}$ ;

9: $b\leftarrow H_{ix}$ ;

10: $B\leftarrow B\cup b$ ;

11: end for

12: end while; 13: Determine $k$ – 1 nearest neighbor points of $u$ from set $B$ to set $R$ ;

14: $R\leftarrow R\cup u$ ;

15: Remove all points of $R$ from $H$ ;

16: If ( $u$ belongs to $R$ )

17: return ( $R$ );

18: Else

19: return ( $H$ );

20: end if

First, all points in the subspace are projected onto a set of vectors. A sort table is obtained in the SKA_LSHP algorithm (Line 1). Then, each table is divided into several buckets according to the value of k (Line 2). All points from the buckets are identified in each sort table (containing query point u), and they are stored in Set B (Lines 3 to 12). The nearest k–1 points are selected from Set B through Set R (Line 13). Finally, the ASR is returned (Lines 14 to 20). The partition with the LSH function can appropriately divide the space. Compared with other algorithms, this algorithm has a smaller anonymous spatial region. An example of Algorithm 2 is detailed as follows:

Suppose the query point is the fourth point, k is 4, l is 2, and ten points of the subspace are obtained after the grid search. The ten points are: a1(3,6), a2(4,8), a3(5,4), a4(4,6), a5(2,7), a6(2,4), a7(5,5), a8(6,7), a9(7,5), and a10(6,5). Vector l1 is 4x $+$ 3y $=$ 12, vector l2 is 2x – 3y $=$ 4. The sorted table after the first projection in the vector is as follows. L[1]: {a5, a2, a1, a4, a8, a6, a7, a10, a3, a9}. The bucket H[1] is: {(a5, a2, a1, a4), (a8, a6, a7, a10, a3, a9)} $=$ {h[1][1], h[1][2]}. The bucket H[2] is: {(a6, a5, a1, a3, a4), (a7, a10, a2, a9, a8)} $=$ {h[2][1], h[2][2]}. Because query point a4 is in h[1][1] and h[2][1], B $=$ h [1][1] $\cup$ h [2][1] $=$ {a5, a2, a1, a4, a6, a3}, and T is {a5, a1, a2, a4}. The part of the obtained ASR is denoted with orange shading in Fig. 5.

Figure 5.
Example of the k-anonymity algorithm based on LSH.

4.2 Algorithm rationality analysis

By creating a subspace to generate anonymous regions, the search area, scope, and complexity of the problem are significantly reduced and the efficiency improves. Furthermore, high-dimensional data space is mapped into low-dimensional linear space by hash projection, which also significantly reduces the complexity of the problem.

Because a single hash function has the probability of dividing two distant points into the same bucket, multiple hash functions are used to reduce the probability of misallocation, improve the accuracy and improve the quality of data. Each sequence table is divided into buckets comprising k elements, and all elements in the same bucket as query points are selected as candidate sets to select the k–1 nearest neighbors. Because LSH has the characteristic of local location preservation, it can guarantee the correctness of candidate sets. Further, because the size of the candidate sets is generally small, this method can further reduce the search scope and improve the efficiency of the algorithm.

4.2.1 Inference

The probability that all the nearest neighbors of query point Q fall into the hash bucket is positively correlated with the number of functions $L$ in the hash function cluster.

4.2.2 Proof

Assuming that the hash function is “ $l$ ”, the random event is $A$ , and the single nearest neighbor of $Q$ falls into the hash bucket with a probability of $p$ .

$\displaystyle\textit{Probability}(A)=P$ (1)

Suppose there are $K$ elements in each hash bucket. Assuming that $B$ is a random event in which all the closest K–1 elements of $Q$ fall in the hash bucket, then the probability of $B$ is given as follows:

$\displaystyle\textit{Probability}(B)=P^{K-1}$ (2)

$C$ is an event where the nearest neighbors not all fall into the hash bucket. $C$ and $B$ constitute a complete event group:

$\displaystyle\textit{Probability}(B)+\textit{Probability}(C)=1$ (3)

Thus, the probability of occurrence of $C$ is as follows:

$\displaystyle\textit{Probability}(C)=1-P^{K-1}$ (4)

Suppose there are $L$ hash functions. $\textit{Set}_{\textit{Hash}}=(l_{1},l_{2},l_{3},\ldots l_{L})$ .

Suppose $D$ is a random event for which each hash function is “ $l_{i}$ ”, then the nearest neighbor of $Q$ falls into the hash bucket.

$\displaystyle\textit{Probability}(D)\geqslant P$ (5)

Assume that $E$ is a random event for which there is a nearest neighbor in all hash functions to not all fall into the hash bucket:

$\displaystyle\textit{Probability}(E)\leqslant(1-P^{K-1})^{L}$ (6)

Assume that $F$ is a random event for which the k–1 nearest neighbor of Q can be found with multiple hash functions, then $F$ and $E$ constitute a complete event group:

$\displaystyle\textit{Probability}(E)+\textit{Probability}(F)=1$ (7)

Thus, the probability of occurrence of $F$ is as follows:

$\displaystyle\textit{Probability}(F)\geqslant 1-(1-P^{K-1})^{L}$ (8)

Therefore, the probability of finding the nearest neighbor of $Q$ with $L$ hash functions increases with increasing $L$ .

4.3 Algorithm complexity analysis

The time complexity analysis is presented as follows. In Algorithm 1, Line 1, the time complexity for reading $n$ points is $O(n)$ . In the fifth line, if the number of points in each grid is the same as $p$ , SK is the threshold, and if the number of cells is $\mu$ , then $\mu=\frac{\textit{SK}}{p}$ . $M$ is equal to $N$ . Extended from one grid to $M\times N$ , it must increase M–1 lines and N–1 columns. Moreover, the number of iterations $q=2\times\left(\sqrt{\frac{\textit{SK}}{p}}-1\right)$ , and the time complexity is O $\left(2\times\left(\sqrt{\frac{\textit{SK}}{p}}-1\right)\right)$ . Thus, the time complexity of the CSABGM ( $n$ , SK) algorithm is O( $n$ ).

In Algorithm 2, Lines 3 to 11, for each array Li in sort table $L$ , there are S points that must be traversed and sorted. The time complexity is O( $L\times S^{2}$ ). The second step of the “while” cycle requires $\left(\frac{S}{K}-1\right)$ iterations. Between Lines 7 and 10, each element should be divided from the sorting table to the barrel; the worst case time complexity is O( $l\times k\log(l\times k)$ ) and the whole time complex degree is O $\left(\frac{S}{K}\times l\times k\log(l\times k)\right)$ . Thus, the time complexity is O( $n^{2}$ ).

5. Experimental classification results and analysis

We evaluated the quality of service and its response time with different values of $l$ , $n$ , and $k$ . The main inputs of the program were the $u$ value, the $k$ value of the anonymity degree, and the $L$ value of the random vector. In this study, the experiments were conducted on two datasets: dataset1 and dataset2. Dataset1 was randomly generated by a program; dataset2 was a real dataset.

Figure 6.

Influence of $k$ on the ASR percentage.

5.1 Evaluation criteria

The evaluation primarily focused on the ASR percentages and execution times. Calculation of the ASR percentage is shown in Eq. (9).

$\displaystyle R_{\text{asr}}=\frac{S_{\text{asr}}}{S_{0}}\times 100\%$ (9)

where $S_{\text{asr}}$ represents the minimum enclosed rectangle of the anonymous group containing $k$ points outputted by the algorithm, and $S_{0}$ represents the entire area. The ratio of the two areas is the ASR percentage. Service quality is determined primarily based on the ASR percentage. The smaller is the ASR percentage, the better is the quality of service. The LBS efficiency is primarily determined based on the algorithm execution time.

5.2 Experimental results on dataset1

The simulation experiments on dataset1 of the algorithm were coded in VC++ 6.0. The main parameters of dataset1 were as follows. The size of the region was 1,000 $\times$ 1,000, the number N was 1,000, the grid size was 20, and the number of position points in each grid was randomly generated. Threshold SK was N/50. Each experiment was conducted ten times. Each experiment was divided into two parts. The first part was the experimental results of the algorithm under different parameters, for which the results are shown in Figs 6 to 9. The second part was comparison of experimental results between the proposed SKA_BGSM_LSHP algorithm and the SKA_ LSHP algorithm.

All the curves in Fig. 6 demonstrate an upward trend, which indicates that the $k$ value is the most influential factor of the ASR size. Figure 6a demonstrates that a smaller ASR size is obtained in the condition of a large $l$ value. The division is more accurate and reasonable when more hash functions are used. Figure 6b shows that the larger is the value of n, the smaller is the ASR size that can be obtained. The results illustrate that the algorithm is effective for anonymity and data quality. It is obvious that the improved algorithms show good performance when the data points or hashing functions are increased.

Figure 7 shows that the ASR percentage decreases with the increase of the $n$ value. In Fig. 8, the two curves are in essence the same. The ASR percentage shows a decreasing trend with the increase of $l$ . The results indicate that multiple hash functions can lead to better partitioning.

Figure 7.

Influence of n on the ASR percentage.

Figure 8.

Influence of l on the ASR percentage ( $n=$ 1,000).

Figure 9.

Influence of $k$ on execution time.

Figure 10.

Comparison of execution time of the two algorithms.

Figure 11.

Comparison of ASR percentages of the two algorithms.

Figure 12.

ASR percentage of SKA_BGSM_LSHP with different parameters.

Figure 13.

Execution time of SKA_BGSM_LSHP with different parameters.

In Fig. 9a, when the data set size is 1,000, the curves show no regular trend. In other words, there is no dependency between the execution time and $k$ value of the algorithm. In general, the execution time is mainly affected by $l$ . In Fig. 9b, when the dataset size is 1,000, a straight line is observed, which shows that the $k$ value slightly affects the program execution time. With increasing of the dataset size to 5,000, the curve shows a significant downward trend; moreover, the running time decreases with increasing $k$ value. This indicates that the algorithm performance is more obvious in the larger-scale case of points. As shown in Fig. 9, the execution time is mainly influenced by $l$ and n and slightly influenced by $k$ .

To verify the effectiveness of the proposed algorithm, it was compared with the algorithm based on LSH partitioning only. The experimental results are shown in Figs 10 and 11.

In Fig. 10, the ASR percentage curves of the two algorithms are the same. They both increase with the increase of $k$ . However, the proposed SKA_BGSM_LSHP algorithm has a smaller anonymous query region. It is more accurate than the compared SKA_ LSHP algorithm, which resulted in a higher LBS quality.

In Fig. 11, the respective curves of the two algorithms do not change when the data size increases. The ASR percentage increases with the increase of $k$ . It is observed that the distance between the two curves increases, which indicates that the advantages of the algorithm in terms of ASR size and service quality index are more obvious than in the other algorithms.

As shown by the above results, compared with the SKA_LSHP algorithm, the proposed algorithm has a smaller ASR and lower time cost. The results demonstrate that the proposed SKA_BGSM_LSHP algorithm is effective and promising. Furthermore, it promotes reasonable partitioning and significantly improves the data availability.

5.3 Experimental results on dataset2

We employed dataset2 to test the performance of the proposed algorithm on large datasets. Our simulation was implemented with Java and executed on the Microsoft Windows 8 operating system running on a 2.40-GHz Intel (R) Core (TM) i5 4528U PC with a 500 GB hard drive. Dataset2 comprised approximately 2.7 million data items from the locations of a social networking service user who accessed the data for geographic locations (http://www.datatang.com/data/43896). The site was crawled from a domestic location service social network, which enabled users to sign in, and the geographical location relating to the user’s visit was recorded. Duplicate data were removed in the experiment. The latitude and longitude attributes were respectively selected as the location coordinates. In this experiment, the performance of the algorithm on massive data was evaluated by changing the parameters $n$ , $M$ , $N$ , and SK.

Figure 14.

Influence of n on IO execution time of SKA_BGSM_LSHP.

Figure 15.

Comparison of ASR percentage of the two algorithms with different parameters.

First, the performance of SKA_BGSM_LSHP was evaluated; the experimental results are shown in Figs 12–14.

Figure 12 depicts the influence diagram of $k$ and $l$ on the ASR percentage. From the figure, it is evident that the value of $k$ has a significant impact on the ASR percentage, whereas the impact of the value of $l$ is small. Moreover, the ASR percentage increases with the $k$ value.

Figure 13 shows the change of the program execution time in accordance with the data value. Figure 13a to d present time diagrams in which the data points are taken as 1,000, 10,000, 100,000, and 200,000, respectively. Figure 13 presents the execution time comparison of different data. As shown in Fig. 13, when the data points increase, the execution times of the program increase with the increase of the program number. The algorithm shows good stability when the points increase to 200,000, which indicates that the algorithm has strong adaptability to a large number of data items.

In Fig. 14, by increasing the dataset size from 7,500 to 25,000, the IO execution time under SKA_BGSM_LSHP is approximately linear, showing a good performance.

In addition, we compared the proposed SKA_BGSM_LSHP algorithm and SKA_ LSHP algorithm on dataset2. The comparative results are shown in Figs 15–17.

From Fig. 15, it is clear that the SKA_BGSM_LSHP algorithm is superior according to the ASR percentage, which indicates that the SKA_BGSM_LSHP algorithm promotes high data quality.

Figure 16.

Comparison of execution time of the two algorithms with different parameters.

In Fig. 16, by increasing the dataset size from 5,000 to 1,0000, the curve of the SKA_ LSHP algorithm shows a significant rising trend, while the curve of the SKA_BGSM_LSHP algorithm shows a slightly rising trend. This indicates that it has better efficiency and stability compared to the SKA_ LSHP algorithm, which resulted in a lower time cost and high LBS quality.

Figure 17.

Comparison of memory usage of the two algorithms with different parameters.

In Fig. 17, by increasing the dataset size from 6,500 to 20,000, the memory usage under SKA_ BGSM_LSHP is also lower than SKA_ LSHP .

In summary, the experimental results show that the proposed SKA_BGSM_LSHP algorithm is superior to the SKA_ LSHP algorithm in terms of data quality and efficiency.

6. Conclusions

In this paper, we described k-anonymity privacy protection requirements and proposed a subspace algorithm for solving the complex problem of privacy protection for LBSs from a novel perspective. In the proposed algorithm, the subspace is created via a bottom-up method to narrow the search range and reduce the time complexity. This approach more closely approximates a human’s approach to solving the problem by using the subspace as opposed to the global view. Furthermore, the LSH makes the partition reasonable for producing high-quality data.

Compared with the SKA_LSHP algorithm, both the theoretical analysis and experimental results showed that the proposed SKA_BGSM_LSHP algorithm reduces the anonymous region, reduces the communication and computation times, and improves the service quality. In the future, we intend to use different granularities in the subspace, and also explore further optimization in the partition.

Footnotes

Acknowledgments

The authors would like to thank the reviewers for their useful comments and suggestions for improving this paper. This work was supported by the National Natural Science Foundation of China (Grant Nos. 61672039, 61772034 and 61602009), the Key Project of Academic Support for the Top Talents in Anhui Universities (Grant No. gxbjZD2016011), and the Novation Foundation of Anhui Normal University (Grant No. 2017XJJ93).

References

Zhang

Choo

K.-K.R.

Liu

and Wang

, Enhancing privacy through uniform grid and caching in location-based services, Future Generation Computer Systems-the International Journal of Escience 86 (2018), 881–892.

Wang

Huang

and Cheng

, Messages in a concealed bottle: achieving query content privacy with accurate location-based services, IEEE Transactions on Vehicular Technology 67(2018), 7698–7711.

Dimitriou

and Al Ibrahim

, “I wasn’t there” Deniable, privacy-aware scheme for decentralized location-based services, Future Generation Computer Systems-the International Journal of Escience 86 (2018), 253–265.

Abul

and Bayrak

, From location to location pattern privacy in location-based services, Knowledge and Information Systems 56 (2018), 533–557.

Chen

and Chen

, LPPS: A distributed cache pushing based k-anonymity location privacy preserving scheme, Mobile Information Systems 2016 (2016), 1–16.

Jia

and Zhang

, Nonexposure accurate location k-anonymity algorithm in LBS, The Scientific World Journal 2014 (2014), 619357–619357.

Shin

K.G.

Chen

and Hu

, Privacy protection for users of location-based services, IEEE Wireless Communications 19 (2012), 30–39.

Jin

and Dai

, Leveraging spatial diversity for privacy-aware location-based services in mobile networks, IEEE Transactions on Information Forensics and Security 13 (2018), 1524–1534.

Sweeney

, K-anonymity: A model for protecting privacy, International Journal of Uncertainty Fuzziness and Knowledge-Based Systems 10 (2002), 557–570.

10.

Biswas

and Sairam

A.S.

, Modeling privacy approaches for location based services, Computer Networks 140 (2018), 1–14.

11.

Ghaffari

Ghadiri

Manshaei

M.H.

and Lahijani

M.S.

, P(4)QS: A peer-to-peer privacy preserving query service for location-based mobile applications, IEEE Transactions on Vehicular Technology 66 (2017), 9458–9469.

12.

Zhang

Tong

and Zhong

, On designing satisfaction-ratio-aware truthful incentive mechanisms for k-anonymity location privacy, IEEE Transactions on Information Forensics and Security 11 (2016), 2528–2541.

13.

Gorisse

Cord

and Precioso

, Locality-sensitive hashing for chi2 distance, IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (2012), 402–409.

14.

Zhang

Chen

, Liu

Zhu

and Wu

, Location prediction based on transition probability matrices constructing from sequential rules for spatial-temporal k-anonymity dataset, Plos One 11(8), e0160629.

15.

Kim

J.S.

and Li

K.J.

, Location k-anonymity in indoor spaces, Geoinformatica 20 (2016), 415–451.

16.

Bhaladhare

P.R.

and Jinwala

D.C.

, Novel approaches for privacy preserving data mining in k-anonymity model, Journal of Information Science and Engineering 32 (2016), 63–78.

17.

Caballero-Gil

Molina-Gil

Hernandez-Serrano

Leon

and Soriano-Ibanez

, Providing k-anonymity and revocation in ubiquitous VANETs, Ad Hoc Networks 36 (2016), 482–494.

18.

Liu

C.G.

Liu

I.H.

Yao

W.S.

and Li

J.S.

, K-anonymity against neighborhood attacks in weighted social networks, Security and Communication Networks 8 (2015), 3864–3882.

19.

Nayahi

J.J.V.

and Kavitha

, An efficient clustering for anonymizing data and protecting sensitive labels, International Journal of Uncertainty Fuzziness and Knowledge-Based Systems 23 (2015), 685–714.

20.

Gkountouna

and Terrovitis

, Anonymizing collections of tree-structured data, IEEE Transactions on Knowledge and Data Engineering 27 (2015), 2034–2048.

21.

Wang

S.L.

Tsai

Y.C.

Kao

H.Y.

and Hong

T.P.

, On anonymizing transactions with sensitive items, Applied Intelligence 41 (2014), 1043–1058.

22.

Wang

L.E.

and Li

X.X.

, A clustering-based bipartite graph privacy-preserving approach for sharing high-dimensional data, International Journal of Software Engineering and Knowledge Engineering 24 (2014), 1091–1111.

23.

Fahad

Tari

Almalawi

Goscinski

Khalil

and Mahmood

, PPFSCADA: Privacy preserving framework for SCADA data publishing, Future Generation Computer Systems-the International Journal of Escience 37 (2014), 496–511.

24.

Loukides

Gkoulalas-Divanis

and Shao

, Efficient and flexible anonymization of transaction data, Knowledge and Information Systems 36 (2013), 153–210.

25.

Ghinita

Kalnis

and Tao

, Anonymous publication of sensitive transactional data, IEEE Transactions on Knowledge and Data Engineering 23 (2011), 161–174.

26.

Wang

and Lin

, MapReduce based personalized locality sensitive hashing for similarity joins on large scale data, Computational Intelligence and Neuroscience 2015 (2015), 217216–217216.

27.

Y.H.

T.H.

Zhong

S.M.

Cao

Wang

and Al-Dhelaan

, Improved locality-sensitive hashing method for the approximate nearest neighbor problem, Chinese Physics B 23 (2014), 217–225.

28.

and Zou

, An incremental community detection method for social tagging systems using locality-sensitive hashing, Neural Networks 58 (2014), 14–28.

29.

Zhang

Cui

Liu

and Zhang

, An adaptive mean shift clustering algorithm based on locality-sensitive hashing, Optik 123 (2012), 1891–1894.

30.

Chen

Horng

S.J.

and Huang

C.P.

, Locality sensitive hashing for sampling-based algorithms in association rule mining, Expert Systems with Applications 38 (2011), 12388–12397.

31.

Cao

Jiang

and Girke

, Accelerated similarity searching and clustering of large compound sets by geometric embedding and locality sensitive hashing, Bioinformatics 26 (2010), 953–959.

32.

Koga

Ishibashi

and Watanabe

, Fast agglomerative hierarchical clustering algorithm using locality-sensitive hashing, Knowledge and Information Systems 12 (2007), 25–53.

33.

Zamora

Mendoza

and Allende

, Hashing-based clustering in high dimensional data, Expert Systems with Applications 62 (2016), 202–211.

34.

Pan

and Manocha

, Fast probabilistic collision checking for sampling-based motion planning using locality-sensitive hashing, International Journal of Robotics Research 35 (2016), 1477–1496.

35.

Al-Qershi

O.M.

and Khoo

B.E.

, Copy-move forgery detection using on locality sensitive hashing and k-means clustering, In: Information science and applications (ICISA), Springer Singapore, 2016, pp. 663–672.

36.

Zhang

Zhu

and Zhang

, BitHash: An efficient bitwise locality sensitive hashing method with applications, Knowledge-Based Systems 97 (2016), 40–47.

37.

Zhu

Xiao

and Sun

, A multi-objective memetic algorithm based on locality-sensitive hashing for one-to-many-to-one dynamic pickup-and-delivery problem, Information Sciences 329 (2016), 73–89.

38.

Datar

Immorlica

Indyk

and Mirrokni

V.S.

, Locality-sensitive hashing scheme based on p-stable distributions, In: Twentieth Symposium on Computational Geometry, New York: ACM, 2004, pp. 253–262.

39.

Zheng

and Gao

, Efficient algorithms for k-anonymous location privacy in participatory sensing, In: IEEE INFOCOM, New York: IEEE, 2012, pp. 2399–2407.