Cluster-Indistinguishability: A practical differential privacy mechanism for trajectory clustering

Abstract

An important method of spatial-temporal data mining, trajectory clustering can mine valuable information in trajectories. However, cluster results without special sanitization pose serious threats to individual location privacy. Existing privacy preserving mechanisms for trajectory clustering still contend with the problems of narrow applicability, low-level utility, and difficulty in being applied to real scenarios. In this paper, we therefore propose a differential privacy preserving mechanism, Cluster-Indistinguishability, to support trajectory clustering. Firstly, a general model of typical trajectory clustering algorithms is given, and the definition of differential privacy is introduced according to the model. Then, we derive the probability density function of two-dimensional Laplace noise, which satisfies the above definition. Finally, we transform the noise from a Cartesian coordinate system to a Polar coordinate system to efficiently apply it in real scenarios. Experimental results show that Cluster-Indistinguishability has general applicability and better performance compared to existing methods.

Keywords

Data mining trajectory clustering privacy preserving differential privacy

1. Introduction

Nowadays, with the popularity of mobile devices and equipment supporting GPS positioning, trajectory clustering, which aggregates similar locations, has become prevalent in many applications, such as population density analysis, advertisement recommendation, and travel route planning [1]. Trajectory clustering provides results as a quasi-circular area, which includes the total number of points in the cluster and its center. It has thus attracted considerable attention.

While trajectory clustering provides significant benefit, it poses serious threats to individual privacy [2]. The cluster results, the total number of points, and the center of the cluster, may reveal a person’s precise position. A diagram of the location privacy disclosure model is shown in Fig. 1.

Figure 1.

Diagram of a location privacy disclosure model.

As illustrated in Fig. 1, let us consider a malicious attacker who does not know the precise location data of Amy. This attacker can additionally obtain one’s position only through several queries and from cluster results. The first query reveals the total number of points and the center of the cluster; the other queries contain those of the adjacent area. However, if two adjacent areas satisfy the condition that the total number of points in them only differs by one, and if Amy is located in the cluster, then Amy’s precise position can be inferred by Eq. (1).

$\displaystyle p_{\textit{Amy}}=\left|{nc-n^{\prime}c^{\prime}}\right|$ (1)

The above attack model illustrates that the total number of points in a cluster and its center can disclose an individual’s location privacy. To address this problem, several methods have been proposed. Most early solutions [3, 4, 5] rely on an anonymity/generalization model (e.g. $k$ -anonymity [6]). An anonymity/generalization model uses a generalized value with quasi-identifier attributes (coarse-grained value) instead of a specified value (fine-grained value). Unfortunately, these schemes have been found to be vulnerable to many types of privacy attacks, as well as to have low-level data utility. Therefore, a method of extracting beneficial knowledge from trajectory clustering without compromising individual privacy is the motivation of our paper.

In order to solve the problems with which the anonymity/generalization model contends, differential privacy [7] was recently introduced to preserve the privacy of statistical databases. This privacy model makes no assumptions on adversaries’ background knowledge; moreover, it can guarantee that the released data will not breach an individual’s privacy. On the study of differential privacy for clustering, existing approaches mainly focus on the schemes for a certain clustering algorithm (e.g. $k$ -means [8, 9, 10] and DBSCAN [11]). To address multidimensional data, independent noise distribution is added to each dimension to achieve differential privacy. The closest method is the Geo-Indistinguishability mechanism proposed by Andres [12]. By defining the indistinguishability between the original and published positions, malicious attackers cannot distinguish the original and perturbed positions.

Compared with the anonymity/generalization model, the above methods achieve better balance on security and utility; nevertheless, they are affected with the following problems.

They are designed for only a certain clustering algorithm. Furthermore, they lack robustness to different clustering algorithms and are not practical for trajectory clustering.

They can guarantee the complete security of a single position; however, when applied to privacy preserving of trajectory clustering, they ignore the restrictive conditions of the cluster boundary, which will cause the problem of security reduction.

In general, different dimensions of multidimensional data are correlated, and independently adding noise in existing methods will lead to a decrease of data utility.

In this paper, we aim to solve these challenges and propose a practical differential privacy preserving mechanism for trajectory clustering. We refer to our approach as Cluster-Indistinguishability. Firstly, we define differential privacy for trajectory clustering under the constraint condition of the cluster boundary to support most typical clustering algorithms. Then, we derive the joint probability density function (JPDF) of noise distribution according to the definition. Finally, we transform the noise from a Cartesian coordinate system to a Polar coordinate system to efficiently apply it in real scenarios. Our contributions are threefold:

We provide the general model of typical clustering algorithms and differential privacy definition under the constraint condition of the cluster boundary to provide complete security for trajectory clustering.

Instead of independent distribution noise, we derive the JPDF of noise to achieve better utility. In addition, to improve the practicality of Cluster-Indistinguishability, we perform it in a Polar coordinate system.

Experiments demonstrate that our approach maintains a higher utility and is scalable to most typical trajectory clustering algorithms.

The remainder of this paper is organized as follows. In Section 2, we summarize related work on privacy preserving trajectory clustering. We then describe the limitations of existing methods in Section 3 and briefly introduce the notations adopted in this work. Our approaches and experiments are described in Sections 4 and 5, respectively, followed by our conclusions and discussion of future work in Section 6.

2. Related work

Existing privacy preserving mechanisms for clustering can be classified into two types. One type is anonymity/generalization-based. It generalizes trajectories to achieve privacy preservation. A typical anonymity-based algorithm is $k$ -anonymity. The other type is differential privacy, which is essentially a kind of noise-perturbed mechanism. It guarantees privacy preservation by adding noise to the original positions.

2.1 Trajectory anonymity

The first type takes advantage of anonymity-based privacy preserving models, such as $k$ -anonymity and its variants. They usually cluster records in the database into disjointed groups according to some privacy constraints. They then calculate and publish certain statistics for each group. For instance, Abul et al. [13] proposed the ( $k,\delta$ )-anonymity model, which is generalized from $k$ -anonymity. Its basic idea is to modify the original trajectories based on clustering and space translation so that at least $k$ different trajectories co-exist in a cylinder with a fixed radius. It then publishes the cylinders instead of the trajectories to protect the privacy of moving objects. Yarovoy et al. [4] regarded timestamps as quasi-identifiers. Thus, they proposed a $k$ -anonymity algorithm that is suitable for a moving object database, namely extreme-union and symmetric anonymization, to create an anonymity group. Accordingly, extremely $k$ -anonymity can be satisfied while simultaneously ensuring very low information loss. Monreale et al. [5] firstly partitions a geographical region according to the features of a trajectory dataset. It then generalizes trajectory points in every partitioned region to achieve $k$ -anonymity. The performance of data utility is also illustrated by several clustering algorithms.

Despite the above work [15, 16, 17] demonstrated that anonymity-based privacy models are vulnerable to numerous types of attacks. As a result, they are deemed unable to provide sufficient privacy protection in trajectory clustering. Because they classify the data that have a property similar to a category, the utility of published data is weak.

2.2 Differential privacy

Owing to the drawbacks of anonymity-based privacy models, differential privacy – a type of randomization-based privacy model – has been recently employed for privacy preservation. Blum et al. [8] proposed a differential privacy mechanism, P $k$ -means, which combines sample and aggregate methods, to release $k$ -means cluster centers. The metric measurement of cluster sensitivity and low bound of a cluster error are given in P $k$ -means. In addition, Dwork [18] presented two allocation methods of privacy budget $\varepsilon$ in $k$ -means clustering: uniform and exponential decay allocation. Li et al. [10] proposed the IDP $k$ -means mechanism, which can improve the utility of differential privacy for $k$ -means. Wu et al. proposed the DP-DBSCAN mechanism to support DBSCAN by adding independent Laplace distribution noise to each dimension of original data. To address privacy disclosure issues in the LBS system, Andres et al. [12] generated an approximate location with a formal privacy guarantee approach called Geo-Indistinguishability. For a single point, it can guarantee differential privacy within a protective radius. However, conventional differential privacy clustering methods mainly focus on a certain clustering algorithm and lack practicality. Independently adding noise will also cause the problem of low-level utility. To the best of our knowledge, no effective solution exists to solve privacy leaking problems in trajectory clustering.

3. Problem statement

In this section, we first present the model of typical trajectory clustering algorithms. Then, we review the theory of differential privacy and state the privacy-preserving problem in trajectory clustering from the perspective of the probability distribution model. Finally, we present an overview of our framework.

3.1 Model of trajectory clustering

Taking common clustering algorithms as an example (such as $k$ -means, DBSCAN, $k$ -median, etc.), the cluster result includes center $c$ , the total number of points, and a quasi-circle cluster area. In the $k$ -means algorithm, center $c$ is the mean of each point in the cluster. In other clustering algorithms, center $c$ will eventually approach the mean of points along with the process of clustering. Thus, if we denote $T=\left\{{x_{1},x_{2},\cdots,x_{n}}\right\}$ as the set of original clustering points, the general model of trajectory clustering can be expressed as:

$\displaystyle\left\{{\begin{array}[]{l}n=\textit{Clu}\left(T\right)\\ c=1/n*\sum\nolimits_{i=1}^{n}{p_{i}}\\ \end{array}}\right.$ (2)

where Clu and Cor represent clustering and centering algorithms, respectively.

Equation (2) depicts the mathematical representations of the trajectory clustering model. $T$ is the set of points to be clustered, Clu denotes the clustering algorithm, and $n$ is the number of clusters after applying the clustering algorithm, Clu, on $T$ . Figure 1 illustrates a motivating trajectory clustering attack model. Furthermore, we now summarize in Table 1 the notations used in this paper.

Table 1

Summary of notations

Notation	Definition
$M$	Perturbed algorithm
$p_{i}$	A single point
Cen	Algorithm to seek centroid
Clu	Clustering algorithm
Cor	Algorithm to seek core point
$T$	Set of original clustering points
$c$	Centroid
$p_{\textit{core}}$	Core point
$r$	Cluster radius
$\varepsilon$	Privacy preserving intensity
$d$	Euclidean distance

3.2 Differential privacy

Differential privacy is a currently recognized preservation model that can guarantee stricter security. It is essentially a kind of noise-perturbed mechanism. By adding noise to the raw data or statistical results, differential privacy can guarantee that the value changing of a single record has a minimal effect on the statistical output results. Thus, differential privacy can not only preserve privacy of sensitive data, but also support data mining on statistical results. Its formal definition is as follows:

Definition 1. $\varepsilon$ -Differential Privacy. We give two neighboring datasets, $D$ and $D^{\prime}$ , which have the same cardinality but differ in only one record. A randomized algorithm, $M$ , gives $\varepsilon$ -differential privacy if $M$ makes every set of outcomes, $S$ , for any pair of $D$ and $D^{\prime}$ satisfy:

$\displaystyle P\left[{M\left(D\right)\in S}\right]\leqslant\exp\left(% \varepsilon\right)\times P\left[{M\left(D^{\prime}\right)\in S}\right]$ (3)

where $S\subseteq\textit{Range}\left(M\right)$ , $\textit{Range}\left(M\right)$ is value range of $M$ . $P\left[\cdot\right]$ and $\varepsilon$ denote the probability distribution and privacy budget parameters.

A smaller $\varepsilon$ means better privacy. Figure 2 depicts the output probability distribution of randomized algorithm $M$ satisfying $\varepsilon$ -differential privacy on $D$ and $D^{\prime}$ . $f\left(\cdot\right)$ is the statistical output function.

Figure 2.

Output probability distribution of random algorithm $M$ on $D$ and $D^{\prime}$ .

Preserving intensity $\varepsilon$ is mainly restricted by randomized algorithm $M$ . In practical applications, $M$ is generally realized by a Laplace mechanism. The definition of the Laplace mechanism is the following:

Definition 2. Laplace Mechanism. Noise sequence $Y\sim\textit{Lap}\left(\lambda\right)$ , which obeys the Laplace distribution, can make randomized algorithm $M\left(D\right)=f\left(D\right)+Y$ satisfy $\varepsilon$ -differential privacy. $\lambda$ is the scale parameter of the Laplace distribution, and the PDF of Laplace is:

$\displaystyle\rho\left(x\right)=\frac{1}{2\lambda}\exp\left({-\frac{|x|}{% \lambda}}\right)$ (4)

Scale parameter $\lambda$ is determined by sensitivity function $\Delta f$ and privacy preserving intensity $\varepsilon$ :

$\displaystyle\lambda=\frac{\Delta f}{\varepsilon}$ (5)

where $\Delta f$ is the maximum effect of the statistical output function a single record has on:

$\displaystyle\Delta f=\mathop{\max}\limits_{D^{\prime}}\left\|{f\left(D\right)% -f\left(D^{\prime}\right)}\right\|_{1}$ (6)

As an example, consider a dataset whose sensitivity of a query is 1. According to differential privacy, adding to the true answer noise distributed according to $\textit{Lap}\left({1/\varepsilon}\right)$ suffices to ensure $\varepsilon$ -differential privacy.

3.3 Problem statement

As the closest method in differential privacy preservation for location data, Geo-Indistinguishability performs well in privacy preservation for a single point. However, in trajectory clustering applications, Geo-indistinguishability is subject to the constraint of the cluster radius. To intuitively explain the limitations of anonymity-based algorithms and Geo-indistinguishability, we elaborate these two mechanisms from the aspect of a probability statistics model and then state the problem.

3.3.1 Anonymity

We take the classic $k$ -anonymity algorithm as an example. It requires that the release data of a record in a certain number (at least $k$ ) cannot be distinguished. Thus, $k$ -anonymity generalizes $k$ records and expresses them in a unified identifier so that the attacker cannot determine the specific individual information. It guarantees that $k$ records are indistinguishable. The mathematical model of $k$ -anonymity is shown in Fig. 3. For a continuous data sequence, we denote $P_{f\left(T\right)}$ and $P_{f^{\prime}\left(T\right)}$ as the probability distribution of original output and generalized output. Then, $P_{f^{\prime}\left(T\right)}$ satisfies the following conditions:

a. $\forall k\in R^{+}$ , there must be a corresponding generalized output $f^{\prime}\left(T\right)$ ; b. the value of $P_{f^{\prime}\left(T\right)}$ is discontinuous between adjacent generalization intervals.

Figure 3.

Probability distribution of $k$ -anonymity.

On account of the generalization of $k$ records, condition (a) represents the equality of statistical output probability distribution within $k$ records. However, since the value of $P_{f^{\prime}\left(T\right)}$ is discontinuous among different generalization intervals, an attacker can utilize the mean between neighboring intervals to launch an attack. Moreover, the utility of the published data processed by $k$ -anonymity is low-level.

3.3.2 Geo-Indistinguishability

Unlike the defects and deficiencies of $k$ -anonymity, Geo-Indistinguishability adds Laplace noise to the original data to make the maximum difference of the probability distribution between the original and perturbed output $P_{f\left(T\right)}$ , $P_{f^{\prime}_{\textit{Geo}}\left(T\right)}$ satisfy:

$\displaystyle\frac{P_{f^{\prime}_{\textit{Geo}}\left(T\right)}}{P_{f\left(T% \right)}}\leqslant e^{\varepsilon d}$ (7)

As shown in Fig. 4, compared with $k$ -anonymity, the Laplace noise added in the original data is small. Therefore, Geo-Indistinguishability performs better in terms of data utility than $k$ -anonymity. However, because of the cluster radius constraint, Geo-Indistinguishability provides insufficient security.

Figure 4.

Probability distribution of Geo-Indistinguishability.

3.3.3 Differential privacy in trajectory clustering

The probability distribution model of Cluster-Indistinguishability is shown in Fig. 5. Let us observe the probability distribution between original output $P_{f\left(T\right)}$ and perturbed output $P_{f^{\prime}_{\textit{Clu}}\left(T\right)}$ under the constraint of cluster radius $r$ . If the maximum difference of the probability distribution between them is $e^{\varepsilon d}$ , then Cluster-Indistinguishability can guarantee that cluster results $n c$ and $n^{\prime}c^{\prime}$ are indistinguishable to the attacker. The mathematical form is shown in Eq. (8):

$\displaystyle\frac{P\left[{{f}^{\prime}_{Clu}\left(T\right)|\left({{n}^{\prime% }{c}^{\prime},r}\right)}\right]}{P\left[{f\left(T\right)|\left({nc,r}\right)}% \right]}\leqslant e^{\varepsilon d}$ (8)

Figure 5.

Probability distribution of Cluster-Indistinguishability.

Noise added in the original dataset $T$ , which satisfies Eq. (8), can guarantee that cluster results $n c$ and $n^{\prime}c^{\prime}$ are indistinguishable to the attacker.

3.4 Framework overview

We elaborate the framework of this study in two aspects: a privacy preserving framework, and a cluster indistinguishability procedure. Figure 6 illustrates the framework for trajectory clustering privacy preservation. It is composed of three participants: Data Owner, LBSP, and Data Analyzer. They play different roles in trajectory mining and are defined below.

Figure 6.

Framework of privacy preservation.

Data Owner: A user, whose mobile device supports positioning, sends his/her location dataset T to LBSP to acquire LBS.

LBSP: It clusters $T$ and obtains the parameters of the original results. Then, LBSP utilizes Cluster-Indistinguishability according to the parameters to produce noise to obtain perturbed location dataset $T^{\prime}$ . Finally, LBSP clusters $T^{\prime}$ and provides cluster results to Data Analyzer.

Data Analyzer: The third party requesting statistical results and conducting clustering analysis.

Data Owner uploads its location data to LBSP to obtain better location-based services. LBSP collects trajectories from various Data Owners and sends trajectory clustering results to Data Analyzer. After obtaining the clustering results, Data Analyzer can mine useful information, such as commercial hot spots, and returns the mining results to LBSP. Based on the mining results, LBSP can improve some business decisions (e.g., advertisement recommending).

Nonetheless, if LBSP directly sends Data Owner’s original trajectories to Data Analyzer, a malicious Data Analyzer may violate Data Owner’s location privacy. Thus, before sending trajectories to Data Analyzer, LBSP should randomize the real trajectories and provide the processed location data to Data Analyzer.

In the process, the domain expert plays the role of designing a randomized algorithm to preserve Data Owner’s location privacy. The procedure of Cluster-Indistinguishability is shown in Fig. 7.

Figure 7.

Procedure of Cluster-Indistinguishability.

Firstly, we describe the model of typical clustering algorithms. Secondly, we define differential privacy for trajectory clustering. Then, we derive the PDF of noise in the Cartesian coordinate system. Finally, we transform the form of noise from a Cartesian coordinate system to a Polar coordinate system to practically apply noise.

4. Methodology

In this section, we present the design of our methodology. We first give the definition of Cluster- Indistinguishability, which can guarantee differential privacy for trajectory clustering. We then conduct a noise design according to the definition and apply it in a Polar coordinate system.

4.1 Clustering-Indistinguishability

We demonstrate in the attack model (Fig. 1) that the privacy of $n c$ should be protected from being disclosed in cluster results. On one hand, we need to protect the precise location data of every point in the cluster. Therefore, after applying randomized algorithm $M$ to all points in the cluster, the precise value of every point should be perturbed to prevent attackers from distinguishing whether a single point is in a certain cluster. On the other hand, in terms of data utility, if we cluster the perturbed dataset and the result remains invariant, it is described as Assumption 1.

Assumption 1. $\forall p_{i}\in\left({R^{+}}\right)^{d}$ , and there exists a randomized algorithm, $M$ , such that both the following conditions hold:

a. $M\left({p_{i}}\right)\notin T$ b. $\frac{\textit{Clu}_{\textit{num}}\left[{M\left(T\right)}\right]}{\textit{Clu}_% {\textit{num}}\left[{M\left(T^{\prime}\right)}\right]}=\frac{\textit{Cor}\left% ({\textit{Clu}\left[{M\left(T\right)}\right]}\right)}{\textit{Cor}\left({% \textit{Clu}\left[{M\left(T^{\prime}\right)}\right]}\right)}$

Condition (a) indicates that processed positions $p_{i}$ do not belong to the original location set $T$ , while condition (b) indicates that the cluster results, including the number of points and the center, remain unchanged. Assumption 1 gives the conditions that a complete randomized algorithm should satisfy.

However, owing to the existence of a large amount of auxiliary information, Dwork [18] proved that a complete privacy preserving mechanism does not exist. Geo-Indistinguishability achieves a good balance in security and utility of a single point and is suitable for applications of location-based query services. Intuitively, however, it cannot be applied in privacy preserving for trajectory clustering because of the cluster radius constraint. The number of points and the cluster center are two core elements for obtaining the cluster results; nevertheless, they tend to leak privacy. Thus, we discuss the effect that $n$ and $c$ have on cluster results.

In order to solve the privacy issue of trajectory clustering, differential privacy adds noise to the original positions. Therefore, the value of the number of points $n$ and the center of cluster $c$ will change because of the perturbed noise. If randomized algorithm $M$ can guarantee that $n c$ does not change after applying $M$ , then $M$ can guarantee differential privacy for trajectory clustering. We give the definition of Cluster-Indistinguishability for trajectory clustering as follows.

Definition 3. Cluster-Indistinguishability

$\displaystyle\frac{P\left[{\left({\textit{Clu}_{\textit{num}}\left[{M\left(T% \right)}\right]}\right)\ast\textit{Cen}\left({\textit{Clu}\left[{M\left(T% \right)}\right]}\right)\in S}\right]}{P\left[{\left({\textit{Clu}_{\textit{num% }}\left[{M\left({T}^{\prime}\right)}\right]}\right)\ast\textit{Cen}\left({% \textit{Clu}\left[{M\left({T}^{\prime}\right)}\right]}\right)\in S}\right]}% \leqslant e^{\varepsilon}$ (9)

where $S$ is the output result of $n c$ and $P$ is the probability distribution function.

We limit the maximum effect that a single position has on cluster results to $\exp\left(\varepsilon\right)$ . Algorithm $M$ satisfying of Cluster-Indistinguishability can guarantee that an attacker cannot acquire an individual’s precise position by observing the cluster results. Furthermore, $M$ maintains the invariance of the cluster results.

4.2 Noise design

The definition of Cluster-Indistinguishability is given in Section 4.1. In this section, we illustrate the noise design procedure. Noise in differential privacy is a Laplace distribution, and the scale $\lambda$ of Laplace is determined by Eq. (5) [19], where $\Delta f$ is the sensitivity function representing the effect that a single record has on the query result. The value of $\Delta f$ is determined by the query function, $Q$ .

Because the third party obtains the number and centroid of cluster $n$ and $c$ from LBSP, we denote the query functions requested by the third party as $Q_{\textit{num}}$ , $Q_{c}$ , respectively. Next, the query function requested by the third party is:

$\displaystyle Q=Q_{\textit{num}}\ast Q_{c}$ (10)

Then, sensitivity function is:

$\displaystyle\Delta f=\Delta f_{\textit{num}}\ast\Delta f_{c}$ (11)

The total number of T and $T^{\prime}$ differs by only one; therefore, $\Delta f_{\textit{num}}=1$ , and then $\Delta f=\Delta f_{c}$ . The value of $\Delta f$ is determined by Property 1.

Property 1. The sensitivity function $\Delta f$ of one cluster is:

$\displaystyle\Delta f=\frac{1}{n}\left({p_{\max}-c^{\prime}}\right)$ (12)

where $p_{\max}$ is the point with the maximum coordinate value in the cluster.

Proof. We denote $\textit{sum}=nc$ as the sum of points, then

$\displaystyle\Delta f=\Delta f_{c}=\frac{\textit{sum}}{n}-\frac{\textit{sum}^{% \prime}}{n-1}=\frac{\left({n-1}\right)\left({\textit{sum}-\textit{sum}^{\prime% }}\right)-\textit{sum}^{\prime}}{n\left({n-1}\right)}=\frac{1}{n}p_{\max}-% \frac{1}{n\left({n-1}\right)}\textit{sum}^{\prime}=\frac{1}{n}p_{\max}-\frac{1% }{n}c^{\prime}=\frac{1}{n}\left({p_{\max}-c^{\prime}}\right)$

4.3 Noise implementation

We can obtain the PDF of one-dimensional noise through the above analysis. However, location data is two-dimensional; thus, we should consider the PDF of the Laplace distribution in two-dimensional space.

$\displaystyle\rho_{f}\left(r^{\prime}\right)=\frac{n^{2}\varepsilon^{2}}{2\pi% \left({p_{\max}-c^{\prime}}\right)^{2}}re^{-\frac{n\varepsilon r}{p_{\max}-c^{% \prime}}}$ (13)

where $r$ is the radius of the cluster in trajectory clustering, and $r^{\prime}$ is the radius of noise.

The two-dimensional form of the Laplace distribution in a Cartesian coordinate system is given in Eq. (13). However, it is not practical to apply it in real scenarios. Therefore, we transform the noise from a Cartesian to a Polar coordinate system. The JPDF $p_{f\left({r^{\prime},\theta}\right)}$ of radius $r^{\prime}$ and angle $\theta$ in the Polar coordinate system is:

$\displaystyle\rho_{f\left({r^{\prime},\theta}\right)}=\frac{n^{2}\varepsilon^{% 2}}{2\pi\left({p_{\max}-c^{\prime}}\right)^{2}}re^{-\frac{n\varepsilon r}{p_{% \max}-c^{\prime}}}$ (14)

Then, the PDF of r and $\theta$ is:

$\displaystyle\rho_{f\left(r^{\prime}\right)}=\int_{0}^{2\pi}{\rho_{f\left({r^{% \prime},\theta}\right)}}d\theta=\frac{n^{2}\varepsilon^{2}}{\left({p_{\max}-c^% {\prime}}\right)^{2}}re^{-\frac{n\varepsilon r}{p_{\max}-c^{\prime}}}$ (15) $\displaystyle\rho_{f\left(\theta\right)}=\int_{0}^{\infty}{\rho_{f\left({r^{% \prime},\theta}\right)}}dr=\frac{1}{2\pi}$ (16)

PDF of $r^{\prime}$ is represented in Eq. (15), and the interval of $\theta$ is $[0,2\pi)$ . Then, the perturbed coordinate is calculated by:

$\displaystyle\left\{{\begin{array}[]{l}{x}^{\prime}=x+{r}^{\prime}\cos\theta\\ {y}^{\prime}=y+{r}^{\prime}\sin\theta\\ \end{array}}\right.$ (17)

Procedure of algorithm $M$ elaborated in Algorithm 1.

Algorithm 1 $\left({n^{\prime},c^{\prime}}\right)=\textit{Cluster}-\textit{% Indistinguishability}\left(T\right)$
Input:
Original trajectory dataset $T$
Output:
Perturbed results of cluster $n^{\prime},c^{\prime}$
1: Calculate the number $n$ and center $c$ of points in the cluster by LBSP
2: Calculate the noise distribution $Z$ according to Eqs (12) and (13)
3: Transform $Z$ to a Polar coordinate system and obtain $r^{\prime}$ , $\theta$ according to Eq. (14)
4: Add noise to every position based on $r^{\prime}$ , $\theta$ and obtain perturbed points set $T^{\prime}$
5: Calculate the number $n^{\prime}$ and center $c^{\prime}$ of points in $T^{\prime}$
6: Return $n^{\prime}$ and $c^{\prime}$

5. Experimental evaluation

In this section, we report our empirical evaluation conducted to assess the performance of our methods. In terms of privacy evaluation, we theoretically analyze the security of our algorithm and verify its validity by experiments. For a utility evaluation, we conducted qualitative and quantitative analyses to evaluate the effect of our algorithm on a single point and cluster results.

5.1 Experiment setup

We conduct our experiments on three real-world trajectory datasets. The experiments were performed on an Intel Core 2 Quad 3.06-Hz Windows 7 machine equipped with 8 GB of main memory. Each experiment was run 1,000 times.

Geolife trajectory dataset: Owing to the Geolife project [20, 21], we obtained the published real trajectories of volunteers. This global positioning system (GPS) trajectory dataset was collected in the (Microsoft Research Asia) Geolife project by 182 users over five years (from April 2007 to August 2012). A GPS trajectory of this dataset is represented by a sequence of time-stamped points, each of which contains the latitude, longitude, and altitude coordinates. This dataset contains 17,621 trajectories with a total distance of 1,292,951 km and a total duration of 50,176 h.

T-Drive taxi trajectories: A set of GPS trajectories was recorded by 8,602 taxi cabs in Beijing, China, in May 2009. The trajectories covered the region of Beijing within the bounding box (39.788N, 116.148W) and (40.093N, 116.612W), approximately 34 km $\times$ 40 km. The raw sampling rate of these trajectories ranged from 30 s to 5 min. The dataset consisted of approximately 4.3 million trips with passengers. Each trip was linearly interpolated into a sequence of location points at 30-s sampling intervals.

Check-in dataset: The dataset consisted of the check-in data generated by over 49,000 users in New York City and 31,000 users in Los Angeles as well as the social structure of the users. Each check-in included a venue ID, the category of the venue, a time stamp, and a user ID.

5.2 Privacy evaluation

We firstly theoretically analyzed the privacy preserving degree of Cluster-Indistinguishability. $M$ satisfying Property 2 can guarantee $\left({n-1}\right)\varepsilon\sim$ differential privacy.

Property 2. If noise distribution $Z$ added in the raw location data satisfies Laplace distribution:

$\displaystyle Z\sim\textit{Lap}\left[{\frac{p_{\max}-c^{\prime}}{n\varepsilon}% }\right]$

then $M$ satisfies $\left({n-1}\right)\varepsilon\sim$ differential privacy.

Proof. If $h_{f}(T)$ denotes the PDF of original dataset $T$ , and if $p_{f}(Z)$ denotes the PDF of noise distribution $Z$ , then

$\displaystyle\frac{\Pr\left[{\left({\textit{Clu}_{\textit{num}}\left[{M\left(T% \right)}\right]}\right)\ast\textit{Cen}\left({\textit{Clu}\left[{M\left(T% \right)}\right]}\right)\in S}\right]}{\Pr\left[{\left({\textit{Clu}_{\textit{% num}}\left[{M\left({T^{\prime}}\right)}\right]}\right)\ast\textit{Cen}\left({% \textit{Clu}\left[{M\left({T^{\prime}}\right)}\right]}\right)\in S}\right]}=% \frac{\Pr\left[{\textit{sum}+z=s}\right]}{\Pr\left[{\textit{sum}^{\prime}+z=s}% \right]}=\frac{p_{f}\left[{s-h_{f}\left({\textit{sum}}\right)}\right]}{p_{f}% \left[{s-h_{f}\left({\textit{sum}^{\prime}}\right)}\right]}$

Figure 8.

Probability distribution of different methods.

where $z\in Z$ , $s\in S$ . Since Z obeys the Laplace distribution, then

$\displaystyle\frac{p_{f}\left[{s-h_{f}\left({\textit{sum}}\right)}\right]}{p_{% f}\left[{s-h_{f}\left({\textit{sum}^{\prime}}\right)}\right]}\leqslant e^{% \frac{n\varepsilon}{p_{\max}-c^{\prime}}\cdot\left|{\textit{sum}-\textit{sum}^% {\prime}}\right|}$ $\displaystyle\quad=e^{\frac{n\varepsilon}{p_{\max}-c^{\prime}}\cdot\left|{nc-% \left({n-1}\right)c^{\prime}}\right|}$ $\displaystyle\quad\leqslant e^{\frac{n\varepsilon}{p_{\max}-c^{\prime}}\cdot% \left({n-1}\right)\left|{c-c^{\prime}}\right|}$ $\displaystyle\quad=e^{\frac{n\varepsilon}{p_{\max}-c^{\prime}}\cdot\left({n-1}% \right)f}$ $\displaystyle\quad=e^{\left({n-1}\right)\varepsilon}$

To achieve $\varepsilon\sim$ differential privacy for an individual’s sensitive information in a trajectory cluster, noise distribution $Z$ should be:

$\displaystyle Z\sim\textit{Lap}\left[{\frac{\left({n-1}\right)\left({p_{\max}-% c^{\prime}}\right)}{n\varepsilon}}\right]$ (18)

To verify the privacy-preserving performance of Cluster-Indistinguishability, we analyzed the statistics probability distribution of cluster results $n c$ , which can be utilized to launch an attack. We compiled the statistics of the probability distribution for different clustering privacy-preserving methods. The statistical results are shown in Fig. 8.

Figure 8 shows the probability distribution of different algorithms. A distribution that is close to the original data distribution means a high degree of privacy preservation. Since $k$ -anonymity generalizes points in a certain area, its probability distribution remains invariant in some intervals. P $k$ -means and IDP $k$ -means have similar distributions because they are designed for the same clustering algorithm, $k$ - means. Geo-Indistinguishability performs best in existing algorithms; however, the proposed algorithm is closer to the original data distribution. The quantitative comparison results are shown in Table 2. Privacy-preserving degree $\varepsilon$ is set to 1 and the corresponding protection radius, $r$ , is 600 m.

Table 2

$\varepsilon$ of different methods

[height=0.cm,width=2.4cm]DatasetMethods	Geolife	T-Drive	Check-in
$k$ -anonymity	2.9518	3.3581	1.0425
P $k$ -means	0.9836	1.2584	0.7459
IDP $k$ -means	0.9625	1.1429	0.7286
DP-DBSCAN	0.8162	0.9362	0.6240
Geo-Indistinguishability	0.5936	0.6423	0.5014
Cluster-Indistinguishability	0.5401	0.5907	0.4207

Figure 9.

Cluster results.

Figure 10.

Total number of clusters.

Figure 11.

Average relative error of noisy position.

Figure 12.

Homogeneity of clusters.

Figure 13.

Completeness of cluster.

Figure 14.

$V$ -measure of clusters

From the results in Table 2, we conclude that Cluster-Indistinguishability has the smallest $\varepsilon$ . Thus, Cluster-Indistinguishability can provide the best security among these methods. The value of $\varepsilon$ in $k$ -anonymity is 2.9518, 3.3581, and 1.0425 in Geolife, T-Drive, and Check-in, respectively, which is far beyond our setting. Therefore, $k$ -anonymity performs worse in trajectory clustering than the one-dimensional dataset. Experimental results show that $\varepsilon$ is half of the setting in Cluster-Indistinguishability, which means that we can guarantee effective privacy preserving.

5.3 Utility evaluation

To evaluate the effect of Cluster-Indistinguishability on the cluster results, we respectively qualitatively and quantitatively demonstrate and analyze the cluster results.

5.3.1 Qualitative analysis

We chose the trajectories of 60 users in Geolife as test objects. The cluster result of the original location data is shown in Fig. 9a. Then, we set $\varepsilon$ to 0.5, which correlated the protected radius to 300 m.

The cluster result of the noisy position is shown in Fig. 9b. Comparing these two figures, we can conclude that our algorithm has a minimal effect on the cluster results.

5.3.2 Quantitative analysis

In terms of quantification, we illustrate the performance of Cluster-Indistinguishability on the number of clusters and the error of the individual position. We also comprehensively evaluate the cluster results.

Total number of clusters

To evaluate the effect of cluster indistinguishability on the number of clusters, we collect statistics of the total number of clusters under different $\varepsilon$ . To prove the validity of our algorithm, we compare Cluster-Indistinguishability with existing methods. Experimental results are shown in Fig. 10.

Figure 10 reflects the change of the total number of clusters with the reduction of the privacy preserving degree. Cluster-Indistinguishability achieves the best results with the numbers nearest to the original clusters. Owing to the spatial generalization, $k$ -anonymity shows poor performance on the total number of clusters. The performance of Cluster-Indistinguishability is improved by approximately 20% more than the existing best algorithm, Geo-Indistinguishability, because noise in Geo-Indistinguishability is not limited by the cluster radius.

Error of the individual’s position

We proofed the security of Cluster-Indistinguishability in Section 5. In this section, we discuss the effect of Cluster-Indistinguishability on an individual’s real position. We utilize the norm of the average relative error to evaluate the error of the individual position. The average relative error of the user’s position in a cluster is defined as follows:

$\displaystyle\textit{Error}=\frac{1}{n}\sum\limits_{i=1}^{n}{\frac{d\left({p^{% \prime}_{i},p_{i}}\right)}{d\left({p_{i},c}\right)}}$ (19)

The above definition reflects the maximum offset distance caused by noise added in the original position.

To verify the effectiveness of Cluster-Indistinguishability, we compare Cluster-Indistinguishability with existing privacy preserving methods for clustering. The results are shown in Fig. 11.

From Fig. 11, we conclude that the privacy preserving degree is gradually weakened with the increase of $\varepsilon$ . Then, noise added in the original dataset becomes smaller and the relative error of all algorithms shows a downward trend. Furthermore, the error of P $k$ -means and IDP $k$ -means is more than 40%, while DP-DBSCAN is below 30%. This is because DBSCAN is a density-based spatial method and can provide more accurate clustering results than $k$ -means. Since noise is derived by its JPDF in Cluster-Indistinguishability, it performs better than DP-DBSCAN, whose noise distribution is independent. The average relative error of cluster indistinguishability is approximately 15%, which can support LBS.

Performance of clusters

Compared with evaluation index $F$ -measure, the evaluation criteria in [22] can accurately reflect the correlation and similarity between original and perturbed clusters. We therefore utilize it to evaluate the results of our experiments. It contains three criteria: homogeneity, completeness, and $V$ -measure.

Assume that a trajectory dataset can be clustered by two means: a set of classes, $C=\left\{{c_{i}|i=1,\cdots,n}\right\}$ and a set of clusters, $K=\left\{{k_{i}|i=1,\cdots,m}\right\}$ . Then, homogeneity means that each cluster contains only members of a single class:

$\displaystyle\textit{homogeneity}=\left\{{\begin{array}[]{ll}1&\textit{if}% \quad H\left({C,K}\right)=0\\ 1-\frac{H\left({C/K}\right)}{H\left(C\right)}&\textit{else}\\ \end{array}}\right.$ (20)

where $H\left(\cdot\right)$ is entropy. Completeness means that all members of a given class are assigned to the same cluster:

$\displaystyle\textit{completeness}=\left\{{\begin{array}[]{ll}1&\textit{if}% \quad H\left({K,C}\right)=0\\ 1-\frac{H\left({K/C}\right)}{H\left(C\right)}&\textit{else}\\ \end{array}}\right.$ (21)

$V$ -measure is the harmonic mean between homogeneity and completeness:

$\displaystyle\textit{V-measure}=2\ast\frac{\textit{homogeneity}\ast\textit{% completeness}}{\textit{homogeneity}+\textit{completeness}}$ (22)

Figure 12 shows the homogeneity of clusters under different algorithms. Based on the figure, we conclude that the homogeneity of Cluster-Indistinguishability is better than that of the other algorithms and stabilizes at approximately 90%.

Figure 13 shows the completeness of the clusters. It is evident that P $k$ -means and IDP $k$ -means have poor performances on completeness. The completeness of DP-DBSCAN is stable at approximately 40%, while that of Cluster-Indistinguishability is 70%. The improvement in cluster completeness of the proposed algorithm demonstrates that Cluster-Indistinguishability has a minimal effect on the trajectory integrity.

The $V$ -measure in Fig. 14 represents a comprehensive reflection on the cluster results. From the comparison results, it is evident that Cluster-Indistinguishability can be improved by 40% more than DP-DBSCAN, which has the best performance among existing methods.

Quantitative and qualitative experimental analyses show that the average relative error of Cluster-Indistinguishability is approximately 15%, and the performance of the cluster results improves 40% compared to those of existing privacy preserving methods for clustering. Thus, Cluster-Indistinguishability can better support LBS and trajectory clustering applications.

6. Conclusions and future work

To address problems of narrow applicability, low-level utility, and difficulty in real-world applications in existing privacy preserving schemes, we herein proposed a practical differential privacy mechanism for trajectory clustering: Cluster-Indistinguishability. Firstly, we provided the general model of typical trajectory clustering algorithms and the definition of differential privacy for trajectory clustering. Then, we presented our noise design according to the definition, and derived the two-dimensional JPDF of noise. Finally, noise was transformed from a Cartesian coordinate system to a Polar coordinate system to efficiently apply it.

An experimental evaluation using a real-world dataset showed that Cluster-Indistinguishability can adapt to most typical trajectory privacy preserving clustering algorithms. The average error of a single point was approximately 15%, and the performance of cluster results improved by approximately 40% compared with existing methods. Therefore, Cluster-Indistinguishability can better support LBS and trajectory clustering applications. Our future work will primarily focus on differential privacy preserving mechanisms to support continuous location releasing and multiple trajectory mining tasks.

Footnotes

Acknowledgments

This work was supported in part by the National Natural Science Foundation of China (41671443, 41371402) and the Applied Basic Research Program of Wuhan (2016010101010024). The authors are grateful for the anonymous reviewers who made constructive comments and improvements.

References

Zheng

and Xie

, Learning travel recommendations from user-generated GPS traces, ACM Transactions on Intelligent Systems and Technology (TIST) 2(1) (2011), 389–396.

Hua

Gao

and Zhong

, Differentially private publication of general time-serial trajectory data, in: IEEE Conference on Computer Communications (INFOCOM), Hong Kong, China, 2015, pp. 1435–1444.

Andrienko

and GiannottiA

, Movement data anonymity through generalization, in: Proceedings of the 2nd SIGSPATIAL ACM GIS 2009 International Workshop on Security and Privacy in GIS and LBS (SIGSPATIAL), Seattle, USA, 2009, pp. 1966–1974.

Yarovoy

Bonchi

and Lakshmanan

L.V.S.

, Anonymizing moving objects: How to hide a MOB in a crowd? in: Proceedings of the 12th International Conference on Extending Database Technology (EDBT), Saint-Petersburg, Russia, 2009, pp. 2560–2565.

Monreale

Andrienko

and Andrienko

, Movement data anonymity through generalization, Transactions on Data Privacy 3(2) (2010), 91–121.

Sweeney

, k-anonymity: A model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5) (2013), 557–570.

Dwork

, Differential Privacy, in: Proceedings of the 33rd International Conference on Automata, Languages and Programming (ICALP), part II 26(2) (2006), pp. 1–12.

Blum

Dwork

and McSherry

, Practical privacy: the SuLQ framework, in: Proceedings of the 24th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, New York, USA, 2005, pp. 1373–1386.

Dwork

Naro

and Pitassi

, Pan-private streaming algorithms, in: Proceedings of the 1st Symposium on Innovations in Computer Science, New York, USA, 2012, pp. 1074–1085.

10.

Hao

Z.F.

and Wen

, Research on Differential Privacy Preserving k-means Clustering, Computer Science 10(3) (2013), 287–290.

11.

and Huang

H.K.

, A DP-DBScan clustering algorithm based on differential privacy preserving, Computer Engineering & Science 37(4) (2015), 830–834.

12.

Andres

Bordenabe

and Chatzikokolakis

, Geo-Indistinguishability: Differential Privacy for Location-Based Systems, in: Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security(CCS), New York, USA, 2013, pp. 1074–1085.

13.

Abul

Bonchi

and Nanni

, Anonymization of moving objects databases by clustering and perturbation, Information Systems 35(8) (2010), 884-910.

14.

Monreale

Andrienko

and Giannotti

, Movement data anonymity through generalization, Transactions on Data Privacy 3(2) (2010), 91–121.

15.

S.T.

and Ng

K.Y.

, Privacy-aware location data publishing, ACM Transactions on Database Systems 35(3) (2012), 53–56.

16.

Chen

Fung

B.C.M.

Mohammed

and Desai

B.C.

, Privacy preserving trajectory data publishing by local suppression, Information Sciences 231(9) (2013), 83–97.

17.

Domingo-Ferrer

and Soria-Comas

, From t-closeness to differential privacy and vice versa in data anonymization, Knowledge-Based Systems, Stevens Point, WI, 2015, 319–324.

18.

Dwork

, A firm foundation for private data analysis, Communications of the ACM 54(1) (2010), 86–95.

19.

Dwork

Mcsherry

and Nissim

, Calibrating Noise to Sensitivity in Private Data Analysis, Springer Berlin Heidelberg 3876(3) (2006), 265–284.

20.

Zheng

Xie

and Ma

W.Y.

, GeoLife: A Collaborative Social Networking Service among User, location and trajectory, IEEE Data Engineering Bulletin 33(2) (2010), 32–40.

21.

Zheng

Chen

Y.K.

Xie

and Ma

W.Y.

, Understanding Mobility Based on GPS Data, in: Proceedings of ACM Conference on Ubiquitous Computing (UbiComp), Seoul, Korea, 2008, pp. 1966–1974.

22.

Rosenberg

and Hirschberg

, V-Measure: A condtional entropy-based external cluster evaluation measure, in: Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP), Prague, Czech, 2007, pp. 1435–1444.