TPPG: Privacy-preserving trajectory data publication based on 3D-Grid partition

Abstract

The issue of privacy preservation is receiving more and more attention when publishing trajectory data. In this paper, we study the challenges of published trajectory data anonymization. Most existing anonymization methods directly delete the trajectories or locations violating specific constraints, it is likely to cause a large loss of information. To address the problem, this paper proposes a trajectory privacy preservation method based on 3D-Grid partition in order to reduce information loss in the process of trajectory anonymization. This method first divides the trajectory region into several spatio-temporal units (denoted as 3D-cells), and then conducts location exchange or suppression in each spatio-temporal unit. Based on the trajectory data partition, within each 3D-cell, the proposed method exchanges locations among trajectories or removes very few locations of some sub-trajectories which do not meet the conditions rather than the whole trajectory. Our method considers three scenarios of trajectory distribution and measures trajectory similarity based on time, orientation, spatial locations and other features of trajectory. After the reconstruction of the related anonymous sub-trajectories, an anonymized trajectory dataset is obtained. Theoretical analysis and experimental results show that, compared to other methods, the proposed algorithm effectively preserves trajectory data privacy and improves the anonymous results of trajectory data in terms of accuracy and availability.

Keywords

Privacy preservation trajectory data partition 3D-cell trajectory similarity measurement trajectory anonymization trajectory reconstruction

1. Introduction

With the rapid development of mobile smart terminals, positioning and storage technology, it is possible to collect and store location and trajectory data of a large number of moving objects. These trajectories contain abundant temporal and spatial information. Collecting, analyzing and mining trajectories can support various applications related to moving objects [1, 2, 3, 4]. Examples include location-based services, traffic monitoring, urban and road planning, user behavior analysis and travel recommendations [5, 6, 7].

Trajectory data represents the moving routes of moving objects. The release of a large number of trajectory data is bound to pose a threat to the privacy and security of users [8]. For example, by analyzing trajectory data, combined with other relevant background information, an attacker can easily obtain some users’ name, gender, work unit, home address, hobbies, behavior patterns, social habits and other privacy information, the vital interests of users have thereby been harmed [9]. Some research results have been achieved for the privacy preservation of a certain location at a certain time [9, 10]. However, the trajectory privacy preservation method for continuous location information remains to be further studied. With the increasing concern about the preservation of personal privacy information, the privacy-preserving issue in the release of trajectory data has gradually become one of the research hotspots in data mining field [11, 12, 13]. When publishing trajectory data, the data publishers should ensure that the anonymous trajectory data does not reveal personal privacy information while maintaining high availability for accurate analysis. Therefore, how to effectively preserve trajectory privacy of moving objects without destroying the data usability has become an urgent problem to be solved in trajectory data publication.

As the generalization method can achieve a good balance between individual privacy preservation and trajectory data availability, generalization-based trajectory $k$ -anonymity model is popularly used. Most existing anonymity methods directly delete the trajectories or locations violating specific constraints [5, 14], it is likely to cause a large loss of information. To address the problem, this paper designs an algorithm based on 3D-Grid partition to ensure privacy preservation of users and effectively improve data availability. The algorithm is denoted as TPPG, which implements location exchange based on similarity between sub-trajectories within each spatio-temporal unit (denotes as 3D-cell).

To sum up, this paper makes the following contributions:

1.
Effective 3D-Grid partition of trajectory data. A novel and effective trajectory partition method that holds the potential features of trajectory data is presented, in order to benefit trajectory similarity evaluation and trajectory anonymization. The method is based on the spatio-temporal features of trajectory. And a new concept 3D-cell is proposed.
2.
Trajectory similarity measurement based on different scenarios. Three scenarios of trajectory distribution are analyzed and a comprehensive trajectory distance calculation method is proposed, which is used to measure the similarity between any two trajectories.
3.
Location exchange and trajectory reconstruction. A privacy preservation method of trajectory data publication, denoted as TPPG, is proposed. It is based on the 3D-Grid partition of trajectory data and similarity measurement between sub-trajectories within each 3D-cell.
4.
Experimental study. The proposed algorithm is tested on both synthetic and real-life trajectory datasets. Theoretical analysis and experimental results show that our algorithm effectively preserves trajectory data privacy and improves the accuracy and availability of anonymized trajectory dataset.

The rest of this paper is organized as follows. In Section 2, we introduce the related work. Section 3 introduces preliminary concept of an enhanced trajectory model. In Section 4, we introduce the trajectory privacy preservation model and the measurement. Section 5 presents a trajectory privacy preservation algorithm. Experimental results and analysis are presented in Section 6. Section 7 concludes the paper and provides future research directions.
2. Related work

In recent years, the scientific issue of trajectory privacy preservation has received more and more attention in academia [14, 15, 16, 17, 18]. Many achievements have been made on privacy preservation of traditional relational data [19, 20, 21], but that can not be directly applied to trajectory data privacy preservation because trajectory data containing rich spatio-temporal information is significantly different with the relational data. In the case of trajectory privacy preservation, any location or time can be used as a quasi-identifier.

Research on trajectory privacy preservation mainly addresses privacy-related issues in two applications. One is privacy preservation of trajectory data publication. The other is trajectory privacy preservation in LBS (Location Based Service) [22]. This paper studies the former, i.e., privacy-preserving trajectory data publication. In trajectory data publication, privacy of moving objects may be compromised because of two aspects: the disclosure of sensitive or frequently accessed locations in trajectories, and the disclosure caused by the linkage between trajectories and external knowledge.

Privacy-preserving trajectory publication methods can be classified into three categories.

2.1 Perturbation-based method

This method is to add false trajectories to the original dataset or to replace real trajectory with false data while ensuring that the disturbed data is not seriously distorted. The method is simple, but will generate large amount of data and result in lower data availability.

Hoh et al. [23] proposed a novel time-to-confusion metric and a disclosure control algorithm to deal with trajectory privacy preservation problems. Mano et al. [24] proposed a method that replaces user name with a pseudonym to achieve privacy-preserving trajectory data publication.

2.2 Suppression-based method

This method is to release trajectory data selectively. Sensitive data and frequent locations are suppressed or conversed before publication. It is a simple method with large information loss and low data availability.

Terrovitis and Mamoulis [25] proposed a trajectory privacy preservation method based on data suppression. The trajectory privacy is preserved by not releasing sensitive locations or frequently accessed location information. Chen et al. [11] performed trajectory privacy preservation based on the $(K,C)_{L}$ -privacy model, but a lot of information loss is generated after the process of global and local suppression. If the information is excessively suppressed, a large amount of information will be lost and the availability of real trajectory data will be greatly reduced. Zhao et al. [8] proposed two solutions to choose reasonable suppression locations. The first one is to suppress the whole sensitive trajectory or to add false data to the trajectory dataset. The second one is to adopt a specific algorithm to realize local suppression. Terrovitis et al. [26] proposed four intuitive techniques based on combinations of locations suppression and trajectories splitted for privacy-preserving publication of trajectory data. The above methods can cause significant data loss and reduce data availability.

2.3 Generalization-based method

This method extracts locations from trajectories, groups them into spatial clusters and uses the centroids of the clusters as generating points for anonymous region.

Trajectory $k$ -anonymity technology belongs to this category, which is the most popular trajectory privacy-preserving method. It is derived from the $k$ -anonymity model proposed by Sweeney [19] for privacy-preserving problem of traditional relational data. Trajectory $k$ -anonymity model requires a set of original trajectories to be converted into a set of anonymized trajectories, any of which can not be distinguished from at least other $k$ -1 anonymous trajectories. In the anonymized trajectory dataset, the probability of any trajectory or location being accurately identified is reduced to less than or equal to 1/ $k$ . However, it obviously generates a lot of information loss and spends a lot of computing time.

Abul et al. [27] proposed NWA (never walk alone) method that divides a trajectory dataset into disjoint subsets based on ( $k$ , $\delta$ )-anonymity method, where the threshold $\delta$ is the radius of a cylinder. In this method, Euclidean distance metric is used and time must be synchronized for different trajectories. Nergiz et al. [28] proposed a trajectory anonymization algorithm based on generalization. First, locations are grouped into different clusters. Second, the locations within each cluster are generalized into a minimum anonymous rectangle. Third, the anonymized trajectory data is reconstructed and published. In order to avoid generating the large number of outliers caused by the Euclidean distance calculation, Abul et al. [29] furtherly proposed W4M (wait for me) method, which uses the edit distance EDR to measure the similarity between trajectories, and implements anonymization within each cluster of trajectories. But the number of trajectories in many clusters is very few. If the number is less than $k$ , it is not possible to realize the trajectory $k$ -anonymization. Monreale et al. [30] generalized the original trajectory $T$ into a trajectory $g(T)$ without any time information so that each cluster has at least other $k$ -1 trajectories. The availability is measured based on the clustering results. Its main drawback is that the time information is ignored. Huo et al. [31] achieved privacy preservation of historical trajectory data based on graph partition method. The trajectory $k$ -anonymity problem is turned to a graph partition one. Information loss of the anonymization process is reduced by minimizing the price of partition. In order to further reduce the trajectory information loss, they proposed YCWA (you can walk alone) method [32], which extracts important stay points in trajectory data and anonymizes the stay points to protect user privacy. To address the privacy disclosure problem in various social applications, this research team [33] also proposed a trajectory privacy-preserving method, PrivateCheckIn, to establish a $k$ -anonymity prefix tree for the cached signature sequence, and obtained the $k$ -anonymous sign-in sequence. The method realizes trajectory $k$ -anonymity, but can not avoid being attacked through background knowledge. Considering the attack based on background knowledge, Domingo-Ferrer and Trujillo-rasua [5] proposed two heuristic trajectory anonymization methods to address the problem of trajectory privacy disclosure caused by sensitive locations. One is the trajectory micro-aggregation method based on trajectory distance calculation and position alignment. The other is merely based on the position alignment, whose target is not $k$ -anonymization but $k$ -diversification. The reachability is considered in this method. However, the two methods have some shortcomings in the measurement of trajectory similarity, without considering the shape and direction features of trajectory. To address this problem, Wang et al. [14] improved the method of calculating distance between trajectories on the basis of Literature [5], and introduced the trajectory shape distance. However, in order to achieve trajectory $k$ -anonymity and location $k$ -anonymity based on clustering, these two methods lose a lot of information including locations, trajectories and spatio-temporal features.

The ability of trajectory privacy preservation and the availability of trajectory data are mutually restricted. The above methods have achieved good results in preserving users’ privacy. However, on the one hand, these methods may suppress data according to the frequency of access, disturb data according to time, or exchange user identifiers with pseudonyms, without considering the information contained in trajectory itself, so the information loss is large after the anonymizing process. On the other hand, most of the methods are based on the processing of the whole trajectory, ignoring the possibility of high similarity between sub-trajectories. As a result, the published anonymized trajectory dataset with high data loss will decrease the quality of trajectory data mining.

In this paper, we partition each trajectory into several units and propose the TPPG algorithm. The method considers the similarity between different sub-trajectories and conducts anonymization within each unit, which improves the usability of trajectory data while protecting trajectory privacy.

3. Preliminary concept and problem definition

To address the privacy-preserving issue of trajectory data publication, it is necessary to determine the representation of trajectory, distance calculation between trajectories and trajectory privacy requirements. This section will introduce an enhanced trajectory model, a serial of calculation formulas of trajectory distance, and related concept of trajectory privacy preservation. For facilitating further description, the specific problem definitions and terms used in our study are listed below.

Definition 1. (Trajectory): A trajectory refers to an ordered set of time-stamped locations, denoted as $T$ .

$\displaystyle T=\{\textit{Tid},(t_{1},x_{1},y_{1}),(t_{2},x_{2},y_{2}),\cdots,% (t_{n},x_{n},y_{n})\}$ (1)

where $t_{r}<t_{r+1}$ for all $1\leqslant r<n$ , $t_{r}$ is a timestamp, $(x_{r},y_{r})$ is a location in ${\rm R}^{2}$ , $(t_{r},x_{r},y_{r})$ is a time-stamped location which means that a moving object is at location $(x_{r},y_{r})$ at time $t_{r}$ , Tid is the identification of a trajectory, representing a moving object or individual user, and $n$ is the number of sampled points in the trajectory.

Definition 2. ( $p t$ %-contemporary trajectories) [5]: Consider two trajectories $T_{i}=\{i,(t_{1}^{i},x_{1}^{i},y_{1}^{i}),\cdots,\linebreak(t_{n}^{i},x_{n}^{i% },y_{n}^{i})\}$ and $T_{j}=\{j,(t_{1}^{j},x_{1}^{j},y_{1}^{j}),\cdots,(t_{m}^{j},x_{m}^{j},y_{m}^{j% })\}$ . Suppose that $t_{n}^{i}\neq t_{1}^{i}$ and $t_{m}^{j}\neq t_{1}^{j}$ , their $p t$ %-contemporary value $p t$ is defined as:

$\displaystyle pt=100*\min\left({\frac{\Delta t}{t_{n}^{i}-t_{1}^{i}},\frac{% \Delta t}{t_{m}^{j}-t_{1}^{j}}}\right)$ (2)

where $\Delta t$ is calculated as:

$\displaystyle\Delta t=\max((\min(t_{n}^{i},t_{m}^{j})-\text{max}(t_{1}^{i},t_{% 1}^{j})),0)$ (3)

If $pt=$ 0, the two trajectories are not contemporary. If and only if they start at the same time and end at the same time, then $pt=$ 100. Denote the overlap time of two trajectories $T_{i}$ and $T_{j}$ as $ol(T_{i},T_{j})$ , which starts at $\text{max}(t_{1}^{i},t_{1}^{j})$ and ends at $\min(t_{n}^{i},t_{m}^{j})$ . Hence, $ol(T_{i},T_{j})=\{\text{max}(t_{1}^{i},t_{1}^{j}),\ldots,\min(t_{n}^{i},t_{m}^% {j})\}$ .

If $t_{n}^{i}=t_{1}^{i}$ or $t_{m}^{j}=t_{1}^{j}$ (that is, there is only one location in $T_{i}$ or $T_{j}$ ), then $p t$ is assigned with $-$ 2 in order to make a distinction.

Definition 3. (Trajectory unit segment): Consider a trajectory $T$ . A trajectory unit segment is defined as the line segment between any pair of adjacent time-stamped locations $((t_{r},x_{r},y_{r}),(t_{r+1},x_{r+1},y_{r+1}))$ in the trajectory $T$ , where $1\leqslant r<n$ .

Definition 4. (The r-th segment vector): The $r$ -th segment vector, denoted as $\textit{seg}_{r}^{i}$ , refers to the directed path of a moving object in the $r$ -th trajectory unit segment of the $i$ -th trajectory $T_{i}$ . It is defined as:

$\displaystyle\textit{seg}_{r}^{i}=(t_{r+1}^{i}-t_{r}^{i},x_{r+1}^{i}-x_{r}^{i}% ,y_{r+1}^{i}-y_{r}^{i})$ (4)

where $(t_{r}^{i},x_{r}^{i},y_{r}^{i})$ and $(t_{r+1}^{i},x_{r+1}^{i},y_{r+1}^{i})$ are respectively the $r$ -th and ( $r+1$ )-th locations of $T_{i}$ , $1\leqslant r<n$ .

Hereinafter, $TS=\{T_{1},T_{2},\cdots,T_{p}\}$ is a set consisting of $p$ trajectories, $\textit{loc}_{r}^{i}$ is the $r$ -th time-stamped location of the $i$ -th trajectory.

Definition 5. (Trajectory orientation distance): $\forall T_{i},T_{j}\in TS$ , the trajectory orientation distance, denoted as $\textit{dist}_{o}(T_{i},T_{j})$ , refers to the distance between the two trajectories calculated based on the angles between two segment vectors. If $T_{i}$ and $T_{j}$ are $p t$ %-contemporary trajectories with $pt>$ 0,

$\displaystyle\textit{dist}_{o}(T_{i},T_{j})=\frac{1}{pt^{*}(|ol(T_{i},T_{j})|-% 1)}\mathop{\sum}\limits_{r=st_{ij}}^{et_{ij}-1}\text{arccos}\left({\frac{% \textit{seg}_{r}^{i}\cdot\textit{seg}_{r}^{j}}{|\textit{seg}_{r}^{i}|*|\textit% {seg}_{r}^{j}|}}\right)$ (5)

where “ $\cdot$ ” is the dot product operator, “arccos” function is used to calculate the angle between two vectors, $p t$ is calculated by Eq. (2), $st_{ij}$ and $et_{ij}$ are respectively the start time and end time of $ol\left({T_{i},T_{j}}\right)$ , which are defined in Definition 2, $\textit{seg}_{r}^{i}$ and $\textit{seg}_{r}^{j}$ are calculated by Eq. (4).

Definition 6. (Trajectory location distance): We call $\textit{triple}_{r}^{i}=(t_{r}^{i},x_{r}^{i},y_{r}^{i})$ the $r$ -th time-stamped location of trajectory $T_{i}$ , $1\leqslant r\leqslant n$ . $\forall T_{i},T_{j}\in TS$ , the trajectory location distance, denoted as $\textit{dist}_{l}(T_{i},T_{j})$ , refers to the distance between the two trajectories calculated based on the time-stamped locations. If $T_{i}$ and $T_{j}$ are $p t$ %-contemporary trajectories with $pt>$ 0,

$\displaystyle\textit{dist}_{l}(T_{i},T_{j})=\frac{1}{pt^{*}(|ol(T_{i},T_{j})|-% 1)}\mathop{\sum}\limits_{r=st_{ij}}^{et_{ij}-1}\sqrt{\sigma_{r}}$ (6)

where $\sigma_{r}$ represents the sum of areas of two triangles consisting of the four time-stamped locations $\textit{triple}_{r}^{i}$ , $\textit{triple}_{r}^{j}$ , $\textit{triple}_{r+1}^{i}$ and $\textit{triple}_{r+1}^{j}$ , it is calculated based on the following equations, $1\leqslant r,s\leqslant n$ and $1\leqslant i,j\leqslant p$ .

$\displaystyle dt({\textit{triple}_{r}^{i},\textit{triple}_{s}^{i}})=\sqrt{({x_% {r}^{i}-x_{s}^{i}})^{2}+({y_{r}^{i}-y_{s}^{i}})^{2}}$ (7) $\displaystyle\alpha^{i}=dt({\textit{triple}_{r}^{i},\textit{triple}_{r+1}^{i}}% ),\beta_{r}=dt({\textit{triple}_{r}^{i},\textit{triple}_{r}^{j}}),\gamma=dt({% \textit{triple}_{r+1}^{i},\textit{triple}_{r}^{j}})$ (8) $\displaystyle\mu_{1}=\frac{\alpha^{i}+\beta_{r}+\gamma}{2},\mu_{2}=\frac{% \alpha^{j}+\beta_{r+1}+\gamma}{2}$ (9) $\displaystyle\sigma_{r1}=\sqrt{|{\mu_{1}({\mu_{1}-\alpha^{i}})({\mu_{1}-\beta_% {r}})({\mu_{1}-\gamma})}|}$ (10) $\displaystyle\sigma_{r2}=\sqrt{|\mu_{2}({\mu_{2}-\alpha^{j}})({\mu_{2}-\beta_{% r+1}})({\mu_{2}-\gamma})|}$ (11) $\displaystyle\sigma_{r}=\sigma_{r1}+\sigma_{r2}$ (12)

Definition 7. (Trajectory distance): $\forall T_{i},T_{j}\in TS$ , the trajectory distance, denoted as $\textit{dist}({T_{i},T_{j}})$ , refers to the distance between the two trajectories. If $T_{i}$ and $T_{j}$ are $p t$ %-contemporary trajectories with $pt>$ 0, we define

$\displaystyle\textit{dist}({T_{i},T_{j}})=\eta*\textit{dist}^{\prime}_{o}({T_{% i},T_{j}})+({1-\eta})*\textit{dist}^{\prime}_{l}({T_{i},T_{j}})$ (13)

where $\eta\in[{0,1}]$ , represents the weight of trajectory orientation distance, $\textit{dist}^{\prime}_{o}({T_{i},T_{j}})$ and $\textit{dist}^{\prime}_{l}({T_{i},T_{j}})$ are respectively the standardization values of $\textit{dist}_{o}({T_{i},T_{j}})$ and $\textit{dist}_{l}({T_{i},T_{j}})$ .

If $pt=$ 0, i.e. $T_{i}$ and $T_{j}$ are not contemporary, but there is at least one trajectory $T_{\textit{ijk}}\in\tau\subseteq TS$ , such that both ( $T_{i}$ , $T_{\textit{ijk}}$ ) and ( $T_{j}$ , $T_{\textit{ijk}}$ ) are $p t$ %-contemporary with $pt>$ 0, then

$\displaystyle\textit{dist}({T_{i},T_{j}})=\min_{T_{\textit{ijk}}\in\tau}({% \textit{dist}({T_{i},T_{\textit{ijk}}})+\textit{dist}({T_{k},T_{\textit{ijk}}}% )}).$ (14)

Otherwise, $\textit{dist}({T_{i},T_{j}})$ is not defined. Where $\tau$ is a set of trajectories being contemporary with both $T_{i}$ and $T_{j}$ .

In this paper, trajectory distance is calculated based on trajectory orientation distance and trajectory location distance in order to improve the accuracy of trajectory similarity measurement and to reduce the loss of the trajectory inherent information. The larger the trajectory distance is, the smaller the similarity between trajectories is, and vice versa.

Definition 8. (3D-cell): Consider an integer $G$ . The three dimensional (3D) spatio-temporal space is partitioned into $G\times G\times G$ sub-parts, each of them is three dimensional. We define each of $G\times G\times G$ sub-parts as a 3D-cell.

Definition 9. (Individual privacy): Individual privacy refers to the ability of an individual secludes himself or the private information about himself, and thereby expresses himself selectively. The content of individual privacy is not willing to be public, such as individual interests, family address, income status, etc.

Definition 10. (Trajectory privacy): Trajectory privacy is a special kind of individual privacy, it refers to the sensitive location information or other private information derived from trajectory data according to background knowledge. Therefore, trajectory privacy preservation must ensure that an user’s sensitive location/trajectory information is not leaked, and prevent attackers confirming an user’s individual information based on his/her trajectory data.

4. Trajectory privacy preservation model and measurement

In this paper, we propose a method that allows anonymizing trajectories in order to prevent attackers from using their background knowledge to infer locations unknown to them.

4.1 Requirements of trajectory privacy preservation model

In generalization-based privacy preservation, there are two main models: $k$ -anonymity model and Mix-zone model. $k$ -anonymity model was firstly proposed by Sweeney [19] for the privacy preservation of traditional relational data. In $k$ -anonymity model, the privacy information of any record can not be confirmed by attackers because at least another $k$ -1 records have the same quasi-identifier as that one. Gruteser and Grunwald [34] applied $k$ -anonymity idea to location privacy preservation for the first time. In their method, at the same time, attackers can not distinguish the real location of one mobile user from the locations of another $k$ -1 users. Therefore, the user’s location privacy is preserved.

Trajectory data contains a large amount of spatio-temporal information. However, the essence of location privacy preservation method is spatial anonymity [34], which mainly uses transformed location coordinates to protect user location privacy [9]. It can not be directly used for trajectory privacy protection because it ignores the spatiotemporal correlation characteristics of trajectory data. Therefore, it is necessary to design a specific algorithm to reduce the possibility of trajectory privacy disclosure.

Note that trajectory privacy preservation technology must take both personal privacy security and data availability into account. That is to say, a privacy-preserving trajectory data publication method which can preserve users’ personal privacy while maintaining as high data availability as possible when trajectory data is published is required. Generally speaking, the higher the degree of privacy preservation, the worse the data availability. Therefore, the evaluation of a trajectory privacy preservation method must be conducted in terms of the ability to preserve trajectory privacy and the quality of published trajectory data.

This paper mainly focuses on a variant of trajectory $k$ -anonymity model. A novel anonymity model is realized based on 3D-grid partition. Trajectory privacy preservation model is built based on trajectory attack model, which uses an unified metric to describe the effect of attack on trajectory data privacy.

4.2 Trajectory attack model

At present, most of trajectory privacy preservation technologies are designed based on the specific environment. When trajectory data is released, the disclosure risk of individual privacy is related to the background knowledge mastered by the attacker.

In privacy-preserving trajectory data publication, different privacy levels can be provided based on different assumptions about original trajectory dataset, anonymous trajectory dataset and adversary capabilities. An important feature of trajectory data is that any location may be sensitive, so for any location, knowing that a particular user has visited it may be useful to an attacker. Most background information is related to semantic locations. Because in real life, locations and sub-trajectories can be easily exposed. Therefore, referring to the attack model proposed by Literature [5], we assume that the attacker has the following background knowledge:

BK1: The attacker can access any location in the anonymous trajectory dataset $TS^{*}$ and know that each location belongs to the original trajectory dataset $T S$ . BK2: The attacker has the sub-trajectory $S$ of the original trajectory $T$ , that is, some consecutive locations visited by someone is known by the attacker. The attacker also knows that $T^{*}\in TS^{*}$ , where $T^{*}$ is the anonymized trajectory of $T$ .

Note that the attacker’s knowledge makes an important difference from the previous adversarial models, our method used for transforming $T S$ to $TS^{*}$ is not known by the attacker, not to mention the parameters used in it.

According to the above background knowledge, we define the attack model as follows:

AC1: The attacker can determine that $T^{*}$ is anonymized from the original trajectory $T(T\in TS)$ . AC2: The attacker can determine the real trajectory $T$ based on $S$ . AC3: The attacker can determine the other locations of $T$ based on $S$ .

If one of the above attack cases AC1–AC3 is met, we conclude that the user’s privacy is revealed.

4.3 Trajectory privacy preservation measurement

Trajectory privacy preservation degree is generally reflected by the disclosure risk of trajectory privacy. The smaller the risk of disclosure, the higher the degree of privacy preservation. The disclosure risk refers to the probability of trajectory privacy disclosure in certain circumstances. The disclosure risk of privacy is associated with the background knowledge mastered by the attacker. The more background knowledge the attacker has, the higher the disclosure risk of privacy is. The privacy preservation performance considered in trajectory anonymization is as follows [5].

4.3.1 Entropy metric

In information entropy theory, $P(x_{i})\log P(x_{i})$ is used to measure information, where $P(x_{i})$ is the probability of information.

In our trajectory privacy preservation measurement, the idea of information entropy is referenced. $H(SS,TS^{*},TS)$ is the entropy that $T S$ is identified by attacker, $P(T_{i}^{*}\rightarrow T_{i})$ is the probability that the $i$ -th trajectory is identified by the attacker. $H(SS,TS^{*},TS)$ is calculated as follows:

$\displaystyle H(SS,TS^{*},TS)=-\sum_{i=1}^{p^{*}}P(T_{i}^{*}\rightarrow T_{i})% \log_{2}P(T_{i}^{*}\rightarrow T_{i})$ (15)

where $p^{*}$ is the size of anonymized set, $S S$ is the set of sub-trajectories mastered by the attacker.

4.3.2 Location and trajectory preservation degree

Consider the original trajectory $T$ of an user $U$ , the disclosure probability of location privacy $Pr(\text{loc}\in T|S,T^{*})$ and the disclosure probability of trajectory privacy $Pr(T^{*}\rightarrow T|S,T^{*})$ are investigated in the case of the attacker’s background knowledge $S$ and anonymously published trajectory dataset $T^{*}$ . The smaller the probability of these two attacks, the less the disclosure of privacy information, and the higher the privacy preservation degree of the algorithm. It is required to calculate the probability of inferring the real location from the fake location, and to evaluate the accuracy of the real identity and geographical location information of the mobile user from the perspective of the illegal attacker. $Pr(\text{loc}\in T|S,T^{*})\leqslant 1/l$ indicates that the algorithm achieves $l$ -diversity of any location. $Pr(T^{*}\rightarrow T|S,T^{*})\leqslant 1/l$ indicates that the algorithm achieves $k$ -anonymity of any trajectory.

4.4 Trajectory availability measurement

In privacy-preserving methods with data loss, errors can be formed before and after anonymizing of dataset. But it is required that less impact on the results of corresponding mining and query will be generated. Data quality is primarily measured by accuracy, completeness, and consistency [35]. Estimating the availability of anonymized trajectory dataset is challenging, since it highly depends on its intended use [26]. Unlike traditional data, trajectory data includes availability features such as the number of locations, the number of trajectories and spatio-temporal information. In order to accurately measure the availability of anonymous trajectories, the following metrics are used in our experiments [5, 14]. These metrics can reflect trajectory changes and measure the effect of related queries after trajectory anonymization.

4.4.1 Average location loss

The objective of any privacy preservation algorithm is that no fake or inaccurate locations replace original locations. That is, locations in the anonymized trajectories should be locations contained in the original trajectories, without any generalization or accuracy loss. Therefore, location loss is adopted as a metric to evaluate the trajectory privacy preservation method. Location loss refers to the ratio of the number of different locations at the same time between $T$ and $T^{*}$ to the number of original locations in $T$ , denoted as $LL(T)$ , is calculated as:

$\displaystyle LL(T)=\frac{\mathop{\sum}\nolimits_{t\in ts}f({({x_{t},y_{t}}),(% {x_{t}^{*},y_{t}^{*}})})}{NL(T)}$ (16)

where $T$ is an original trajectory, $t s$ is the set of all the timestamps in $T$ , $NL(T)$ is the number of locations in $T$ , $T^{*}$ is the anonymized trajectory of $T$ , $(x_{t}^{*},y_{t}^{*})$ and $(x_{t},y_{t})$ are respectively the anonymized location and the original location at time $t$ . $f({({x_{t},y_{t}}),({x_{t}^{*},y_{t}^{*}})})$ is a function with value of 0 if $({x_{t},y_{t}})=({x_{t}^{*},y_{t}^{*}})$ while with value of 1 if $({x_{t},y_{t}})\neq({x_{t}^{*},y_{t}^{*}})$ .

The average location loss of the trajectory dataset $T S$ , denoted as avgLL, is calculated as:

$\displaystyle\textit{avgLL}=\frac{1}{|{TS}|}\mathop{\sum}\limits_{T\in TS}LL(T% )\times 100\%$ (17)

where $|TS|$ is the number of trajectories in $T S$ .

4.4.2 Average locations appearance ratio [26]

It is a metric measures the change in the number of appearances of locations in the original and the anonymized datasets. For each location $\ell$ , we define the locations appearance ratio ${\cal R}_{\ell}$ as the ratio of the appearance number of $\ell$ in the anonymized trajectory dataset $TS^{*}$ to the appearance number of $\ell$ in the original counterpart $T S$ . The values of ${\cal R}_{\ell}$ are in [0, 1], a value of 0 means that all the locations in $\ell s$ are completely suppressed while a value of 1 means that not any location in $\ell s$ is suppressed in the anonymized dataset. The average locations appearance ratio, denoted as $\textit{avg}{\cal R}$ , is defined as:

$\displaystyle\textit{avg}{\cal R}=\frac{1}{|{\ell s}|}\mathop{\sum}\limits_{% \ell\in\ell s}{\cal R}_{\ell}\times 100\%$ (18)

where $\ell s$ is the set of all the distinct locations in $T S$ . The higher the value of $\textit{avg}{\cal R}$ , the higher the data utility.

4.4.3 Trajectory loss

Trajectory loss refers to the ratio of the number of deleted trajectories to that of original trajectories, denoted as $T L$ , is calculated as:

$\displaystyle TL=\frac{|{TS}|-|{TS^{*}}|}{|{TS}|}\times 100\%$ (19)

where $|{TS}|$ and $|{TS^{*}}|$ are respectively the number of trajectories in original trajectory dataset $T S$ and that in anonymous trajectory dataset $TS^{*}$ .

4.4.4 Spatio-temporal information loss of trajectory

Spatio-temporal information loss refers to some errors caused by the comparison of the anonymous trajectory with the original trajectory. By comparing the original trajectory dataset $T S$ with the anonymous dataset $TS^{*}$ , we compute spatio-temporal information loss to get information distortion degree. If the information is too distorted, the anonymous trajectory dataset is less available.

After a series of anonymizing operations, the original trajectory dataset is transformed to an anonymous one. In the anonymizing process, some information is missing or incorrect. The spatio-temporal information loss at single timestamp can be calculated as:

$\displaystyle IL_{t}({T,T^{*}})=\left\{{{\begin{array}[]{cl}\sqrt{({x_{t}^{*}-% x_{t}})^{2}+({y_{t}^{*}-y_{t}})^{2}},&\text{if }({x_{t}^{*},y_{t}^{*}})\text{ % exists at }t\\ \Omega,&\textit{otherwise}\\ \end{array}}}\right.$ (20)

where $t s$ , $(x_{t}^{*},y_{t}^{*})$ and $(x_{t},y_{t})$ are the same as the explanation provided in Eq. (16), $\Omega$ is a penalty parameter. Then, the spatio-temporal information loss of single trajectory is calculated as:

$\displaystyle IL({T,T^{*}})=\mathop{\sum}\limits_{t\in ts}IL_{t}({T,T^{*}})$ (21)

Therefore, the spatio-temporal information loss of the total trajectory dataset is calculated as:

$\displaystyle\textit{TIL}({T,T^{*}})=\mathop{\sum}\limits_{T\in TS}IL({T,T^{*}})$ (22)

4.5 Accuracy ratio of AOI query

Accuracy ratio of AOI (Area Of Interest) query metric, denoted as ARAOI, estimates the ratio of the number of AOIs that are correctly retrieved in the anonymized dataset based on the same retrieve mechanism, where AOI refers to the area wherein the point density is higher than the specified threshold. AOI is a statistic result useful for many applications, including the personalized recommendation and path planning. Higher ARAOI score indicates that the number of AOIs is identified more accurately based on the anonymized data. To measure ARAOI, we compare the retrieved results on anonymized trajectory dataset with that on original dataset. The calculation method is as follows:

$\displaystyle\textit{ARAOI}=\frac{1}{\mathbb{Q}}\mathop{\sum}\limits_{i=1}^{% \mathbb{Q}}\textit{eqn}({\textit{AOI}_{i},\textit{AOI}_{i}^{*}})\times 100\%$ (23)

where ${\mathbb{Q}}$ is the number of AOIs needed to be retrieved, it is a parameter set by experiments, $\textit{AOI}_{i}$ refers to the $i$ -th AOI in the sorted set of AOIs retrieved from the original dataset, $\textit{AOI}_{i}^{*}$ refers to the $i$ -th AOI in the sorted set of AOIs retrieved from the anonymized dataset, $\textit{eqn}({\textit{AOI}_{i},\textit{AOI}_{i}^{*}})$ is a function with value of 1 if $\textit{AOI}_{i}$ is equal to $\textit{AOI}_{i}^{*}$ while with value of 0 if $\textit{AOI}_{i}$ is not equal to $\textit{AOI}_{i}^{*}$ .

5. Trajectory privacy preservation algorithm

5.1 Algorithm idea

Privacy preservation algorithms must balance data security and data availability, trajectory privacy preservation algorithm is no exception.

In order to reduce information loss and to improve data availability in trajectory anonymization process, we propose a trajectory privacy preservation method based on 3D-Grid partition, it is denoted as TPPG. In this method, we take both temporal and spatial factors into consideration. The calculation of trajectory distance is based on the rate and time interval of contemporary trajectories and the spatial features.

TPPG is developed along three main phases to achieve trajectory privacy preservation. The block diagram of TPPG algorithm is shown in Fig. 1.

5.1.1 Trajectory pre-processing

(a) General processing of dataset. For one hand, the trajectory dataset $T S$ is pre-processed in order that each $T_{i}\in TS$ is in the form: $T_{i}=\{{i,({t_{1}^{i},x_{1}^{i},y_{1}^{i}}),\cdots,({t_{n}^{i},x_{n}^{i},y_{n% }^{i}})}\}$ . In the following algorithms, both trajectories and sub-trajectories conform to this form. For the other hand, the sampling time of each trajectory is continuous based on uniform interpolation. (b) Trajectory data partition. First, the 3D-space is divided into $G\times G\times G$ 3D-cells, where $G$ is a partition parameter. Second, all the original trajectories are distributed in these 3D-cells. There may be zero or more sub-trajectories in each 3D-cell. That is to say, there may be some time-stamped locations in specified 3D-cells.

5.1.2 Trajectory similarity measurement

We need to enhance the data availability while preserving the privacy of published trajectory data. The distances between sub-trajectories within each 3D-cell are calculated to measure the similarity between them, which is a guidance for trajectory anonymization.

5.1.3 Trajectory anonymization

(a) Spatial locations conversion. To achieve the trajectory anonymization, some locations of sub-trajectories in the same 3D-cell are exchanged based on spatio-temporal attributes and trajectory similarity. (b) Trajectory reconstruction. It is necessary to reconstruct the sub-trajectories from different 3D-cells. After reconstruction of all the new sub-trajectories belonging to the same trajectory, an anonymous trajectory dataset based on real data is finally obtained.

The first part (a) in trajectory pre-processing phase is pre-experimental preparation, which is not difficult to achieve. Therefore, the above Steps 5.1.1) (b) – 5.1.3) of the algorithm will be focused on in the next section.

5.2 Algorithm description

5.2.1 Trajectory data partition

In order to improve the efficiency of trajectory anonymity algorithm, and to reduce the information loss generated by the anonymous trajectory, we firstly partition trajectory dataset into $G\times G\times G$ 3D-cells. Figure 2 shows the trajectory partition effect of synthetic dataset in 2D-plane, where $G$ is assigned with 10. This synthetic dataset is generated by the Brinkhoff generator [36]. A detailed explanation of the dataset is provided in Section 6.1.2. For ease of understanding, we use Cartesian coordinate system to realize the partition.

Figure 1.

The block diagram of TPPG algorithm.

Figure 2.

Trajectory partition effect of synthetic dataset in 2D-plane.

As described in Definition 1, a trajectory is an ordered set of time-stamped locations. It is formalized as $T=\{{\textit{Tid},({t_{1},x_{1},y_{1}}),({t_{2},x_{2},y_{2}}),\cdots,({t_{n},x% _{n},y_{n}})}\}$ . Each time-stamped location of $T$ has three attributes: time, $x$ -coordinate and $y$ -coordinate. In Fig. 2, the abscissa and ordinate of each location respectively represent its $x$ -coordinate and $y$ -coordinate. That is to say, the time dimension of location data is ignored in this figure. Here we do not convert the 3D data to 2D data, we only demonstrate the spatial distribution of trajectories based on the ( $x$ , $y$ ) coordinates of locations. As can be seen from Fig. 2, some grids do not contain any locations and some grids gather a lot of locations. This can reflect the visiting rate of each area.

The trajectory data partition of original dataset is implemented by Algorithm 1, denoted as SPTD. Given a dataset of $p$ trajectories $T S$ and the partition parameter $G$ , the algorithm applies to each location of trajectories in $T S$ . We use two 3D datasets to respectively store the set of locations and the set of sub-trajectories distributed in each 3D-cell. The notations used in the SPTD algorithm are listed in Table 1.

Table 1

Notations used in the SPTD algorithm

Notation	Description
$T S$	A dataset of $p$ trajectories
$G$	The parameter used to guide partitioning
$x{\_}\textit{left}$	The minimum $x$ value of all time-stamped locations in $T S$
$x{\_}\textit{right}$	The maximum $x$ value of all time-stamped locations in $T S$
$y{\_}\textit{top}$	The minimum $y$ value of all time-stamped locations in $T S$
$y{\_}\textit{bottom}$	The maximum $y$ value of all time-stamped locations in $T S$
$t{\_}\min$	The minimum $t$ value of all time-stamped locations in $T S$
$t{\_}\max$	The maximum $t$ value of all time-stamped locations in $T S$
$ln_{i}$	The number of time-stamped locations in the $i$ -th trajectory $T_{i}$
$\textit{tloc}_{r}^{i}$	The $t$ value of $\textit{loc}_{r}^{i}$
$\textit{xloc}_{r}^{i}$	The $x$ -coordinate of $\textit{loc}_{r}^{i}$
$\textit{yloc}_{r}^{i}$	The $y$ -coordinate of $\textit{loc}_{r}^{i}$
Cells	A 3D dataset with size $G\times G\times G$ , used to store the locations within each 3D-cell
TS3d	A 3D dataset with size $G\times G\times G$ , used to store the sub-trajectories within each 3D-cell

The pseudo code of SPTD algorithm is given as follows:

Algorithm 1: SPTD: Spatio-temporal Partition of Trajectory Dataset

Input:

TS=\{{T_{1},T_{2},\cdots,T_{p}}\}

G

Output: Cells, TS3d

1. Initialize 3D dataset Cells with size

G\times G\times G

;

2. Calculate

x{\_}\textit{left}

x{\_}\textit{right}

y{\_}\textit{top}

y{\_}\textit{bottom}

t{\_}\min

t{\_}\max

;

3. height

\leftarrow

(

t{\_}\max-t{\_}\min

G

;

4. length

\leftarrow

(

x{\_}\textit{right}-x{\_}\textit{left}

G

;

5. width

\leftarrow

(

y{\_}\textit{bottom}-y{\_}\textit{top}

G

;

6. for

i\leftarrow

1 to

p

ln_{i}\leftarrow

the number of time-stamped locations in the

i

-th trajectory

T_{i}

;

8. for

r\leftarrow

1 to

ln_{i}

9. //Put

\textit{loc}_{r}^{i}

into the specified cell based on the

x

and

y

coordinates of

\textit{loc}_{r}^{i}

;

10.

Rt\leftarrow\lceil({\textit{tloc}_{r}^{i}-t\_\min})/\textit{height}\rceil

;

11.

Rx\leftarrow\lceil\textit{xloc}_{r}^{i}-x\_\textit{left})/\textit{length}\rceil

;

12.

Ry\leftarrow\lceil({\textit{yloc}_{r}^{i}-y\_\textit{top}})/\textit{width}\rceil

;

13. Process the cases about boundary locations;

14.

\textit{Cells}{\{}Rt,Rx,Ry{\}}\leftarrow\textit{Cells}{\{}Rt,Rx,Ry{\}}\mathop{% \cup}({i,\textit{tloc}_{r}^{i},\textit{xloc}_{r}^{i},\textit{yloc}_{r}^{i}})

;

15. end for

16. end for

17. TS3d

\leftarrow

MergeTr (Cells);

18. returnCells, TS3d;

Lines 1–16 are designed to get the collection of locations for each 3D-cell by comparing spatio-temporal features of each location with the upper and lower bounds of each 3D-cell. Here the spatio-temporal partition of trajectories is implemented based on the timestamps and ( $x$ , $y$ ) coordinates of locations. Line 17 is used to extract sub-trajectories distributed in each 3D-cell.

After SPTD algorithm is executed, we distribute the locations into different 3D-cells. The 3D dataset Cells stores the information of such locations. In Line 17 of Algorithm 1, MergeTr is a function used to merge all locations belonging to the same trajectory into a sub-trajectory. The 3D dataset TS3d is used to store $G\times G\times G$ datasets consisting of several sub-trajectories. It is worth noting that all the sub-trajectories conform to the trajectory model presented in Definition 1. The pseudo code of MergeTr function is given as follows:

Function 1: [TS3d] $=$ MergeTr ( $R g$ )
1. $[Gt,Gx,Gy]\leftarrow$ size ( $R g$ );
2. Initialize 3D dataset TS3d with size $Gt\times Gx\times Gy$ ;
3. for each Rgcell in $R g$ do
4. $(ti,xi,yi)\leftarrow$ the 3-dimensional coordinates of Rgcell;
5. if !isempty (Rgcell) then
6. [crow, ccol] $\leftarrow$ size (Rgcell);
7. $di\leftarrow$ 1;
8. tdrow $\leftarrow$ Rgcell ( $d i$ , :);
9. tracno $\leftarrow$ tdrow (1, 1);
10. for $ci\leftarrow$ 2 to crowdo
11. if Rgcell ( $c i$ , 1) $==$ tracno
12. tdrow $\leftarrow$ [tdrowRgcell ( $c i$ , 2: ccol)];
13. else
14. TS3d ${\{}ti,xi,yi{\}}$ ( $d i$ , 1: length (tdrow)) $\leftarrow$ tdrow;
15. tracno $\leftarrow$ Rgcell ( $c i$ , 1);
16. tdrow $\leftarrow$ Rgcell ( $c i$ , :);
17. $di\leftarrow di+1$ ;
18. end if
19. end for
20. TS3d ${\{}ti,xi,yi{\}}$ ( $d i$ , 1: length (tdrow)) $\leftarrow$ tdrow;
21. end if
22. end for
23. return TS3d;

Line 12 of Function 1 is repeated to obtain tdrow. For each iteration, a location ( $t$ , $x$ , $y$ ) is added to the same trajectory with same id. The variable tracno is used to record trajectory id (Lines 9 and 15). Lines 14 and 20 get each sub-trajectory in the same 3D-cell. Therefore, in each 3D-cell of TS3d, the sub-trajectories conform to the trajectory model presented in Definition 1.

5.2.2 The similarity measurement between two trajectories

Conducting locations exchange is a major method for achieving the purpose of privacy preservation. Random anonymity and exchange of trajectories or locations will degrade the data quality of anonymous trajectories. To ensure high data availability, we measure the similarity between any pair of trajectories, which serves as an basis for trajectory anonymization. Therefore, locations exchange is implemented based on the distance calculation between trajectories. The data availability is greatly improved by locations exchange between nearest neighbors.

Literature [5] concludes that not all similarity measurements between two trajectories are suitable for trajectory anonymization purposes. Many distance measures for trajectories or time series have been proposed in the past, such as the Euclidean distance, the Hausdorff distance, DTW, LCSS, EDR, etc. Most of them are ill-suited to compare trajectories for anonymization purposes, because the requirement for anonymization is not just similarity regarding trajectory shape, but also spatial and temporal closeness of trajectories. To address this issue, Domingo-Ferrer and Trujillo-rasua [5] proposed a trajectory similarity calculation method which considers the time and space dimensions. As described in Section 2, this method ignores other factors such as trajectory orientation, moving speed and continuity feature. Similarly, Wang et al. [14] took trajectory shape into account, but did not consider the continuity of the trajectory. Therefore, based on the execution results of SPTD algorithm, we propose a new distance calculation method to measure the similarity between two trajectories. Our method is designed for trajectory anonymization purpose and takes full account of the above factors when measuring the trajectory distance.

There are three main scenarios of the spatio-temporal relationship between two trajectories, which are shown in Fig. 3. The distance calculation depends on the specific scenario.

Figure 3.

Three main scenarios of the relationship between two trajectories.

1) Only one location in each trajectory

The distance is calculated based on the Euclidean distance. For the two scenarios (a) and (b) in Fig. 3, the calculation equation is as follows:

$\displaystyle\textit{dist}_{o}(T_{i},T_{j})=\textit{dist}_{l}(T_{i},T_{j})=% \left\{{{\begin{array}[]{ll}\sqrt{(x_{t_{i}}^{i}-x_{t_{j}}^{j})^{2}+(y_{t_{i}}% ^{i}-y_{t_{j}}^{j})^{2}},&(a)t_{i}=t_{j}\\ \varepsilon+\sqrt{(x_{t_{i}}^{i}-x_{t_{j}}^{j})^{2}+(y_{t_{i}}^{i}-y_{t_{j}}^{% j})^{2}},&(b)t_{i}\neq t_{j}\\ \end{array}}}\right.$ (24)

where $(t_{i},x_{t_{i}}^{i},y_{t_{i}}^{i})$ and $(t_{j},x_{t_{j}}^{j},y_{t_{j}}^{j})$ are respectively the unique locations of $T_{i}$ and $T_{j}$ , $\varepsilon$ is a distance adjustment parameter used to measure the difference between two locations with different time values, in our experiments, $\varepsilon$ is assigned with $|t_{i}-t_{j}|$ . In Fig. 3a, the two locations are at time $t_{2}$ , that is, $t_{i}=t_{j}=t_{2}$ . In Fig. 3b, $t_{i}=t_{8}$ and $t_{j}=t_{10}$ .

2) Only one location in one trajectory, and at least two locations in another one

The distance is calculated based on the area of triangle. Without loss of generality, we assume $T_{i}$ contains only one location $(t_{i},x_{t_{i}}^{i},y_{t_{i}}^{i})$ . For the two scenarios (c) and (d) in Fig. 3, the calculation equation is as follows:

$\displaystyle\textit{dist}_{o}(T_{i},T_{j})=\textit{dist}_{l}(T_{i},T_{j})=% \left\{{{\begin{array}[]{ll}\frac{1}{|{et-st}|}\mathop{\sum}\limits_{t_{r}=st}% ^{et-1}\sqrt{\sigma_{t_{r}}}&(c)st<t_{i}<et\\ \delta+\frac{1}{|{et-st}|}\mathop{\sum}\limits_{t_{r}=st}^{et-1}\sqrt{\sigma_{% t_{r}}},&(d)t_{i}<st\,or\,t_{i}>et\\ \end{array}}}\right.$ (25)

where $s t$ and $e t$ are respectively the start time and the end time of $T_{j}$ , $\sigma_{t_{r}}$ is the area of triangle consisting of the three locations $(t_{i},x_{t_{i}}^{i},y_{t_{i}}^{i})$ , $(st,x_{t_{r}}^{j},y_{t_{r}}^{j})$ and $(et,x_{t_{r}+1}^{j},y_{t_{r}+1}^{j})$ , $\delta$ is a distance adjustment parameter used to measure the difference between two scenarios (c) and (d). As shown in Fig. 3c, $st<t_{i}<et$ ( $t_{i}=t_{2}$ , $st=t_{1}$ , $et=t_{4}$ ). In Fig. 3d, $t_{i}<st$ ( $t_{i}=t_{7}$ , $st=t_{8}$ ).

As is presented in Definition 2, the $p t$ value is assigned with $-$ 2 for the above two scenarios.

3) At least two locations in each trajectory

This is the general scenario, as shown in (e)–(f) of Fig. 3, there are at least two locations in $T_{i}$ or $T_{j}$ . There is some difference between scenarios (e) and (f). For the former, $pt>$ 0, it represents that there are common periods ( $t_{2}\sim t_{3}$ ) between $T_{i}$ and $T_{j}$ . The orientation distance $\textit{dist}_{o}(T_{i},T_{j})$ is calculated using Eq. (5), the location distance $\textit{dist}_{l}(T_{i},T_{j})$ is calculated using Eqs (6)–(12). The trajectory distance $\textit{dist}(T_{i},T_{j})$ is calculated based on Eq. (13). For the latter, $pt=$ 0, it represents that $T_{i}$ and $T_{j}$ are not contemporary. The trajectory distance $\textit{dist}(T_{i},T_{j})$ is calculated using Eq. (14).

The proposed method gets all the distances between each pair of sub-trajectories within each 3D-cell. This work is implemented by Algorithm 2, denoted as GD. The input of the algorithm includes a 3D dataset TS3d obtained from the execution results of Algorithm 1, a weight parameter $\propto$ , and the partition parameter $G$ used in the trajectory partition step. The output of the algorithm is a 3D dataset with size $G\times G\times G$ , used to store the distances between each pair of sub-trajectories within any 3D-cell. The notations used in the GD algorithm are listed in Table 2.

Table 2

Notations used in the GD algorithm

Notation	Description
$\propto$	The weight of trajectory orientation distance
$\varepsilon$	A parameter of distance adjustment
$\|\textit{STS}\|$	The number of sub-trajectories in dataset STS
time	A 3D dataset with size $G\times G\times G$ , which consists of the start and end time of each sub-trajectory within any 3D-cell
$p t$	A 3D dataset with size $G\times G\times G$ , which consists of the $p t$ values for each pair of sub-trajectories within any 3D-cell
dso	A 3D dataset with size $G\times G\times G$ , which consists of the trajectory orientation distance for each pair of sub-trajectories within any 3D-cell
dsl	A 3D dataset with size $G\times G\times G$ , which consists of the trajectory location distance for each pair of sub-trajectories within any 3D-cell
Dists	A 3D dataset with size $G\times G\times G$ , used to store the distances between each pair of sub-trajectories within any 3D-cell

The pseudo code of GD algorithm is given as follows.

Algorithm 2: GD: Get Distances within each 3D-cell

Input:

G

, TS3d,

\propto

Output:Dists

1. for

i\leftarrow

1 to

G

2. for

j\leftarrow

1 to

G

3. for

k\leftarrow

1 to

G

4. STS

\leftarrow

TS3d (

i

j

k)

; //A dataset of sub-trajectories

5. if !isempty (STS) then

6. [time (

i

j

k

p t

(

i

j

k

), dso (

i

j

k

), dsl (

i

j

k

)]

\leftarrow

ComDist (STS);

7. Normalize the distances dso (

i

j

k

) and dsl (

i

j

k

);

8. Dists (

i

j

k

)

\leftarrow\propto^{*}\textit{dso }(i,j,k)+(1-\propto)^{*}\textit{dsl }(i,j,k)

;

9. end if

10. end for

11. end for

12. end for

13. returnDists;

In Line 6 of Algorithm 2, ComDist is a function used to calculate the distances between each pair of trajectories. Lines 7 and 8 achieve a combination of two weighted distances according to Eq. (13). The pseudo code of ComDist function is given as follows.

Function 2: $[\textit{time},pt,\textit{dso},\textit{dsl}]=$ ComDist (STS)
1. Initialize time, $p t$ , dso, dsl;
2. for $i\leftarrow$ 1 to $\|\textit{STS}\|$ do
3. Calculate the start time start ${}_{i}$ and the end time end ${}_{i}$ of the $i$ -th trajectory $ST_{i}$ ;
4. time ( $i$ ) $\leftarrow$ ( $i$ , start ${}_{i}$ , end ${}_{i}$ );
5. end for
6. for $i\leftarrow$ 1 to $\|\textit{STS}\|-1$ do
7. for $j\leftarrow i+1$ to $\|\textit{STS}\|$ do
8. Calculate $p t$ ( $i$ , $j$ ) using Eqs (2) and (3);
9. $p t$ ( $j$ , $i$ ) $\leftarrow pt$ ( $i$ , $j$ ); //The contemporary rate between two trajectories $ST_{i}$ and $ST_{j}$ ;
10. end for
11. end for
12. for $i\leftarrow$ 1 to $\|\textit{STS}\|-1$ do
13. for $j\leftarrow i+1$ to $\|\textit{STS}\|$ do
14. if $p t$ ( $i$ , $j$ ) $>$ 0 then
15. Calculate dso ( $i$ , $j$ ) using Eq. (5);
16. Calculate dsl ( $i$ , $j$ ) using Eqs (6)–(12);
17. elseif $p t$ ( $i$ , $j$ ) $==-$ 2 then
18. if both $ST_{i}$ and $ST_{j}$ contain only one location then //See (a) or (b) in Fig. 3
19. Calculate dso ( $i$ , $j$ ) and dsl ( $i$ , $j$ ) using Eq. (24);
20. elseif $ST_{i}$ or $ST_{j}$ contains only one location then //See (c) or (d) in Fig. 3
21. Calculate dso ( $i$ , $j$ ) and dsl ( $i$ , $j$ ) using Eq. (25);
22. end if
23. elseif $p t$ ( $i$ , $j$ ) $==$ 0 then
24. Respectively calculate dso ( $i$ , $j$ ) and dsl ( $i$ , $j$ ) using Eq. (14);
25. end if
26. dso ( $j$ , $i$ ) $\leftarrow$ dso ( $i$ , $j$ );
27. dsl ( $j$ , $i$ ) $\leftarrow$ dsl ( $i$ , $j$ );
28. end for
29. end for
30. returntime, $p t$ , dso, dsl;

Lines 2–11 of Function 2 calculate the two 3D datasets time and $p t$ used in Eqs (5) and (6). Lines 14–16 aim at the scenario of trajectory distance calculation problem presented in Fig. 3e. Lines 17–22 aim at the scenario presented in Fig. 3a–d. Lines 23 and 24 aim at the scenario presented in Fig. 3f. Lines 26 and 27 are used to obtain the distance matrixes for each 3D-cell of TS3d.

The proposed privacy-preserving trajectory data publication method is based on 3D-grid partition, so both the temporal and spatial attributes of trajectories need to be taken into consideration. When calculating the distance between two trajectories, most existing methods calculate spatial distance based on the center point and the length of each sub-trajectory [37, 38]. They ignore the continuity of trajectory. In our similarity measurement method, three main scenarios of spatio-temporal relationship between two trajectories are all considered, where the trajectory distance is calculated based on spatiotemporal attributes. Trajectory orientation distance and location distance are integrated in calculation of trajectory distance. In particular, the location distance based on area calculation is used to address the continuity problem of trajectory. Therefore, our distance measurement method considers the comprehensive impact of the internal and external characteristics of trajectory itself on the similarity between trajectories.

5.2.3 Spatial locations conversion and trajectory reconstruction

When the original dataset contains a large number of locations or trajectories, regional division and sub-regional anonymization can effectively reduce the amount of calculation in the process of trajectory similarity measurement and anonymization.

Based on 3D datasets TS3d and Dists, Algorithm 3 processes locations exchange to implement trajectory anonymization. Starting from the first trajectory in each 3D-cell of TS3d, the algorithm exchanges the corresponding time and locations between two nearest trajectory neighbors when the constraint of $\theta_{t}$ and $\theta_{d}$ is met. This anonymization work is completed by Algorithm 3, which is denoted as TPPG. The output of this algorithm is the anonymous trajectory dataset. The notations used in TPPG algorithm are listed in Table 3.

Table 3
Notations used in the TPPG algorithm

Notation	Description
$\theta_{d}$	The distance threshold
$\theta_{t}$	The time threshold
TSanony	The anonymous trajectory dataset for $T S$
TS3danony	A 3D anonymous dataset for TS3d
loc.id	The trajectory index where the current location is located
loc.t	The time value of the current location
loc.x	The $x$ value of the current location
loc.y	The $y$ value of the current location

The pseudo code of TPPG algorithm is given as follows:

Algorithm 3: TPPG: Trajectory Privacy Preservation Based on 3D-Grid Partition

Input:TS3d, Dists,

\theta_{d}

\theta_{t}

Output:TSanony

1. TS3danony

\leftarrow

LocEx (TS3d, Dists,

\theta_{d}

\theta_{t}

);

2. Find the minimum and maximum indexes of locations in TS3danony;

3. Initialize the anonymous dataset TSanony;

4. for each TS3danonycell in TS3danonydo

5. if !isempty (TS3danonycell) then

6. [drow, dcol]

\leftarrow

size (TS3danonycell);

7. for each loc in TS3danonycelldo

8. pos

\leftarrow

\textit{loc.t}^{*}3+2

;

9. TSanony (loc.id, pos:

\textit{pos}+2

)

\leftarrow

(loc.t, loc.x, loc.y);

10. end for

11. end if

12. end for

13. TSanony

\leftarrow

Remove ids of all the trajectories from TSanony;

14. returnTSanony;

In Line 1 of Algorithm 3, LocEx is a function used to swap the locations between two closest trajectories. Lines 2–12 are the reconstruction process of anonymous trajectory. Because each step before this algorithm is conducted on the sub-trajectories in 3D-cells, the final anonymous trajectory dataset must be obtained based on the reconstruction. Before Line 13 is executed, the dataset TSanony also conforms to the form same to the original dataset $T S$ . Line 13 is used to hide the trajectory identifiers, that is, each anonymous trajectory can be represented as $T_{i}^{*}=\{{({t_{1^{*}}^{i},x_{1^{*}}^{i},y_{1^{*}}^{i}}),\cdots,({t_{n^{*}}^% {i},x_{n^{*}}^{i},y_{n^{*}}^{i}})}\}$ . The pseudo code of LocEx function is given as follows:

Function 3: [TS3danony] $=$ LocEx (TS3d, Dists, $\theta_{d}$ , $\theta_{t})$
1. [ $t n$ , $x n$ , $y n$ ] $\leftarrow$ size (TS3d);
2. Initialize TS3danony with the size of $tn^{}xn^{}yn$ ;
3. for each TS3dcell in TS3ddo
4. ( $t i$ , $x i$ , $y i$ ) $\leftarrow$ the 3-dimensional coordinates of TS3dcell;
5. if isempty (TS3dcell) then continue; end if
6. [row, col] $\leftarrow$ size (TS3dcell);
7. swaptrflag $\leftarrow$ zeros (row); // Initialize swaptrflag of each trajectory in TS3dcell to 0;
8. swaplocflag $\leftarrow$ zeros (row, col); // Initialize swaplocflag of each locations in TS3dcell to 0;
9. for tdsi $\leftarrow$ 1 to rowdo
10. if swaptrflag (tdsi) $==$ 0 then
11. [value, pos] $\leftarrow$ sort (Dists { $t i$ , $x i$ , $y i$ } (tdsi, :));
12. for $vi\leftarrow$ 1 to length (value) do
13. $tp\leftarrow$ pos ( $v i$ );
14. if value ( $v i$ ) $\leqslant\theta_{d}$ && swaptrflag ( $t p$ ) $==$ 0 then
15. for $si\leftarrow$ 2 to coldo //check each timestamp
16. for $sj\leftarrow$ 2 to coldo
17. if swaplocflag ( $t p$ , $s j$ ) $==$ 0 && abs (TS3dcell (tdsi, $s i$ )-TS3dcell ( $t p$ , $s j$ )) $\leqslant\theta_{t}$ then
18. TS3dcell (tdsi, $s i$ : $si+2$ ) $\leftrightarrow$ TS3dcell ( $t p$ , $s j$ : $sj+2$ );
19. swaptrflag (tdsi) $\leftarrow$ 1;
20. swaptrflag ( $t p$ ) $\leftarrow$ 1;
21. swaplocflag ( $t p$ , $s j$ ) $\leftarrow$ 1;
22. break;
23. end if
24. $sj\leftarrow sj+3$ ;
25. end for
26. $si\leftarrow si+3$ ;
27. end for
28. end if
29. if swaptrflag (tdsi) $==$ 1 then break; end if
30. end for
31. end if
32. end for
33. Untras $\leftarrow$ trajectories with swaptrflag being 0;
34. for each untra in Untrasdo
35. unlocs $\leftarrow$ all the locations in untra;
36. if $\|\textit{unlocs}\|<$ 3 // $\|\textit{unlocs}\|$ is the number of the elements in unlocs
37. Suppress the unlocs;
38. end if
39. end for
40. TS3danony { $t i$ , $x i$ , $y i$ } $\leftarrow$ TS3dcell;
41. end for
42. returnTS3danony;

Line 7 of Function 3 sets an exchange sign swaptrflag for each sub-trajectory, whose value is 0 before being exchanged and 1 after being exchanged. Line 8 of Function 3 sets an exchange sign swaplocflag for each location of each sub-trajectory, whose value is 0 before being exchanged and 1 after being exchanged. Lines 9–41 are the specific steps of trajectory anonymization, and Lines 33–39 are designed for processing the case that locations are sparsely distributed. Lines 10, 14 and 17 serve as the constraints on locations exchange.

5.3 Algorithm analysis

5.3.1 Trajectory privacy preservation capability

Consider a sub-trajectory $S$ of $T(T\in TS)$ . Let $S$ be the adversary’s knowledge of a target original trajectory $T$ , $\textit{loc}_{1},\textit{loc}_{2},\cdots,\textit{loc}_{|s|}$ be all the locations in $S$ , $S^{*}$ be the anonymized counterpart of $S$ , and $T^{*}$ be the anonymized trajectory of $T$ .

TPPG algorithm performs 3-dimensional Grid partition on the original trajectory dataset, so each entire trajectory will be divided into different 3D-cells. Therefore, $S$ may be divided into the same 3D-cell or several different 3D-cells. In the former case, assuming the number of sub-trajectories within this 3D-cell is $N_{st}$ , each location in $S$ will be exchanged with the location meeting the constraints within the 3D-cell. The probability of $\textit{loc}_{i}(1\leqslant i\leqslant|s|)$ being exchanged is 1/ $N_{st}$ . In the latter case, we assume the number of 3D-cells including $S$ is $CN_{S}$ , the number of sub-trajectories within the $r$ -th $(1\leqslant r\leqslant|{CN_{S}}|)$ 3D-cell is $N_{st}^{r}$ . Without loss of generality, we assume $N_{st}^{r}$ is the minimum one for $|{CN_{S}}|$ 3D-cells. The probability of the locations in the reconstructed anonymous trajectory corresponding with $S$ can be related to $T$ is $1/(N_{st}^{r}CN_{s})$ .

In summary, even if the attacker grasps the background knowledge BK1 and BK2 defined in Section 4.2, the TPPG algorithm satisfies trajectory privacy preservation requirement according to the attack model defined in Section 4.2. In particular, there are three cases:

1.
AC1: After being partitioned, $T$ has several parts belonging to different 3D-cells. Let $PN_{T}$ be the number of parts in $T$ , $LN_{T}$ be the number of locations in $T$ , $\textit{SubT}_{l}$ be the $l$ -th sub-trajectory of $T$ , $LN_{T}^{l}$ be the number of locations in the $l$ -th sub-trajectory of $T(1\leqslant l\leqslant|PN_{T}|)$ . Trajectory anonymization is performed within each 3D-cell, each sub-trajectory of $T$ is anonymized based on locations exchange. The anonymized trajectory $T^{}$ is generated based on the reconstruction of all anonymized sub-trajectories. In every 3D-cell to which $T$ belongs, each location will be exchanged with another one in another sub-trajectory when the constraints are met. Therefore, the probability of inferring that $T^{}$ is anonymized from the original trajectory $T$ is $1/((LN_{T}^{l})^{2}PN_{T})$ .
2.
AC2: Because each trajectory has been partitioned into different 3D-cells, in which the locations of each sub-trajectory are exchanged with that of other trajectories. Anonymous trajectory is generated after the anonymized locations being reconstructed. As is described above, the probability of the locations in the reconstructed anonymity trajectory corresponding with $S$ can be related to $T$ is $1/(N_{st}^{r}CN_{s})$ .
3.
AC3: Because each location of $S$ has been exchanged with the location of another trajectory when the constraints are met. Thus, given $S$ , an attacker can not determine the other locations of $T$ with probability higher than $1/(LN_{T})^{2}$ .

5.3.2 Complexity analysis

The time complexity of Algorithm 1 depends on the following two parts: (a) the time for computing the 3D dataset Cells, whose time complexity is $O(p^{*}ln_{\textit{avg}})$ , where $p$ is the number of trajectories in $T S$ , i.e. $p=|TS|$ , $ln_{\textit{avg}}$ is the average length of all trajectories; (b) the time to execute the function MergeTr, which is called once by Algorithm 1. From Lines 3–22 in Function 1, we know that the locations in each 3D-cell of Cells are merged into a sub-trajectory. Hence, the number of calculations for this part is equal to the number of all the locations in $T S$ . The time complexity of this part is also $O(p^{*}ln_{\textit{avg}})$ . So the total approximate time complexity of Algorithm 1 is $O(p^{*}ln_{\textit{avg}})$ . The space complexity of Algorithm 1 is $O(\text{max}(G^{3}*\textit{cln}_{\text{max}},G^{3}*\textit{ctn}_{\text{max}}))$ , which is mainly due to storing the 3D datasets Cells and TS3d, where $\textit{cln}_{\text{max}}$ is the maximum number of locations in all the 3D-cells of Cells, and $\textit{ctn}_{\text{max}}$ is the maximum number of sub-trajectories in all the 3D-cells of TS3d.

Algorithm 2 calls the function ComDist once for each 3D-cell of TS3d, that is, $G^{3}$ times. In the worst case, the time complexity of Function 2 depends on the following three parts: (a) Lines 2–5: the time for calculating the matrix time, whose time complexity is $O(\textit{ctn}_{\text{max}})$ . (b) Lines 6–11: the time for calculating the matrix $p t$ , the time complexity is $O(\textit{ctn}_{\text{max}}^{2})$ . (c) Lines 12–29: it is used to calculate the matrixes dso and dsl, whose time complexity is also $O(\textit{ctn}_{\text{max}}^{2})$ . So the total time complexity of Function 2 is $O(\textit{ctn}_{\text{max}})+O(\textit{ctn}_{\text{max}}^{2})$ . In conclusion, its overall time complexity is $O(\textit{ctn}_{\text{max}}^{2})$ . Therefore, the time complexity of Algorithm 2 is $O(G^{3}*\textit{ctn}_{\text{max}}^{2})$ . The space complexity of Algorithm 2 is $O(G^{3}*\textit{ctn}_{\text{max}}^{2})$ , which is mainly due to storing the 3D datasets Dists.

Regarding Algorithm 3, we have precomputed its input TS3d and Dists using Algorithms 1–2. This is reasonable, because those datasets need to be computed only once, while the TPPG method need to be run several times (e.g. with different parameters $\theta_{d}$ or $\theta_{t})$ .

The time complexity of Algorithm 3 depends on the following two parts: (a) the time to execute the function LocEx, which is called once by Algorithm 3. From Lines 3–41 in Function 3, we know that each sub-trajectory in each 3D-cell of TS3d is scanned to process locations exchange. The time complexity of Lines 3–32 is $O(G^{3}*\textit{ctn}_{\text{max}}*\textit{cln}_{\text{max}}^{2})$ , while the time complexity of Lines 34–39 is $O(G^{3}*|\textit{Untras}|)$ , where $|\textit{Untras}|<<\textit{ctn}_{\text{max}}$ . Hence, in the worst case, the time complexity of this part is also $O(G^{3}*\textit{ctn}_{\text{max}}*\textit{cln}_{\text{max}}^{2})$ ; (b) the time for reconstructing the anonymized trajectories, whose time complexity is $O(G^{3}*\textit{cln}_{\text{max}})$ . So the total approximate time complexity of Algorithm 3 is $O(G^{3}*\textit{ctn}_{\text{max}}*\textit{cln}_{\text{max}}^{2})$ . The space complexity of Algorithm 3 is identical with that of Algorithm 2.

In summary, the total time complexity of the proposed method is: $O(p^{*}ln_{\textit{avg}})+O(\textit{ctn}_{\text{max}}^{2})+O(G^{3}*\textit{ctn% }_{\text{max}}*\textit{cln}_{\text{max}}^{2})$ . Because $p^{*}ln_{\textit{avg}}<G^{3}*\textit{cln}_{\text{max}}$ and $\textit{ctn}_{\text{max}}<\textit{cln}_{\text{max}}$ , in the worst case, the total time complexity of the proposed method is thus: $O(G^{3}*\textit{ctn}_{\text{max}}*\textit{cln}_{\text{max}}^{2})$ . In fact, $\textit{ctn}_{\text{max}}$ and $\textit{cln}_{\text{max}}$ are the numbers of sub-trajectories and locations distributed in some 3D-cell, so $\textit{ctn}_{\text{max}}\ll p$ and $\textit{cln}_{\text{max}}\approx(p^{*}ln_{\textit{avg}})/G^{3}$ . In addition, in our experiments, $G$ is assigned with 10, the value of $G^{3}$ is not big. For the experimental dataset SynDs, $G^{3}<|\textit{SynDs}|$ . Therefore, the time complexity of our method is reasonable.

6. Experimental results and analysis

In this section, we perform a set of experiments to evaluate the performance of the proposed algorithm in terms of data utility.

6.1 Experimental environment and dataset

6.1.1 Experimental environment

The experiments are conducted with Matlab 8.3 on a PC with Intel (R) Core (TM) 2 Duo CPU 3.7 GHz and 8 GB of RAM. The operating system is Microsoft Windows 7. Two datasets are used to compare TPPG with the prior work of GC_DM [14] and MDAV [5], where SwapLocations method is used in the anonymization process.

According to Algorithms 1–3, for each dataset, our experimental process is specifically arranged as follows:

1.
Pre-processing the experimental dataset based on our proposed trajectory model.
2.
Calculating the 3D datasets Cells and TS3d by implementing Algorithm 1, and the 3D distance dataset Dists based on Algorithm 2.
3.
Completing the trajectory anonymization using Algorithm 3.
4.
Achieving the experimental results comparison based on various evaluation metrics.

6.1.2 Dataset

As described in Section 6.1.1, in order to illustrate the superiority of our algorithm, we carry on a series of comparisons with two other algorithms GC_DM and MDAV, which were verified based on the following two datasets. Therefore, we also choose them to conduct the comparison experiments. Details as follows.

1) Synthetic dataset

The Brinkhoff generator [36, 39] is used in the experiments to generate 1005 synthetic trajectories, containing 45727 locations in the Oldenburg city of German. For ease of description, this synthetic dataset is denoted as SynDS. The parameters used are set as follows: the number of timestamps is 101, the number of moving objects classes is 6, the number of external objects classes is 1, the number of moving objects generated per timestamp is 10, the number of external objects generated per timestamp is 1, the speed of the moving objects is 250, the value of “report probability (0–1000)” is 1000, which means that a moving object is reported at every time stamp during its moving. There are at most 101 locations in each trajectory, and 45.5 locations per trajectory on average. The total area is 23.57 km $\times$ 26.92 km.

2) Real-life dataset

We also use the dataset of cab moving trajectories collected from San Francisco in the United States [40, 41] as the real-life dataset in our experiments, which is denoted as RealDS. It contains GPS coordinates of approximately 500 cabs collected in the San Francisco Bay Area during May 2008. The locations in this dataset are very fine-grained because the average time interval between two consecutive locations is less than 10 seconds [41]. The format of each mobility trajectory file is as follows: each line contains latitude, longitude, occupancy and time, where the occupancy is ignored in our experiments. Since the trajectory of a cab during an entire month can hardly be considered a single trajectory, we use the method in the literature [5] to pre-process this dataset. In particular, the trajectory data of the day between May 25 at 12:04 and May 26 at 12:04 is extracted because during this period there was the highest concentration of locations in the dataset [5]. After a trajectory filtering and location interpolation process, we obtain 480 trajectories and 244 locations per trajectory on average.

As mentioned above, when generating the dataset SynDS, the location at each timestamp is reported in each trajectory. We use these two datasets mainly for comparing our method with the GC_DM and MDAV algorithms. Therefore, a same pre-treatment method to the dataset RealDS is adopted to avoid missing values. Following the method of Literature [5], missing location at a certain timestamp is interpolated based on its two timewise-neighbouring locations. So there is no missing data for these two datasets. Noise is not considered in TPPG and the other two algorithms. The issue will be taken into account in our further research work.

6.2 Experimental results

In this section, a set of comparative experiments are conducted to evaluate the utility of the three algorithms. In our proposed method TPPG, the partition parameter $G$ is assigned with 10, and the weight of trajectory orientation distance $\propto$ is set as 0.5.

In order to illustrate the results of trajectory availability measurement clearly, we change the values of corresponding parameters in the algorithms. For one thing, we respectively change the values of parameters $\theta_{d}$ and $\theta_{t}$ in verification of TPPG algorithm. In particular, we vary $\theta_{d}$ from 0.5 ${}^{\text{th}}$ to 4 ${}^{\text{th}}$ percentile in the sorted set of distances, and vary $\theta_{t}$ from 1 to 8. For another thing, we respectively change the values of parameters $R^{t}$ and $R^{s}$ in verification of GC_DM and MDAV algorithms. Specifically, we vary $R^{t}$ from 20 to 100, and vary $R^{s}$ from 10 ${}^{3}$ to 10 ${}^{4}$ .

6.2.1 Average location loss

Based on Eqs (16) and (17), the avgLL values of TPPG algorithm run on SynDS and RealDS are calculated and shown in Fig. 4. As is shown in Fig. 4a, $\theta_{d}$ is changed and $\theta_{t}$ is assigned with 2. In Fig. 4b, $\theta_{t}$ is changed and $\theta_{d}$ is assigned with 0.1 $\times$ 10 ${}^{-3}$ .

Figure 4.

The avgLL values of TPPG algorithm run on SynDS and RealDS.

Figure 5.

The avgLL values of GC_DM and MDAV algorithms run on SynDS and RealDS.

Figure 6.

The $\textit{avg}{\cal R}$ values of TPPG algorithm run on SynDS and RealDS.

Note that for clarity, only ordinals are represented on the horizontal axis in Fig. 4a. For example, the value on the location of $X$ -axis marked “1th” refers to the 1 ${}^{\text{th}}$ percentile in the sorted set of distances. The representation in later figures is the same.

As is shown in Fig. 4a, with the change of $\theta_{d}$ , the avgLL values range from 10% to 38% for SynDS and range from 8% to 11% for RealDS. In Fig. 4b, with the change of $\theta_{t}$ , the avgLL values approximately remain at 21.8% for SynDS and range from 13% to 14% for RealDS. The general trend is, the larger the value of $\theta_{d}$ , the higher the average location loss. Because $\theta_{d}$ is the distance threshold, when it is becoming larger, the number of sub-trajectories satisfying exchange conditions will be larger and then the average location loss will be greater. It’s the same reason for the relation of avgLL and the time threshold $\theta_{t}$ . In addition, the large difference between the results run on the two datasets depends on the length of dataset. The length of SynDS is about twice the length of RealDS.

In GC_DM and MDAV algorithms, parameters $R^{t}$ and $R^{s}$ are used to implement the trajectory $k$ -anonymity. The results of these two algorithms are shown in Fig. 5. $R^{t}$ is changed and $R^{s}$ is assigned with 10 ${}^{4}$ in Fig. 5a. In Fig. 5b, $R^{s}$ is changed and $R^{t}$ is assigned with 100. The other parameters are set referring to Literatures [5, 14]: $k=$ 20, $\textit{Max-Trash}=$ 10, $\textit{maxradius}=$ 178.95 for SynDS and $\textit{maxradius}=$ 0.1 for RealDS. The setting of parameters in later experiments are the same as those in this section. To avoid repetition, it will not be described later.

As is shown in Fig. 5a, with the change of $R^{t}$ , the avgLL values range from 98.7% to 99.9% for SynDS and range from 99.8% to 100% for RealDS. In Fig. 5b, with the change of $R^{s}$ , the avgLL values range from 98.8% to 100% for SynDS and approximately remain at 99.9% for RealDS. Based on the results obtained under all parameters, the average avgLL values of GC-DM algorithm run on SynDS and RealDS are respectively 99.17% and 99.93%. The average avgLL values of MDAV algorithm run on SynDS and RealDS are respectively 99.26% and 99.94%.

One can observe from Figs 4 and 5 that, compared with the average location loss caused by the GC-DM and MDAV algorithms, the loss caused by the TPPG algorithm is much smaller. It represents the anonymized data utility of our algorithm is much higher than that of the other two algorithms in terms of avgLL metric. This main reason is that TPPG uses the meshing method to divide the trajectories, the number of sub-trajectories contained in each 3D-cell is much smaller than the number of trajectories in the whole area. The GC-DM and MDAV algorithms implement anonymity for the entire whole trajectory dataset, the change of locations for each trajectory is more complicated and widely.

6.2.2 Average locations appearance ratio

Based on Eq. (18), the $\textit{avg}{\cal R}$ values of TPPG algorithm run on SynDS and RealDS are calculated and shown in Fig. 6. The setting of parameters $\theta_{d}$ and $\theta_{t}$ is the same as the description provided in Section 6.2.1, and the same below.

As is shown in Fig. 6a, with the change of $\theta_{d}$ , the $\textit{avg}{\cal R}$ values range from 97% to 97.5% for SynDS and range from 94.5% to 96.1% for RealDS. In Fig. 6b, with the change of $\theta_{t}$ , the $\textit{avg}{\cal R}$ values approximately remain at 97.33% for SynDS and remain at 97.05% for RealDS. The general trend is, the larger the value of $\theta_{d}$ , the higher the average locations appearance ratio.

The $\textit{avg}{\cal R}$ results of GC_DM and MDAV algorithms are shown in Fig. 7.

Figure 7.

The $\textit{avg}{\cal R}$ values of GC_DM and MDAV algorithms run on SynDS and RealDS.

As is shown in Fig. 7a, with the change of $R^{t}$ , the $\textit{avg}{\cal R}$ values are not greater than 24% for SynDS and not greater than 2% for RealDS. In Fig. 7b, with the change of $R^{s}$ , the range of $\textit{avg}{\cal R}$ values is similar to that in Fig. 7a. Based on the results obtained under all parameters, the average $\textit{avg}{\cal R}$ values of GC-DM algorithm run on SynDS and RealDS are respectively 16.24% and 0.78%. The average $\textit{avg}{\cal R}$ values of MDAV algorithm run on SynDS and RealDS are respectively 13.39% and 0.64%.

One can observe from Figs 6 and 7 that, compared with GC-DM and MDAV algorithms, the average locations appearance ratio of TPPG algorithm is much higher. It represents the anonymized data utility of our algorithm is much higher than that of the other two algorithms in terms of $\textit{avg}{\cal R}$ metric. Many locations are deleted in the anonymization process of GC-DM and MDAV algorithms. A location will be deleted if the number of its neighbors satisfying threshold conditions is less than $k$ .

6.2.3 Trajectory loss

Based on Eq. (19), the $T L$ values of TPPG algorithm run on SynDS and RealDS are calculated and shown in Fig. 8.

Figure 8.

The $T L$ values of TPPG algorithm run on SynDS and RealDS.

Figure 9.

The $T L$ values of GC_DM and MDAV algorithms run on SynDS and RealDS.

As is shown in Fig. 8a, with the change of $\theta_{d}$ , the $T L$ values range from 1.9% to 2.2% for SynDS and approximately remain at 0.2% for RealDS. In Fig. 8b, with the change of $\theta_{t}$ , the $T L$ values approximately remain at 2% for SynDS and remain at 0.2% for RealDS.

The $T L$ results of GC_DM and MDAV algorithms are shown in Fig. 9.

As is shown in Fig. 9a, with the change of $R^{t}$ , the $T L$ values range from 55% to 95% for SynDS and from 70% to 100% for RealDS. In Fig. 9b, with the change of $R^{s}$ , the $T L$ values range from 55% to 100% for SynDS and from 70% to 88% for RealDS. Based on the results obtained under all parameters, the average $T L$ values of GC-DM algorithm run on SynDS and RealDS are respectively 67.31% and 91.13%. The average $T L$ values of MDAV algorithm run on SynDS and RealDS are respectively 73.57% and 79.01%.

One can observe from Figs 8 and 9 that, compared with GC-DM and MDAV algorithms, the trajectory loss of TPPG algorithm is much lower. It represents the anonymized data utility of our algorithm is much higher than that of the other two algorithms in terms of $T L$ metric. The main reason is that many trajectories are deleted in the anonymization process of GC-DM and MDAV algorithms, if the number of their neighbors satisfying threshold conditions is less than $k$ .

6.2.4 Spatio-temporal information loss of trajectory

Based on Eqs (20)–(22), the TIL values of TPPG algorithm run on SynDS and RealDS are calculated and shown in Fig. 10.

Figure 10.

The TIL values of TPPG algorithm run on SynDS and RealDS.

As is shown in Fig. 10a, with the change of $\theta_{d}$ , the TIL values range from 0.5 $\times$ 10 ${}^{6}$ to 6.5 $\times$ 10 ${}^{6}$ for SynDS and range from 0.5 $\times$ 10 ${}^{6}$ to 1 $\times$ 10 ${}^{6}$ for RealDS. In Fig. 10b, with the change of $\theta_{t}$ , the TIL values approximately remain at 2.4 $\times$ 10 ${}^{6}$ for SynDS and remain at 0.4 $\times$ 10 ${}^{6}$ for RealDS.

The TIL results of GC_DM and MDAV algorithms are shown in Fig. 11.

Figure 11.

The TIL values of GC_DM and MDAV algorithms run on SynDS and RealDS.

As is shown in Fig. 11a, with the change of $R^{t}$ , the TIL values range from 0.1 $\times$ 10 ${}^{7}$ to 2.8 $\times$ 10 ${}^{7}$ for SynDS and approximately remain 1.5 $\times$ 10 ${}^{7}$ for RealDS. In Fig. 11b, with the change of $R^{s}$ , the range of TIL values is similar to that in Fig. 11a. Based on the results obtained under all parameters, the average TIL values of GC-DM algorithm run on SynDS and RealDS are respectively 1.67 $\times$ 10 ${}^{7}$ and 1.49 $\times$ 10 ${}^{7}$ . The average TIL values of MDAV algorithm run on SynDS and RealDS are respectively 1.18 $\times$ 10 ${}^{7}$ and 1.49 $\times$ 10 ${}^{7}$ .

One can observe from Figs 10 and 11 that, spatio-temporal information loss of trajectory of TPPG algorithm is less than that of GC-DM and MDAV algorithms. It represents the anonymized data utility of our algorithm is higher than that of the other two algorithms in terms of TIL metric. The main reason is that many locations are randomly exchanged in the anonymization process of GC-DM and MDAV algorithms.

6.2.5 Accuracy ratio of AOI query

We set $Q=$ 10 in the measurement of ARAOI. Based on Eq. (23), the ARAOI values of TPPG algorithm run on SynDS and RealDS are calculated and shown in Fig. 12.

Figure 12.

The ARAOI values of TPPG algorithm run on SynDS and RealDS.

As is shown in Fig. 12a, with the change of $\theta_{d}$ , the ARAOI values remain at 100% for SynDS and remain at 90% for RealDS. In Fig. 12b, with the change of $\theta_{t}$ , the ARAOI values are the same as that in Fig. 12a.

The ARAOI results of GC_DM and MDAV algorithms are shown in Fig. 13.

Figure 13.

The ARAOI values of GC_DM and MDAV algorithms run on SynDS and RealDS.

As is shown in Fig. 13a, with the change of $R^{t}$ , the ARAOI values are not greater than 40% for SynDS and not greater than 30% for RealDS. In Fig. 13b, with the change of $R^{s}$ , the ARAOI values are not greater than 50% for SynDS and not greater than 30% for RealDS. Based on the results obtained under all parameters, the average ARAOI values of GC-DM algorithm run on SynDS and RealDS are respectively 33.82% and 6.50%. The average ARAOI values of MDAV algorithm run on SynDS and RealDS are respectively 25.27% and 24%.

One can observe from Figs 12 and 13 that, compared with GC-DM and MDAV algorithms, the accuracy ratio of AOI query of TPPG algorithm is much higher. It represents the anonymized data utility of our algorithm is much higher than that of the other two algorithms in terms of ARAOI metric.

To sum up, compared with the other two algorithms, TPPG algorithm achieves higher data utility in the process of trajectory anonymization. It shows that TPPG algorithm effectively balances the availability of published trajectory data and the ability of preserving privacy.

7. Conclusion

In this paper, we propose a novel privacy-preserving trajectory data publication algorithm TPPG based on 3D-Grid partition. The trajectory region is first divided into several 3D-cells, and then locations exchange or suppression are conducted in each 3D-cell. Based on the trajectory data partition, within each 3D-cell, the proposed method exchanges locations among trajectories or removes very few locations of some sub-trajectories which do not meet the conditions rather than the whole trajectory. To achieve more accurate distance between two trajectories, our method considers three scenarios of trajectory distribution and measures trajectory similarity based on time, orientation, spatial locations and other features of trajectory. After the reconstruction of the related anonymized sub-trajectories, the anonymous trajectory dataset is obtained. Processing in sub-regions speeds up the trajectory anonymization process and improves data availability. Theoretical analysis and experimental results on real-life cab trajectory dataset and synthetic trajectory dataset show that the proposed approach is suitable for privacy-preserving trajectory data publication. In future research, we plan to integrate the features of stay points of each trajectory into our research in order to reflect the reality more accurately in trajectory privacy preservation.

Footnotes

Acknowledgments

The authors would like to thank the reviewers for their useful comments and suggestions for this paper. This work was supported by the National Natural Science Foundation of China (61702010, 61672039), the Natural Science Foundation of Anhui Province (1508085QF134), the University Natural Science Research Program of Anhui Province (KJ2017A327), and the Science and Technology Project of Wuhu City (2016cxy04).

Conflict of interest

The authors claim that no conflict of interest exists in the submission of this manuscript, and the manuscript is approved by all co-authors for publication. None of the material in the paper has been published or is under consideration for publication elsewhere.

References

Tian

Maybank

and Zhang

, An incremental DPMM-based method for trajectory clustering, modeling, and retrieval, IEEE Transactions on Pattern Analysis and Machine Intelligence 35(5) (2013), 1051–1065.

Luo

Tan

Chen

and Ni

L.M.

, Finding time period-based most frequent path in big trajectory data, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2013, pp. 713–724.

Zheng

, Trajectory data mining: An overview, ACM Transactions on Intelligent Systems and Technology (TIST) 6(3) (2015), 29.

Chen

and Chen

, The discovery of personally semantic places based on trajectory data mining, Neurocomputing 173 (2016), 1142–1153.

Domingo-Ferrer

and Trujillo-rasua

, Microaggregation- and permutation-based anonymization of movement data, Information Sciences 208 (2012), 55–80.

Cai

Wang

Chen

and Jiang

, Trajectory-based anomalous behaviour detection for intelligent traffic surveillance, IET Intelligent Transport Systems 9(8) (2015), 810–816.

Zheng

Yuan

N.J.

Shang

and Zhou

, Online discovery of gathering patterns over trajectories, IEEE Transactions on Knowledge and Data Engineering 26(8) (2014), 1974–1988.

Zhao

Zhang

and Ma

, A trajectory privacy protection approach via trajectory frequency suppression, Chinese Journal of Computers 37(10) (2014), 2096–2106.

Puttaswamy

K.P.N.

Wang

Steinbauer

Agrawal

El Abbadi

Kruegel

and Zhao

B.Y.

, Preserving location privacy in geosocial applications, IEEE Transactions on Mobile Computing 13(1) (2014), 159–173.

10.

Song

and Park

, A privacy-preserving location-based system for continuous spatial queries, Mobile Information Systems 2016(1) (2016), 1–9.

11.

Chen

Fung

B.C.M.

Mohammed

Desai

B.C.

and Wang

, Privacy-preserving trajectory data publishing by local suppression, Information Sciences 231 (2013), 83–97.

12.

Ghasemi Komishani

Abadi

and Deldar

, PPTD: Preserving personalized privacy in trajectory data publishing by sensitive attribute generalization and trajectory local suppression, Knowledge-Based Systems 94 (2016), 43–59.

13.

Xin

Xie

Z.Q.

and Yang

, The privacy preserving method for dynamic trajectory releasing based on adaptive clustering, Information Sciences 378 (2017), 131–143.

14.

Wang

Yang

and Zhang

, Privacy preserving algorithm based on trajectory location and shape similarity, Journal on Communications 36(2) (2015), 1–14.

15.

Chen

Acs

and Castelluccia

, Differentially private sequential data publication via variable-length n-grams, in: Proceedings of the ACM Conference on Computer and Communications Security, 2012, 638–649.

16.

C.Y.T.

Yau

D.K.Y.

Yip

N.K.

and Rao

N.S.V.

, Privacy vulnerability of published anonymous mobility traces, IEEE/ACM Transactions on Networking 21 (2013), 720–733.

17.

Cicek

A.E.

Nergiz

M.E.

and Saygin

, Ensuring location diversity in privacy-preserving spatio-temporal data publishing, The VLDB Journal 23(4) (2014), 609–625.

18.

Al-Hussaeni

Fung

B.C.M.

and Cheung

W.K.

, Privacy-preserving trajectory stream publishing, Data & Knowledge Engineering 94 (2014), 89–109.

19.

Sweeney

, k-anonymity: A model for protecting privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5) (2002), 557–570.

20.

Machanavajjhala

Kifer

Gehrke

and Venkitasubramaniam

, l-diversity: Privacy beyond k-Anonymity, ACM Transactions on Knowledge Discovery from Data 1(1) (2007), 3.

21.

Tang

Cui

Ren

Liu

and Buyya

, Ensuring security and privacy preservation for cloud data services, ACM Computing Surveys (CSUR) 49(1) (2016), 13.

22.

Feng

Liu

and Zhang

, A mobile terminal based trajectory preserving strategy for continuous querying LBS users, in: Proceedings of the IEEE International Conference on Distributed Computing in Sensor Systems (DCOSS), 2012, pp. 92–98.

23.

Hoh

Gruteser

Xiong

and Alrabady

, Achieving guaranteed anonymity in GPS traces via uncertainty-aware path cloaking, IEEE Transactions on Mobile Computing 9(8) (2010), 1089–1107.

24.

Mano

Minami

and Maruyama

, Pseudonym exchange for privacy-preserving publishing of trajectory data set, in: Proceedings of the IEEE 3rd Global Conference on Consumer Electronics (GCCE), 2014, pp. 691–695.

25.

Terrovitis

and Mamoulis

, Privacy preservation in the publication of trajectories, in: Proceedings of the IEEE International Conference on Mobile Data Management, 2008, pp. 65–72.

26.

Terrovitis

Poulis

Mamoulis

and Skiadopoulos

, Local suppression and splitting techniques for privacy preserving publication of trajectories, IEEE Transactions on Knowledge & Data Engineering 29(7) (2017), 1466–1479.

27.

Abul

Bonchi

and Nanni

, Never walk alone: Uncertainty for anonymity in moving objects databases, in: Proceedings of the IEEE 24th International Conference on Data Engineering, 2008, pp. 376–385.

28.

Nergiz

M.E.

Atzori

Saygin

and Guc

, Towards trajectory anonymization: A generalization-based approach, Transactions on Data Privacy 2(1) (2009), 47–75.

29.

Abul

Bonchi

and Nanni

, Anonymization of moving objects databases by clustering and perturbation, Information Systems 35(8) (2010), 884–910.

30.

Monreale

Andrienko

Giannotti

Pedreschi

Rinzivillo

and Wrobel

, Movement data anonymity through generalization, Transactions on Data Privacy 3(2) (2010), 91–121.

31.

Huo

and Meng

, History trajectory privacy-preserving through graph partition, in: Proceedings of the 1st International Workshop on Mobile Location-Based Service (MLBS), 2011, pp. 71–78.

32.

Huo

Meng

and Huang

, You can walk alone: Trajectory privacy-preserving through significant stays protection, in: Proceedings of the International Conference on Database Systems for Advanced Applications, 2012, pp. 351–366.

33.

Huo

Meng

and Huang

, PrivateCheckIn: Trajectory privacy-preserving for check-in services in MSNS, Chinese Journal of Computers 36(4) (2013), 716–726.

34.

Gruteser

and Grunwald

, Anonymous usage of location-based services through spatial and temporal cloaking, in: Proceedings of the 1st International Conference on Mobile Systems, Applications and Services, 2003, pp. 31–42.

35.

Kiran

and Kavya

N.P.

, A survey on methods, attacks and metric for privacy preserving data publishing, International Journal of Computer Applications 53(18) (2012), 20–28.

36.

Brinkhoff

, Generating traffic data, Bulletin of the Technical Committee on Data Engineering 26(2) (2003), 19–25.

37.

Lee

J.-G.

Han

and Whang

K.-Y.

, Trajectory clustering: A partition-and-group framework, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2007, pp. 593–604.

38.

Sanchez

Aye

Z.M.M.

Rubinstein

B.I.P.

and Ramamohanarao

, Fast trajectory clustering using hashing methods, in: Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), 2016, pp. 3689–3696.

39.

Brinkhoff

, A framework for generating network-based moving objects, GeoInformatica 6(2) (2002), 153–180.

40.

Piorkowski

SarafijanovicDjukic

and Grossglauser

, CRAWDAD dataset epfl/mobility (v. 2009-02-24), doi: 10.15783/C7J010.

41.

Piorkowski

Sarafijanovic-Djukic

and Grossglauser

, A parsimonious model of mobile partitioned networks with clustering, in: Proceedings of the 1st International Conference on Communication Systems and NET Works, 2009, pp. 1–10.

TPPG: Privacy-preserving trajectory data publication based on 3D-Grid partition

Abstract

Keywords

1. Introduction

2.1 Perturbation-based method

2.2 Suppression-based method

2.3 Generalization-based method

3. Preliminary concept and problem definition

4.1 Requirements of trajectory privacy preservation model

4.2 Trajectory attack model

4.3 Trajectory privacy preservation measurement

4.3.1 Entropy metric

4.4 Trajectory availability measurement

4.4.1 Average location loss

5.1 Algorithm idea

5.1.1 Trajectory pre-processing

5.1.2 Trajectory similarity measurement

5.1.3 Trajectory anonymization

5.2 Algorithm description

5.2.1 Trajectory data partition

1) Only one location in each trajectory

2) Only one location in one trajectory, and at least two locations in another one

3) At least two locations in each trajectory

Table 3 Notations used in the TPPG algorithm

5.3.1 Trajectory privacy preservation capability

6. Experimental results and analysis

6.1 Experimental environment and dataset

6.1.1 Experimental environment

1) Synthetic dataset

2) Real-life dataset

6.2 Experimental results

6.2.1 Average location loss

Footnotes

Acknowledgments

Conflict of interest

References

Table 3
Notations used in the TPPG algorithm