Hierarchical interpolation point anonymity for trajectory privacy protection

Abstract

The traditional trajectory privacy protection algorithm approaches the task as a single-layer problem. Taking a perspective in harmony with an approach more characteristic of human thinking, in which complex problems are solved hierarchically, we propose a two-level hierarchical granularity model for this problem. The first level of the proposed model is a coarse-grained layer, in which the original dataset is divided into groups. The second level is a fine-grained layer, where problems are solved in each group instead of on the original dataset, which reduces complexity and computation while improving efficiency. On the basis of this hierarchical model, we propose the interpolation trajectory-anonymous privacy protection algorithm with temporal and spatial granularity constraints. In addition, we propose interpolation-based modified Hausdorff distance on adjacent segment (IMHD_AS), which provides a smaller clustering area and better data utility than the traditional Euclidean distance, as the trajectory similarity criterion for clustering within each group. Further, we theoretically prove that the proposed algorithm outperforms the traditional algorithm in terms of data distortion and anonymity cost and verify its efficacy experimentally. Compared with the classic anonymity algorithm, the maximum information loss and the anonymity cost are reduced by up to 21.04% and 28.32%, respectively.

Keywords

Hausdorff distance interpolation points hierarchical granularity anonymity trajectory privacy protection

1. Introduction

The proliferation of GPS applications and mobile devices has led to the accumulation of a large quantity of trajectory data that include valuable information for intelligent transportation, route planning, city computing, and related applications [1]. Trajectory data collected by mobile positioning techniques and location-aware devices include substantial amounts of sensitive spatio-temporal and semantic information that supports many applications through data analysis and mining [2]. Users’ activity trajectories, which may reveal their private information, such as home and workplace details, require proper protection [3]. For instance, a user’s trajectory could expose the user’s interest in places and behaviors in time via inference and link attacks [4]. Therefore, the preservation of individual privacy when data are published is attracting increasing attention [5].

The improvement of data utility is the main objective of spatial and temporal trajectory publication. However, guaranteeing data utility and protecting the privacy of users are conflicting objectives that require optimization of trade-offs. Moreover, some optimization algorithms involve a large number of calculations, leading to extremely low efficiency.

In this paper, we propose a two-level model that solves the above problem from the new perspective of a hierarchical approach with multiple levels of granularity. The first layer of the proposed model is divided into groups to reduce complexity and computation while improving efficiency. The second layer solves the problem via temporal and spatial granularity constraints. In this two-layer model, we first employ the Hausdorff distance with the interpolation points to measure the similarity of trajectories and generate $k$ anonymous sets. Then, the $k$ anonymous sets are disturbed on the basis of the interpolation points via temporal and spatial constraints to achieve cooperative anonymity. Finally, trajectory privacy protection and data availability are realized.

The main contributions of this paper can be summarized as follows.

1.
To address the high computation cost of the data availability optimization algorithm, we propose a two-layer hierarchical granularity model for solving complex problems at different levels from a novel perspective, which can reduce complexity and computation while improving efficiency. This approach is in alignment with the way human beings characteristically think. First, we generate different equivalence classes, i.e., subsets or groups; this corresponds to the coarse granularity layer. Then, we solve the problems on the subsets instead of on the global dataset, corresponding to the fine granularity layer.
2.
To the best of our knowledge, this is the first time that the Hausdorff distance is being employed for trajectory privacy protection to improve data utility. The related proof is presented on the basis of theoretical analysis and experimental results. Using temporal granularity constraints, we propose interpolation-based modified Hausdorff distance on adjacent segment (IMHD_AS) – which provides a smaller clustering area than the traditional Euclidean distance – as the trajectory similarity criterion for clustering. The proof is presented on the basis of theoretical analysis.
3.
We use the interpolation points instead of the original sampling points to perturb the trajectory. Because of their uncertainty, the sampling points form an uncertainty region of the trajectory, which can be used for cooperative anonymity. After perturbation, each trajectory in the anonymous set is in the uncertainty region and cannot be distinguished; thus, cooperative anonymity is achieved, and user privacy is protected. Moreover, because the perturbation based on the interpolation point only needs to change a small number of location points, most of the original location points remain unchanged. Furthermore, IMHD_AS yields a shorter distance moved than the traditional Euclidean distance. Therefore, the loss of information can be reduced.
4.
It is theoretically proven that the proposed algorithm outperforms the traditional algorithm in terms of data distortion and anonymity cost. The superiority of the algorithm over the classic trajectory k-anonymity algorithm is experimentally verified using two experimental datasets.

The remainder of this paper is organized as follows. Section 2 reviews related work. Section 3 provides background knowledge and an analysis of the problem. Section 4 introduces the concept underlying the proposed algorithm and theoretically proves the effectiveness of the algorithm. Section 5 describes the evaluation criteria adopted in the algorithm. Section 6 outlines the experimental verification of the effectiveness of the algorithm on two datasets in terms of information loss, anonymity cost, and time efficiency. Finally, Section 7 summarizes the study and briefly explores future research directions.
2. Related work

In this section, we review related work on trajectory privacy protection and the Hausdorff distance.

2.1 Trajectory privacy protection

To address the problem of the direct hiding of an identifier not achieving privacy protection, Sweeney [6] first proposed the k-anonymity privacy protection model. This model is mainly applicable to relational data tables. The basic assumption is that each tuple in the table corresponds to a unique individual, and these attributes are partitioned into quasi-identifiers and sensitive attributes. The main idea is to generalize $k$ records into an anonymous group of quasi-identifiers so that the probability that attackers can identify one tuple is reduced to 1/ $k$ . Gruteser and Grunwald [7] first applied k-anonymity to location services and proposed the concept of location k-anonymity, which requires the user to send a spatio-temporal location that is indistinguishable from that of the other $k-1$ users. Existing trajectory privacy protection methods are classified into the following three main categories.

The first type is the dummy trajectory. The purpose of such methods is to interfere with the original data by adding virtual trajectories while ensuring that the statistical properties of the disturbed trajectory set are not significantly changed [8]. You et al. [9] proposed a virtual trajectory generation method that generates a virtual trajectory according to the following principles: (1) the mode of motion of the virtual trajectory is similar to that of the true trajectory; (2) there are as many intersections as possible. Virtual trajectories can be generated using either a random mode or a transformation mode. Lei et al. [10] used a rotation scheme to generate user trajectories that are difficult to distinguish from virtual ones. Gao et al. [11] proposed a way of protecting the privacy of users’ location and trajectory information with high-quality services in participatory sensing applications, and they achieved a trade-off between protection of location and trajectory privacy and quality of service by using a dummy trajectory. Hara et al. [12] took physical constraints into account so that the dummy trajectories generated could be applied to the real environment. They also considered the ability to track user locations to avoid accidental leakage of user privacy. Xiao et al. [13] proposed a centralized privacy-preserving location-sharing system that integrates social network servers and location-based servers into a location-storing social network server (LSSNS), and they mapped dummy and real trajectories using a dedicated mapping protocol between LSSNS and cellular towers (CTs) to protect the trajectory privacy of social networks. Sun et al. [14] proposed a new virtual location privacy protection algorithm that considers the computing costs and the various privacy requirements of different users; compared to the previous algorithm, the computational cost and efficiency are improved while the same privacy level is achieved. In general, the shape and semantic properties of the dummy trajectory should not deviate significantly from those of the original trajectory because severe distortion may allow an attacker to easily deduce the real trajectory of the user [15]. Liao et al. [16] chose $k-1$ virtual positions using a k-anonymity mechanism based on a sliding window in a single query and $k-1$ virtual trajectories in a continuous query, because previous anonymity methods cannot protect privacy in the case of a single query and continuous queries. Such methods are simple but not very effective. First, a dummy trajectory can overcome the existing obstacles, and an attacker can easily discard the apparently unreasonable trajectory. Second, the costs for calculating and storing pseudo-trajectories are high. Finally, trajectory data availability is poor because of the false trails published, which affects the quality of the query or application based on such data.

The second type is the mixed-zone method, which ignores points that have sensitive properties or are frequently accessed during a trajectory release; such methods only release non-sensitive sampling points [17]. This can be accomplished by suppressing location updates or using pseudonyms to anonymize trajectories when a user enters a mixed area. Arain et al. [18] proposed an effective privacy protection protocol that provides dynamic pseudonyms to users to improve the security of road networks. Gao et al. [19] proposed a trajectory privacy protection framework to improve the time-dependent theoretical mixed-zone model from the perspective of graph theory and reduce information loss. Typical mixed areas are generally set at road intersections [20]. Liu et al. [21] considered the deployment of multiple mixed areas as a cost-constraint optimization problem. The influence of traffic density was considered to improve the protection efficiency of the system. Terrovitis and Mamoulis [22] hypothesized that different opponents have trajectory information of different moving objects, which indicates that an inhibition method can reduce the probability of disclosure of real trajectories. Their method iteratively inhibits some trajectory segments until the privacy leakage probability is less than a given threshold. Although the published trajectory does not include original sensitive information, such an approach can lead to serious distortion of the trajectory data.

The third type is trajectory k-anonymity. Such methods can ensure that the published data are real, and they achieve a certain degree of balance between privacy protection and data utility. The (k, $\delta$ )-anonymity privacy protection model proposed by Abul et al. [23] is based on the uncertainty of the trajectory information, where $\delta$ represents the uncertainty threshold of the trajectory. They designed the “never walk alone” (NWA) algorithm on the basis of this model to realize trajectory k-anonymity. However, the NWA algorithm uses the Euclidean distance to calculate the similarity between trajectories, resulting in high data distortion. They subsequently proposed the “wait for me” (W4M) method [24] based on the NWA algorithm, which measures the trajectory with the Edit Distance on Real Sequence (EDR) distance. It does not need to preprocess the trajectory information and it retains more trajectory information; however, the EDR involves recursive evaluation, resulting in high time complexity of the algorithm. Nergiz et al. [25] proposed a trajectory k-anonymity grouping algorithm based on condensation. In this method, k-anonymity is enforced on the basis of the log-cost distance clustering trajectory metric, and the trajectory is then reconstructed by randomly selecting location samples from the anonymous region. Trujillorasua and Domingoferrer [26] showed that the above anonymity methods cannot effectively hide the original trajectory; they used micro-aggregation to group the trajectories and then replaced the locations within each cluster to solve the problem. Existing methods ignore trajectory similarity and direction, and Gao et al. [27] believed that this has a significant impact on privacy. Therefore, they used trajectory angles to evaluate trajectory similarity and direction, and they built anonymous regions based on trajectory distance. However, because angular distance mainly focuses on the direction of the trajectory, which has good applicability for similar shape trajectories, it cannot satisfy the general demand of distance measure between geographic locations. Huo et al. [28] transformed the k-anonymity problem in NWA into the problem of graph division to reduce the cost of anonymity; thus, they could control the size of each anonymous group and the processing path outliers of the system. Trajectory alignment will result in certain data loss. To solve this problem, Xin et al. [29] proposed a dynamic trajectory release method based on adaptive clustering, which uses the Gibbs sampling clustering method to detect representative regions.

Most existing trajectory privacy protection algorithms analyze and solve the problem on a single level and do not consider the possibility of a variety of levels. In the real world, however, complex problems are typically solved using a hierarchy of levels. Therefore, we took a novel perspective and propose a two-layer hierarchical granularity model.

2.2 Hausdorff distance

Trajectory similarity is currently used in many applications, such as clustering analysis and pattern mining based on trajectory sequences. The main measure used is the Euclidean distance [23, 28]; other measures include edit distance [24, 30], logarithmic distance [25], and angle distance [27]. The computation of trajectory similarity is also needed in trajectory privacy protection. We have not encountered any studies on the use of the Hausdorff distance for trajectory privacy protection. Table 1 compares various measures.

Table 1
Distance measures and their characteristics

Distance measures	Accuracy match	Preprocessing	Noise sensitivity	Time complexity
Euclidean distance	Yes	Yes	Yes	Low
Hausdorff distance	Yes	Yes	Yes	Low
Angle distance	Yes	Yes	Yes	Low
Edit distance	No	No	No	High

The Hausdorff distance is used to measure the distance between two given sets of points in Euclidean space. It is employed mainly for point set matching in computer image recognition applications to judge the similarity of image data. However, it has been successfully employed in many fields. For example, in the field of mathematics, Ahn and Hoffmann [31] used the Hausdorff distance to obtain an elliptical approximation of a high-precision polynomial curve, thereby producing a more reasonable offset than other methods. In the field of image recognition, medical imaging in particular, Cui et al. [32] proposed a topological graph model and used the Hausdorff distance to solve the problem of identifying the boundary of an object in anatomical imaging under low contrast or overlapping distributions of adjacent tissues. Qian and Yang [33] used the Hausdorff distance to evaluate a new comprehensive framework for the segmentation of atherosclerotic carotid plaques in ultrasound images. Cheimariotis et al. [34] proposed a segmentation method for the automatic detection of the cavity boundary in the entire cavity of optical coherence tomography images, and the validity of the method was verified using the Hausdorff distance. In the field of industrial manufacturing, Li et al. [35] simplified the measurement model of complex free-form surface components with the directed Hausdorff distance, which can be used for quality detection in the precision manufacturing of parts.

Huttenlocher et al. [36] proved that the Hausdorff distance is more tolerant of location perturbations than other related methods. However, because it is highly sensitive to the presence of outliers, Dubuisson and Jain [37] proposed an arithmetic averaging method based on the Hausdorff distance to increase the robustness of the distance against outliers or noise.

The trajectory information of a moving object consists of a finite series of discrete position-updating points. A trajectory dataset is similar to an image in computer graphics and pattern recognition. The computation of trajectory similarity can be regarded as a multidimensional data matching problem. The distance between trajectories can be regarded as the distance between two location subsets. Inspired by this, we therefore propose trajectory privacy protection based on the Hausdorff distance.

In trajectory point matching, Shao et al. [38] suggested that interpolation points should be used instead of trajectory sampling points to improve the robustness of the Hausdorff distance against the location update strategy. However, the points in a graphic image are not characterized as a time sequence, and as the Hausdorff distance does not involve the attributes of a timestamp, it cannot be used directly in trajectory anonymity. In this paper, we use the trajectory timestamp as a constraint and limit the interpolation points to the corresponding adjacent timestamps, and we propose IMHD_AS as the metric for ascertaining trajectory similarity.

3. Background knowledge and problem description

3.1 Background knowledge

Definition 1 (Trajectory). The trajectory $T r$ is a string composed of a series of three-tuple sequences $(t_{i},x_{i},y_{i})$ containing a timestamp:

$\displaystyle Tr=\{(t_{1},x_{1},y_{1}),(t_{2},x_{2},y_{2}),\dots,(t_{n},x_{n},% y_{n})\}$ (1)

where $x_{i},y_{i}$ are the coordinates of the trajectory at timestamp $t_{i}(1\leqslant i\leqslant n)$ .

Trajectories can also consist of a continuous series of broken segments $\overline{p_{j-1}p_{j}}$ :

$\displaystyle Tr=\left(\overline{p_{1}p_{2}},\overline{p_{2}p_{3}},\dots,% \overline{p_{j-1}p_{j}},\dots,\overline{p_{{len}_{Tr}-1}p_{{len}_{Tr}}}\right)$ (2)

where $p_{i}$ represents a sampling point in the trajectory $T r$ , $\textit{len}_{Tr}$ is the length of $T r$ , and $\overline{p_{j-1}p_{j}}$ is a broken line between two sampling points, simulating the real trajectory. When the sampling interval approaches 0, the trajectory is close to the real motion path. However, the higher the sampling frequency, the higher is the cost for storage and trajectory analysis.

Definition 2 (Hausdorff distance, HD) [39]. Given two point sets A and B,

$\displaystyle\left\{\begin{array}[]{l}A=\{a_{1},a_{2},\dots,a_{i},\dots,a_{m}% \}\\ B=\{b_{1},b_{2},\dots,b_{j},\dots,b_{n}\}\\ \end{array}\right.$ (3)

The Hausdorff distance between A and B is defined as follows.

$\displaystyle\left\{\begin{array}[]{l}H(A,B)=\max(h(A,B),h(B,A))\\ h(A,B)=\max_{a_{i}\subset A}(\min_{b_{j}\subset B}(\text{dist}(a_{i},b_{j})))% \\ h(B,A)=\max_{b_{j}\subset B}(\min_{a_{i}\subset A}(\text{dist}(a_{i},b_{j})))% \\ \end{array}\right.$ (4)

where $\text{dist}(a_{i},b_{j})$ is the Euclidean distance between two points.

Because the distance between the point sets is calculated from the maximum and minimum values of the original Hausdorff distance, it will be affected by the presence of outliers. To improve the robustness of the Hausdorff distance against outliers and noise, Dubuisson and Jain [37] used an averaged value to reduce the impact of outliers and proposed a modified Hausdorff distance.

Definition 3 (Modified Hausdorff distance, MHD).

$\displaystyle h(T_{i},T_{j})=\frac{1}{n_{i}}\sum_{P_{a}\subset T_{i}}{\min}_{P% _{b}\subset T_{j}}((\text{dist}(P_{a},P_{b})))$ (5)

On the assumption of equal time sampling intervals, Shao et al. [38] proposed an improved interpolation distance calculation method based on the modified Hausdorff distance, which overcomes the problem of the inconsistency of the sampling interval in the trajectory space and is robust against changes in the position updating strategy.

Definition 4 (Interpolation-based modified Hausdorff distance, IMHD).

$\displaystyle h(T_{i},T_{j})=\frac{1}{n_{i}}\sum_{P_{a}\subset T_{i}}{\min}_{% \overline{p_{b-1}p_{b}}\subset T_{j}}((\text{dist}(P_{a},\overline{p_{b-1}p_{b% }}))$ (6)

Shao et al. [38] proposed the use of interpolation points instead of trajectory sampling points for trajectory point matching to improve the robustness of the interpolation Hausdorff distance against the location update strategy. However, points in a graphic image are not characterized by a time sequence, and the Hausdorff distance does not involve the attributes of a timestamp. Therefore, it cannot be used directly in trajectory anonymity. In this paper, we use the trajectory timestamp as the constraint and limit the interpolation points to the corresponding adjacent timestamps. IMHD_AS is proposed as the criterion for ascertaining trajectory similarity.

On the basis of Definition 4, we propose IMHD_AS under a time constraint as follows.

Definition 5 (Interpolation-based modified Hausdorff distance on adjacent segment, IMHD_AS).

$\displaystyle\left\{\begin{array}[]{l}\text{dist}(p_{a},p_{b})=\min(\text{dist% }(p_{a},\overline{p_{b-1}p_{b}}),\text{dist}(p_{a},\overline{p_{b}p_{b+1}}))\\ \text{dist}(p_{a},\overline{p_{b-1}p_{b}})={\min}_{\odot\subset\overline{p_{b-% 1}p_{b}}}(\text{dist}(p_{a},\odot))\\ \text{dist}(p_{a},\overline{p_{b}p_{b+1}})={\min}_{\odot\subset\overline{p_{b}% p_{b+1}}}(\text{dist}(p_{a},\odot))\\ \end{array}\right.$ (7)

where $\text{dist}(p_{a},p_{b})$ is the distance between $p_{a}$ and $p_{b}$ , $\odot$ is an interpolation point, and $\odot\subset\overline{p_{b-1}p_{b}}$ , which minimizes $\text{dist}(p_{a},\odot)$ . In general, $\odot$ is the vertical line from the sampling point $p_{a}$ to the line segment $\overline{p_{b-1}p_{b}}$ .

$\displaystyle\text{dist}({Tr}_{a},{Tr}_{b})=\frac{1}{t}\sum_{P_{a}\subset{Tr}_% {a}}{\text{dist}(p_{a},p_{b})}$ (8)

where $\text{dist}({Tr}_{a},{Tr}_{b})$ is the IMHD_AS between ${Tr}_{a}$ and ${Tr}_{b}$ , and $t$ is the number of sampling points.

Inference 1. IMHD_AS is always less than or equal to the Euclidean distance between two trajectories.

Proof. The Euclidean distance between trajectories $T r$ 1 and $T r$ 2 is shown in Fig. 1a. The IMHD_AS between $T r$ 1 and $T r$ 2 is shown in Fig. 1b.

Figure 1.

Comparison of Euclidean distance and IMHD_AS.

As shown in Fig. 1b, at time t1, when the sampling points cannot form vertical bisection lines for the time-stamped adjacent segments in the trajectory, interpolation points cannot be obtained. In this scenario, the IMHD_AS between the trajectory sampling points is consistent with the Euclidean distance.

As shown in Fig. 1b, at time t3, an interpolation point can be obtained when the sampling points can only vertically split the ends of the segments in the trajectory. In this scenario, the IMHD_AS is the Euclidean distance from the sampling point to the interpolation point. Because the three points constitute a right triangle, the length of the oblique is greater than that of either of the other two sides. We can see that the IMHD_AS between the sampling points is less than the Euclidean distance.

As shown in Fig. 1b, at time t2, when the sampling points can be used for vertical bisection of the two ends of the adjacent segment to the timestamp in the trajectory, two interpolation points can be obtained. In this scenario, the IMHD_AS takes the shortest distance from the sampling point as the Euclidean distance between the two interpolation points. Again, we see that the IMHD_AS between the sampling points is less than the Euclidean distance.

The IMHD_AS takes the mean value of the distance above the locus point, and in each of the three cases, the value is less than or equal to the Euclidean distance. It can be seen that the IMHD_AS from anonymous trajectories to central trajectories is always less than the Euclidean distance. When none of the sampling points is able to make a vertical bisection for the trajectory segments adjacent to the timestamp in the center trajectory, the two calculated values are equal.

The proof is complete.

Definition 6 (Uncertainty of trajectory sampling point). Let $\delta$ be the uncertainty threshold, $p_{\text{real}}$ be the real location of the trajectory, and $p$ be the sampling point. Then

$\displaystyle\text{dist}(p_{\text{real}},p)\leqslant\delta$ (9)

Because of the imprecise positioning technology in practice, the circular region with the trajectory as the center and having radius $\delta$ is the uncertainty area of the trajectory sampling point. $p_{\text{real}}$ can exist at any location in the uncertainty region, as shown in Fig. 2.

Figure 2.

Uncertainty area of the trajectory sampling point $p$ .

Abul et al. [23] proposed the (k, $\delta$ )-anonymity model and NWA algorithm based on uncertainty characteristics. Uncertainty leads to indistinguishability. Therefore, if all the points in each of two trajectories are located in the uncertainty region of the other, the two trajectories can cooperate anonymously.

Definition 7 (Trajectory cooperative anonymity, co-localization). Consider trajectories $T r$ and $Tr^{\prime}$ :

$\displaystyle Tr=\{(t_{1},x_{1},y_{1}),(t_{2},x_{2},y_{2}),\dots,(t_{n},x_{n},% y_{n})\}$ $\displaystyle Tr^{\prime}=\{(t_{1},x^{\prime}_{1},y^{\prime}_{1}),(t_{2},x^{% \prime}_{2},y^{\prime}_{2}),\dots,(t_{n},x^{\prime}_{n},y^{\prime}_{n})\}$

Assume that

$\displaystyle\sqrt{{(x_{i}-x^{\prime}_{i})}^{2}+{(y_{i}-y^{\prime}_{i})}^{2}}% \leqslant\delta\quad(1\leqslant i\leqslant n)$ (10)

Then, every sampling point on $Tr^{\prime}$ is in the range of $T r$ uncertainty, and the two trajectories satisfy cooperative anonymity (co-localization), which is denoted as $\text{Coloc}(Tr,Tr^{\prime})$ .

Definition 8 (Trajectory (k, $\delta$ )-anonymity). Assume that $\delta$ is the uncertainty threshold and the number of trajectories in the anonymous group is greater than or equal to k. If any two trajectories in the group satisfy cooperative anonymity, the group has trajectory (k, $\delta$ )-anonymity.

In Fig. 3, $T r$ 1, $T r$ 2, and $T r$ 3 satisfy (3, $\delta$ )-anonymity at time t1. However, at time t2, the sampling point on $T r$ 2 is not in the uncertainty region of $\delta$ . Therefore, it does not satisfy (3, $\delta$ )-anonymity; it satisfies only (2, $\delta$ )-anonymity. Likewise, the cases for times t3 and t4 are similar.

Figure 3.

Clustering not satisfying (3, $\delta$ )-anonymity.

3.2 Problem description

Traditionally, there are two main ways to publish a trajectory-anonymous set. One is to generalize the trajectories in the anonymous group and publish only the feature trajectories, and the other is to generalize the location points for each moment in the anonymous group and publish the anonymous areas after generalization, as shown in Fig. 4a and b, respectively. These two methods have achieved good results, but they need to change all the location points of the original trajectory to achieve privacy protection, which results in significant information loss and adversely affects the availability of data.

Figure 4.

Two methods for publishing trajectory-anonymous sets.

In contrast to the traditional methods of feature release and generalization, the (k, $\delta$ )-anonymity model first selects the central trajectory of the anonymous group. If the location point in the anonymous group is not in the uncertainty area of its central trajectory, it disturbs the location point. Finally, each location point in the anonymous group is moved to the uncertainty area to achieve cooperative anonymity, and privacy protection is achieved. This is illustrated in Fig. 5. The gray sampling points are the locations of the points after being moved from the locations of the corresponding sampling points. After being moved, all trajectories satisfy cooperative anonymity; hence, the trajectory group has (3, $\delta$ )-anonymity.

Figure 5.

Process of disturbance from corresponding sampling points.

Figure 6.

Hausdorff distance calculated using interpolated points in adjacent trajectories.

Figure 7.

Hausdorff distance calculated using interpolated points without time constraints.

Figure 8.

Process of perturbation by interpolation points.

The trajectory distance calculated by the interpolation of points with a time constraint is more consistent with the real trajectory state. Moreover, because time constraints are limited to adjacent trajectory segments, it is not necessary to find interpolation points on all segments of the trajectory, which reduces computation and improves efficiency significantly.

For example, two trajectories $T r$ 1 and $T r$ 2 with opposite directions of motion are shown in Figs 6 and 7. At times t1 and t4, the distance between them is large. However, as shown in Fig. 6, the Hausdorff distance calculated from the interpolation point is the smallest, as expressed by the dotted line, and it is not consistent with the real state because there is no time constraint. As shown in Fig. 7, owing to time constraints, the interpolation points are restricted to adjacent trajectory segments. If there are no interpolation points on the adjacent trajectory segments, the end points of the adjacent trajectory segments will be replaced. Therefore, the calculated Hausdorff distance will be very large, which is consistent with the real state.

On the basis of the IMHD_AS and (k, $\delta$ )-anonymity model, we propose a method of perturbation based on interpolation points, as shown in Fig. 8. We use interpolation points instead of trajectory sampling points for perturbation, thereby achieving lower data distortion, lower anonymity cost, and higher data availability.

Inference 2. Using anonymous interpolation points instead of trajectory sampling points can reduce the anonymity cost.

Proof. The anonymity costs of using IMHD_AS and Euclidean distance are as follows:

$\displaystyle\text{Translation}(\text{IMHD\_AS})=\textit{Euclidean}({\textit{% Trp}\_\odot}_{i},{Tr}_{i})-\delta$ (11) $\displaystyle\text{Translation}(\textit{Eurp})=\textit{Euclidean}(\textit{Trp}% _{i},{Tr}_{i})-\delta$ (12)

where $\textit{Trp}\_\odot_{{i}}$ is the interpolation point in the central trajectory calculated by cluster and ${\textit{Trp}}_{{i}}$ is a sample point. ${{Tr}}_{{i}}$ is a sample point in the anonymity trajectory.

$\displaystyle\textit{Euclidean}(p_{{1}},p_{{2}})=\sqrt{{(x_{{1}}-x_{{2}})}^{{2% }}+{(y_{{1}}-y_{{2}})}^{{2}}}$ (13)

where $p_{1}$ is ( $x_{1},y_{1}$ ) and $p_{2}$ is ( $x_{2},y_{2}$ ).

According to Inference 1, the IMHD_AS from the anonymous trajectory to the center trajectory is always less than or equal to the Euclidean distance between the same trajectories.

$\displaystyle\text{IMHD \_AS}(\textit{Trp}_{i},{Tr}_{i})=\textit{Euclidean}({\textit{Trp}\_\odot}_{i},% {Tr}_{i})\leqslant\textit{Eurp}(\textit{Trp}_{i},{Tr}_{i})$ (14)

Because the trajectory uncertainty threshold $\delta$ is determined, the anonymous operation of the interpolation point instead of the trajectory sampling point can reduce the anonymity cost of the trajectory point.

The proof is complete.

4. Hierarchical interpolation point anonymity for trajectory privacy protection

The method proposed in this paper has three main steps. The first step is trajectory preprocessing, by which the trajectory dataset is divided into multiple equivalent classes according to the timestamps. In the second step, each equivalent class is clustered to form a number of anonymous groups. Finally, the trajectories of each anonymous group are perturbed to satisfy the interpolation (k, $\delta$ )-anonymity. This section introduces the proposed method with a summary of the algorithm and then introduces each step in turn. The proof of the algorithm-related inference is also presented.

Algorithm 1 Trajectory anonymity algorithm based on IMHDT
Input: Original trajectory set $T s$ , $k,\delta$ ;
Output: Anonymity trajectory set $Ts^{\prime}$ ;
1. $\textit{Groups}{\leftarrow}\text{preprocess}(Ts)$ ;
2. $\textit{clusteredTs}{\leftarrow}\text{cluster}(\textit{Ecs},k)$ ;
3. $Ts^{\prime}{\leftarrow}\text{anonymity}(\textit{clusteredTs},\delta)$ ;
4. Return $Ts^{\prime}$ ;

4.1 Framework of the proposed algorithm

Figure 9 and Algorithm 1 show the framework of the proposed algorithm. The input is the original trajectory dataset $T s$ , the privacy protection parameter $k$ , and the trajectory uncertainty threshold $\delta$ . The output is an anonymous trajectory dataset $Ts^{\prime}$ that satisfies the interpolation (k, $\delta$ )-anonymity. The first step is data preprocessing to divide $T s$ into several groups. The second step is clustering of each group to form many equivalent classes according to the IMHD_AS measure. The number of trajectories in each equivalent class is not less than $k$ . In the third step, each class is perturbed so that it satisfies the interpolation (k, $\delta$ )-anonymity.

Figure 9.

Framework of the proposed algorithm.

4.2 Trajectory preprocessing algorithm

In this paper, a preprocessing method for grouping is proposed. Like the NWA method [23], our method carries out modular operations at a given granularity by timestamps, which can regularize inconsistent trajectory sampling times and divide the original dataset into different groups. Compared with the equivalent class of the whole trajectory, a large number of trajectories are retained, which improves the anonymity of the trajectories significantly.

Unlike NWA, however, our method uses multiple granularities to carry out modular operations. The appropriate granularity size is selected according to trajectory and anonymity data quality; we try to maintain the trajectory quality and reduce the quantity of data to be suppressed.

Algorithm 2 shows the preprocessing algorithm. For each equivalent class, we record the start and end stamps. First, the beginning and ending timestamps of the trajectory are recorded as $[t_{b},t_{e}]$ . The timestamps at both ends of the trajectory are taken as delimiting an equivalence class. From the beginning of the trajectory, determine the first timestamp $t$ for which $t$ mod $P i$ is zero, where $P i$ is the granularity that was input. From the end of the trajectory, determine the first timestamp $t$ for which $t$ mod $P i$ is zero. These timestamps are denoted as $i$ and $j$ , respectively. Any point whose timestamp is not in the interval [ $i, j$ ] is removed, and trajectories having the same $i$ and $j$ values are placed into the same equivalence class. At the end, the trajectories in each equivalence class have the same starting time and ending time.

Algorithm 2 Trajectory preprocessing
Input: Original trajectory set $T s$ , granularity $P i$ ;
Output: Preprocessed trajectory set Groups;
1. For each trajectory $Tr\in Ts$ do
2. $[t_{b},t_{e}]$ is time span of trajectory $T r$
3. $i{\leftarrow}\min\{t\|t\geqslant t_{b}\&t\textit{ mod }Pi=0\}$
4. $j{\leftarrow}\max\{t\|t\leqslant t_{e}\&t\textit{ mod }Pi=0\}$
5. If $i\leqslant j$
6. $Tr^{\prime}\leftarrow Tr[i,j]$
7. Insert $Tr^{\prime}$ into $D_{[i,j]}$
8. $\textit{Groups}{\leftarrow}{\cup}D_{[i,j]}$
9. ReturnGroups

4.3 Clustering algorithm

Algorithm 3 shows the clustering process. Its input is preprocessed trajectory set groups and privacy protection degree $k$ . Its output, clusteredTs, is the clustered trajectory set. For each group, a center trajectory is first selected; then, the IMHD_AS from other trajectories to the center trajectory is calculated, and the nearest $k-1$ trajectories are selected to form a clustering set containing $k$ trajectories. Subsequently, the trajectory farthest from the remainder is taken as the next center, and the previous step is repeated.

Algorithm 3 Trajectory Cluster
Input: Preprocessed trajectory set Groups, $k$ ;
Output: Clustered trajectory set clusteredTs;
1. $\textit{Set}(\textit{max}\_\textit{radius})$ ;
2. $\text{unclustered }{\leftarrow}\,{\emptyset}$ ;
3. While $\textit{Groups }{\neq}\,{\emptyset}$ do
4. $\textit{Initialize}(\textit{clustered})$ ;
5. If $\textit{num}(\textit{Group}\subset\textit{Groups})\geqslant k$
6. $\textit{Initialize}(\textit{active}),\textit{active}\leftarrow\textit{Group}$ ;
7. $\textit{Initialize}(\textit{pivot}),\textit{pivot}\leftarrow\textit{random}(Tr)$ ;
8. While $\textit{active}\neq\emptyset$ do
9. For each $T r$ in activedo
10. $\textit{dis}\leftarrow\textit{countIMHDT}(Tr,\textit{pivot})$ ;
11. $\textit{tr\_pivot}\leftarrow\textit{maxdis}(Tr)$ ;
12. For each $T r$ in activedo
13. $\textit{dis}\leftarrow\textit{countIMHDT}(Tr,\textit{tr\_pivot})$ ;
14. $\textit{Initialize}(\textit{anonymity})$ ;
15. $\textit{anonymity}\leftarrow\{\textit{nearest }k-1\textit{ point by dis}\}$ ;
16. If $\textit{dis}(\textit{furthest point in anonymity})\leqslant\max\_\textit{radius}$
17. Remove anonymity in active;
18. Add tr_pivot in pivot;
19. Add anonymity in clustered
20. Else
21. Remove anonymity in active;
22. Add anonymity in unclustered;
23. ReturnclusteredTs;

The clustering process needs to divide the trajectories according to their similarity into corresponding clustered sets, and the number in each set must be no less than $k$ . How to measure the similarity of two trajectories is a key issue. The classic metric function is the Euclidean distance, which calculates the arithmetic mean of the Euclidean distances between the sampling points at the timestamp. We use the proposed IMHD_AS as a trajectory metric function. By Inference 1, IMHD_AS is less than or equal to the Euclidean distance under the same circumstances. Therefore, the clustering of IMHD_AS has a smaller generalization radius and smaller generalization area than the Euclidean distance, which can reduce the trajectory data distortion caused by generalization.

4.3.1 IMHD_AS calculation

Algorithm 4 shows the IMHD_AS calculation. Its inputs are trajectories $T r$ 1 and $T r$ 2. First, we calculate the shortest distance between each trajectory sampling point ${Tr1\_\textit{node}}_{t=t_{i}}$ and the trajectory segment $\overline{{Tr2\_\textit{node}}_{t=t_{i}},{Tr2\_\textit{node}}_{t=t_{i+1}}}$ . Then, we calculate the shortest distance between ${Tr1\_\textit{node}}_{t=t_{i}}$ and $\overline{{Tr2\_\textit{node}}_{t=t_{i}},{Tr2\_\textit{node}}_{t=t_{i+1}}}$ . Finally, the minimum value is taken as the IMHD_AS for the point, and the average value for all of the points is taken as the IMHD_AS between the trajectories.

Algorithm 4 countIMHDT
Input: Trajectories $T r$ 1, $T r$ 2;
Output:dis between trajectories;
1. $\textit{dis}\leftarrow 0$ ;
2. For each ${Tr1\_\textit{node}}_{t=t_{i}}$ do
3. $d1=\text{countNodeDis}({Tr1\_\textit{node}}_{t=t_{i}},{Tr2\_\textit{node}}_{t=% t_{i}},{Tr2\_\textit{node}}_{t=t_{i+1}})$ ;
4. $d2=\text{countNodeDis}({Tr1\_\textit{node}}_{t=t_{i}},{Tr2\_\textit{node}}_{t=% t_{i-1}},{Tr2\_\textit{node}}_{t=t_{i}})$ ;
5. $\textit{dis}+=\min(d1,d2)$ ;
6. Returndis/len ( $Tr1$ );

4.3.2 Shortest-distance calculation

Algorithm 5 shows the calculation of the shortest distance from the sampling point to its adjacent trajectory segment. Its input is the sampling point ${Tr1\_\textit{node}}_{t=t_{i}}$ and the trajectory segment made up of points ${Tr2\_\textit{node}}_{t=t_{i}}$ and ${Tr2\_\textit{node}}_{t=t_{j}}$ . First, it is determined whether there is an interpolation point on the trajectory segment such that the connection between the sampling point and the interpolation point is perpendicular to the trajectory segment. If there is such an interpolation point, then the Euclidean distance from the sampling point to the interpolation point is returned. Otherwise, the minimum distance from the trajectory sampling point ${Tr1\_\textit{node}}_{t=t_{i}}$ to the two ends will be returned.

Algorithm 5 countNodeDis
Input: Trajectory nodes ${Tr1\_\textit{node}}_{t=t_{i}},{Tr2\_\textit{node}}_{t=t_{i}},{Tr2\_\textit{% node}}_{t=t_{j}}$ ;
Output:dis between trajectory nodes;
1. If $\exists\odot{\subset}\overline{{Tr2_{\textit{node}}}_{t=t_{i}},{Tr2_{\textit{% node}}}_{t=t_{j}}}$
2. make $\overline{{Tr1\_\textit{node}}_{t=t_{i}},\odot}\bot\overline{{Tr2\_\textit{% node}}_{t=t_{i}},{Tr2\_\textit{node}}_{t=t_{j}}}$ ;
3. Return $\text{Europendis}(\overline{{Tr1\_\textit{node}}_{t=t_{i}},\odot})$ ;
4. Else
5. Return $\min((\text{dis}(\overline{{Tr1_{\textit{node}}}_{t=t_{i}},{Tr2_{\textit{node}% }}_{t=t_{i}}}),\text{dis}(\overline{{Tr1_{\textit{node}}}_{t=t_{i}},{Tr2_{% \textit{node}}}_{t=t_{j}}})))$ ;

4.4 Data perturbation algorithm

Algorithm 6 shows the cooperative anonymity algorithm. The input of the algorithm is the clustered trajectory set obtained by the above clustering algorithm and the uncertainty threshold $\delta$ . The output is the set of trajectories that satisfy the anonymity of the interpolation. The algorithm traverses each data sampling point in each trajectory. If the sampling point is not in the uncertainty area of the corresponding center trajectory, then the point is moved until it satisfies the condition. At the end of the traversal, the anonymity process is complete.

Algorithm 6 Trajectory Cooperative Anonymity
Input: Clustered trajectory set clusteredTs, Uncertain threshold $\delta$ ;
Output: Anonymity trajectory set $Ts^{\prime}$ ;
1. For each trajectory $Tr\in\textit{clusteredTs}$ do
2. For each trajectory point $Tr\_\textit{node}$ in $T r$ do
3. If $\textit{countNodeDis}(Tr\_\textit{node},\textit{pivot\_node})>\delta$
4. Move $Tr\_\textit{node}$ into pivot_node until $\textit{countNodeDis}\leqslant\delta$
5. Return $Ts^{\prime}$ ;

5. Evaluation metrics

5.1 Information loss

Information loss is defined as follows.

$\displaystyle\textit{InfoLoss}=\frac{1}{\text{len}(\textit{clusteredTs})}\sum^% {\text{len}(\textit{clusteredTs})}_{i=1}{\frac{\text{ClusterArea}(\textit{% Group}_{i})}{\textit{MaxArea}}}$ (15)

where

$\displaystyle\text{ClusterArea}(\textit{Group}_{i})=\pi\times\max_{Tr\subset% \textit{Group}_{i}}{(\text{IMHDT\_AS}(Tr,\textit{Pivot}))}^{2}$ (16)

where $\text{len}(\textit{clusteredTs})$ is the number of clustered trajectory set after clustering, $\text{ClusterArea}(\textit{Group}_{i})$ is the generalization area of the cluster $\textit{Group}_{i}$ , and MaxArea is the total area of the trajectory area. Pivot represents the central trajectory of $\textit{Group}_{i}$ , which is generated by the clustering process. $T r$ is one of $k-1$ trajectories in $\textit{Group}_{i}$ in addition to Pivot. The smaller the generalization area, the lower is the information loss. Therefore, reduction of the generalization area can reduce the distortion of data in the clustering process.

5.2 Anonymity cost

$\displaystyle\text{TranslationNode}=\left\{\begin{array}[]{ll}\text{dis}({Tr}_% {\textit{node}},{Tr}_{\textit{pivot}})-\delta&\text{if }\text{dis}({Tr}_{% \textit{node}},{Tr}_{\textit{pivot}})>\delta\\ 0&\textit{else}\\ \end{array}\right.$ (17) $\displaystyle\textit{Translation}=\sum^{\text{len}(Tr)}_{j=1}\textit{TranslationNode}$ (18)

TranslationNode is the distance the trajectory sampling point moves. Translation is the total distance moved by all the points. The higher the value of translation, the more points to move and the higher the anonymity cost.

6. Experimental results and analysis

6.1 Experimental environment

The experimental data were generated by Thomas Brinkhoff [40]. The generator simulates trajectory information for mobile objects on the basis of road network data. The OLDEN dataset is based on the road network in Auden, Germany, and SANFR is based on the road network in San Francisco, USA. Table 2 shows the statistical information for the datasets.

Table 2
Statistical information of the datasets

Dataset	$\|D\|$	$\|\textit{Point}\|$	MaxTimeStamp	MinTimeStamp	Width	Height
OLDEN	11000	203397	20	0	23572	26915
SANFR	12005	145237	30	0	1813251	1419550

$|D|$ is the number of trajectories in the dataset. $|\textit{Point}|$ is the total number of sampling points. MaxTimeStamp is the maximum of the trajectory sampling interval, and MinTimeStamp is the minimum of the interval. Height is the length of the data area.

The algorithm was implemented in Java and executed in the Eclipse integrated development environment on a system with an Intel ${}^{\@setsize{\scriptsize}{9.5pt}{\viiipt}{\@viiipt}\textregistered}$ Core ${}^{\text{TM}}$ i5-4200U CPU (1.6 GHz, 2.3 GHz), 4 GB RAM, and 64-bit Windows 7 OS.

6.2 Granularity selection analysis

The first process of the experiment was the data preprocessing, which was designed to divide the original trajectory dataset into a set of trajectory equivalence classes with consistent timestamps. Through preprocessing, we can select the proper granularity value of $P i$ for the group dividing. The results are presented below.

In Tables 3 and 4, $P i$ is the parameter that controls the number of groups, $|D|$ is the number of trajectories retained after preprocessing, and $|\textit{Point}|$ is the total number of sampling points remaining. The higher the values of $|D|$ and $|\textit{Point}|$ , the smaller is the quantity of data discarded. Time is the time cost of preprocessing; lower times indicate higher efficiency. EcsNum is the number of trajectory equivalence classes generated after preprocessing. $|D|_{\text{per}}$ is the average number of trajectories per equivalence class. The smaller the number of equivalence classes and the greater the average number of trajectories, the higher is the quality of k-anonymity.

Table 3
Preprocessing results as influenced by $P i$ (OLDEN dataset)

$P i$	$\|D\|$	$\|\textit{Point}\|$	Time (s)	EcsNum	$\|D\|_{\text{per}}$
9	9538	161433	0.797	3	3179
8	9640	146440	0.802	3	3213
7	9723	130739	0.753	3	3241
6	9800	164538	0.825	5	1960
5	9884	181480	0.955	6	1647
4	9941	182508	1.116	7	1420
3	9984	167286	0.755	8	1248
2	10006	184492	1.534	12	833

Table 4

Preprocessing results as influenced by $P i$ (SANFR dataset)

$P i$	$\|D\|$	$\|\textit{Point}\|$	Time (s)	EcsNum	$\|D\|_{\text{per}}$
9	5131	63513	0.437	5	1026
8	4725	52472	0.440	5	945
7	6582	80388	0.455	8	822
6	7727	98478	0.614	13	594
5	8112	103870	0.500	18	450
4	8563	98996	0.580	24	356
3	8942	116985	0.609	47	190
2	9330	123530	0.677	105	88

If $P i$ is too large, too many trajectories will be discarded, and the information loss will be large. As shown in Table 3, on the OLDEN dataset, when $Pi=$ 9, the number of trajectories remaining is 9538. However, if $P i$ is too small, the number of equivalence classes will be excessive, which will affect the data quality. As shown in Table 4, on the SANFR dataset, when $Pi=$ 2, the number of equivalence classes increased to 105.

Therefore, according to the experimental results of the preprocessing, we consider the number of discarded and equivalence classes synthetically and select the most suitable value of $P i$ for achieving the best data quality. As can be analyzed from Tables 3 and 4, $Pi=$ 5 is the most suitable, so in the experiment, parameter $P i$ was set to 5.

We analyze and verify the performance of the algorithm based on data distortion, anonymity cost, and execution time below.

Table 5

Comparison of information loss (InfoLoss) for the two algorithms as influenced by privacy protection parameter $k$ (OLDEN dataset)

$k$	InfoLoss (NWA)	InfoLoss (IMHD_AS)	Ratio	Reduction (%)
10	0.007302	0.005766	0.789646672	21.04
20	0.010025	0.008278	0.825735661	17.43
30	0.013526	0.011201	0.828108827	17.20
40	0.017177	0.013992	0.814577633	18.54
50	0.020334	0.016970	0.834562801	16.54

Table 6

Comparison of information loss (InfoLoss) for the two algorithms as influenced by privacy protection parameter $k$ (SANFR dataset)

$k$	InfoLoss (NWA)	InfoLoss (IMHD_AS)	Ratio	Reduction (%)
5	0.009993	0.00774	0.774542180	22.55
10	0.022564	0.018971	0.840764049	15.93
15	0.032288	0.02691	0.833436571	16.65
20	0.039866	0.036073	0.904856268	9.51
25	0.047135	0.041926	0.889487642	11.05

Figure 10.

Influence of $k$ on information loss (InfoLoss).

6.3 Data distortion analysis

Figure 10 shows that as $k$ increases, the loss of generalization information becomes more serious, and data distortion is increased. This is because $k$ represents the number of anonymous trajectories in each cluster; the larger the value of $k$ , the lower is the probability of each trajectory’s being identified and the higher is the degree of privacy protection. However, an increase in $k$ also causes the generalization area of each cluster to increase, and the more serious the data distortion, the lower is the availability of the data. Our method achieves lower data distortion than NWA, because the anonymous group area $\text{ClusterArea}({Ec}_{i})$ based on the IMHD_AS is smaller than that based on the Euclidean distance. As shown in Tables 5 and 6, the proposed algorithm reduced the loss of information by up to 21.04% and 22.55% on the OLDEN and SANFR datasets, respectively.

As shown in Figs 11 and 12, a change in $P i$ has a slight effect on information loss because $P i$ has a slight effect on the area of generalization. However, $P i$ affects the number of trajectories discarded in preprocessing.

Figure 11.

Influence of $P i$ on information loss (OLDEN).

Figure 12.

Influence of $P i$ on information loss (SANFR).

6.4 Anonymity cost analysis

From Figs 13 and 14, we can see that on both the SANFR dataset and the OLDEN dataset, translation decreases as $\delta$ increases. This is because $\delta$ is a trajectory uncertainty threshold: The greater the value, the greater is the uncertainty area of the trajectory, the smaller is the trajectory of the cluster not in the anonymous range, the shorter is the distance the sampling point moves, and the lower is the anonymity cost of the trajectory sampling point, and thus the translation value decreases. It can also be seen that under the same condition, the translation value with our method is less than that with the NWA algorithm. This is because the anonymity cost based on interpolation points is less than that based on the original sampling points.

Figure 13.

Influence of trajectory uncertainty threshold $\delta$ on translation (OLDEN).

Figure 14.

Influence of trajectory uncertainty threshold $\delta$ on translation (SANFR).

As shown in Fig. 15, translation increases with $k$ . This is because the higher the anonymity level $k$ , the greater is the number of trajectories in an anonymous cluster and the greater is the total number of disturbances to be executed, which increases the anonymity cost. Our algorithm is obviously superior to the NWA algorithm, because the distances that the location points need to move are shorter with our algorithm than with NWA. From Tables 7 and 8, we can see that the proposed algorithm reduced the anonymity cost by up to 28.32% and 19.76% on the OLDEN and SANFR datasets, respectively.

Table 7

Comparison of translation for the two algorithms as influenced by trajectory uncertainty threshold $\delta$ (OLDEN)

$\delta$	Translation (NWA)	Translation (IMHD_AS)	Ratio (%)	Reduction (%)
500	5705072	4573439	80.16	19.84
1000	2679774	1996459	74.50	25.50
1500	1092297	803461	73.56	26.44
2000	477972	347568	72.72	27.28
2500	194065	139108	71.68	28.32

Table 8

Comparison of translation for the two algorithms as influenced by trajectory uncertainty threshold $\delta$ (SANFR)

$\delta$	Translation (NWA)	Translation (IMHD_AS)	Ratio (%)	Reduction (%)
30000	5.20 $\times$ 10 ${}^{8}$	4.25 $\times$ 10 ${}^{8}$	81.73	18.27
35000	4.88 $\times$ 10 ${}^{8}$	3.96 $\times$ 10 ${}^{8}$	81.15	18.80
40000	4.57 $\times$ 10 ${}^{8}$	3.69 $\times$ 10 ${}^{8}$	80.74	19.21
45000	4.27 $\times$ 10 ${}^{8}$	3.44 $\times$ 10 ${}^{8}$	80.56	19.51
50000	3.99 $\times$ 10 ${}^{8}$	3.20 $\times$ 10 ${}^{8}$	80.20	19.76

Figure 15.

Influence of $k$ on translation.

Figure 16.

Influence of $k$ and $\delta$ on translation (OLDEN).

Figure 17.

Influence of $k$ and $\delta$ on translation (SANFR).

Figure 18.

Influence of $k$ on execution time.

As shown in Figs 16 and 17, translation decreases with $k$ . The smaller the $k$ value, the smaller is the number of trajectories in the cluster, the smaller is the number of trajectories needed to be moved, and the lower is the anonymity cost. The translation values with our algorithm are smaller than those with NWA. Our algorithm requires shorter movement distances, which means that the availability of anonymous data is higher.

6.5 Execution time analysis

As shown in Fig. 18, the execution time for IMHD_AS is longer than that for NWA on both the OLDEN and SANFR datasets. This is because the calculation of IMHD_AS needs to determine the interpolation points, whereas the Euclidean distance is calculated directly. This results in a large amount of computation and requires a longer time. However, in the preprocessing, we generate equivalence classes and multiple groups by partitioning. Solving problems within each group can simplify problems, reduce computation, and improve efficiency significantly. When data availability is optimized, time efficiency will not be significantly reduced. Therefore, it is feasible, at the expense of some execution time, to improve the data utility of the proposed algorithm. In general, both algorithms have good time efficiency.

7. Conclusions and future work

In this paper, we proposed a novel algorithm for trajectory privacy protection that considers the problem from the perspective of hierarchical granularity and interpolation points. By using interpolation points, we improve the data utility while maintaining privacy protection. Use of the hierarchical granularity model can reduce the high computation cost of data utility optimization and thus improve efficiency. We also proposed the temporal granularity constraint of IMHD_AS to guarantee high data quality and the spatial granularity constraint of the uncertainty threshold $\delta$ to minimize the change in original locations and reduce information loss. We performed experimental analysis to determine the most appropriate value for parameter $P i$ by comprehensively considering the number of discarded trajectories and the number of equivalent classes generated. In the future, we plan to develop a dynamic and adaptive model for trajectories in realistic scenarios and study the characteristics of their motion irregularity according to granularity.

Footnotes

Acknowledgments

The authors would like to thank the anonymous reviewers for their invaluable comments and suggestions. This work was supported by the National Natural Science Foundation of China (Grant Nos. 61672039, 61972439, 61602009 and 61702010), the Natural Science Foundation of Colleges and Universities in Anhui Province (Grant No. KJ2019A0481), and the Natural Science Foundation of Anhui Province (Grant Nos. 1808085MF172 and 1708085MF156).

References

Sui

and Yang

, A privacy-preserving compression storage method for large trajectory data in road network, Journal of Grid Computing 16(2) (2018), 229–245.

Dai

Shao

Wei

Zhang

and Shen

H.T.

, Personalized semantic trajectory privacy preservation through trajectory reconstruction, World Wide Web-Internet and Web Information Systems 21(4) (2018), 875–914.

Xiao

Yang

J.J.

Huang

Ponnambalam

and Goh

R.S.M.

, QLDS: a novel design scheme for trajectory privacy protection with utility guarantee in participatory sensing, IEEE Transactions on Mobile Computing 17(6) (2018), 1397–1410.

Hasan

A.S.M.T.

Chen

and Jiang

, An effective privacy architecture to preserve user trajectories in reward-based LBS applications, ISPRS International Journal of Geo-Information 7(2) (2018), 53.

Dong

and Pi

, Novel privacy-preserving algorithm based on frequent path for trajectory data publishing, Knowledge-Based Systems 148 (2018), 55–65.

Sweeney

, k-Anonymity: a model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5) (2002), 557–570.

Gruteser

and Grunwald

, Anonymous usage of location-based services through spatial and temporal cloaking, in: International Conference on Mobile Systems, Applications, and Services, 2003, pp. 31–42.

Jensen

C.S.

and Man

L.Y.

, PAD: privacy-area aware, dummy-based location privacy in mobile services, in: ACM International Workshop on Data Engineering for Wireless and Mobile Access, 2008, pp. 16–23.

You

T.H.

Peng

W.C.

and Lee

W.C.

, Protecting Moving Trajectories with dummies, in: International Conference on Mobile Data Management, 2007, pp. 278–282.

10.

Lei

P.R.

Peng

W.C.

I.J.

and Chang

C.P.

, Dummy-based schemes for protecting movement trajectories, Journal of Information Science and Engineering 28(2) (2012), 335–350.

11.

Gao

Shi

and Zhan

, Towards location and trajectory privacy protection in participatory sensing, in: International Conference on Mobile Computing, Applications, and Services, 95, 2011, pp. 381–386.

12.

Hara

Suzuki

Iwata

Arase

and Xie

, Dummy-based user location anonymization under real-world constraints, IEEE Access 4 (2016), 673–687.

13.

Xiao

Chen

Sangaiah

A.K.

and Jiang

, CenLocShare: a centralized privacy-preserving location-sharing system for mobile online social networks, Future Generation Computer Systems 86 (2018), 863–872.

14.

Sun

Chang

Ramachandran

Sun

and Li

, Efficient location privacy algorithm for Internet of Things (IoT) services and applications, Journal of Network and Computer Applications 89(C) (2016), 3–13.

15.

Luper

Cameron

Miller

and Arabnia

H.R.

, Spatial and temporal target association through semantic analysis and GPS data mining, in: International Conference on Information and Knowledge Engineering, 2007, pp. 251–257.

16.

Liao

Sun

and Chang

, The framework and algorithm for preserving user trajectory while using location-based services in IoT-cloud systems, Cluster Computing 20(2) (2017), 1–15.

17.

Gidófalvi

Huang

and Pedersen

T.B.

, Privacy: Preserving trajectory collection, in: ACM Sigspatial International Symposium on Advances in Geographic Information Systems, Acm-Gis, 2008, p. 46.

18.

Arain

Q.A.

Deng

Memon

Arain

and Shaikh

F.K.

, Privacy preserving dynamic pseudonym-based multiple mix-zones authentication protocol over road networks, Wireless Personal Communications An International Journal 95(2) (2017), 505–521.

19.

Gao

Shi

Zhan

and Sun

, TrPF: a trajectory privacy-preserving framework for participatory sensing, IEEE Transactions on Information Forensics and Security 8(6) (2013), 874–887.

20.

Palanisamy

and Liu

, MobiMix: protecting location privacy with mix-zones over road networks, in: IEEE International Conference on Data Engineering, 2011, pp. 494–505.

21.

Liu

Zhao

Pan

and Yue

, Traffic-aware multiple mix zone placement for protecting location privacy, IEEE INFOCOM 131(5) (2012), 972–980.

22.

Terrovitis

and Mamoulis

, Privacy preservation in the publication of trajectories, in: International Conference on Mobile Data Management, 2008, pp. 65–72.

23.

Abul

Bonchi

and Nanni

, Never walk alone: uncertainty for anonymity in moving objects databases, in: IEEE International Conference on Data Engineering, 2008, pp. 376–385.

24.

Abul

Bonchi

and Nanni

, Anonymization of moving objects databases by clustering and perturbation, Information Systems 35(8) (2010), 884–910.

25.

Nergiz

M.E.

Atzori

and Saygin

, Towards trajectory anonymization: a generalization-based approach, Transactions on Data Privacy 2(1) (2009), 52–61.

26.

Trujillorasua

and Domingoferrer

, On the privacy offered by (k, delta)-anonymity, Information Systems 38(4) (2013), 491–494.

27.

Gao

Sun

and Li

, Balancing trajectory privacy and data utility using a personalized anonymization model, Journal of Network and Computer Applications 38(1) (2014), 125–134.

28.

Huo

Huang

and Meng

, History trajectory privacy-preserving through graph partition, in: International Workshop on Mobile Location-Based Service, 2011, pp. 71–78.

29.

Xin

Xie

Z.Q.

and Yang

, The privacy preserving method for dynamic trajectory releasing based on adaptive clustering, Information Sciences 378 (2017), 131–143.

30.

Chen

and Oria

, Robust and fast similarity search for moving object trajectories, in: ACM SIGMOD International Conference on Management of Data, 2005, pp. 491–502.

31.

Ahn

Y.J.

and Hoffmann

, Sequence of GnGn LN polynomial curves approximating circular arcs, Journal of Computational and Applied Mathematics 341 (2018), 117–126.

32.

Cui

Wang

Zhou

Gong

and Eberl

, A topo-graph model for indistinct target boundary definition from anatomical images, Computer Methods & Programs in Biomedicine 159 (2018), 211–222.

33.

Qian

and Yang

, An integrated method for atherosclerotic carotid plaque segmentation in ultrasound image, Computer Methods and Programs in Biomedicine 153 (2018), 19–32.

34.

Cheimariotis

G.A.

Chatzizisis

Y.S.

Koutkias

V.G.

Toutouzas

and Giannopoulos

, ARCOCT: automatic detection of lumen border in intravascular OCT images, Computer Methods and Programs in Biomedicine 151 (2017), 21–32.

35.

Pan

Gao

and Li

, Differential evolution algorithm-based range image registration for free-form surface parts quality inspection, Swarm and Evolutionary Computation 36 (2017), 106–123.

36.

Huttenlocher

D.P.

Klanderman

G.A.

and Rucklidge

W.J.

, Comparing images using the Hausdorff distance, IEEE Transactions on Pattern Analysis and Machine Intelligence 15(9) (1993), 850–863.

37.

Dubuisson

M.P.

and Jain

A.K.

, A modified Hausdorff distance for object matching, in: Proceedings of 12th International Conference on Pattern Recognition, 1, 1994, pp. 566–568.

38.

Shao

Cai

and Gu

, A modified Hausdorff distance based algorithm for 2-dimensional spatial trajectory matching, in: International Conference on Computer Science and Education, 2010, pp. 166–172.

39.

Huttenlocher

D.P.

and Rucklidge

W.J.

, A multi-resolution technique for comparing images using the Hausdorff distance, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1992, pp. 654–656.

40.

Brinkhoff

, Generating traffic data, Bulletin of the Technical Committee on Data Engineering IEEE Computer Society 26(2) (2003), 19–25.

Hierarchical interpolation point anonymity for trajectory privacy protection

Abstract

Keywords

1. Introduction

2.1 Trajectory privacy protection

2.2 Hausdorff distance

Table 1 Distance measures and their characteristics

3.1 Background knowledge

4.1 Framework of the proposed algorithm

4.3 Clustering algorithm

4.3.1 IMHD_AS calculation

4.3.2 Shortest-distance calculation

4.4 Data perturbation algorithm

5. Evaluation metrics

5.1 Information loss

6.1 Experimental environment

Table 2 Statistical information of the datasets

Table 3 Preprocessing results as influenced by P ⁢ i (OLDEN dataset)

7. Conclusions and future work

Footnotes

Acknowledgments

References

Table 1
Distance measures and their characteristics

Table 2
Statistical information of the datasets

Table 3
Preprocessing results as influenced by $P i$ (OLDEN dataset)