Trajectory personalization privacy preservation method based on multi-sensitivity attribute generalization and local suppression

Abstract

Fast-developing mobile location-aware services generate an enormous volume of trajectory data while adding value to people’s lives. However, trajectory data contains not only location information, but also sensitive personal information. If the original trajectory data is published directly, it could result in serious privacy leaks. Most of the existing privacy-preserving trajectory publishing methods only protect the location information or set the same privacy preservation levels for all moving objects. To meet the users’ personalized privacy requirements and ensure the utility of trajectory location and sensitive information, we propose a trajectory personalized privacy preservation method based on multi-sensitivity attribute generalization and local suppression. First, we set different security levels for each trajectory by calculating the correlation between sensitive attributes to establish a sensitive attribute classification tree. Second, we generalized sensitive attributes based on privacy preservation levels for each trajectory, the trajectory data still at risk of privacy leakage after generalization was locally suppressed. Finally, an anonymized trajectory dataset was generated. Experimental results on real datasets demonstrated that our method could improve data availability while preserving privacy.

Keywords

Trajectory data publishing privacy preservation sensitive attribute generalization trajectory local suppression correlation

1. Introduction

With the popularization of mobile devices and the rapid development of positioning technology, researchers can easily obtain a large volume of trajectory data such as social interaction, check-ins, and travel. The collected trajectory data itself can serve myriad practical applications, for example, to provide more reasonable path planning references for urban traffic planning departments; to avoid traffic congestion and reduce traffic accidents; to provide decision support for the construction planning of governments and commercial organizations; or to provide the data for location-based advertising [1, 2]. Various applications use trajectory data to facilitate travel, bringing great convenience to the lives of users [3]. However, while trajectory data brings great benefits, it also poses potential threats. Trajectory data contains not only spatiotemporal information, but also sensitive personal information, including a user’s physical condition, occupation, and home address. If the original trajectory data is published directly, it could cause serious privacy problems. In particular, if an attacker has some trajectory data as background knowledge, all private information could be obtained by matching it with the published trajectory data. Consequently, trajectory privacy preservation has become a research hotspot in the field of trajectory data mining.

Presently, trajectory privacy preservation can be divided into two categories, namely, offline trajectory privacy preservation [4, 5], where organizations collect user trajectory data and store it in a database, analyze and mine useful information to feed back to users, which requires privacy preservation of the trajectory data prior to its publication, and online trajectory privacy preservation [6, 7, 8], such as location-based services, which provide related services for moving objects by locating them in real time. However, the real-time trajectory data needs to be provided to a server provider, which also faces the risk of privacy leakage. In this study, we examine the former, offline trajectory privacy preservation, as shown in Fig. 1.

Figure 1.

Privacy-preserving trajectory data publishing structure in offline mode.

The trajectory consists of a series of ordered location points, including the current position and the passing time. In addition, the user’s sensitive attributes will be published along with the trajectory data. The privacy of the trajectory locations and sensitive attributes simultaneously must be protected. The purpose of this study is to protect the user’s sensitive attributes while protecting their trajectory information.

If the trajectory data were to be published together with sensitive attributes, attackers could infer sensitive attributes through the trajectory data, having more serious implications. Table 1 shows the original trajectory data without any preservation measures in which there are three types of attributes, namely, explicit identifiers, quasi-identifiers, and sensitive attributes. Explicit identifiers, typically used to uniquely identify a user, such as names, should be removed from the table before publishing. Generally, user information cannot be determined by a single quasi-identifier, but several quasi-identifiers can be combined to locate a specific user. The quasi-identifier we focus on here is trajectory information, which comprises a set of spatiotemporal trajectory points, with each point representing a location and a corresponding timestamp. Sensitive attributes are private information that users do not want to disclose, such as diseases and occupations, as presented in Table 1.

Table 1

Original trajectory data

Id	Name	Trajectory	Disease	Job
1	Alice	$d_{3}\to c_{4}\to e_{5}\to f_{8}$	Diabetes	Lawyer
2	Bob	$d_{1}\to e_{2}\to d_{5}\to c_{6}\to f_{8}$	SARS	Doctor
3	Caesar	$c_{2}\to d_{6}\to f_{8}$	Flu	Student
4	Daniel	$c_{3}\to d_{7}\to d_{8}$	HIV	Doctor
5	Eden	$d_{1}\to d_{5}\to e_{6}\to f_{8}$	SARS	Student
6	Freeman	$d_{3}\to c_{4}\to a_{7}$	Pancreatitis	Lawyer

Directly publishing original trajectory data could cause serious privacy leakage problems. There are three types of privacy attacks that could typically be used against published trajectory data should an attacker have sufficient background knowledge as follows:

Identity link attack. In the trajectory database, several trajectories may appear relatively few times, with essentially no moving objects matching it. At this stage, an attacker with sufficient background knowledge could uniquely identify a user’s data records and obtain their sensitive attributes. For example, if an attacker knows that Freeman’s data are in the trajectory database and that he has visited $a_{7}$ , the attacker could determine that the sixth record in Table 1 belongs to Freeman and then establish that he suffers from pancreatitis and his occupation is that of a lawyer, because only the sixth trajectory data contains $a_{7}$ .

Attribute link attack. Attackers could infer sensitive attributes of users based on their background knowledge. For example, if an attacker knows that Bob visited location $d$ at timestamps 1 and 5, based on the published trajectory dataset, the attacker could infer that Bob’s record could be the second or fifth record in Table 1. These two records reflect the same disease (SARS), therefore, enabling the attacker to know that Bob has SARS.

Similarity attack. Some sensitive attributes differ, but attribute values with similar meanings frequently appear in a trajectory database. Even if an attacker cannot uniquely identify a user’s record, they could infer the category of the user’s sensitive attribute. For example, if an attacker has the location point $f_{8}$ based on background knowledge, they could infer (with 75% confidence) that the user suffers from a disease related to lung infection, because the corresponding diseases of the four trajectory records, including the location point $f_{8}$ , are diabetes, SARS, flu, and SARS, with the last three diseases being diseases of lung infection.

In research related to trajectory sensitive attribute preservation methods, most researchers study the preservation methods for a single sensitive attribute and cannot protect trajectory datasets containing multiple sensitive attributes. However, the trajectory datasets published often have multiple sensitive attributes, so many existing methods cannot meet the privacy requirements of users.

To this end, we propose a trajectory personalization privacy preservation method based on multi-sensitivity attribute generalization and local suppression. This method achieves anonymity requirements by analyzing the correlation between sensitive attributes, assigning privacy preservation levels, generalizing multiple sensitive attributes, and adopting local suppression of trajectories. It achieves a good balance between data availability and privacy preservation and meets users’ personalized privacy preservation requirements.

The main contributions of this paper can be summarized as follows:

A personalized privacy preservation method in trajectory data publishing is proposed, which can achieve different degrees of privacy preservation based on different users’ needs and avoid the waste of resources.

We propose a method to set the privacy preservation level based on the correlation between sensitive attributes, which combines the generalization of sensitive attributes and local suppression of trajectories to preserve the utility of anonymous datasets as much as possible.

We define a balance suppression coefficient for achieving the balance between privacy preservation levels and trajectory suppression, which effectively avoids deleting too many location points and ensures a reasonable trajectory retention ratio while satisfying privacy preservation requirements.

The experimental results on a real trajectory dataset showed that the proposed algorithm could achieve a good balance between data availability and privacy preservation.

The rest of this paper is organized as follows: Section 2 reviews the related work. Section 3 presents the problem definition. Section 4 describes the methodology of this study. Section 5 outlines the experimental results, and Section 6 summarizes the work and future research directions.

2. Related work

In the face of enormous volumes of trajectory data, the problem of privacy leakage has become increasingly serious. Traditional (GPS-type) trajectory data only contains latitude and longitude locations and time information, while high-dimensional (semantic) trajectory data contains much semantic information, including time, semantic location, and sensitive attributes. Consequently, the preservation of trajectory data can be divided into two types, namely, traditional trajectory privacy preservation and semantic trajectory privacy preservation. The former refers to the adoption of particular measures regarding the GPS locations of the original trajectories to reduce the probability of an attacker correctly identifying a user’s trajectory and achieve the required anonymity; the latter refers to the consideration of the relationship between the semantic location information and sensitive attributes, while protecting the semantic locations and the sensitive attributes of the trajectory.

2.1 Traditional trajectory privacy preservation methods

Fake trajectory method. This method adds particular fake trajectories to the original trajectory dataset to achieve the purpose of protecting privacy information. The fake trajectory method is simple, but it can be difficult to meet privacy preservation requirements and ensure data utility after adding the fake trajectories. Many scholars have researched this problem with some success. Lei et al. [9] proposed a privacy preservation method for fake trajectories based on spatiotemporal correlation that analyzed the spatiotemporal correlation between adjacent positions in a single trajectory and the similarity between trajectories, and it could effectively blur the lines between fake trajectories and real trajectories. Dong et al. [10] proposed a trajectory privacy preservation method based on frequent path mining. It first removed uncommon trajectories from the trajectory dataset, before proposing a new method to find frequent paths, and then selected representative trajectories to represent all trajectories within a group, adding virtual trajectories to satisfy $k$ -anonymity. Wang et al. [11] introduced a fake location selection algorithm, which ensured semantic differences, similar query probabilities, and geographical dispersion in fake locations, and it could effectively prevent attackers from identifying fake trajectories by using their background knowledge. However, there were still deficiencies in meeting personalized privacy preservation requirements. The most common disadvantage was that it could be difficult to achieve a good balance between privacy preservation efficiency and data availability.

Suppression method. This method suppresses or deletes some sensitive locations or those visited frequently by moving objects before the trajectory data are published. Gruteser et al. [12] first proposed dividing the trajectory data into sensitive and non-sensitive areas. Only when a moving object was in a sensitive area would the corresponding trajectory be suppressed before being published. Mohammed et al. [13] proposed a high-dimensional sparse trajectory privacy preservation method, which used a global suppression method to suppress trajectory data. It ensured that the $k^{\text{th}}$ trajectory was indistinguishable from the other $k-1$ trajectories, satisfying the $k$ -anonymity principle, which enabled attackers to infer a sensitive location with a probability less than a set threshold. However, due to the global suppression method, the data availability was reduced. Based on the work of Mohammed et al., Chen et al. [14] adopted a method of local suppression instead of global suppression to obtain the lkc-privacy preservation scheme for low availability problems. Terrovitis et al. [15] proposed four trajectory suppression methods, including global suppression, local suppression, trajectory splitting, and hybrid methods. Among them, trajectory splitting was a newly proposed trajectory suppression method, which could split the trajectory into several sub-trajectories, adopting global and local suppression methods on the sub-trajectories, which improved data availability. Zhao et al. [16] proposed a method that combined the fake trajectory method with the suppression method. Based on the frequency of trajectories appearing in a dataset, they chose to add fake trajectories to the original trajectory dataset or locally suppress the trajectory data. Wang et al. [17] proposed a trajectory privacy preservation algorithm based on information entropy suppression and designed a cost function to eliminate the violation sequence in the trajectory by calculating the entropy value of a position point to suppress the trajectory. Chen et al. [18] proposed a trajectory privacy preservation method based on single-point revenue, which calculated the revenue of using the fake trajectory method or suppression method for a location point, respectively, before selecting the method with the higher revenue to process the trajectory, which could effectively protect the data. This suppression method was simple and effective and could deal with situations where an attacker has some background knowledge, safeguarding trajectory privacy preservation while ensuring the high availability of the trajectory data. The key to using a suppression method is that of reasonably suppressing the location information. If too many locations are suppressed, it results in low data availability and fails to meet users’ needs; if too few locations are suppressed, the degree of privacy preservation is low, and privacy leakage could easily occur.

Generalization method. Compared with the aforementioned two methods, the generalization method can achieve a better balance between data availability and data preservation efficiency. The current mainstream trajectory privacy preservation technology based on the generalization method is $k$ -anonymity, which was first proposed by Sweeney et al. [19] and is used in the privacy preservation of relational databases. Subsequently, scholars have proposed many trajectory generalization methods based on $k$ -anonymity technology. The main idea is to generalize the attributes of a single trajectory, so that an attacker cannot distinguish a trajectory from other $k-1$ trajectories without background knowledge, and the probability of identifying a specific trajectory does not exceed 1/ $k$ . Abul et al. [20] proposed a ( $k-\delta$ anonymous model to address the problem that devices, such as positioning systems, could not precisely locate their position. It calculates the similarity between two trajectories based on the Euclidean distance function, obtaining a trajectory $k$ -anonymous dataset using the greedy clustering algorithm. It requires that the start time and end time of the two trajectories are the same and that their corresponding locations are the same. However, such trajectory data is rare. The above methods ignore the limitations of road networks. Consequently, Gurung et al. [21] adopted clustering trees to accelerate the clustering process in which an attacker uses public road network information as background knowledge. The proposed method not only satisfied the $k$ -anonymity requirement, but also road network limitations. Because one-time anonymity may not be able to meet user needs, Jia et al. [22] proposed a trajectory privacy preservation algorithm based on quadratic anonymity. First, they $k$ -anonymized the synchronized trajectory based on a starting time. Then, they identified the sensitive area and performed secondary anonymization on data from within it to ensure the data validity. The generalization method has advantages in data availability; however, anonymity can be expensive, and the risk of privacy leakage remains.

2.2 Semantic trajectory privacy preservation

Trajectory data includes not only single location information, but also semantic information and sensitive attributes such as gender, home address, and occupation. Simply protecting location information could cause privacy leakage. Most privacy preservation principles are to constrain the disclosure of sensitive information, removing identifying attributes, such as name and id, from the trajectory database before the data are published. However, such methods cannot resist similarity link, identity link, or attribute link attacks, particularly when an attacker’s background knowledge could include partial trajectories. To solve the problem that $k$ -anonymity cannot resist homogeneous and background knowledge attacks, Machanavajjhala et al. [23] proposed an $l$ -diversity anonymity model, which divided the original data into different equivalent blocks, with each block containing at least $l$ different sensitive attributes, so that the probability of an attacker identifying the real data was at most 1/ $l$ . This not only satisfied the requirement of $k$ -anonymity, but it also was able to resist homogeneity and background knowledge attacks somewhat. Subsequently, Li et al. [24] pointed out the defects of the $l$ -diversity anonymity model and discovered a correlation between the distribution of sensitive attributes in the equivalent block and those in the original data, which could lead to the leakage of sensitive attributes. Thus, they proposed the $t$ -closeness method. Here, the relationship between quasi-identification and sensitive attributes was initially suppressed, so that the distribution of sensitive attributes in each equivalent block was similar, thereby reducing the probability of an attacker identifying the sensitive attributes of real moving objects through the distribution of their sensitive attributes. Yao et al. [25] proposed the ( $l$ , $\alpha$ , $\beta$ ) anonymous model, which ensured that the number of sensitive attributes in each equivalent block was not less than $l$ and that the probability of determining each sensitive attribute did not exceed $\alpha$ , ensuring that the probability of an attacker obtaining similar sensitive attributes was not more than $\beta$ . Although the $k$ -anonymity, $l$ -diversity, and $t$ -closeness methods met the privacy preservation requirements to a certain extent, auxiliary elements needed to be added, which increased their computational overhead and run-time. Komishani et al. [26] proposed a method based on sensitive attribute generalization and trajectory local suppression. However, it could only solve the problem of a single sensitive attribute and could not protect trajectory data carrying multiple sensitive attributes. Moreover, under normal circumstances, the published semantic trajectory dataset usually contains multiple sensitive attributes. Consequently, Jia et al. [27] proposed the ( $l$ , $m$ , $d)$ anonymous model to resist multi-sensitivity attribute similarity attacks, using a semantic hierarchy tree to analyze and calculate the semantic similarity between sensitive attributes, ensuring that there were at least $l$ sensitive attributes in each equivalence class that could satisfy the $l$ -diversity principle. The above methods did not consider that different moving objects have different privacy preservation requirements or the published trajectory data usually had multiple sensitive attributes. Thus, they could not meet the needs of users.

In conclusion, published trajectory data often contain semantic location and sensitive attribute information. Simply protecting GPS location information is not sufficient. Moreover, each user has different privacy preservation requirements. Most of the existing methods do not consider the different privacy requirements of different users and cannot resist all the linking attacks, perhaps just one or two of them at best. Consequently, this study considers the correlation between multiple sensitive attributes and proposes a trajectory personalized privacy preservation method based on multi-sensitive attribute generalization and local suppression, which can meet a user’s personalized privacy preservation requirements and resist all linking attacks.

3. Theory: Problem description and related definitions

3.1 Trajectory model

Definition 1 (Trajectory): A trajectory refers to a set of ordered location points, denoted as $L_{i}$ , and expressed as follows:

$\displaystyle L_{i}=(\textit{loc}_{1}^{i},t_{1}^{i})\to(\textit{loc}_{2}^{i},t% _{2}^{i})\to\ldots\to(\textit{loc}_{n}^{i},t_{n}^{i}),$ (1)

where $t_{j}^{i}$ represents time, and $\textit{loc}_{j}^{i}$ represents the location passed at time $t_{j}^{i}$ . $({\textit{loc}_{j}^{i},t_{j}^{i}})$ is called the location point, denoted as $p_{j}^{i}({1\leqslant j\leqslant n})$ . The length of the trajectory is $|{L_{i}}|=n$ . The sequence consisting of the first $k$ location points of $L_{i}$ is denoted as $L_{i}^{k}$ .

Definition 2 (Trajectory dataset): The set of trajectories is called a trajectory dataset, denoted as $T=\cup_{1\leqslant i\leqslant|T|}L_{i}$ . $|T|$ is the number of trajectories in $T$ .

Table 1 shows a trajectory dataset $T$ composed of six trajectories. In the first trajectory $L_{1}$ , $d, c, e,$ and $f$ represent the locations, and 3, 4, 5, and 8 represent the time passing through the corresponding locations, respectively. The length of the trajectory $|{L_{1}}|$ is 4.

Definition 3 (Sub-trajectory): Suppose two trajectories $L_{i}=(\textit{loc}_{1}^{i},t_{1}^{i})\to(\textit{loc}_{2}^{i},t_{2}^{i})\to% \ldots\to(\textit{loc}_{n}^{i},t_{n}^{i})$ and $L_{j}=(\textit{loc}_{1}^{j},t_{1}^{j})\to(\textit{loc}_{2}^{j},t_{2}^{j})\to% \ldots\to(\textit{loc}_{n}^{j},t_{n}^{j})$ , then if $n>m$ , and there is a mapping relationship $f$ such that $(\textit{loc}_{1}^{j},t_{1}^{j})=(\textit{loc}_{f(1)}^{i},t_{f(1)}^{i}),\ldots% ,(\textit{loc}_{m}^{j},t_{m}^{j})=(\textit{loc}_{f(m)}^{i},t_{f(m)}^{i}),1% \leqslant f(1)<f(2)<\ldots<f(m)\leqslant n$ , $L_{j}$ can be termed a sub-trajectory of $L_{i}$ .

Definition 4 (Connectable trajectory): Suppose two trajectories $L_{i}=(\textit{loc}_{1}^{i},t_{1}^{i})\to(\textit{loc}_{2}^{i},t_{2}^{i})\to% \ldots\to(\textit{loc}_{n}^{i},t_{n}^{i})$ and $L_{j}=(\textit{loc}_{1}^{j},t_{1}^{j})\to(\textit{loc}_{2}^{j},t_{2}^{j})\to% \ldots\to(\textit{loc}_{n}^{j},t_{n}^{j})$ , then if $L_{i}^{n-1}=L_{j}^{n-1}$ and $t_{n}^{i}<t_{n}^{j}$ , $L_{i}$ and $L_{j}$ are connectable, and the connected trajectory, denoted as $L_{i}*L_{j}$ , can be expressed as follows:

$\displaystyle L_{i}\ast L_{j}=(\textit{loc}_{1}^{i},t_{1}^{i})\to(\textit{loc}% _{2}^{i},t_{2}^{i})\to\ldots\to(\textit{loc}_{n}^{i},t_{n}^{i})\to(\textit{loc% }_{n}^{j},t_{n}^{j}).$ (2)

Definition 5 (Sensitive attribute set): Consider a trajectory $L_{i}$ . Its sensitive attribute set can be defined as the set of personal privacy information (i.e., sensitive attributes) in $L_{i}$ , denoted as $S(L_{i})$ :

$\displaystyle S(L_{i})=\{{S_{L_{i}}^{1},S_{L_{i}}^{2},\cdots,S_{L_{i}}^{t}}\},$ (3)

where $S_{L_{i}}^{j}({1\leqslant j\leqslant t})$ is the sensitive attribute value, and $t$ is the number of sensitive attributes.

The trajectory dataset studied here contains two sensitive attributes: disease and occupation. For example, for the second trajectory $L_{2}$ in Table 1, $S(L_{2})=$ {“SARS”, “Doctor”}, $S_{L_{2}}^{1}=$ “SARS”, and $S_{L_{2}}^{2}=$ “Doctor”. Different sensitive attributes belong to different categories. In the following, a classification tree can be established for disease; the diseases are being generalized based on the classification tree to achieve privacy preservation; for occupation, no further classification is performed, and a unified generalization value can be used to achieve privacy preservation.

Definition 6 (Sensitive attribute classification tree) [26]: Let $\Lambda$ be the set of similar sensitive attributes in the trajectory set $T$ , the sensitive attribute classification tree corresponding to $\Lambda$ being a two-tuple, defined as $\textit{ST}=<N,h>$ , where $N$ represents the set of all nodes in ST, denoted as $N=\{{n_{1},n_{2},\cdots,n_{|N|}}\}$ , which contains all the sensitive attributes and their generalization categories in $\Lambda$ , and $h$ is the height of the classification tree ST.

There are two types of nodes in ST, one being the leaf node, which records sensitive attribute values, and the other being the internal node, which represents possible attribute values after generalization of the sensitive attributes. Each node can be denoted as $\textit{Node}=<\textit{val},\textit{ht}>$ , the val field storing the sensitive attribute value or the generalized attribute value, and the ht field storing the height of the node. We define the height of each leaf node to be 0; the height of the root node represents the height of the classification tree. Each node has subsets: the subset of the leaf nodes containing only itself, and the subset of the internal nodes containing all the leaf nodes in the subtree rooted to it. The subset of node $n_{i}$ can be denoted as [ $n_{i}$ ], $n_{i}\in N$ .

Using the disease-sensitive attribute as an example, we use the classification tree shown in Fig. 2 to classify a user’s disease. The leaf nodes (19 in total) represent the values of sensitive attributes; the parent node of each node represents the previous category to which they belong. The name of each node is unique. Particularly, the classification tree can be expanded as user disease types increase in the dataset.

For example, Flu and SARS are both pulmonary infections; pulmonary infections are pulmonary diseases, and pulmonary diseases are diseases. The classification tree height shown in Fig. 2 is ST.h $=$ 3, [Flu] $=$ {“Flu”}, [Lung Infection] $=$ {“Flu”, “Cold”, “SARS”}.

Figure 2.

Sensitive attribute classification tree.

Definition 7 (Containing node): Given a sensitive attribute classification tree $\textit{ST}=<N,h>$ , $n_{i}$ is the containing node of $n_{j}$ , iff $[n_{j}]\subset[n_{i}]$ , $n_{i},n_{j}\in N$ .

3.2 Privacy model

Definition 8 (Privacy preservation level): Given a trajectory dataset $T$ , the privacy preservation level of each trajectory $L_{i}$ in $T$ refers to the strength of its privacy preservation, which can be expressed as the generalization strength of sensitive attributes and the number of suppressed locations.

Different moving objects have different privacy requirements. As presented in Table 2, we provide users with four levels of privacy preservation, namely None, Low, Middle, and High. Each privacy preservation level corresponds to a value, that is, “None” corresponds to 0, “Low” corresponds to 1, “Middle” corresponds to 2, and “High” corresponds to 3. For example, the privacy preservation level of trajectory $L_{1}$ is “Low”, which is denoted as $\textit{pl}(L_{1})=1$ .

Table 2
Trajectory data with privacy preservation level

Id	Level	Trajectory	Disease	Job
1	Low	$d_{3}\to c_{4}\to e_{5}\to f_{8}$	Diabetes	Lawyer
2	Middle	$d_{1}\to e_{2}\to d_{5}\to c_{6}\to f_{8}$	SARS	Doctor
3	None	$c_{2}\to d_{6}\to f_{8}$	Flu	Student
4	High	$c_{3}\to d_{7}\to d_{8}$	HIV	Doctor
5	Low	$d_{1}\to d_{5}\to e_{6}\to f_{8}$	SARS	Student
6	High	$d_{3}\to c_{4}\to a_{7}$	Pancreatitis	Lawyer

Definition 9 (Mastered location) [18]: Mastered location refers to the set of all location information that an attacker can track, that is, the background knowledge they have. Consider a set of attackers $\textit{Advs}=\{{\textit{Adv}_{1},\textit{Adv}_{2},\ldots,\textit{Adv}_{m}}\}$ , $m$ is the number of attackers, and the mastered location of an attacker $\textit{Adv}_{i}({1\leqslant i\leqslant m})$ can be denoted as $\psi^{\textit{Adv}_{i}}$ :

$\displaystyle\psi^{\textit{Adv}_{i}}=\{p_{a_{1}}^{i},p_{a_{2}}^{i},\ldots,p_{a% _{u}}^{i}\},u\leqslant\xi,$ (4)

where $\xi$ represents the maximum number of location points that the attacker has mastered.

Using the mastered location, $\textit{Adv}_{i}$ can identify a partial trajectory set $T({\psi^{\textit{Adv}_{i}}})$ , which may pose a privacy threat to the user.

$\displaystyle T({\psi^{\textit{Adv}_{i}}})=\{{L_{i}{|}L_{i}\in T,\psi^{\textit% {Adv}_{i}}\subseteq L_{i}}\}.$ (5)

As presented in Table 2, assuming that the attacker masters the location $\psi^{\textit{Adv}_{i}}={\{}d_{3},c_{4}{\}}$ , then $T({\psi^{\textit{Adv}_{i}}})={\{}L_{1},L_{6}{\}}$ , so the trajectories $L_{1}$ and $L_{6}$ have the risk of privacy leakage.

Definition 10 (Sibling node set): Given a sensitive attribute classification tree ST and assuming that there is an attacker $\textit{Adv}_{j}$ , we can define the nodes in ST corresponding to the sensitive attribute values of the trajectory in $T({\psi^{\textit{Adv}_{j}}})$ to be sibling nodes of each other, the set of these nodes being denoted as ${\zeta}^{\textit{Adv}_{j}}$ .

Definition 11 (Guard node): Given a trajectory $L_{i}$ and a sensitive attribute classification tree ST, node $n_{i}$ is called the guard node of $L_{i}$ , if $S_{L_{i}}^{1}\in[n_{i}]$ and $pl({L_{i}})=n_{i}.ht$ , denoted as $\lambda_{L_{i}}$ .

Definition 12 (Generalization strength): Given a trajectory $L_{i}$ and a sensitive attribute classification tree ST, the generalization strength can be expressed as the generalizable depth, denoted as ${\Delta}({L_{i}})$ :

$\displaystyle{\Delta}({L_{i}})=\left\{{{\begin{array}[]{ll}{\zeta}_{i}^{% \textit{Adv}_{j}}.ht-\lambda_{L_{i}}.ht,&{\zeta}_{i}^{\textit{Adv}_{j}}.ht>% \lambda_{L_{i}}.ht\\ 0,&\text{otherwise}\\ \end{array}}}\right..$ (6)

Definition 13 (Leakage probability of trajectory): Given a trajectory dataset $T$ and assuming the trajectory $L_{i}\in T$ , the attacker $\textit{Adv}_{i}$ can obtain a partial trajectory set $T({\psi^{\textit{Adv}_{i}}})$ based on his background knowledge $\psi^{\textit{Adv}_{i}}$ . We can obtain the guard node $\lambda_{L_{i}}$ of $L_{i}$ from the set privacy preservation level, and for any $L_{k}\in T({\psi^{\textit{Adv}_{i}}})({1\leqslant k\leqslant|{T({\psi^{\textit% {Adv}_{i}}})}|})$ , the probability that the current guard node may leak can be expressed as follows:

$\displaystyle P(\lambda_{L_{i}}|{\zeta}_{k}^{\textit{Adv}_{i}})=\frac{\left|{% \left[{\lambda_{L_{i}}}\right]\mathop{\cap}\nolimits\left[{{\zeta}_{k}^{% \textit{Adv}_{i}}}\right]}\right|}{\left|{\left[{{\zeta}_{k}^{\textit{Adv}_{i}% }}\right]}\right|}.$ (7)

The leakage probability of trajectory $L_{i}$ can then be defined as follows:

$\displaystyle P({L_{i}{|}\psi^{\textit{Adv}_{i}}})=\frac{1}{|T(\psi^{\textit{% Adv}_{i}})|}\mathop{\sum}\limits_{L_{k}\in T({\psi^{\textit{Adv}_{i}}})}P(% \lambda_{L_{i}}|{\zeta}_{k}^{\textit{Adv}_{i}}),$ (8)

where $\lambda_{L_{i}}$ is the guard node, and ${\zeta}_{k}^{\textit{Adv}_{i}}$ is the node in the set of sibling nodes.

4. Methodology: Trajectory personalization privacy preservation algorithm

4.1 Algorithm idea

Because attackers have partial location information as background knowledge, simply removing explicit identifiers cannot effectively protect user privacy. We propose the personalized trajectory privacy preservation algorithm based on multi-sensitive attribute generalization and local suppression (PTPPMGLS). If the privacy leakage probability of each trajectory is less than or equal to the privacy parameter $P_{b}$ , it can be guaranteed that an attacker cannot identify the user location and sensitive attribute information with a probability higher than $P_{b}$ .

Before using the PTPPMGLS method proposed in this study, data pre-processing is required. The pre-processing generally considers missing values and the length of the trajectory. Here, the trajectory with missing data is safe by default, and no protection measures are required, because our purpose is to protect sensitive attributes and trajectory information. The trajectory data is considered safe if either of them is missing. In addition, we need to delete trajectories with too few locations, because these trajectories are usually matched to the users by the attacker with a high probability.

The PTPPMGLS algorithm can prevent different degrees of privacy attacks while ensuring data availability. It can be divided into three steps – that is:

Setting the privacy preservation level. Use the Apriori algorithm [28] to obtain the correlation between the sensitive attributes of each trajectory, and set the privacy preservation level for the trajectory accordingly;

Generalization of the sensitive attributes. For the trajectory data for which the privacy preservation level has been set, generalize the sensitive attributes based on the correlation between them and the constructed classification tree;

Trajectory local suppression. The generalized trajectory data can be further protected. The remaining trajectories that may run the risk of privacy leakage being subjected to location point suppression. Section 4.2 introduces the setting method of privacy preservation level. Sections 4.3 and 4.4 implement multi-sensitivity attribute generalization and trajectory local suppression, respectively. The algorithm flow diagram is shown in Fig. 3.

Figure 3.

Algorithm flow diagram.

4.2 Privacy preservation level setting

Different users have different requirements for the degree of privacy preservation. It is not necessary to adopt the same privacy preservation level for each user, otherwise it could waste resources. Consequently, we use the Apriori algorithm to calculate the correlation between sensitive attributes and set the privacy preservation level according to the correlation.

Definition 14 (Correlation): Correlation is an information measure that represents the correlation between two sets of events. For the two sensitive attributes $S_{L_{i}}^{1}$ and $S_{L_{i}}^{2}$ of trajectory $L_{i}$ in trajectory dataset $T$ , when one of the sensitive attributes is obtained, the possibility of inferring the other is obtained. The correlation between sensitive attributes $S_{L_{i}}^{1}$ and $S_{L_{i}}^{2}$ can be calculated as follows:

$\displaystyle Co({S_{L_{i}}^{1},S_{L_{i}}^{2}})=P({S_{L_{i}}^{2}{|}S_{L_{i}}^{% 1}})=\frac{\textit{Sup}({S_{L_{i}}^{1},S_{L_{i}}^{2}})}{\textit{Sup}({S_{L_{i}% }^{1}})},$ (9)

where $\textit{Sup}()$ is the support degree of the sensitive attribute; $\textit{Sup}({S_{L_{i}}^{1}})$ and $\textit{Sup}({S_{L_{i}}^{1},S_{L_{i}}^{2}})$ refer to the frequency of occurrence of “ $S_{L_{i}}^{1}$ ” and “ $S_{L_{i}}^{1},S_{L_{i}}^{2}$ ” in the trajectory dataset, respectively. $Co({S_{L_{i}}^{1},S_{L_{i}}^{2}})$ represents the correlation between the sensitive attributes of the trajectory $L_{i}$ , which is abbreviated as $L_{i}.Co$ hereinafter.

Based on the correlation of each trajectory, its privacy preservation level can be set as follows:

$\displaystyle pl({L_{i}})=\left\{{{\begin{array}[]{ll}0,&L_{i}.Co\in[{0,0.333}% )\\ 1,&L_{i}.Co\in[{0.333,0.667})\\ 2,&L_{i}.Co\in[{0.667,1})\\ 3,&L_{i}.Co=1\\ \end{array}}}\right.$ (10)

4.3 Sensitive attribute generalization

The sensitive attributes can be generalized based on the maximum generalization strength ( $\theta$ ) and the maximum generalization boundary ( $\textit{MAX}^{c}$ ) given by the user. Although it could lead to the publishing of ambiguous sensitive attributes, trajectory information can be retained as much as possible.

The pseudo-code of the sensitive attribute preservation (SAP) algorithm is shown in Algorithm 1. The SAP algorithm first builds a disease-sensitive attribute classification tree for the trajectory dataset $T$ (Line 1). It then uses the Apriori algorithm to obtain the correlation between sensitive attributes and set the privacy preservation level for each trajectory $L_{k}\in T$ (Lines 2–5). For each attack sequence $\psi^{\textit{Adv}_{i}}\in B^{s}$ , it determines the trajectory with the highest probability of privacy leakage under this attack sequence, with the set of these trajectories being denoted as TD, and then finds all trajectories with a privacy leakage probability greater than a given threshold $P_{b}$ in TD, denoted as TC (Lines 6–9). If the TC is not empty, it uses the multi-sensitive attribute generalization (MSAG) algorithm to generalize the sensitive attributes of the trajectory data in TC (Lines 10–12).

Algorithm 1: SAP
Input:
$T$ : Original trajectory dataset
$\theta$ : Maximum generalization intensity
$P_{b}$ : Privacy breach threshold
$B^{s}$ : A set of background knowledge mastered by all attackers in Advs
Output:
$T_{g}$ : Generalized trajectory dataset
1: Build generalization hierarchy tree ST for illness attributes in $T$
2: for each $L_{k}\in T$ do
3: Calculate $L_{k}.Co$ based on Eq. (9)
4: Set $pl({L_{k}})$ based on $L_{k}.Co$
5: end for
6: for each $\psi^{\textit{Adv}_{i}}\in B^{s}$ do
7: $TD\leftarrow\{L_{k}\|L_{k}\in T({\psi^{\textit{Adv}_{i}}})$ , $\textit{arg max}_{L_{k}}P({L_{k}{\|}\psi^{\textit{Adv}_{i}}})\}$
8: $TC\leftarrow\{L_{k}\|L_{k}\in TD,P(L_{k}\|\psi^{\textit{Adv}_{i}})>P_{b}\}$
9: $T\leftarrow T-TC$
10: if $TC\neq\emptyset$ then
11: $T\leftarrow T\mathop{\cup}\nolimits\textit{MSAG}({\psi^{\textit{Adv}_{i}},TC})$
12: end if
13: end for
14: $T_{g}\leftarrow T$
15: return $T_{g}$

Using the trajectory dataset shown in Table 2 as an example, the generalized trajectory information is summarized in Table 3. Given $P_{b}=$ 0.5, $\theta=$ 2, $\textit{MAX}^{c}=$ 0.66, and $pl({L_{4}})=$ 3, and running the SAP algorithm, it is found that the disease attribute can generalized from “HIV” to “Any Illness”, and the Job attribute blurred to “*”.

Assuming that the background knowledge mastered by the attacker $\psi^{\textit{Adv}_{i}}={\{}f_{8}{\}}$ , according to Table 3, $T({\psi^{\textit{Adv}_{i}}})={\{}L_{1},L_{2},L_{3},L_{5}{\}}$ , therefore ${\zeta}_{1}^{\textit{Adv}_{i}}=$ “High Blood Sugar”, ${\zeta}_{2}^{\textit{Adv}_{i}}=$ “Any Illness”, ${\zeta}_{3}^{\textit{Adv}_{i}}=$ “Flu”, ${\zeta}_{4}^{\textit{Adv}_{i}}=$ “Lung Infection”. Based on Definition 11, the guard node $\lambda_{L_{3}}$ of $L_{3}$ is the “Flu” node, which can be obtained using Eqs (7) and (8). $P(\lambda_{L_{3}}|{\zeta}_{1}^{\textit{Adv}_{i}})=$ 0, $P(\lambda_{L_{3}}|{\zeta}_{2}^{\textit{Adv}_{i}})=$ 0.05, $P(\lambda_{L_{3}}|\zeta_{3}^{\textit{Adv}_{i}})=$ 1, $P(\lambda_{L_{3}}|\zeta_{4}^{\textit{Adv}_{i}})=$ 0.33. Finally, $P({L_{3}{|}\psi^{\textit{Adv}_{i}}})=\frac{1}{4}({0+0.05+1+0.33})=$ 0.345.

Table 3
Trajectories after generalization of sensitive attributes

Id	Level	Trajectory	Disease	Job
1	Low	$d_{3}\to c_{4}\to e_{5}\to f_{8}$	High Blood Sugar	Lawyer
2	Middle	$d_{1}\to e_{2}\to d_{5}\to c_{6}\to f_{8}$	Any Illness	Doctor
3	None	$c_{2}\to d_{6}\to f_{8}$	Flu	Student
4	High	$c_{3}\to d_{7}\to d_{8}$	Any Illness	*
5	Low	$d_{1}\to d_{5}\to e_{6}\to f_{8}$	Lung Infection	Student
6	High	$d_{3}\to c_{4}\to a_{7}$	Any Illness	*

Algorithm 2 shows the pseudo-code of the MSAG algorithm. First, it initializes $S^{c}$ (Line 1). For each trajectory $L_{j}\in\textit{TC}$ with a risk of privacy leakage, if $L_{j}.Co>\textit{MAX}^{c}$ , it performs sensitive attribute generalization (Lines 2–5). For each trajectory in TC, if its privacy leakage probability is equal to 1, it generalizes sensitive attributes. If its privacy leakage probability is less than the given threshold, it adds this trajectory to $S^{c}$ and finally deletes the trajectory in $S^{c}$ from TC (Lines 6–12). If TC is not empty, it first judges whether the current trajectory in TC satisfies ${\Delta}({L_{j}})\geqslant\theta$ or ${\zeta}_{j}^{\textit{Adv}_{i}}.ht=\textit{ST.h}$ , or $P({L_{j}{|}\psi^{\textit{Adv}_{i}}})\leqslant P_{b}$ ; if it does, it adds it to $S^{c}$ and removes it from TC, otherwise it perform sensitive attribute generalization (Lines 13–20). Finally, the trajectory set $S^{c}$ after multi-sensitivity attribute generalization is obtained (Line 21).

Algorithm 2: MSAG
Input:
$\psi^{\textit{Adv}}$ : Background knowledge
$C$ : Risky trajectory dataset
Output:
$S^{c}$ : Generalized trajectory dataset
1: $S^{c}\leftarrow\emptyset$
2: for each $L_{j}\in TC$ do
3: if $L_{j}.Co>\textit{MAX}^{c}$ then // $\textit{MAX}^{c}$ is maximum generalization boundary threshold
4: Set the occupation attribute of $L_{j}$ to ‘*’
5: end if
6: if ( $P({L_{j}{\|}\psi^{\textit{Adv}}})=$ 1) then
7: Generalize the disease attribute of $L_{j}$ to its parent node in the classification tree
8: else if $P({L_{j}{\|}\psi^{\textit{Adv}}})\leqslant P_{b}$ then // $P_{b}$ is privacy breach threshold
9: $S^{c}\leftarrow S^{c}\mathop{\cup}\nolimits\{{L_{j}}\}$
10: end if
11: end for
12: $TC\leftarrow TC-S^{c}$
13: for each $L_{j}\in TC$ do
// $\theta$ is maximum generalization strength threshold
14: if ( ${\Delta}({L_{j}})\geqslant\theta$ or ${\zeta}_{j}^{\textit{Adv}}.ht=\textit{ST.h}$ or $P({L_{j}{\|}\psi^{\textit{Adv}}})\leqslant P_{b}$ ) then
15: $S^{c}\leftarrow S^{c}\mathop{\cup}\nolimits\{{L_{j}}\}$
16: $TC\leftarrow TC-\{{L_{j}}\}$
17: else
18: Generalize the disease attribute of $L_{j}$ to its parent node in the classification tree
19: end if
20: end for
21: return $S^{c}$

4.4 Trajectory local suppression

After generalization of sensitive attributes, there may still be trajectories at risk of privacy leakage in the trajectory dataset, so further preservation measures must be taken. We use the trajectory local suppression (TLS) algorithm to suppress the publishing of location points and process the remaining trajectories with privacy leakage risks to meet the privacy utility.

Although suppressing the location points in the trajectory can effectively reduce the risk of privacy leakage, it also greatly increases information loss. Consequently, we set a balanced suppression coefficient, which can comprehensively consider the risk of information loss and privacy leakage based on the privacy preservation level of the trajectory and achieve a better balance.

Definition 15 (Dangerous background knowledge): Given a trajectory dataset $T$ , $\psi^{\textit{Adv}_{i}}$ is the background knowledge mastered by an attacker; for any trajectory $L_{j}\in T$ , if $P({L_{j}{|}\psi^{\textit{Adv}_{i}}})>P_{b}$ , then $\psi^{\textit{Adv}_{i}}$ is called the dangerous background knowledge for $T$ .

Definition 16 (Balance suppression coefficient): Given a trajectory dataset $T$ , $T_{\textit{risk}}$ is the set of dangerous background knowledge, $\psi^{\textit{Adv}_{i}}\in T_{\textit{risk}}$ is the dangerous background knowledge, and the location point $p_{k}^{i}\in\psi^{\textit{Adv}_{i}}$ ; we define the location balance suppression coefficient to select the locations to be suppressed, denoted as ${\omega}({p_{k}^{i},\psi^{\textit{Adv}_{i}},T_{\textit{risk}}})$ , expressed as follows:

$\displaystyle{\omega}(p_{k}^{i},\psi^{\textit{Adv}_{i}},T_{\textit{risk}})=% \frac{|{T_{\textit{risk}}({p_{k}^{i}})}|}{|{T({\psi^{\textit{Adv}_{i}}})}|}% \mathop{\sum}\limits_{L_{i}\in T({\psi^{\textit{Adv}_{i}}})}pl({L_{i}}),$ (11)

where $T_{\textit{risk}}({p_{k}^{i}})$ represents the background knowledge of the location $p_{k}^{i}$ contained in $T_{\textit{risk}}$ , $T({\psi^{\textit{Adv}_{i}}})$ is as shown in Eq. (5), and $pl(L_{i})$ represents the privacy preservation level of the trajectory $L_{i}$ . We can define the background knowledge balance suppression coefficient to select the background knowledge to which the location with the largest location balance suppression coefficient belongs, denoted as $\gamma({\psi^{\textit{Adv}_{i}},T_{\textit{risk}}})$ , expressed as follows:

$\displaystyle\gamma({\psi^{\textit{Adv}_{i}},T_{\textit{risk}}})=\mathop{\max}% \limits_{p_{k}^{i}\in\psi^{\textit{Adv}_{i}}}\omega(p_{k}^{i},\psi^{\textit{% Adv}_{i}},T_{\textit{risk}}).$ (12)

Algorithm 3: TLS
Input:
$T_{g}$ : Generalized trajectory dataset
$B^{s}$ : A set of background knowledge mastered by all attackers in Advs
Output:
$T_{a}$ : Anonymized trajectory dataset
1: $T_{\textit{risk}}\leftarrow\emptyset$
2: for each $\psi^{\textit{Adv}_{i}}\in B^{s}$ do
3: $TD\leftarrow\{L_{k}\|L_{k}\in T_{g}({\psi^{\textit{Adv}_{i}}})$ , $\textit{arg max}_{L_{k}}P({L_{k}{\|}\psi^{\textit{Adv}_{i}}})\}$
4: $TC\leftarrow\{L_{k}\|L_{k}\in TD,P(L_{k}\|\psi^{\textit{Adv}_{i}})>P_{b}\}$
5: if $TC\neq\emptyset$ then
6: $T_{\textit{risk}}\leftarrow T_{\textit{risk}}\mathop{\cup}\nolimits\psi^{% \textit{Adv}_{i}}$
7: end if
8: end for
9: while $T_{\textit{risk}}\neq\emptyset$ do
// $\psi_{\max}$ is the background knowledge with maximum balanced suppression coefficient
10: $\psi_{\max}\leftarrow\textit{arg max}_{\psi^{\textit{Adv}_{i}}\in T_{\textit{% risk}}}\gamma({\psi^{\textit{Adv}_{i}},T_{\textit{risk}}})$
11: $C_{\textit{risk}}\leftarrow\{L_{k}\|L_{k}\in T_{g}({\psi_{\max}}),P(L_{k}\|\psi_% {\max})>P_{b}\}$
12: $p_{k}^{i}\leftarrow\textit{arg max}_{p_{k}^{i}\in\psi_{\max}}{\omega}(p_{k}^{i% },\psi_{\max},T_{\textit{risk}})$
13: $D_{\textit{risk}}\leftarrow\emptyset$
14: while $C_{\textit{risk}}\neq\emptyset$ do
15: $L_{\max}\leftarrow\textit{arg max}_{L_{i}\in C_{\textit{risk}}}pl({L_{i}})$
16: $D_{\textit{risk}}\leftarrow D_{\textit{risk}}\mathop{\cup}\nolimits\{{L_{\max}}\}$
17: $T_{g}\leftarrow T_{g}-\{{L_{\max}}\}$
18: $L_{\max}\leftarrow L_{\max}-\{{p_{k}^{i}}\}$
19: $T_{g}\leftarrow T_{g}\mathop{\cup}\nolimits\{{L_{\max}}\}$
20: $C_{\textit{risk}}\leftarrow\{L_{k}\|L_{k}\in T_{g}({\psi_{\max}})$ , $P(L_{k}\|\psi_{\max})>P_{b}\}$
21: end while
22: for each $L_{i}\in D_{\textit{risk}}$ do
23: $T_{\textit{risk}}\leftarrow\text{RDBK}({T_{g},L_{i},p_{k}^{i}})$
24: end for
25: $T_{\textit{risk}}\leftarrow T_{\textit{risk}}-\{{\psi_{\max}}\}$
26: end while
27: $T_{a}\leftarrow T_{g}$
28: return $T_{a}$

The pseudo-code of the TLS algorithm is shown in Algorithm 3. This algorithm can be used to process the trajectory dataset that still has the risk of privacy leakage after SAP processing. First, it initializes the dangerous background knowledge set $T_{\textit{risk}}$ (Line 1). Based on each attack sequence in $B^{s}$ , it establishes the trajectory set TD with the largest privacy leakage probability and finds the trajectory set TC with a privacy leakage probability greater than $P_{b}$ from the TD. If TC is not empty, it adds this attack sequence to $T_{\textit{risk}}$ (Lines 2–8). For a non-empty $T_{\textit{risk}}$ , it determines the attack sequence $\psi_{\max}$ with the largest background knowledge balance suppression coefficient and the trajectory set $C_{\textit{risk}}$ that contains $\psi_{\max}$ and the privacy leakage probability greater than $P_{b}$ in $T_{g}$ and establishes the location $p_{k}^{i}$ with the largest location balance suppression coefficient, initializing $D_{\textit{risk}}$ to be empty (Lines 9–13). It then processes the trajectory in $C_{\textit{risk}}$ , adds the trajectory with the highest privacy preservation level to $D_{\textit{risk}}$ , removes this trajectory from $T_{g}$ , deletes the location $p_{k}^{i}$ from this trajectory, and adds it to $T_{g}$ , before recalculating the trajectory set $C_{\textit{risk}}$ with leakage probability greater than $P_{b}$ (Lines 14–21). After suppressing some location points, the privacy leakage probability of the current trajectory is less than $P_{b}$ , but it may cause the trajectory privacy leakage probability (of which the original privacy leakage probability is less than $P_{b})$ to exceed $P_{b}$ again.

If the TLS algorithm is used again, the processing cost could be high. Consequently, we propose to re-identify the dangerous background knowledge (RDBK) algorithm to regenerate the dangerous background knowledge set $T_{\textit{risk}}$ at minimum cost. Finally, it deletes $\psi_{\max}$ from $T_{\textit{risk}}$ , and enters the next round of iterations (Lines 22–26).

The anonymous trajectories are shown in Table 4. It can be seen that $d_{1}$ and $f_{8}$ are deleted from trajectory $L_{2}$ , and $d_{3}$ and $c_{4}$ are deleted from trajectory $L_{6}$ .

Table 4

Trajectories after local suppression

Id	Level	Trajectory	Disease	Job
1	Low	$d_{3}\to c_{4}\to e_{5}\to f_{8}$	High Blood Sugar	Lawyer
2	Middle	$e_{2}\to d_{5}\to c_{6}$	Any Illness	Doctor
3	None	$c_{2}\to d_{6}\to f_{8}$	Flu	Student
4	High	$c_{3}\to d_{7}\to d_{8}$	Any Illness	*
5	Low	$d_{1}\to d_{5}\to e_{6}\to f_{8}$	Lung Infection	Student
6	High	$a_{7}$	Any Illness	*

The pseudo-code of the RDBK algorithm is shown in Algorithm 4. It first initializes the dangerous background knowledge set $T_{\textit{risk}}$ and the attacker’s background knowledge set $B^{s}$ (Lines 1–2). When the number of loops is less than the attacker’s maximum background knowledge length $\xi$ and $B^{s}$ is not empty, it calculates the privacy leakage probability of the trajectory based on each attack sequence in $B^{s}$ . If it is greater than the $P_{b}$ , this indicates that the current attack sequence is still dangerous, and it is added to $T_{\textit{risk}}$ again (Lines 5–9). At this stage, all attack sequences, including the location $p_{k}^{i}$ , the length of which is increased by 1, are added to $B^{s}$ (Lines 10–12). Finally, it returns $T_{\textit{risk}}$ (Line 13).

Algorithm 4: RDBK
Input:
$T_{g}$ : Generalized trajectory dataset
$L_{i}$ : Risk trajectory
$p_{k}^{i}$ : Location point
Output:
$T_{\textit{risk}}$ : a set of background knowledge on leakage risk
1: $T_{\textit{risk}}\leftarrow\emptyset$ , $B^{s}\leftarrow\emptyset$
2: $B^{s}\leftarrow\{{p_{k}^{i}}\}$
3: $j\leftarrow 1$
// $\xi$ is the maximum length of background knowledge
4: while $j\leqslant\xi$ and $B^{s}\neq\emptyset$ do
5: for each $\psi^{\textit{Adv}_{i}}\in B^{s}$ do
6: if $P(L_{i}\|\psi^{\textit{Adv}_{i}})>P_{b}$ then
7: $T_{\textit{risk}}\leftarrow T_{\textit{risk}}\mathop{\cup}\nolimits\{{\psi^{% \textit{Adv}_{i}}}\}$
8: end if
9: end for
10: $B^{s}\leftarrow\{\psi^{\textit{Adv}_{m}}\|p_{k}^{i}\in\psi^{\textit{Adv}_{m}},\|% {\psi^{\textit{Adv}_{m}}}\|=j+1\}$
11: $j\leftarrow j+1$
12: end while
13: return $T_{\textit{risk}}$

4.5 Algorithm analysis

This section will demonstrate that the PTPPMGLS algorithm not only ensures that the anonymized dataset does not have trajectories at risk of privacy leakage but can also resist three similarity attacks and analyze the time complexity of the algorithm.

4.5.1 Security analysis

Theorem 1. There are no trajectories at risk of privacy leakage in the anonymized trajectory dataset.

Proof. $T$ is the original trajectory dataset; after sensitive attribute generalization and trajectory local suppression, an anonymous trajectory dataset $T_{a}$ is formed. When the attacker has any background knowledge $\psi^{\textit{Adv}_{i}}$ , according to Eq. (8), there is no trajectory of $P(L_{i}^{\ast}|\psi^{\textit{Adv}_{i}})>P_{b}$ ( $L_{i}^{\ast}\in T_{a})$ . Consequently, there are no trajectories in $T_{a}$ at risk of privacy leakage.

Theorem 2. The anonymized trajectory dataset can resist identity link, attribute link, and similarity attacks.

Proof. The trajectory data anonymized using the PTPPMGLS algorithm can resist identity link, attribute link, and similarity attacks. The details are as follows:

Analyze the trajectory dataset, find the trajectory $L_{i}$ where the location points of low frequency, and generalize the sensitive attributes within it. After the generalization of sensitive attributes, even if the attacker can uniquely determine the trajectory $L_{i}$ of a user in the published trajectory dataset $T_{a}$ based on their background knowledge, they cannot accurately know a user’s disease and occupation, the obfuscated sensitive information of the user being obtained only based on the anonymous trajectory $L_{i}^{\ast}$ . Therefore, it can resist an identity link attack;

Perform sensitive attribute generalization and local suppression on trajectories whose privacy leakage probabilities are greater than the $P_{b}$ . Even if the attacker knows that a user visited a specific place at a specific time and obtains the determined user trajectory, they cannot determine the sensitive attribute information of these users. Therefore, it can resist an attribute link attack;

Assign different privacy preservation levels to different users and generalize the sensitive attributes of users to different degrees in the anonymous trajectory dataset, with users with similar trajectory characteristics having different sensitive attribute values such as disease. Even if an attacker has the background knowledge that can infer similar users, it is impossible to determine sensitive attribute values such as disease. Therefore, it can resist similarity attacks.

For example, Table 1 shows the original trajectory dataset, and Table 4 shows the anonymized trajectory dataset. If an attacker has the location point $a_{7}$ as background knowledge, that is, they know that a user has visited the location $a_{7}$ , from Table 1, they can uniquely determine that the user’s trajectory is Record 6 and accurately determine the user’s disease and occupation; however, from Table 4, only the fuzzy information of the user can be obtained. Thus, it can resist the identity link attack. If the attacker knows that the user visited location $d$ at timestamps 1 and 5, they can determine that the user has SARS from Table 1, but it is only known that the user has a lung disease from Table 4. Thus, it can resist the attribute link attack. If the attacker has the location $f_{8}$ as background knowledge, from Table 1, they can infer with 75% confidence that the user has a lung infection. From Table 4, the user with the location point $f_{8}$ has different diseases. The attacker cannot obtain user information, so the PTPPMGLS algorithm can resist similarity attacks.

4.5.2 Time complexity analysis

The PTPPMGLS algorithm consists of three steps as follows:

Setting the privacy preservation level. The Apriori algorithm is used to calculate the correlation between attributes, and the privacy level is set based on the correlation. The time complexity of the Apriori algorithm is $O(|T|^{4})$ .

Generalization of sensitive attributes. The worst time complexity of the SAP algorithm is $O(\theta\ast|T|^{2})$ , where $\theta$ is the maximum generalization strength and is a constant. Thus, the time complexity of the SAP algorithm is $O(|T|^{2})$ .

Trajectory local suppression. Eliminate the location points based on the balance suppression coefficient. In the worst case, the time complexity of the RDBK algorithm is $O(n^{\xi}\ast|T|)$ , so the time complexity of the TLS algorithm is $O(n^{2\xi}\ast|T|^{2})$ .

5. Results and discussion: Experimental analysis

5.1 Experimental environment and datasets

The experimental environment used an AMD Ryzen 7 4800H processor, running a CPU 2.9 GHz, on a Windows 10 operating platform, and the experiment was completed using Visual Studio 2019 programming. The experimental dataset was City80k [26], and 10,000 trajectories were randomly selected for the experiment. The format of the dataset is presented in Table 1.

5.2 Evaluation indicators

In this study, the sensitive attribute retention ratio and the trajectory information loss ratio were used to measure the utilization of trajectory data, and the disclosure risk was used to judge the privacy preservation degree of trajectory data.

5.2.1 Information loss

Definition 17 (Sensitive attribute retention ratio): Given the anonymized trajectory dataset $T_{a}$ and the original trajectory dataset $T$ , we can compare the retention of sensitive attributes in the trajectory dataset before and after anonymization, denoting the sensitive attribute retention ratio as SR, expressed as follows:

$\displaystyle\textit{SR}=\frac{|{T_{a}^{sa}}|}{|{T^{sa}}|},$ (13)

where $|{T^{sa}}|$ is the number of sensitive attributes in the original trajectory dataset, and $|{T_{a}^{sa}}|$ is the number of sensitive attributes in the anonymous trajectory dataset.

Definition 18 (Trajectory information loss ratio): Given the anonymized trajectory dataset $T_{a}$ and the original trajectory dataset $T$ , we can compare the changes of location points in the trajectory dataset before and after anonymization, with the trajectory information loss ratio being recorded as TR and expressed as follows:

$\displaystyle\textit{TR}=\frac{|{T^{ls}|-|T_{a}^{ls}}|}{|{T^{ls}}|},$ (14)

where $|{T^{ls}}|$ is the number of locations in the original trajectory dataset, and $|{T_{a}^{ls}}|$ is the number of locations in the anonymous trajectory dataset.

5.2.2 Disclosure risk

Definition 19 (Disclosure risk): Given an anonymous trajectory dataset $T_{a}$ , the disclosure risk refers to the probability of trajectory privacy leakage under different lengths of background knowledge, the disclosure risk of the $i^{\text{th}}$ anonymous trajectory $L_{i}^{\ast}$ being represented by $\textit{DR}_{i}$ , is expressed as follows:

$\displaystyle\textit{DR}_{i}=\frac{1}{|{B^{s}}|}\mathop{\sum}\limits_{\psi^{% \textit{Adv}_{i}}\in B^{s}}P(T_{a}({L_{i}^{\ast}})|\psi^{\textit{Adv}_{i}}),$ (15)

where $|{B^{s}}|$ is the number of background knowledge. $P(T_{a}({L_{i}^{\ast}})|\psi^{\textit{Adv}_{i}})$ is the leakage risk of the trajectory $L_{i}^{\ast}$ after anonymization, which can be obtained from Eq. (8).

5.3 Experimental results

5.3.1 Sensitive attribute retention ratio analysis

Setting the maximum length of the attacker’s background knowledge to 2, the maximum generalization boundary $\textit{MAX}^{c}=$ 0.66, and the balance between sensitive attribute preservation and trajectory information privacy security can be verified by adjusting the privacy threshold $P_{b}$ ([0.2, 0.5]) and the maximum generalization strength $\theta$ .

Figure 4 shows the sensitive attribute retention ratio under four privacy preservation levels. Figure 4a and b are the results of the sensitive attribute retention ratio showing changing values of $P_{b}$ under different values of $\theta$ . With increasing levels of privacy preservation, the sensitive attribute retention ratio of the trajectory dataset becomes smaller and smaller. When the trajectory data are not protected, the sensitive attribute retention ratio is 1; when a high level of privacy preservation is applied to all trajectory data, the sensitive attribute retention ratio is 0.1–0.3.

As $P_{b}$ increases, the sensitive attribute retention ratio also increases, with high preservation having the greatest impact on it. When $P_{b}$ is set to 0.5, the sensitive attribute retention ratio is only approximately 0.3. Comparing Fig. 4a and b, it can be seen that the changing maximum generalization strength has little effect on the sensitive attribute retention ratio.

Figure 4.

Sensitive attribute retention ratios under different privacy preservation levels.

Figure 5 shows a comparison between the PTPPMGLS and PPTD algorithms [26] in terms of sensitive attribute retention. Setting $\theta=$ 1, $P_{b}$ is then in the range of 0.2–0.5, the sensitive attribute retention ratio of the PPTD algorithm is only 0.4–0.6, and the sensitive attribute retention ratio of the PTPPMGLS algorithm is 0.7–0.9. It can be clearly seen that the PTPPMGLS algorithm has a great advantage in the sensitive attribute retention ratio. Regardless of how the privacy threshold changes, our method achieves better results in terms of sensitive attribute retention.

Figure 5.

Sensitive attribute retention ratio for different methods.

Figure 6.

Trajectory information loss ratios under different privacy preservation levels.

5.3.2 Trajectory information loss ratio analysis

Trajectory information loss is a valid measure of the utility of an anonymous trajectory dataset. Experiments were conducted under the four privacy preservation levels proposed in this study and compared with the PPTD and KCL-Local methods [29] in terms of trajectory information loss.

Setting the maximum background knowledge of the attacker to be 2, Fig. 6 shows how the trajectory information loss ratio changes as the privacy threshold $P_{b}$ changes. In the cases of the middle and high preservation levels, the trajectory information loss ratio is close to 1 when $P_{b}$ is low. Compared with the first two preservation levels, the low preservation level has a small trajectory information loss ratio when $P_{b}$ is low. When $P_{b}$ is close to 0.3, there is almost no loss of trajectory information. Figure 6a and b show the trajectory information loss ratio under different maximum generalization strengths. Compared with $\theta=$ 1, when $\theta=$ 2, the trajectory information achieves a small loss.

Since KCL-Local algorithm adopts the $k$ -anonymity method to prevent identity link attacks, the PTPPMGLS algorithm cannot be directly compared with the KCL-Local algorithm. A variant of the PTPPMGLS algorithm, that is, the KCL-PTPPMGLS algorithm, and a variant of the PPTD method, that is, the KCL-PPTD algorithm, are proposed. $C$ is equivalent to $P_{b}$ in the PTPPMGLS algorithm, and $L$ is equivalent to the maximum number of attackers. Figure 7 shows a comparison of the three methods in terms of trajectory information loss, with fixed $K=$ 30, and $L=$ 3. When $C$ is small, the trajectory information loss of the KCL-Local and KCL-PPTD algorithms are higher, with our method having lower trajectory information loss and better utility.

Figure 7.

Trajectory information loss ratio for different methods.

5.3.3 Disclosure risk analysis

We use disclosure risk to measure the privacy preservation ability of the trajectory dataset. The average disclosure risk of the entire trajectory dataset can be given by adjusting $P_{b}$ for experimental validation under the four privacy preservation levels proposed in this study.

Setting the attacker’s background knowledge to a maximum of 2, Fig. 8 shows how the disclosure risk changes as $P_{b}$ changes. As $P_{b}$ decreases, the average disclosure risk decreases gradually, due to more sensitive attributes being generalized and more location points being eliminated. With increasing $P_{b}$ , the average disclosure risks of the four privacy preservation levels are maintained at approximately 0.2, meeting the needs of privacy preservation. Meanwhile, it can be seen from Fig. 8a and b that the disclosure risk minimally changes for different maximum generalization strengths. Consequently, it can be concluded that the maximum generalization strength has almost no effect on the disclosure risk.

Figure 8.

Average disclosure risk under different privacy preservation levels.

Figure 9 shows a comparison between the PTPPMGLS and PPTD algorithms in terms of average disclosure risk. Setting $\theta=$ 1 and $P_{b}$ in the range of 0.2–0.5, the PPTD algorithm is more effective in averaging the disclosure risk when $P_{b}$ is low, but as $P_{b}$ increases, the PTPPMGLS algorithm has similar utility as that of the PPTD algorithm.

Figure 9.

Average disclosure risk of different methods.

Under the premise of ensuring data security, the PTPPMGLS method exhibited better results in terms of data utility and sensitive attribute retention with lower information loss ratio, which could meet the needs of users, thus effectively achieving a balance between security and usability.

6. Conclusions

In this study, a trajectory personalized privacy preservation method based on multi-sensitive attribute generalization and local suppression was proposed to protect users from attacks when attackers have partial location information as background knowledge. First, the privacy preservation level was set for each trajectory based on the correlation of sensitive attributes. Then, sensitive attribute generalization was performed for trajectory data at risk of privacy disclosure. Finally, location point suppression was performed for trajectory data. The algorithm improved the degree of privacy preservation during trajectory publishing to a certain extent and could protect users’ privacy more effectively when the given threshold was high. However, analysis of the correlation between sensitive attributes, resulted in high time complexity and a lengthy running time, and only two sensitive attributes were considered to be correlated. Additionally, the differential privacy preservation method currently under study does not care about the background knowledge possessed by the attacker [30]. The quantitative analysis of the risk of privacy leakage needs to be adopted to the method proposed in this study. In the next work, we will conduct research on differential privacy techniques to improve the usability of the proposed method. In addition, more correlations between sensitive attributes will be established, so that the proposed method can be applied to more realistic scenarios.

Footnotes

Acknowledgments

The authors would like to thank the reviewers for their useful comments and suggestions for this paper. This work was supported by the National Natural Science Foundation of China (Grant Nos. 61702010, 61972439), the Anhui Provincial Natural Science Foundation of China (Grant No. 2208085MF164, 2108085MF214), the University Natural Science Research Program of Anhui Province (Grant No. KJ2021A0125), and the Key Program in the Youth Elite Support Plan in Universities of Anhui Province (Grant No. gxyqZD2020004).

Ethical statements

The authors claim that no conflict of interest exists in the submission of this manuscript, and the manuscript is approved by all co-authors for publication. None of the material in the paper has been published or is under consideration for publication elsewhere.

References

Chen

Ding

and Lu

, A decentralized trust management system for intelligent transportation environments, IEEE Transactions on Intelligent Transportation Systems 23(1) (2020), 558–571.

Parmar

and Rao

U.P.

, Towards privacy-preserving dummy generation in location-based services, Procedia Computer Science 171 (2020), 1323–1326.

Qiu

Squicciarini

A.C.

Pang

Wang

and Wu

, Location privacy preservation in vehicle-based spatial crowdsourcing via geo-indistinguishability, IEEE Transactions on Mobile Computing 1 (2020), 1–14.

Zhao

and Jin

, Protecting trajectory from semantic attack considering k-anonymity, l-diversity, and t-closeness, IEEE Transactions on Network and Service Management 16(1) (2019), 264–278.

Yang

Gao

Wang

Zheng

and Guo

, A semantic k-anonymity privacy preservation method for publishing sparse location data, in: Proceedings of the 7th IEEE International Conference on Advanced Cloud and Big Data, 2019, pp. 216–222.

Zhang

Mao

Choo

K.K.R.

Peng

and Wang

, A trajectory privacy-preserving scheme based on a dual-k mechanism for continuous location-based services, Information Sciences 527 (2020), 406–419.

Cao

Xiao

Xiong

Bai

and Yoshikawa

, Protecting spatiotemporal event privacy in continuous location-based services, IEEE Transactions on Knowledge and Data Engineering 33(8) (2019), 3141–3154.

Zhang

Qian

Ding

and Shaham

, Location privacy preservation based on continuous queries for location-based services, in: Proceedings of the IEEE Conference on Computer Communications Workshops, 2019, pp. 1–6.

Lei

Liu

Pei

and Li

, A privacy preservation scheme for fake trajectories based on spatiotemporal correlation in trajectory publishing, Journal of Communications 37(12) (2016), 156–164.

10.

Dong

and Pi

, Novel privacy-preserving algorithm based on frequent path for trajectory data publishing, Knowledge-Based Systems 148 (2018), 55–65.

11.

Wang

and Li

, Fake location selection algorithm based on location semantics and query probability, Journal of Communications 41(3) (2020), 53–61.

12.

Gruteser

and Liu

, Protecting privacy, in continuous location-tracking applications, IEEE Security and Privacy 2(2) (2004), 28–34.

13.

Mohammed

Fung

B.C.M.

and Debbabi

, Walking in the crowd: Anonymizing trajectory data for pattern analysis, in: Proceedings of the 18th ACM Conference on Information and Knowledge Management, 2009, pp. 1441–1444.

14.

Chen

Fung

B.C.M.

Mohammed

Desai

B.C.

and Wamg

, Privacy-preserving trajectory data publishing by local suppression, Information Sciences 231 (2013), 83–97.

15.

Terrovitis

Poulis

Mamoulis

and Skiadopoulos

, Local suppression and splitting techniques for privacy preserving publication of trajectories, IEEE Transactions on Knowledge and Data Engineering 29(7) (2017), 1466–1479.

16.

Zhao

Zhang

and Ma

, Trajectory privacy preservation method based on trajectory frequency suppression, Chinese Journal of Computers 37(10) (2014), 2096–2106.

17.

Wang

Luo

Liu

and Chen

, Trajectory privacy-preserving method based on information entropy suppression, Journal of Computer Applications 38(11) (2018), 3252–3257.

18.

Chen

Lin

and Luo

, A trajectory privacy-preserving method based on single point gain, Acta Electoniva Sinica 48(1) (2020), 143–152.

19.

Sweeney

, K-anonymity: A model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10(5) (2002), 557–570.

20.

Abul

Bonchi

and Nanni

, Never walk alone: Uncertainty for anonymity in moving objects databases, in: Proceedings of the 24th IEEE International Conference on Data Engineering, 2008, pp. 376–385.

21.

Gurung

Lin

Jiang

Hurson

and Zhang

, Traffic information publication with privacy preservation, ACM Transactions on Intelligent Systems and Technology 5(3) (2014), 1–26.

22.

Jia

Qin

Chen

and Ma

, Trajectory anonymity based on quadratic anonymity, in: Proceedings of the 3rd IEEE International Conference on Electronic Information Technology and Computer Engineering, 2019, pp. 485–492.

23.

Machanavajjhala

Kifer

Gehrke

and Venkitasubramaniam

, L-diversity: Privacy beyond k-anonymity, in: Proceedings of the 22nd ACM International Conference on Data Engineering, 2006, pp. 3–8.

24.

and Venkatasubramanian

, T-closeness: Privacy beyond k-anonymity and l-diversity, in: Proceedings of the 23rd IEEE International Conference on Data Engineering, 2011, pp. 106–115.

25.

Yao

Chen

and Wu

, Sensitive attribute privacy preservation of trajectory data publishing based on l-diversity, Distributed and Parallel Databases 39(3) (2021), 785–811.

26.

Komishani

E.G.

Abadi

and Deldar

, PPTD: Preserving personalized privacy in trajectory data publishing by sensitive attribute generalization and trajectory local suppression, Knowledge-Based Systems 94 (2016), 43–59.

27.

Jia

and Chen

, (

l

m

d

)-Anonymity: A resisting similarity attack model for multiple sensitive attributes, in: Proceedings of the 2nd IEEE Conference on Information Technology, Networking, Electronic and Automation Control, 2017, pp. 756–760.

28.

Cong

, Research on data association rules mining method based on improved Apriori algorithm, in: Proceedings of the 1st IEEE International Conference on Big Data, Artificial Intelligence and Software Engineering, 2020, pp. 373–376.

29.

Chen

Fung

B.C.M.

Mohammed

Desai

B.C.M

and Wang

, Privacy-preserving trajectory data publishing by local suppression, Information Sciences 231 (2013), 83–97.

30.

Yang

Wang

Song

and Ma

, Local trajectory privacy protection in 5g enabled industrial intelligent logistics, IEEE Transactions on Industrial Informatics 18(4) (2021), 2868–2876.

Trajectory personalization privacy preservation method based on multi-sensitivity attribute generalization and local suppression

Abstract

Keywords

1. Introduction

2.1 Traditional trajectory privacy preservation methods

2.2 Semantic trajectory privacy preservation

3. Theory: Problem description and related definitions

3.1 Trajectory model

Table 2 Trajectory data with privacy preservation level

4.1 Algorithm idea

Table 3 Trajectories after generalization of sensitive attributes

4.5.1 Security analysis

4.5.2 Time complexity analysis

5. Results and discussion: Experimental analysis

5.1 Experimental environment and datasets

5.2 Evaluation indicators

5.2.1 Information loss

5.3.1 Sensitive attribute retention ratio analysis

Footnotes

Acknowledgments

Ethical statements

References

Table 2
Trajectory data with privacy preservation level

Table 3
Trajectories after generalization of sensitive attributes