Probabilistic optimal projection partition KD-Tree k -anonymity for data publishing privacy protection

Abstract

Data needs to be released to the relevant decision makers and researchers. Privacy protection should be carried out first because it contains personal sensitive information. The $k$ -anonymity algorithm is an important privacy protection algorithm, and partitioning is one of its key methods. To reduce the computational complexity and low speed of existing privacy-preserving algorithms for high-dimensional data publishing, a probabilistic optimal projection partition $k$ -dimensional (KD)-tree $k$ -anonymity algorithm is proposed. First, some attribute dimensions are probabilistically selected from the global domain. Then, for these dimensions, the partition coefficient is calculated and the optimal partition point is determined. Furthermore, an improved KD-tree structure is introduced in which a node is a collection rather than a data point. The proposed KD-tree node is divided into left and right child nodes by the hyper-plane passing through the dividing point and perpendicular to the optimal dimension. The proposed algorithm is validated by a theoretical analysis and comparison experiments. The results show that the proposed algorithm can reduce the average generalization range by 11% to 22% compared to traditional $k$ -anonymity. This enables better division and better dataset availability. Moreover, the runtime is reduced by 8% to 32% compared to globally optimal projection partitioning $k$ -anonymity.

Keywords

Data publishing privacy protection k-anonymity KD-tree probabilistic partitioning

1. Introduction

The large amounts of data generated everyday by different organizations for various purposes have catalyzed research opportunities related to data science [1]. The sharing of data between the organizations has also been found to be beneficial for business growth [2]. However, publishing raw data may raise security concerns among the users or actors who provide sensitive information in the raw data [1]. Privacy-preserving data mining is an active area in data mining [2], and topics relating to data security have therefore received much attention [3]. Moreover, privacy-preserving data publishing has received a considerable amount of attention in research communities, and the privacy-preserving publishing of microdata has been studied extensively in recent years [4]. The $k$ -anonymization algorithm is a promising privacy protection mechanism in data publishing that generalizes every identifier attribute into at least $k-1$ others from the dataset [5]. In other words, $k$ -anonymity defines a clustering with a minimum of $k$ tuples in each group. Many anonymous approaches have been proposed for different data publishing scenarios [6], such as data within a distributed environment or the Internet of Things [7], incomplete microdata [8], and data from multiple data sets [9].

Recently, data utility has become a significant performance measure for data anonymization. Moreover, efficiency is also very important for real-time processing. Therefore, the main purpose of this paper is to reduce the runtime of the algorithm to improve its efficiency. Another aim is to reduce the range of data generalization to improve data quality and maximize data availability after anonymization. The results of experiments presented in this paper also demonstrate the ability of the proposed algorithm to protect data privacy in large-scale high-dimensional data mining because of its high efficiency, sufficient security, and high data utility.

The main contributions of this paper are as follows. i) Considering global and local optimization, we propose a probabilistic search for finding the optimal dimensions from a probabilistic selection of some dimensions instead of all dimensions. Treating a locally optimal dimension as the globally optimal dimension and then looking for the optimal partition point in that dimension can greatly speed up the algorithm. ii) A dual-interval method is proposed for finding the optimal partition point. By increasing the scope of the search, it can determine a more reasonable division, improve the data quality, and hence compensate for the probability that a non-globally optimal dimension has been chosen, which increases the final data quality and availability. iii) A newly designed KD tree is used to store the partitioned data space. In contrast to traditional KD tree nodes, which are data points, all the nodes of the KD tree in this paper are sets, and non-leaf nodes correspond to anonymous sets that require division. A leaf node is a final $k$ -anonymous collection. iv) A $k$ -anonymous partitioning algorithm based on an extended KD tree for probabilistic optimization is proposed. In contrast to traditional $k$ -anonymous algorithms, the data quality is improved by sacrificing speed. This algorithm takes into account both the availability of the data after anonymization and the speed of the algorithm. The correctness of the algorithm is proved theoretically and its validity is verified experimentally.

This paper is structured as follows. Section 2 introduces related work. Section 3 presents the basic issues related to the current research. Section 4 introduces the basic concepts used in this article. Section 5 presents the $k$ -anonymous privacy protection algorithm that uses a KD tree based on a probabilistic optimal partition. It also gives a theoretical proof of the algorithm as well as a time and space complexity analysis. Section 6 describes the experiments and results, and Section 7 concludes the paper.

2. Related work

Sweeney [10, 11, 12] first applied the $k$ -anonymity algorithm to privacy protection. Since then, numerous anonymity models such as l-diversity [13], t-closeness [14, 15], and ( $\alpha$ , $k$ )-anonymity have been utilized to achieve $k$ -anonymity [16]. However, generalization and suppression may lead to substantial information loss and decrease data utility. Recent work has focused on proposing different anonymity algorithms for various data publishing scenarios to simultaneously satisfy privacy requirements and maintain data utility [16].

Gong et al. assumed that one individual has multiple records and proposed the 1:M-generalization algorithm, which addresses disclosure risks in 1:M data publishing, and is based on their proposed (k, l)-diversity model [17]. Ni et al. devised an (s, $\varepsilon$ )-anonymity location privacy model by setting parameters s and $\varepsilon$ (instead of k), which are the minimum inferred region (location protection strength) and candidate answer region (the scale of the intermediate results), respectively [18]. To reduce the correlation loss and improve data utility, Zhu et al. generated a refining partition and anonymization based on an optimization model [19]. Matatov et al. used a genetic algorithm to search for optimal feature set partitioning and proposed a decomposition algorithm for $k$ -anonymity that partitions the original dataset into several projections that each adhere to $k$ -anonymity [20]. Kumar and Minz proposed an optimal feature set partitioning approach for high-dimensional data classification and maximized the performance of the classifier using ensemble learning [21]. Sarrafi Aghdam and Sonehara proposed a bottom up greedy algorithm based on the similarity-based clustering model for $k$ -anonymization that measures similarity and calculates distances between tuples of numerical and categorical attributes without hierarchical taxonomies. It substantially improves data utility and reduces information loss significantly [22]. Gkoulalas-Divanis et al. presented a survey of algorithms for publishing structured patient data in a privacy-preserving way [23].

Data privacy and data utility are quite naturally opposing concepts [24]. This is because the utility of data is essential for researchers and analysts during data publishing and data mining. Hence, discussions about privacy protection in these fields focus on strong guarantees to avoid information disclosure and protect individuals’ privacy.

Partitioning is a common method in $k$ -anonymous mechanisms. Of them, 2-way division divides a large temporary anonymous component into two smaller anonymous groups. The Mondrian algorithm [25] was the first 2-way partition method, and it uses a balanced partitioning strategy. In essence, it is a locally greedy strategy, and at each iteration, the dataset is divided into two smaller groups with the same capacity, which leads to local but not global fairness. Wu et al. [26] proposed the integer partition $k$ -anonymity algorithm, which uses a rounding function. The multidimensional anonymous area, which is a temporary anonymous group, is divided into two smaller areas of anonymity. This not only avoids the reduction in potential anonymous groups and increase in the number of anonymous data that occur in the conventional 2-way partitioning method, but also improves the availability of published datasets.

Wang et al. [27] proposed a $k$ -anonymity algorithm based on projection region density partitioning that optimizes the rounded partitioning function and attribute dimension selection. By improving the projection area density, temporary anonymous groups are reasonably divided. After the division, little information is lost and data availability is high. It was also proved theoretically that the anonymous group size generated by the algorithm is less than $2k$ in the worst case, and the average size of the anonymous group will approach $k$ if the released data table is large enough. In the algorithm, the projective dimension is selected randomly and the partitioning is not optimal. Therefore, the dimension selection is locally optimal and the global optimum is not available.

To address these problems, Wang et al. [28] adopted an approach that finds the optimal projection dimension before finding the division points, which gives a global perspective to the selection of the division dimensions and partition points. The algorithm studies the choice of dimensions, calculates the discrete value of all dimensions, finds the optimal dimension, and achieves the optimal division in the global domain. Moreover, the traditional k-dimensional (KD) tree structure [29, 30, 31] was redesigned for data storage, and an optimal projection partitioning method based on it was proposed. The quality of the data can be increased from 10% to 22% after anonymity. However, it is very time-consuming to search high-dimensional data, especially in large datasets, and the generalization process is slow. Therefore, it is still necessary to improve the efficiency of the algorithm while ensuring the quality and availability of data. Hence, in this paper, a probabilistic locally optimal dimension is used to replace the globally optimal dimension to improve time efficiency. To improve the quality of the data after anonymization, a dual-interval method is adopted to extend the search range of the optimal partition point.

3. Problem description

3.1 Data attributes

The data in a data table for publication has the following three main attributes.

1)
Quasi identifier (QI) attributes: An attacker cannot directly identify a set of data through the QI attributes, but can indirectly determine a tuple in the data table by means of external links or related background knowledge.
2)
Privacy attributes (PAs): These are the data attributes that should be protected by $k$ -anonymization because they contain private information.
3)
Irrelevant attributes: These attributes are any attributes other than QI attributes and PAs in the data table (e.g., identifying attributes). They can be removed during anonymization.

3.2 Problem analysis

Assume that the data table to be published is $D(m+n)$ , described as $D(Q1$ , $Q2$ , $\ldots$ , Qm, P1, P2, …, Pn), where $m$ is the number of QI attributes and $n$ is the number of PIs. Moreover, $\prod_{\text{QI}}(D)$ is the projection of $D(m+n)$ in the QI attribute set. If $D(m+n)$ is mapped to a multidimensional space, each record corresponds to a point in that space. Therefore, the $k$ -generalization problem can be transformed into a set partition in multidimensional space, where each generalization set is equivalent to a subregion. Examples are given as follows.

Table 1
Original data set

A1	A2	A3	A4	A5	A6	A7
Tue	5	24	6	19	6	1
Fam	6	19	5	13	8	5
Cmce	5	16	11	15	16	7
Fame	4	9	15	14	21	8
Lee	3	7	6	7	9	11
Ami	9	18	9	20	13	13
Lucy	8	14	28	15	22	16
Blues	10	9	9	26	10	20
Aima	9	7	10	9	19	24
Joye	12	15	5	25	4	27
Judd	19	23	35	26	25	30
Sharpe	20	17	42	15	43	32
Mogan	21	21	42	25	32	37
Swan	24	13	35	13	28	39
Sweet	25	10	38	20	26	42
Luind	21	6	42	13	41	44
Lydia	28	27	37	8	30	50

Suppose there are 17 records in Table 1, and each record has seven attributes, where attribute A1 is a PA, and A2 to A7 are QI attributes. If the data in Table 1 are projected onto three selected dimensions, the projection points are distributed in three-dimensional space. In Fig. 1, these data are projected onto dimensions A2, A3, and A4 as well as onto A5, A6, and A7. It can be seen from Fig. 1 that the distribution density of projection points is different.

Figure 1.

Partition results for Table 1 using different dimensions.

In Fig. 1, the aggregation and degree of dispersion of the blue and green dots differ. The choice of dimensions in the partition influences the quality of data after anonymity. The more reasonable the partition is, the higher the quality of data will be after anonymity. Therefore, how to choose the dimensions for the partitions is a question worth exploring.

Further analysis of Table 1 shows that if we want to choose the best division dimension AX, we need to consider all the attribute dimensions from A1 to A7. When there are more attributes, it is very complicated to determine the discrete situation for all attributes. In fact, if we only select the three dimensions of A2, A3, and A4, we can determine the optimal dimension AX. Local judgment can achieve the same result as global optimization. In the real world, when people analyze problems, they usually make judgments and decisions by examining a part of the problem. Therefore, in dimension selection, we can select some attributes, analyze the degree of dispersion, and treat the best dimension as the “globally optimum” dimension. This probabilistic search for optimal dimensions fits the way people think. Therefore, in this paper, we propose to select some attribute dimensions according to a certain probability to improve the data quality and efficiency of the algorithm.

In high-dimensional space partitioning, not only will the selection of projective dimensions affect the division results, but the selection of partition points will also affect the projection results. Based on previous work, this paper proposes a method for finding an optimal dimension by probabilistic selection and searches for the dividing point using a double interval to divide the high-dimensional space.

Figure 2.

Partition results for Table 1 obtained using different methods.

An example is used to illustrate the selection of the partition point in this paper. Suppose that T $=$ {t1(5,24), t2(5,19) t3(5,16), t4(4,6), t5(3,5), t6(9,18), t7(6.5,12), t8(11,9), t9(10,7), t10(12,15), t11(18,23), t12(19.5,17), t13(21,21), t14(24,13), t15(25,10), t16(21,6), t17(24,20)}. First, the data in T are projected to the attritute plane spaces of its attributes. To divide the plane region, the proposed probabilistic optimal projection partitioning, equilibrium 2-way partitioning [25], rounded partitioning [26], and optimal projection partitioning [28] are compared. The result is shown in Fig. 2. The detailed steps of these methods are shown in Fig. 3, where the rectangular dashed frame represents the area where the potential partition points are located.

Figure 3.

Partition discrimination of Table 1 using different methods.

The rounded partition is the optimization of the traditional equilibrium 2-way partition. It obtains more reasonable divisions and the number of anonymous groups is lower. The rounded partition and equilibrium 2-way partition are both directly located at a partition point and have no search interval. Optimal projection partitioning searches for the optimal partition points from a single interval, and the proposed partitioning uses two intervals to find the optimal partition point.

4. Background knowledge

4.1 Related definitions

The following definitions are used in this paper.

1)
Temporary equivalent cluster Tec. The anonymous dataset can be considered as an equivalent cluster, and each division divides a large temporary equivalent cluster into two smaller temporary equivalence clusters.
2)
Equivalent cluster center Ecc. Suppose that an anonymous equivalent cluster contains $n$ data points, and each data point has $m$ attributes. Then, an Ecc is denoted as

$\displaystyle\left[\begin{array}[]{ccc}a_{11}&\cdots&a_{1,m}\\ \vdots&\ddots&\vdots\\ a_{n,1}&\cdots&a_{n,m}\\ \end{array}\right],$

and we can calculate the mean of any attribute $j$ , which is denoted as EQI ${}_{j}$ , as shown in Eq. (1).

$\displaystyle\textit{EQI}_{j}=\frac{1}{n}\cdot\sum_{i=1}^{n}a_{ij}(j=1,2,% \ldots m)$ (1)

Here, $a_{ij}$ is the value of the $i$ th record and $j$ th attribute. All $m$ attributes are $\textit{EQI}_{1},\textit{EQI}_{2},\ldots,\textit{EQI}_{m}$ , which corresponds to the m-dimensional space data point $P(\textit{EQI}_{1},\textit{EQI}_{2},\ldots,\textit{EQI}_{m})$ .
3)
Equivalence set range Esr. Each item of data in the equivalent cluster contains $m$ identifier attributes. If the data item was mapped to m-dimensional space, it would be represented as a point. The region in $m$ -dimensional space that covers all data points in the equivalent cluster is called an anonymous equivalent cluster domain whose range is Esr, as shown in Eq. (2).

$\displaystyle\textit{Esr}=\prod\limits_{i=1}^{m}(\text{Max}(a_{1i},a_{2i},% \ldots,a_{ni})-\text{Min}(a_{1i},a_{2i},\ldots,a_{ni}))$ (2)

Smaller values of Esr indicate less data information loss.
4)
Average data range Adr. This range represents the average size of the surrounding area of each data point. Suppose anonymous equivalent cluster $X$ has $a$ data points. Then, its average data range is calculated as shown in Eq. (3).

$\displaystyle\text{Adr}_{X}=\frac{\textit{Esr}}{a}$ (3)

Here, Esr is an anonymous equivalent cluster. Smaller values for Adr indicate less data information loss.
5)
$W$ class node. If a node in the KD tree contains $w$ data points in the equivalent cluster, the tree node is called a $w$ -class node.
6)
Discrete degree. The discrete degree of the projection of the data onto the $j$ th attribute is $\textit{Agree}_{j}$ . Suppose that a node at level $H$ contains $h$ data points and each data points has $m$ attributes, denoted as

$\displaystyle\left[\begin{array}[]{ccc}{a_{11}}&\ldots&{a_{1,m}}\\ \vdots&\ddots&\vdots\\ {a_{h,1}}&\cdots&{a_{h,m}}\\ \end{array}\right].$

The formula for $\textit{Agree}_{j}$ is shown in Eq. (4).

$\displaystyle\textit{Agree}_{j}=\frac{1}{h}\sum_{i=1}^{h}(a_{i,j}-\textit{EQI}% _{j})^{2}$ (4)

Larger values for $\textit{Agree}_{j}$ indicate more scattered projection points on the $j$ th attribute of the data. For a discrete dimension, the partition cannot destroy the aggregation of the original data, and the created KD tree is more balanced, the data distribution is more uniform, and the partition is more reasonable.
7)
Partition metric, Pm. This metric measures the quality of a partition. A temporary anonymous equivalent cluster $X$ is divided into two smaller temporary anonymous equivalent clusters $X1$ and $X2$ , and Pm is calculated using Eq. (5).

$\displaystyle\textit{pm}=\frac{\textit{adr}_{x1}+\textit{adr}_{x2}}{2\textit{% adr}_{x}}$ (5)

Here, $\textit{adr}_{x}$ , $\textit{adr}_{x1}$ , and $\textit{adr}_{x2}$ are the average data range of temporary anonymous equivalent clusters $X$ , $X1$ , and $X2$ , respectively. Smaller values of Adr* indicate a smaller Pm, and a smaller information loss indicates a better partition.
8)
Data set range Dsr. Each record $a$ in dataset $S$ is generalized into $a^{\ast}$ by the $k$ -anonymous algorithm. Each attribute is generalized as an interval from $\beta$ to $\alpha$ and the generalization range is denoted as Dsr.

Suppose record $a$ is $a(x_{1},x_{2},\ldots,x_{m})$ and $a$ * is the generalized value of $a$ , written as $([x_{1b},x_{1t}]$ , $[x_{2b},x_{2t}]$ , $\ldots$ , $[x_{mb},x_{mt}])$ . Here, $\alpha$ , $\beta$ , and Dsr are defined in Eqs (6)–(8), respectively.

$\displaystyle\alpha=\textit{Sup}(a^{\ast}([x_{1b},x_{1t}],[x_{2b},x_{2t}],% \ldots,[x_{mb},x_{mt}]))=(x_{1t},x_{2t},\ldots,x_{mt})$ (6) $\displaystyle\beta=\textit{Inf}(a^{\ast}([{x_{1b},{x}_{1t}}],[x_{2b},x_{2t}],% \ldots,[x_{mb},x_{mt}]))=(x_{1b},x_{2b},\ldots,x_{mb})$ (7) $\displaystyle\textit{Dsr}=\alpha-\beta=((x_{1t}-x_{1b}),(x_{2t}-x_{2b}),\ldots% ,(x_{mt}-x_{mb}))$ (8)
9)
Average dataset range Adsr. Suppose there are $n$ points in the dataset. After generalization, the average generalization range in each dimension is Adsr, as shown in Eq. (9).

$\displaystyle\textit{Adsr}=\left({\left({\frac{\mathop{\sum}\nolimits_{i=1}^{n% }\textit{Dsr}_{i1}}{n}}\right),\left({\frac{\mathop{\sum}\nolimits_{i=1}^{n}% \textit{Dsr}_{i2}}{n}}\right),\ldots,\left({\frac{\mathop{\sum}\nolimits_{i=1}% ^{n}\textit{Dsr}_{im}}{n}}\right)}\right)$ (9)

4.2 Proposed KD tree

The KD tree is often used to search for key data in range and nearest neighbor searches. It is a data structure that divides data points in high-dimensional space.

In this study, the traditional KD tree is improved. In contrast to traditional KD tree nodes, which are points, each node of the proposed KD tree is composed of a dataset corresponding to the m-dimensional space data points that are to be divided. When we partition the high-dimensional space, we try to place similar points on the same subtree of the KD tree. By calculating the variance of all numerical data points in each dimension, we can choose a dimension such that the variance and data fluctuation of the dimension is greater. A higher degree of discrete data points increases the probability that the data points in that dimension do not belong to the same space, which means it is more suitable for the dimension division. All nonroot nodes can be regarded as a hyperplane. The left subtree of the node corresponds to the left side of the hyperplane, and the right subtree corresponds to the right side of the hyperplane.

Figure 4.

Structure of the proposed KD tree.

As shown in Fig. 4, the root node of the tree is S0, which is a collection of all data points. $S0=S1\cup S2$ . Similarly, $S1=S3\cup S4$ and $S2=S5\cup S6$ . The final leaf node is the set that meets the required level of $k$ -anonymity. Each leaf node corresponds to an anonymous cluster and all leaf nodes are collections of anonymous clusters.

5. Probabilistic optimal projection partition KD-tree

k

-anonymity algorithm

This paper presents a probabilistic optimal projection partition $k$ -anonymity algorithm using the improved KD tree data structure. In the proposed algorithm, by traversing each of the selected attributes from all dimensions using a probabilistic method, the dispersion degree is first calculated. The dimension with the highest level of dispersion is selected as the optimal dimension. Then, using the partition coefficient pm, the optimal partition point is determined.

5.1 Algorithm

The main steps of the algorithm are summarized as follows.

Step 1:
The original data are read into the memory in a two-dimensional vector and saved in a two-dimensional array. The root of the KD tree and the partitioned node queue are created. The root node is added to the queue. Each node in the queue is inspected to determine whether it satisfies the $k$ -anonymity condition. If the node does not satisfy this condition, then it needs to be divided, otherwise it is removed from the queue.
Step 2:
When the partition queue is not empty, the head node of the queue is fetched. Suppose the head node is a $M$ -level node. We traverse each attribute, calculating its dimension dispersion, to determine the maximum dmax for the optimal dimension. We find the upper limit $p_{max}=M-\lfloor\alpha/2\rfloor\cdot k$ and lower limit $p_{min}=\lfloor\alpha/2\rfloor\cdot k$ and traverse each potential partition point in the region [ $p_{min}$ , $p_{max}$ ]. The partition coefficients of each potential division point are calculated and the optimal partition point $P$ is obtained.
Step 3:
At point $P$ , the current node is divided into two small anonymous clusters nl and nr that for generating the left and right subtrees, respectively. Nodes nl and nr are added to the queue and equally divided by recursion until the root node queue is empty, all nodes satisfy the $k$ -anonymous condition, and the KD tree is completely constructed.
Step 4:
Each leaf node is traversed. We obtain the equivalent clusters in each leaf node and anonymize each cluster. An anonymous table $T*(m+n)$ is generated.

Figure 5.
Links between the four proposed algorithms.

5.2 Algorithm pseudocode

Algorithm 1 lists the pseudocode of the proposed algorithm described in the previous section. Algorithms 2 and 3 list the pseudocode for the function in Algorithm 1 that find the optimal projection dimension. Moreover, Algorithm 4 lists the pseudocode for the function in Algorithm 1 that finds the optimal partition point on the projection. The links between the four proposed algorithms is shown in Fig. 5.

Algorithm 1 Probabilistic optimal projection partition KD-tree $k$ -anonymity
Input: Dataset to be generalized $T(m+n)$ , $k$ ;
Output: Generalized dataset $T(m+n)$ ; 1. $\textit{originalData}\left[m\right]\left[n\right]\leftarrow T\left({m+n}\right)$ ; $\textit{root}=\Phi$ ; 2. Init(root);* 3. Create Kd_tree(root); 4. Create nodeQueue; 5. nodeQueue.push(root); 6. WHILE (!nodeQueue) 7. { 8. node $=$ nodeQueue.pop(); 9. IF (node.count $>$ 2k)* 10. { 11. GetProbabilisticOptimalDimension(root); 12. int tmpPivot $=$ GetProbabilisticPivot(root); 13. Divide(root,tmpPivot); 14. root $\to$ leftChild $\to$ parent $=$ root; 15. root $\to$ rightChild $\to$ parent $=$ root; 16. Create_KD_Tree(root $\to$ leftChild); 17. Create_KD_Tree(root $\to$ rightChild); 18. } 19. } 20. Traverse leaf node; 21. Get anonymous group; 22. Get anonymous table $T*(m+n);$

Algorithm 1 Probabilistic optimal projection partition KD-tree

k

-anonymity

Input: Dataset to be generalized

T(m+n)

k

;

Output: Generalized dataset

T*(m+n)

;

1.
$\textit{originalData}\left[m\right]\left[n\right]\leftarrow T\left({m+n}\right)$ ; $\textit{root}=\Phi$ ; 2.
Init(root);
3.
Create Kd_tree(root);
4.
Create nodeQueue;
5.
nodeQueue.push(root);
6.
WHILE (!nodeQueue)
7.
{
8.
node $=$ nodeQueue.pop();
9.
IF (node.count $>$ 2k)*
10.
{
11.
GetProbabilisticOptimalDimension(root);
12.
int tmpPivot $=$ GetProbabilisticPivot(root);
13.
Divide(root,tmpPivot);
14.
root $\to$ leftChild $\to$ parent $=$ root;
15.
root $\to$ rightChild $\to$ parent $=$ root;
16.
Create_KD_Tree(root $\to$ leftChild);
17.
Create_KD_Tree(root $\to$ rightChild);
18.
}
19.
}
20.
Traverse leaf node;
21.
Get anonymous group;
22.
Get anonymous table $T*(m+n);$

First, Algorithm 1 reads the original data into a two-dimensional array (line 1), initializes the root node of the KD tree (line 2), creates the KD tree with the root (line 3), and creates a queue to store the data in the root node (line 4). Then, it adds the root node to the queue (line 5). While nodeQueue is not empty, each element in the queue is divided (line 6). The top element of nodeQueue (line 8) is processed to determine whether it meets the requirements of anonymity (line 9). If the number of elements in the node is greater than 2 $k$ , this condition is not satisfied and the node needs further division. For this division, it is necessary to find the local optimal dimensions (line 11) and then the optimal partition point (line 12). The algorithm uses the hyperplane that is perpendicular to the optimal partition dimension and goes through the optimal partition point to divide the original data space into two parts (line 13). Each part becomes the left and right children of the KD tree (lines 14 to 17). These steps are executed recursively until the queue is empty. Eventually, each leaf node in the KD tree corresponds to an anonymous group (lines 20 to 21) and all leaf nodes correspond to the collection of all anonymous groups (line 22).

In Algorithm 2, the fifth line obtains the discrete values of each dimension. The pseudocode for this step is listed in Algorithm 3.

Algorithm 2GetProbabilisticOptimalDimension(root)
Input: root;
Output: Probabilistic Optimal Dimension; 1. Set maxDimension $=$ 0; maxVariance $=$ 0; 2. Randomly select some dimensions by probabilistic method; 3. FOR( each selectedDimension ${}_{i}$ in node ) 4. { 5. set var $=$ getVariance(node, selectedDimension ${}_{i})$ ; 6. IF(var $>$ maxVariance) 7. { 8. maxVariance $=$ var; 9. maxDimension $=$ selectedDimension ${}_{i}$ ; 10. } 11. } 12. probabilisticOptimalDimensiondimension $=$ maxDimension; 13. Get Probabilistic Optimal Dimension;

Algorithm 2GetProbabilisticOptimalDimension(root)

Input: root;

Output: Probabilistic Optimal Dimension;

Set maxDimension $=$ 0; maxVariance $=$ 0; 2.

Randomly select some dimensions by probabilistic method;

FOR( each selectedDimension ${}_{i}$ in node )

{

set var $=$ getVariance(node, selectedDimension ${}_{i})$ ;

IF(var $>$ maxVariance)

{

maxVariance $=$ var;

maxDimension $=$ selectedDimension ${}_{i}$ ;

10.

}

11.

}

12.

probabilisticOptimalDimensiondimension $=$ maxDimension;

13.

Get Probabilistic Optimal Dimension;

Algorithm 3 getVariance( node , selectedDimension ${}_{i})$
Input: node;
Output: Get variance of selectedDimension ${}_{i}$ ; 1. Set maxDimension $=$ 0; maxVariance $=$ 0; 2. FOR(selectedDimension ${}_{i}$ in node) 3. { 4. set var $=$ getVariance(node, selectedDimension ${}_{i})$ ; 5. IF( var $>$ maxVariance) 6. { 7. maxVariance $=$ var; 8. maxDimension $=$ selectedDimension ${}_{i}$ ; 9. } 10. } 11. dimension $=$ maxDimension; 12. return;

Algorithm 3 getVariance( node , selectedDimension

{}_{i})

Input: node;

Output: Get variance of selectedDimension ${}_{i}$ ;

Set maxDimension $=$ 0; maxVariance $=$ 0; 2.

FOR(selectedDimension ${}_{i}$ in node)

{

set var $=$ getVariance(node, selectedDimension ${}_{i})$ ;

IF( var $>$ maxVariance)

{

maxVariance $=$ var;

maxDimension $=$ selectedDimension ${}_{i}$ ;

}

10.

}

11.

dimension $=$ maxDimension;

12.

return;

In Algorithm 2, a certain number of attributes are randomly selected from all dimensions (line 1), and the candidate dimension set is generated (line 2). The algorithm then traverses each selected attribute (lines 3 to 11) and calculates each variance of the projected dimension (lines 5 to 10) to find the optimal dimension. The node is then divided along the hyperplane perpendicular to the optimal dimension (lines 12 to 13). This method of local optimization ensures division in the locally optimal dimension, maximizing the aggregation of the original data and reducing the loss of information, thus improving the quality of the anonymous data. In addition, the KD tree generated by this division is more balanced and the data distribution is more uniform. Moreover, compared with global dimension optimization, the probabilistic dimension optimization method improves the speed of the algorithm at the expense of a small amount of data quality loss. Thus, a better balance between data availability and calculation speed can be achieved.

In Algorithm 4, the selected dimensions of all projection points are sorted (line 1). Then, the upper bounds of bbv1 and bbv2 and lower bounds of tbv1 and tbv2 are calculated (lines 3 to 6) for both intervals and merged as the division point determination interval (line 7). Partition coefficient set pmset is then initialized (line 8). For each point in the interval [bbv, tbv], the partition coefficient $pm_{i}$ is calculated. Partition coefficient $pm_{i}$ and corresponding partition point $i$ are stored in partition coefficient set pmset (lines 9 to 16). Finally, by traversing set pmset, the partition point corresponding to the maximum of the partition coefficients is used as the optimal partition point (lines 17 to 18).

Algorithm 4GetPivot(root)
Input: root;
Output: Optimal Pivot; 1. Sort node $\to$ data by node $\to$ selectedDimension; 2. Calculate $\alpha$ , $\beta$ , let $\alpha\cdot k+\beta=$ node’s count; 3. set bottom boundary value bbv1 $=\lfloor\frac{\alpha}{2}\rfloor$ ; 4. set top boundary value tbv1 $=\lfloor\frac{\alpha}{2}\rfloor+\beta$ ; 5. set bottom boundary value bbv2 $=$ node $\to$ datanum $-\lfloor\frac{\alpha}{2}\rfloor-\beta$ ;// 6. set top boundary value tbv2 $=$ node $\to$ datanum $-\lfloor\frac{\alpha}{2}\rfloor$ ; 7. bbv, tbv] $=$ [bbv1, tbv1] $\cup$ [bbv2, tbv2]; 8. set partition metric set pmset; 9. FOR (do candidate pivot $p_{i}$ between bbv and tbv) 10. { 11. calculate original data range adr; 12. calculate blow $p_{i}$ data range bpdr; 13. calculate above $p_{i}$ data range apdr; 14. set partition metric $pm_{i}=\frac{(bpdr/i)+(apdr/[(\alpha\cdot k+\beta)-i])}{adr/(\alpha k+\beta)}$ ; 15. push $\left({pm_{i},i}\right)$ into pmset; 16. } 17. find optimal pivot in pmset; 18. get optimal pivot;

Algorithm 4GetPivot(root)

Input: root;

Output: Optimal Pivot;

Sort node $\to$ data by node $\to$ selectedDimension; 2.

Calculate $\alpha$ , $\beta$ , let $\alpha\cdot k+\beta=$ node’s count;

set bottom boundary value bbv1 $=\lfloor\frac{\alpha}{2}\rfloor$ ;

set top boundary value tbv1 $=\lfloor\frac{\alpha}{2}\rfloor+\beta$ ;

set bottom boundary value bbv2 $=$ node $\to$ datanum $-\lfloor\frac{\alpha}{2}\rfloor-\beta$ ;//

set top boundary value tbv2 $=$ node $\to$ datanum $-\lfloor\frac{\alpha}{2}\rfloor$ ;

bbv, tbv] $=$ [bbv1, tbv1] $\cup$ [bbv2, tbv2];

set partition metric set pmset;

FOR (do candidate pivot $p_{i}$ between bbv and tbv)

10.

{

11.

calculate original data range adr;

12.

calculate blow $p_{i}$ data range bpdr;

13.

calculate above $p_{i}$ data range apdr;

14.

set partition metric $pm_{i}=\frac{(bpdr/i)+(apdr/[(\alpha\cdot k+\beta)-i])}{adr/(\alpha k+\beta)}$ ;

15.

push $\left({pm_{i},i}\right)$ into pmset;

16.

}

17.

find optimal pivot in pmset;

18.

get optimal pivot;

5.3 Proof of the algorithm

Theorem 1 Assume table $T(m+n)$ and anonymous degree $k$ . If $\left|{T\left({m+n}\right)}\right|=\alpha k+\beta$ , then $\alpha$ anonymous equivalent clusters are generated by the proposed algorithm by the improved KD tree. Moreover, each cluster satisfies the $k$ -anonymity model.

Proof:

1)
When $\alpha=1$ , $\beta$ equals any natural number because $\left|{T\left(d\right)}\right|=\alpha k+\beta<2k$ . Hence, the dataset satisfies the $k$ -anonymity model and the original dataset does not need to be divided. The theorem is proved for this case.
2)
When $\alpha\geqslant 2$ and $\beta$ equals any natural number, table $T(d)$ is divided into two small temporary equivalent clusters according to the proposed algorithm. The size of each cluster is $\lfloor\frac{\alpha}{2}\rfloor\cdot k+\beta_{1}$ and $\lceil\frac{\alpha}{2}\rceil\cdot k+\beta_{2}$ , respectively, where $\beta_{1}\geqslant 0$ , $\beta_{2}\geqslant 0$ , $\beta=\beta_{1}+\beta_{2}$ , and the sum of the coefficients of both datasets is $\alpha$ . Similarly, any anonymous equivalent cluster $X$ satisfies $\left|X\right|=\alpha k+\beta$ , $\alpha\geqslant 2$ . The proposed partition algorithm obtains two smaller temporary anonymous equivalent clusters and both equivalent clusters conform to the conditions $\alpha=\alpha_{1}+\alpha_{2}$ and $\beta=\beta_{1}+\beta_{2}$ . Moreover, for any temporary equivalent cluster satisfying the above conditions, recursive splitting can be performed. There are at most $k$ -anonymous equivalent clusters for which $\alpha=$ 1. Hence, the theorem is proved.

Theorem 2 Assume table $T\left({m+n}\right)$ and anonymous degree $k$ . If $\left|{T\left({m+n}\right)}\right|=\alpha k+\beta$ when $\left|{T\left({m+n}\right)}\right|\geqslant 2$ and $k>3$ , the anonymous equivalent cluster sizes in each leaf node of the KD tree generated by the proposed algorithm are less than 2 $k$ . As $\left|{T\left(d\right)}\right|\to\infty$ , the average node rank of the leaf nodes in the KD trees tends to be M-level nodes, where $M=k$ .

Proof: By Theorem 1, we can see that when the data is processed by this algorithm, the data after anonymization consist of $\alpha$ anonymous equivalent clusters conforming to the $k$ -anonymous model. For any anonymous equivalent cluster $X$ that is divided into two smaller clusters of anonymous equivalence, the smaller clusters satisfy the following formula.

$\left\{{{\begin{array}[]{{20}c}{X_{1}=\lfloor\frac{\alpha}{2}\rfloor\cdot k+% \beta_{1}}\\ {X_{2}=\lceil\frac{\alpha}{2}\rceil\cdot k+\beta_{2}}\\ \end{array}}}\right..$

In the worst case, the maximum size of the anonymous equivalence cluster of the leaf node in the KD tree satisfies $\left({\alpha k+\beta}\right)-\left({\alpha-1}\right)\cdot k<2k$ .

The average number of anonymous clusters of all leaf nodes satisfies $\frac{\left|{T\left({m+n}\right)}\right|}{\alpha}=k+\frac{\beta}{\alpha}$ because $\left|{T\left({m+n}\right)}\right|\to\infty\Rightarrow\alpha k+\beta\to\infty$ , so that $\alpha\to\infty\mathop{\Rightarrow}\limits^{\beta<k}k+\frac{\beta}{\alpha}\to k$ . Hence, the theorem is proved.

In summary, under nontrivial conditions, a $k$ -anonymous algorithm based on the improved KD tree can obtain an upper bound of the data size of an anonymous equivalent cluster that is less than 2 $k$ . For large data, the average data size of anonymous equivalent clusters approaches $k$ , which reduces the information loss when it is anonymized.

Theorem 3 Assume table $T(m+n)$ and anonymous degree $k$ . If $|T(m+n)|=\alpha k+\beta$ , a smaller anonymous equivalent cluster domain generated by the algorithm decreases the information loss caused by anonymity and increases data quality.

Proof:* If an anonymous equivalent cluster X satisfies the $k$ -anonymous model, the number of data in the equivalent cluster must be greater than $k$ . In this paper, the data size of each anonymous equivalent cluster generated by the proposed algorithm satisfies $k\leqslant k_{0}\leqslant 2k$ .

Assuming that $k_{0}$ is the data size of an anonymous equivalent cluster, $k_{0}$ remains unchanged. A decrease in the size of the equivalent cluster domain decreases the spatial distance between the data points. Denser spatial data points and a smaller overall projection area of the data in the equivalent cluster decreases the information loss after anonymization. Hence, the theorem is proved.
5.4 Time complexity analysis

Suppose that $T(m+n)$ is the given table of the dataset, $k$ is the anonymous degree, $|T(m+n)|=\alpha k+\beta$ , and $m$ is the number of QI attributes. The total number of data points is $m\times n$ . The time complexity of the proposed algorithm is composed of the time complexity of generating the KD tree and the time complexity of generating the anonymous table. First, the time complexity of creating the KD tree is analyzed. Suppose that the $i$ th level node of the KD tree is the temporary anonymous equivalence to be divided and the size of the node set is $N$ . Obviously, $N\approx\frac{\alpha}{2^{i}}\cdot k+\beta$ . When splitting the nodes, we need to traverse $m$ dimensions and calculate the dimension aggregation metrics for each one. The time complexity of selecting the optimal $i$ th dimension is $O\left(N\right)$ . When selecting a partition point on the $i$ th dimension, the projection points on that dimension must be sorted in ascending order. The sorting operation time complexity is less than $O\left({N\times\log_{2}N}\right)$ . The partition coefficient pm of each potential point of the division is then calculated, and this time complexity is less than $O\left({k\times N}\right)$ . The time complexity of the division is $O\left(1\right)$ . Therefore, the time complexity of generating the $i$ th layer node of the KD tree is $O\left({N+N\times\log_{2}N+k\times N+N\times\log_{2}N}\right)$ . Moreover, the height of the KD tree is $h\approx\log_{2}\frac{N}{k}$ , which is at least one, the maximum value of $k$ is $n$ , and the maximum value of h is $\log_{2}N$ . Therefore, the complexity of generating the KD tree is less than $O\left({\left({N+N\times\log_{2}N+k\times N+N\times\log_{2}N}\right)\times\log% _{2}N}\right)=O\left({N\times N\times\log_{2}N}\right)$ . When we create a KD tree, we need to create an anonymous table. The algorithm needs to traverse all leaf nodes, and the time complexity of this operation is $O\left(N\right)$ . In summary, the time complexity of the proposed algorithm is $O\left({N\times N\times\log_{2}N}\right)$ .

5.5 Algorithm space complexity analysis

The maximum memory consumption of the algorithm is used to create the queues for partitioning. The extra memory space needed is $O\left({n\times\log_{2}n}\right)$ . Because a recursive algorithm is used, the maximum height of the recursive tree is $\log_{2}n$ . Therefore, the space complexity of the algorithm is $O\left({n\times\log_{2}n\times\log_{2}n}\right)$ .

6. Experimental results and analysis

In this section, we analyze the performance of probabilistic optimal projection partition $k$ -anonymous algorithm based on the proposed KD tree. The algorithm was developed using Visual Studio Ultimate 2013 on a computer equipped with an Intel Core (TM) i5-3230m 2.60 GHz CPU, 4 GB memory, and a Windows 7 (64 bit) operating system.

6.1 Experimental data

The experiment used the UCI multifeature digit dataset called mfeat-fac.1

¹
http://archive.ics.uci.edu/ml/machine-learning-databases/.

It contains 2,000 records, each of which has 107 characteristic attributes. To test the performance of the algorithm, we randomly selected 20 attributes (mfeat-fac_20) to create the low-dimensional dataset and 107 attributes (mfeat-fac_107) to create the high-dimensional dataset from mfeat-fac. The proposed partitioning algorithm was compared with the optimal projection

k

-anonymity algorithm for privacy protection based on the KD tree (KA-PPKT) [28] and flexible partitioning algorithm [10, 11].

6.2 Evaluation metrics

The following metrics were used to evaluate the results:

1)
Data quality (Dq), which is the logarithm of Adr, given in Definition 4, and is used because the value of Adr is large when the dimensions of the attributes are large. It is calculated using Eq. (10).

$\displaystyle\textit{Dq}=\log_{10}\textit{Adr}$ (10)
2)
The execution time of the algorithm over the dataset.
3)
Information loss, which is mainly measured by the data generalization scope and the upper and lower bounds of the data.

Figure 6.
Data quality comparison of three algorithms on mfeat-fac_20 for (a) $p=$ 0.25, (b) $p=$ 0.5, and (c) $p=$ 0.75.

Figure 7.
Data quality comparison of three algorithms on mfeat-fac_107 for (a) $p=$ 0.25, (b) $p=$ 0.5, and (c) $p=$ 0.75.

6.3 Data quality

Figures 6 and 7 compare the data quality obtained by the algorithms for probabilistic partitioning (proposed method), KA-PPKT, and flexible partitioning for different datasets. Figure 6a–c compare the results for mfeat-fac_20 and Fig. 7a–c compare the results for mfeat-fac_107. Overall, the data quality of the proposed probabilistic partitioning algorithm is better than that of the flexible partitioning algorithm and close to that of the KA-PPKT algorithm. Moreover, the proposed algorithm is very stable on the high-dimensional dataset mfeat-fac_107. This is because the algorithm selects the optimal dimensions. Compared with random partitioning, locally optimized dimension selection is helpful for improving the quality of partitioning. Relative to the KA-PPKT algorithm with its globally optimal partitioning, although the random choice of dimensions to be considered reduces the data quality, the algorithm adopts the double point selection interval division to make up for this disadvantage, in practice obtaining a data quality that is similar to that of the optimal partition.

Table 2
Execution time on mfeat-fac_20 (s)

$k$	3	5	7	9	11	13
KA-PPKT	0.5120	0.3566	0.5380	0.3860	0.6860	0.7680
Probabilistic partitioning, $p=$ 0.25	0.4520	0.3400	0.4100	0.3180	0.4880	0.4960
Probabilistic partitioning, $p=$ 0.50	0.4586	0.3400	0.5022	0.3524	0.6426	0.7148
Probabilistic partitioning, $p=$ 0.75	0.3564	0.2604	0.3643	0.26635	0.4569	0.49795

6.4 Execution time

Tables 2 and 3 show the running time results of the experiments on mfeat-fac_20. Table 2 shows that when $p=$ 0.75 is used, the average value of this algorithm relative to the KA-PPKT algorithm for different $k$ values is only 0.684649 s, and the time efficiency is improved by 32%. Similarly, when P is equal to 0.50 and 0.25, Table 3 shows that the time efficiency of the proposed algorithm relative to the KA-PPKT algorithm is improved by 21% and 8%, respectively.

Table 3
Ratio of the time efficiency of the probabilistic partitioning algorithm to KA-PPKT algorithm on mfeat-fac_20

$k$	3	5	7	9	11	13	Avg.
$p=$ 0.25	0.882813	0.953449	0.762082	0.823834	0.71137	0.645833	0.796564
$p=$ 0.50	0.895703	0.953449	0.933457	0.912953	0.936735	0.930729	0.927171
$p=$ 0.75	0.696094	0.73023	0.677138	0.690026	0.666035	0.648372	0.684649

Figure 8 shows the runtime results of the probabilistic partitioning algorithm for different probability values. Figure 8a shows the change in runtime of this algorithm on the mfeat-fac_20 dataset along with the $k$ values. Figure 8a shows that the time spent by the global optimum dimension projection $k$ -anonymous KA-PPKT algorithm on the candidate dimensions not selected in the probabilistic partitioning algorithm to be higher for probabilities of 0.25, 0.50, and 0.75. Figure 8b shows the running time for various $k$ values for the probabilistic partitioning algorithm on the mfeat-fac_107 dataset. When partitioning a high-dimensional dataset, the running time was significantly lower when probabilistic partitioning was used. The probabilistic method proposed in this paper can not only improve the time efficiency, but is also stable on high-dimensional datasets.

Figure 8.

Execution time comparison of the proposed algorithm on (a) mfeat-fac_20 and (b) mfeat-fac_107.

Figure 8 shows that the probabilistic partitioning algorithm has a better overall running time. There are some fluctuations, mainly caused by dual-interval partitioning, as many points can fall within the two zones to be checked. Hence, the algorithm takes some time to process them, and the runtime time will fluctuate somewhat. However, on the whole, the speed of the algorithm is substantially improved.

Figure 8 further shows that the overall performance of the proposed probabilistic partitioning algorithm is better than the algorithm that does not use it. There is some fluctuation in the running time, mainly due to the influence of the dual-interval partitioning. When the number of data points in the two intervals is relatively high, it will take a certain amount of time to determine each partition point, so the time will fluctuate slightly. However, on the whole, the time efficiency of this algorithm is obviously substantially improved.

Figure 9.

Data range obtained by the three algorithms in each dimension on mfeat-fac_20 for (a) $p=$ 0.25, (b) $p=$ 0.50, and (c) $p=$ 0.75.

Figure 10.

Data range obtained by the three algorithms in each dimension on mfeat-fac_107 for (a) $p=$ 0.25, (b) $p=$ 0.50, and (c) $p=$ 0.75.

Figure 11.

Comparison of the data range ratio for the probabilistic partitioning and flexible partitioning algorithms in each dimension on mfeat-fac_20 ( $p=$ 0.50) for (a) $k=$ 3, (b) $k=$ 5, (c) $k=$ 7, (d) $k=$ 9, (e) $k=$ 11, and (f) $k=$ 13.

6.5 Information loss

6.5.1 Data range

Figure 9 compares the data generalization range of the three algorithms when the probabilities are 0.25, 0.50, and 0.75 on the mfeat-fac_20 dataset. The experimental results show that the data range obtained by the proposed algorithm is better than that obtained by the flexible partitioning algorithm and the data range of the KA-PPKT algorithm is almost the same as that of the full probabilistic algorithm. To further verify the performance of the algorithm, the experiment was repeated on the high-dimensional dataset mfeat-fac_107 and the experimental results are shown in Fig. 10. Figures 9 and 10 both show that the proposed algorithm performs well and is stable on both datasets, especially the high-dimensional dataset. This is because we consider the influence of local probability on the result of spatial division and extend the division range to improve its quality. Therefore, the data quality is close to that obtained by the globally optimal dimension selection method. However, the proposed algorithm is more efficient.

The ratio of the data range Rdr is the ratio of the data point generalization of the two algorithms and is calculated using Eq. (11).

$\displaystyle\textit{Rdr}=\frac{Dr_{\textit{probability partition}}}{Dr_{% \textit{Flexible partition}}}$ (11)

If Rdr $>1$ , then the generalization scope of the flexible partitioning algorithm is better. Otherwise, if Rd $r<1$ , then the generalization scope of the probabilistic partitioning algorithm is better. Figures 11 and 12 compare the Rdr of the probabilistic partitioning algorithm and the flexible partitioning algorithm in each dimension for different $k$ values on mfeat-fac_20 and mfeat-fac_107, respectively. Figure 11 shows that the Rdr values on the 20-attribute-dimension data are less than one, indicating that the probabilistic partitioning algorithm has less generalization scope and less information loss than the flexible partitioning algorithm. Figure 12 shows that the performance of the probabilistic partitioning algorithm is still stable when in 107 dimensions and is significantly better than the flexible partitioning algorithm.

Figure 12.

Comparison of the data range ratio for the probabilistic partitioning and flexible partitioning algorithms in each dimension on mfeat-fac_107 ( $p=$ 0.50) for (a) $k=$ 3, (b) $k=$ 5, (c) $k=$ 7, (d) $k=$ 9, (e) $k=$ 11, and (f) $k=$ 13.

Figure 11 shows that in 20 dimensions, the probabilistic partitioning algorithm and flexible partitioning have an Rdr in the vicinity of 0.8, indicating that probabilistic partitioning algorithm has a generalization range of 80% of that of the flexible partitioning algorithm generalization. In contrast, the probabilistic partitioning algorithm and KA-PPKT algorithm have an Rdr close to one, indicating that these algorithms have approximately the same information loss. Figure 12 shows that the performance of the algorithm is still stable in 107 dimensions. Moreover, it is superior to the data quality of the flexible partitioning algorithm and comparable to that of the KA-PPKT algorithm. Figure 12 shows that the high-dimensional data quality improves slightly, which indicates that this algorithm is superior for data in 107 dimensions.

The ratio of the average data range Radr is the average value of the data point generalization of the two algorithms, as calculated using Eq. (12).

$\displaystyle\textit{Radr}=\frac{Adr_{\textit{probability partition}}}{Adr_{% \textit{Flexible partition}}}$ (12)

If Radr $>$ 1, then the generalization scope of the flexible partitioning algorithm is better. Otherwise, if Radr $<$ 1, then the generalization scope of the probabilistic partitioning algorithm is better.

Figure 13.

Comparison of the average data range ratio for the three algorithms in each dimension on mfeat-fac_20 for (a) $p=$ 0.25, (b) $p=$ 0.5, and (c) $p=$ 0.75.

Figure 14.

Comparison of the average data range ratio for the three algorithms on mfeat-fac_107 for (a) $p=$ 0.25, (b) $p=$ 0.50, and (c) $p=$ 0.75.

Figures 13 and 14 show the Radr in each dimension along with the change in $k$ values. Figure 13 shows that for mfeat-fac_20, the Radr of the two algorithms for all dimensions are within the interval [0.78, 0.88]. Figure 14 shows that for mfeat-fac_107, the Radr of the two algorithms for all dimensions are within the interval [0.81, 0.89] are all less than one. Therefore, the generalization range is only 78% to 89% and improved by 11% to 22% compared with the original algorithm. In this paper, the Radr for the proposed and KA-PPKT algorithm approaches one, and the method of probabilistic dimension optimization obtains a data quality level similar to that of the global optimization method. The proposed method, which expands the potential range of optimal partition points, improves data quality and can maintain the characteristics of data aggregation. Moreover, the divided data point domain is smaller, the anonymized data is closer to the real data, and less information is lost.

6.5.2 Upper and lower boundaries

Assume that the upper boundary of the generalization data points obtained by the two algorithms are denoted by $X_{\textit{probability partition}}$ and $X_{\textit{KA\_PPKT}}$ , respectively. Then, the ratio of the upper boundary Rubr is shown in Eq. (13).

$\displaystyle\textit{Rubr}=\frac{X_{\textit{probability partition}}}{X_{% \textit{Flexible partition}}}$ (13)

Figure 15.

Average ratio of the (a) lower boundary and (b) upper boundary of the probabilistic partitioning algorithm and flexible partitioning algorithm on mfeat-fac_20 ( $k=$ 5).

Figure 16.

Average ratio of the (a) lower boundary and (b) upper boundary of the probabilistic partitioning algorithm and flexible partitioning algorithm on mfeat-fac_107 ( $k=$ 3).

If Rubr $<$ 1, then the upper boundary of the probabilistic partitioning algorithm is smaller and better than that of the flexible partitioning algorithm. Otherwise, it is worse.

Moreover, assume that the upper boundary of the generalization data points obtained by the two algorithms are denoted by $X_{\textit{probability partition}}$ and $X_{\textit{KA\_PPKT}}$ , respectively. Then the ratio of the lower boundary Rlbr is shown in Eq. (14).

$\displaystyle\textit{Rlbr}=\frac{X_{\textit{probability partition}}}{X_{% \textit{Flexible partition}}}$ (14)

If Rlbr $<$ 1, then the lower boundary of the flexible partitioning algorithm is smaller and better than that of the probabilistic partitioning algorithm. Otherwise, it is worse.

Figures 15 and 16 respectively show the upper and lower data bounds for each dimension after the probabilistic partitioning and flexible partitioning algorithms have anonymized the data of mfeat-fac_20 and mfeat-fac_107. The ratios in Figs 15a and 16a are mostly less than one, indicating that the algorithm in this paper has a smaller upper bound. The ratios in Figs 15b and 16b are mostly greater than one, indicating that this algorithm has a larger lower bound. The smaller upper bound and larger lower bound indicate that the anonymized data has a smaller scope and the quality of the data is higher.

7. Conclusion

This paper presents an improved KD tree probabilistic $k$ -anonymous privacy protection method for high-dimensional data publishing. We adopt a probabilistic method to select a part of the attributes for determining the local optimal dimension. This is similar to the approach of humans in real-world situations. In fact, when learning something, human beings often summarize the general principles using individual examples, and use their existing theoretical knowledge to analyze and summarize the characteristics if these examples. This helps them understand the laws governing the topic.

We can improve its speed by probabilistically selecting some dimensions for the global search to find the best dimension. This method can also increase the quality of data by searching for the best partition points by increasing the search interval of the points. The data quality of this algorithm is similar to that of the globally optimum partitioning method KA-PPKT, but the time efficiency is improved by 8% to 32%. The proposed method preserves data quality, loses little information, and is fast to compute. Moreover, the algorithm shows good stability at high dimensions and for large datasets.

In the future, we will consider simulation studies using artificial data with changeable parameters in order to gain a deeper insight into the behavior of the algorithm. We will also deeply explore the effects of noise and missing values in the data sets. Finally, we plan to study how to dynamically find the optimal probability value based on the relationship between data quality, execution time, and privacy protection.

Footnotes

Acknowledgments

The authors would like to thank the reviewers for their useful comments and suggestions for this paper. This work was supported by the Key Project of Academic Support for the Top Talents in Anhui Universities (Grant No. gxbjZD2016011), National Natural Science Foundation of China (Grants No. 61370050, No. 61672039, and No. 61702010) and the Novation Foundation of Anhui Normal University (Grant No. 2017XJJ93).

References

Chakraborty

and Tripathy

B.K.

, Privacy preserving anonymization of social networks using eigenvector centrality approach, Intell. Data Anal 20 (2016), 543–560.

Rajalaxmi

R.R.

and Natarajan

A.M.

, Effective sanitization approaches to hide sensitive utility and frequent itemsets, Intell. Data Anal 16 (2012), 933–951.

Lin

C.Y.

Kao

Y.H.

Lee

W.B.

and Chen

R.C.

, An efficient reversible privacy-preserving data mining technology over data streams, Springer Plus 5 (2016), 1047–1057.

T.C.

N.H.

Zhang

and Molloy

, Slicing: A new approach for privacy preserving data publishing, IEEE Transactions on Knowledge and Data Engineering 24 (2012), 561–574.

Fung

B.C.M.

Wang

L.Y.

and Hung

P.C.K.

, Privacy-preserving data publishing for cluster analysis, Data & Knowledge Engineering 68 (2009), 552–575.

Fung

B.C.M.

Wang

Chen

and Yu

P.S.

, Privacy-preserving data publishing: a survey of recent developments, ACM Computing Surveys 42 (2010), 53.

Nayahi

J.J.V.

and Kavitha

, Privacy and utility preserving data clustering for data anonymization and distribution on Hadoop, Future Generation Computer Systems-the International Journal of Escience 74 (2017), 393–408.

Gong

Yang

Chen

and Luo

, A framework for utility enhanced incomplete microdata anonymization, Cluster Computing-the Journal of Networks Software Tools and Applications 20 (2017), 1749–1764.

et al., A hybrid approach to prevent composition attacks for independent data releases, Information Sciences 367 (2016), 324–336.

10.

Sweeney

, Achieving k-anonymity privacy protection using generalization and suppression, International Journal of Uncertainty Fuzziness and Knowledge-Based Systems 10 (2002), 571–588.

11.

Sweeney

, k-anonymity: A model for protecting privacy, International Journal of Uncertainty Fuzziness and Knowledge-Based Systems 10 (2002), 557–570.

12.

Newton

E.M.

Sweeney

and Malin

, Preserving privacy by de-identifying face images, IEEE Transactions on Knowledge and Data Engineering 17 (2005), 232–243.

13.

Zhou

and Pei

, The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks, Knowledge and Information Systems 28 (2011), 47–77.

14.

Rebollo-Monedero

Forne

and Domingo-Ferrer

, From t-closeness-like privacy to postrandomization via information theory, IEEE Transactions on Knowledge and Data Engineering 22 (2010), 1623–1636.

15.

N.H.

T.C.

and Venkatasubramanian

, Closeness: A new privacy measure for data publishing, IEEE Transactions on Knowledge and Data Engineering 22 (2010), 943–956.

16.

Tang

and Tian

, A survey of privacy preserving data publishing using generalization and suppression, Applied Mathematics & Information Sciences 8 (2014), 1103–1116.

17.

Gong

Q.Y.

Luo

J.Z.

Yang

W.W.

and Li

X.B.

, Anonymizing 1:M microdata with high utility, Knowledge-Based Systems 115 (2017), 15–26.

18.

W.W.

M.Z.

and Chen

, Location privacy-preserving k nearest neighbor query under user’s preference, Knowledge-Based Systems 103 (2016), 19–27.

19.

Zhu

Tian

and Xie

, Anonymization on Refining Partition: Same Privacy, More Utility, Computer Science and Information Systems 12 (2015), 1193–1216.

20.

Matatov

Rokach

and Maimon

, Privacy-preserving data mining: A feature set partitioning approach, Information Sciences 180 (2010), 2696–2720.

21.

Kumar

and Minz

, Multi-view ensemble learning: an optimal feature set partitioning for high-dimensional data classification, Knowledge and Information Systems 49 (2016), 1–59.

22.

Sarrafi Aghdam and Sonehara

, Achieving high data utility k-anonymization using similarity-based clustering model, IEICE Transactions on Information and Systems E99D (2016), 2069–2078.

23.

Gkoulalas-Divanis

Loukides

and Sun

, Publishing data from electronic health records while preserving privacy: A survey of algorithms, Journal of Biomedical Informatics 50 (2014)., 4–19.

24.

Park

and Shim

, Approximate algorithms with generalizing attribute values for k-anonymity, Information Systems 35 (2010), 933–955.

25.

Lefevre

Dewitt

D.J.

, and Ramakrishnan

, Mondrian Multidimensional K-Anonymity, in International Conference on Data Engineering, 2006, 25–25.

26.

Y.J.

Tang

Q.M.

and Ni

W.W.

, Algorithm for k-anonymity based on rounded partition function, Journal of Software 23 (2012), 2138–2148.

27.

Wang

Yang

Zhang

J.P.

and Lv

, Algorithm for k-anonymity based on projection area density partition, Journal on Communications 36 (2015), 125–134.

28.

Wang

X.H.

Luo

Y.L.

and Jiang

Y.F.

, An optimal projection partition k-anonymity algorithm of KD-tree, Journal of Nanjing University 52 (2016), 1050–1064.

29.

Herranz

Nin

and Sole

, Kd-trees and the real disclosure risks of large statistical databases, Inf. Fusion 13 (2012), 260–273.

30.

Popov

Guenther

Seidel

H.-P.

and Slusallek

, Stackless KD-Tree traversal for high performance GPU ray tracing, Comput. Graph. Forum 26 (2007), 415–424.

31.

Redmond

S.J.

and Heneghan

, A method for initialising the K-means clustering algorithm using kd-trees, Pattern Recognit. Lett. 28 (2007), 965–973.

Probabilistic optimal projection partition KD-Tree k -anonymity for data publishing privacy protection

Abstract

Keywords

1. Introduction

2. Related work

3. Problem description

3.1 Data attributes

Table 1 Original data set

4.1 Related definitions

5.1 Algorithm

5.5 Algorithm space complexity analysis

6. Experimental results and analysis

6.1 Experimental data

1 http://archive.ics.uci.edu/ml/machine-learning-databases/.

Table 2 Execution time on mfeat-fac_20 (s)

Table 3 Ratio of the time efficiency of the probabilistic partitioning algorithm to KA-PPKT algorithm on mfeat-fac_20

6.5.1 Data range

Footnotes

Acknowledgments

References

Table 1
Original data set

¹
http://archive.ics.uci.edu/ml/machine-learning-databases/.

Table 2
Execution time on mfeat-fac_20 (s)

Table 3
Ratio of the time efficiency of the probabilistic partitioning algorithm to KA-PPKT algorithm on mfeat-fac_20