Abstract
Knowledge discovery on social network data can benefit general public, since these data contain latent social trends and valuable information. Recent research finds that preserving data privacy plays a vital role in knowledge discovery. Therefore, social network data need to be anonymized to preserve users’ identity before the data can be released for research purposes. In this paper, we model social network data as directed graphs with signed edge weights; formally define privacy, attack models for the anonymization problem. Based on our analysis, we develop a graph anonymization approach. The other main contribution is our graph clustering algorithm which can effectively group similar graph nodes into clusters with minimum cluster size constraints. Finally, we carry out a series of experiments to evaluate the effectiveness and utility of our approach on anonymizing social network data.
Introduction
We consider the problem of publishing social network data, while preserving the privacy of individuals associated to them. Consider a dataset D, which stores sensitive information of individual participants about the social relationship. We observe that with some background knowledge about a user’s information of relations, an adversary may be able to uniquely identify and consequently learn additional information about the user. For example, assume John has social relations with Alice and Albert. Assume that the adversary knows his co-worker John dislikes Albert a lot. John would not like others find out he likes Alice. However by matching with the released dataset in Fig. 1, the adversary can identify that the ID 2 corresponds to John. Consequently, the adversary learns additional social relations that John has and his options to his co-workers.
This example emphasizes the need to transform the original social network dataset D to a dataset D′ before publication so as to avoid the association of specific records to particular persons. Since publishing anonymized data for research purpose is an inevitable trend and has the potential to provide general public with great benefits, we take on the task of developing an efficient, utility-preserving approach for anonymizing social network data.
Such privacy risk is very similar with the problem which has been studied on bipartite graph (e.g. Netflix prize 1 ).Narayanan et al. demonstrate that the adversary who knows only a little bit about an individual subscriber can identify this subscriber in the Netflix dataset [1] As a result, 84% of subscribers can be uniquely identified if the adversary knows 6 out of 8 movies outside the top 500 most popular movies. In social network data, unpopular users are rarely rated by other users. Thus, by rating an unpopular user, the voter distinguishes himself/herself from the crowd. This example shows that removing the attribute of username is not sufficient to protect users from the attack; with certain background knowledge, the adversary is still able to obtain sensitive information. In additon, such background knowledge can be easily obtained from personal blogs and other related websits.
In this paper, we propose such a k-anonymity model for social network data. Assuming that the knowledge of an adversary is any set of social relations, we want to prevent him/her from distinguishing the specific user from a set of k users in the dataset. Equivalently, for any set of social relations in the dataset D, there should be at least k users which contain this set in the released dataset D′.
We propose three types of algorithms in this direction. Our first algorithmic type is represented by the Immediate Merge-Split (IMS) for clustering individuals in a social network into groups that satisfy our k-anonymity model. IMS has very high computational cost and does not operate under limited memory. Besides, finding similar individuals through clustering is challenging - even individuals with similar relations have relatively small conflicts. We introduce a second type of algorithms, Iterative Relaxed Balance (IRB). IRB eliminates the conflicts and iteratively partitions similar individuals into groups. However, our experimental evaluation shows that IRB incurs much more information loss than IMS does. Our third type of algorithms, Author Anonymization (AA) is a more carefully designed heuristic, which anonymizes the dataset in the presence of limited memory. The basic idea is that the dataset is partitioned by IRB and the results are processed by Merge- Split clustering.
The rest of the paper is organized as follows. Section 2 provides a brief background of related work. Our privacy model is defined in Section 3. In Section 4, we present our anonymization method. Complexity and security analysis are given in Section 5. The experimental results are described and analyzed in Section 6. Section 7 summarizes the paper.
Related work
Anonymity for social network data has received considerable attention due to the need of several organizations to publish data without revealing the individual’s identity. Even if identifying attributes (e.g. username) are deleted, an adversary may associate datasets with specific individuals using some unique patterns (e.g. node degree, edge weights to specific nodes), which is called quasi-identifiers (QI). The adversary who uses certain background knowledge of node degree to re-identify the nodes/edges in the published graph is called "Degree Attack". There are two practical way proposed to publish a privacy preserving graph: Edge Modification (EM) [2, 5] and Anonymization Grouping (AG) [6, 11]. EM is to include or exclude some QI values to make anonymized graphs satisfy certain privacy requirements. AG is to cluster (or partition) “similar” nodes together to form anonymization groups, and then the edges between nodes are generalized as the edges between anonymization groups. We call the technique that replaces the actural QI values with more generalized ones as Graph Generalization.
Most of the EM models differ from the definitions of background knowledge. Liu and Terzi [2] propose k-degree-anonymity model on social network graph, that is for any node v, there exists at least k - 1 other nodes in the graph have the same degree as node v. Zou et al. [4] introduce K-Automorphism-anonymity model, that is for any node v, there exists at least k - 1 other nodes do not have any structure difference with node v. Cheng et al.[5] design a k-isomorphism-anonymity model which can not only protect the nodes but also edges between thems. Spectrum Preserving Approach [3] is a quite different model which randomly selects and changes the edges in the graph. The final goal of any EM model is to generate a released graph with as less edge change as possible.
For most of the methods, which are based on AG models, generate groups containing at least k members, the probability to re-identify individual in social network graphs can be bounded as no more than 1/k. Hay et al. [6] propose a heuristic clustering algorithm to group similar nodes and then publish a generalized graph. In the grouping process, they employ simulated annealing to do a random exploration of the search space for a good set of clusters where the branching factor may be exponential. Zhou and Pei [8] employ a greedy approach to group and anonymize similar nodes. Although they provide optimizations that makes the computation feasible in practice, their algorithm runs in exponential time indeed, which is the consequence of frequent checks for isomorphisms between subgraphs [10]. In their literature [11], that model is extended to support l-diversity. Thompson and Yao [10] study another specific anonymization problem and propose (k, i)-anonymity. Assuming that the maximum knowledge of an adversary is the degree of a target node v and the degrees of all of v’s i-hop neighbors, they try to prevent the adversary from distinguishing the target node from a set of k publishednodes.
Our privacy model share a similar spirit to the neighborhood anonymization approach [8, 11]. In comparison, our anonymization method is more lightweight, because our anonymization contains two phases: grouping and anonymizing. We group graph nodes based on similarity metrics first and then anonymize within each group. Additionally, our edge insertion and deletion in the anonymizing phase are performed strategically to match needy nodes and avoid unnecessary modifications.
The majority researches focus on anonymizing graph topology information against degree-based attack. We not only study how to anonymize social network graph to combat degree-based attack, but also analyze another privacy problem called weight-based attack, which is beyond their study. Yuan et al. [12] treat graph edges as two possible weights: "positive" and "negative", and propose an anonymizing method based on node/edge additions. Hao et al. [13] extend the type of edge weights to several categorical attributes. Our work is quite different compared to the problems in the literature of [12, 13], since there is no fixed number of attributes for edge weights. Liu and Yang [14] prove that optimal k-anonymity against weight-based attack with minimal information loss is NP-hard. For the latter, they propose an approximate algorithm that minimizes the number of fake vertices added; the approximation bound is O (n2), where n is the number of vertices in graph G. The main difference of our work is that we consider edge recoding instead of vertex recoding, which in general results in less information loss [8, 11]. In vertex recoding, their recoding methods are allowed to add, modify and delete arbitrary number of vertices (including their edges). In edge recoding, on the other hand, any operation about vertex is disallowed.
Attack and privacy models
In this section, we describe the attack and privacy models that we consider in our anonymization method. In particular, we define the type of prior knowledge that adversaries are allowed to gain.
Social Network Graph G (V, E) is a directed graph having a set of verteices denoted by V and a set of weighted edges denoted by E. A node v ∈ V represents an individual in the social network. An edge (v o , v e , σ) ∈ E signifies a social relationship between the two individuals represented by nodes v o and v e . σ → {p, n} is a sign function. Edges with the sign p have positive weight while other edges have the sign n have negative weight, especially in some social network graph with signed weights, σ → {+1, - 1}. We denote the degree of a node by d (v) which means the number of edges starting with v. Let n = |V| and m = |E| be the total number of nodes and edges in G respectively. Immediate neighbors of a node v are denoted by N (v).
Attack models
By collecting information from external databases, an adversary might have prior knowledge about an individual and his/her social relationship. The adversary can try to uniquely identify the user by matching the prior knowledge with the released dataset. Degree-based attack is defined as follows,
An degree-based attack on a social network graph G is one in which the adversary has prior knowledge of the degree d (v) (consisting of the degree of positive edges d (v+) and that of negative edges d (v-)) of a target node v and the degrees of all of v’s immediate neighbors. That is, the adversary may know {d (u) |u ∈ N (v)}. Given the anonymized social network graph G′, the adversary’s goal is to successfully re-identify target node v.
Figure 2(a) shows an example of the adversary knowledge for a degree-based attack. The adversary knows that John likes Alice while dislikes Albert. That means . By matching this knowledge to the released data in Fig. 1, the adversary uniquely identifies John as user 2, since John is the only user who has negative degree.
In addition to the degree-based attack that is based on degree knowledge of vertices, the adversary may also know the weights of edges. Such additional adversary knowledge enables the weight-based attack.
An weight-based attack on a social network graph G is one in which the adversary has prior knowledge of the weights σ of edge (v o , v e , σ) which is oriented from a target node v o . Given the anonymized social network graph G′, the adversary’s goal is to successfully re-identify target node v o .
Figure 2(b) shows an example of adversary knowledge for a weight-based attack. The adversary knows the exact ratings upon the opinions which can be denoted as (v{John}, v{Alice}, 3). By matching this knowledge to the released data in Fig. 1, the adversary uniquely identifies Alice as user 1, since other users who have positive opinion on Alice gave lowerratings.
The weight-based attack is a stronger model than the degree-based attack. Thus by giving privacy guarantees against the weight-based attack, we aim to protect against the degree-based attack as well. In the following, we mainly focus on the weight-based attack.
Privacy model
In practice, it is hard to predict the amount of background knowledge that an adversary has gained. Therefore we try to provide protection against the strongest assumption that we have considered, the weight-based attack. To achieve this privacy goal, we adapt the definition of k-anonymity[15] to our model. The conventional k-anonymity model defines quasi-identifier attributes (publicly available information that may be used to identify individuals) and sensitive attributes (private information known only by the individual). These two sets of values are assumed to not be overlapped. In our problem, the vertices, edges, and weights in the released social network data are both quasi-identifiers and sensitive values. To address this problem, we define k-anonymity against weight-based attack asfollows,
Given a social network graph G (V, E), let G′ (V′, E′) be the anonymized graph. We say G′ satisfies k-anonymity against weight-based attack if for every vertex v ∈ V, there are at least k - 1 other vertex v1, …, vk-1 ∈ V such that is isomorphic to . We say that {v} ∪ {v j } is the anonymization group of v.
Intuitively, definition 4 requires that for each user vertex v o , there are at least k - 1 other user vertices that have identical relational graphs to v e in terms of both degrees and weights. Thus the k-anonymity model is effective for defending against the degree-based attack and weight-based attack. In this paper, we present an algorithm for generating a graph G′ that is k-anonymous against the given adversary, but also preserves the utility of the original data.
Anonymization Methods
To protect sensitive information about users in social network data, data sharers should anonymize social network data. For an effective anonymization method, we delineate two goals:
Therefore we propose a novel anonymization method that achieves these goals.
Some previous anonymization techniques [16, 10] take a greedy approach to anonymization, where anonymization groups (see definition 4) are chosen randomly. We design a cluster-based approach that first lay V in social network datasets into groups of similar vertices, and then anonymizes vertices within each group. Our key innovation in the problem setting is that social relations in social network data are signed in the sense of have ties between vertices that can be positive or negative. To anonymize signed social relations, it requires social network data must be "balanced"(introduced in this section). However for most observed signed structure for social network data, the assumption does not hold. We apply an effective method, namely Relaxed Balance, for establishing the partition structure of a signed relation for groups that is as close to exact balance as is possible.
In this section, we outline the general anonymization procedure and provide details of our implementation. The procedure consists of three major steps: (1) Relaxed Balance the data to generate "balanced" partitions of vertices, (2) construct anonymization groups based on our clustering algorithm, and (3) use the graph generalization to anonymize social network data.
Relaxed balance for social network data
Here we introduce the structural balance theory which is due to Cartwright and Harary[17]:
Davis observed that human groups often split into more than two mutually hostile subsets and provided a generalization of Theorem 1. The second structure theorem, due to Davis [18], is as follows,
Consider the balanced social network graph which is described as a matrix in Table 1(a). The structural balance theory partitions the vertices into two subsets. The partition of the vertices is {a - c} and {d - f} with all of the positive edges contained within the subsets and all of the negatives edges linking vertices in distinct subsets. A positive block has only positive or null edges and a negative block has only negative or null edges.
However most social network data are imbalanced; that is, the structural balance theory cannot delineate the known partition. This partition shown in Table 1(b) is inconsistencies with structural balance theorem. The inconsistencies are bolded and the blocks are not identified.
Relaxed Balance(RB) is a proper solution for imbalanced graph and capable to delineate structures as close to the ideal partitions which has been effectively used in the domain of social analysis [19, 20]. Given a social network graph G and partition paramether n, RB outputs the graph G′ with n partitions. Note that although we chose Relaxed Balance for our implementation, partitioning in a top-down way is a general approach that supports any method of imputation.
As a result of RB phase, all vertices have been partitioned into certain sets ({positive, negative, positive, negative, …}), eliminating the imbalanced structural problem. Besides, vertices which contain similar edges will be placed in the same set. Thus, we have high chances of clustering with common rating preferences. This way, fewer clustering will be needed in order to form anonymization groups. In addition, this procedure does not affect the original data, which may still be used to construct the final released anonymized dataset. Yet this phase significantly improves the efficiency of clustering that is based on user similarities which is explained next.
Forming anonymization groups
Our anonymization strategy is to partition users into anonymization groups of size no less than k so that an adversary cannot distinguish individuals within a group. We employ a clustering algorithm which guarantees a minimum cluster size. As a result of the procedure, each cluster has no less than k users with similar preferences.
Definitions
The distance metrics that we use to measure similarity of two graph vertices are defined next.
Given two graph vertices, u and v, we define the similarity between them based on the intuition that two vertices are more similar if they share more common neighbors. The weight of common neighbors can be represented as vectors (u1, …, u n ) and (v1, …, v n ) where n is the size of their common neighbors. Therefore the similarity between vertex u and v is
It is straightforward to verify that sim (u, v) =1 only when {∀ i ∈ n|u i = v i }, and sim (u, v) = -1 only when {∀ i ∈ n|u i = - v i }.
Consider a cluster C of vertices {v1, …, v m }, each v i assigned with a weight vector {w(i,1), …, w(i,n)} to all vertices. Then the virtual center c of cluster C is a weight vector such that .
Our clustering algorithm differs from existing clustering anonymization work (eg., [21]) by computing and utilizing pair-wise similarity values for clustering weighted social graphs. Our clustering algorithm, Merge-Split clustering, works as follows,
The set of vertices C generated by RB step
The parameter of anonymization k
The cluster of vertices C′
1: Initialize vertices to be in the clusters generated by Relaxed Balance;
2: Compute all pair-wise distances between clustercenters (see Definition 5).
1:
2: Select an undersized cluster c whose distanceto its nearest cluster is the shortest, and mergecluster c with its nearest cluster.
3: If the combined cluster c′ of size is no less than 2k, split it into several clusters each of size ≥ k.(see Algorithm 2)
4: Update all relevant cluster distances
5: end while
6: return C′
The proof of theorem 3 is given in the appendix.
Graph Generalization
To defend against both the degree-based attack and weight-based attack, our final step is to generalize the users in each generalization group. Our approach, graph generalization, consists of adding fake edges and weights so that all users within an anonymization group are connected to the same set of vertices with the same weights. Formally, the graph generalization of anonymization group corresponding to cluster C′ of vertices {v1, …, v s } has the two steps. Firstly, we compute cluster centers and add fake edges to create a complete subgraph GC′. Finally, all vertices in each cluster are assigned the weights of cluster centers.
Analysis
In this section, we analyze the complexity and security of our proposed algorithms.
Computational complexity
et n be the number of vertices in the social network graph, k be the minimum allowable cluster size. Let m be the number of clusters in the current partition. Note that, m ≤ n/k. Let μ be the similarity metric being used. A cluster is small if it contains less than k vertices. A cluster is large if it consists of≥2k vertices. We denote by the maximum time to calculate the similarity between any two vertices in the graph G using similarity metric μ. We denote by β G the maximum time it takes to calculate the center of a cluster in G.
We maintain two dynamic data structures. The first is the inter-cluster similarity table which records the similarity between each pair of clusters, and is thus an m × m matrix. The second is the nearest cluster heap list, which for each cluster c stores a heap containing all other clusters, prioritized by their similarty from c which is used in the merge-split algorithm to find the nearest cluster to c. At the begining of the algorithm, we initialize these data structures, which takes and O (m2) time, respectively.
Choosing which cluster to merge by iterating through the current clusters and finding the small cluster with the smallest value at the top of its heap takes O (m) time. After merging those two clusters, which takes O (k) time, we calculate the new center and then update the tables. Calculating the new center takes O (β G ) time. In the inter-cluster similarity table, we nullify the distances for the rows and columns corresponding to the merged clusters and replace it with newly calculated similarities from all clusters to the new merged cluster. This takes time . For each modified entry, we update that value in the corresponding heap, which takes O (m · log m) time, totally. Finally, we create a new heap for the new merged cluster in time O (m). Since finding the two vertices in the cluster of the least similartiy requires calculating all pairwise similarities which takes time. The total run-time is .
We bound the number of iterations for the merge-split algorithm by keeping track of the number of small clusters. During every iteration, the number of small clusters decreases by at least one when the merge procedure is performed. Since splitting only results in clusters of size no less than k, no new small clusters are created. Therefore the algorithm terminates in at most n iterations. According to application-specific constants, we assume the value of parameters as , β G = O (m), and k < n. Therefore we conclude the total run time of the merge-split algorithm is O (n2 · log n).
Privacy analysis
We analyze the guarantee that our method provides both node re-identification privacy and link existence privacy defined in Section 3.1.
As shown in the algorithm of Graph Generalization, fake edges are added among user vertices which prevents the adversary from explicitly determining which edges exist in the original dataset (or furthermore their weights). Assume that all weights between user1 and user2 are independent, both the existence and the value of the weights. Then the confidence with which the adversary can learn the existence of a link is at most 1/k. This claim is stated concisely as follows.
Suppose an adversary has prior knowledge that the probability a user has opinion on user i is p. Obtaining this knowledge is often feasible in practice by learning aggregate information about the dataset. Even if p < 1/k, we claim that with this additional prior knowledge, the adversary cannot significantly improve his confidence that a user has opinion on user i.
Proofs can be found in the Apprendix. Not that the most confidence gain occurs when p = 1/k, at which point the adversary has a 2/k confidence probability. Therefore assuming that p ≤ 1/k, the maximum confidence in predicting the existence of a link is bounded by 2/k. When social network data is typically sparse as p → 0, it is clear that the adversary confidence tends to be 1/k.
Experiments
We run our experiments on Intel Core 2 CPU 2.40 GHz machine with 4 GB memory, and running Ubuntu Linux 12.04. We implement all of our algorithms in C++. Specifically, we want to evaluate the effectiveness and efficiency of anonymization on the social network data, and the amount of information loss incurred by our anonymization algorithms. In this section, we describe our experimental design and results.
Datasets
We first evaluate our techniques against degree-based attack. To have a realistic setting for our experimental evaluation, we use two real world datasets: Enron, arXiv. Dataset Enron 2 is a transaction log from email communications of a company. Each edge represents the user communications posted by a customer. Dataset arXiv 3 is a collaboration network on general relativity and quantum cosmology from Stanford university.
For the evaluation of our methods for anonymizing edge weights, we also use the R-MAT generator to generate synthetic datasets. R-MAT can generate graphs with power law vertex degree distribution and small-world characteristic, which are the most important properties for many real-world social networks [8, 25]. The characteristics of each dataset are detailed in Table 2. Note that, r represents ratings made by individuals. We modify the generator of edge weights which was set in the range [0, 100] as default.
Experimental setup
We evaluate our algorithms with respect to three performance factors: (a) execution time, (b) memory requirement, (c) information loss. We do not measure the I/O cost explicitly, since all algorithms are CPU-bounded for the high complexity of grouping similar nodes into clusters with a minimum size constraint. The memory requirements are dominated by the two dynamic data structures introduced in Section 5.1, so we report the memory usage in terms of clusters generated.
All privacy-preserving transformations cause information loss, which must be minimized in order to keep the capability to extract meaningful information in the released data. A variety of information loss metrics have been proposed. The Degree Centrality (DC) [26] provides the easiest way of measuring the cardinality of the anonymized groups. Nodes with a high DC value imply more popularity and leadership [26]. Watts and Strogatz [27] propose several accurate metrics, including Shortest Path Distribution and Infectiousness.
In order to show the utility of the anonymous graph after degree and edge weight anonymization, Yuan et al. [12] try to measure the distance between the distributions of edge weights in the original graph and the anonymous graph. They compute Earth Mover Distance which are often used to measure the similarity of multi-dimensional objects, while Hao et al. [13] employ Manhattan Distance and Cosine Similarity. We introduce an additional (more intuitive) information loss measure, which reflects the impact of anonymization to the result of knowledge discovery on the anonymized graph, Normalized Distance Penalty (NDP). In the experimental section of the paper, unless otherwise stated, we will use the generic measures when referring to information loss of degree anonymization. NDP is only used in our experimental evaluation to access the quality of our anonymization techniques against weight-based attack.
Let v be a vertex in social graph G,
The information loss of particular anonymization ranges from 0 to 1 and can be easily measured.
We investigate how our algorithms behave in comparison with two alternative methods: (a) the iterative method using relaxed balance, (b) the immediate method using our merge-split algorithm. The iterative ralaxed balance method in our experiments can be described as follows: (1) consider an optimal partition with m clusters(m > 0), (2) partition graph nodes into m clusters, (3) set m = m + 1 and repeat step 2 until any cluster of size violates the minimum constraint k. The immediate merge-split method applies our merge-split algorithm directly on the social network data.
We first evaluate the information loss of the anonymous graph by calculating several popular metrics. Degree Centrality shows the distribution of degree frequencies. The global metric Shortest Path Distribution measures the length of the shortest paths between each two graph vertices. The functional metric infectiousness measures the fraction of graph nodes that are infected by a disease which starts to spread from a random vertex. The infection will not stop until the disease die out or all graph nodes are infected.
We analyze the genetic metrics in detail. Figures 3(b) and 4(b) show that the anonymous graph has larger deviations from the original graph when the number of graph nodes is relatively small. This is because our method modify edges weights in small graphs more frequently to satisfy k-anonymity, which may greatly change the graph topology. Figures 3(c) and 4(c) show that the anonymous graphs have more short paths than the original graphs. It is expected because our method creates new edges and weights. Considering infectiousness in figs. 3(d) and 4(d), the anonymous graphs share very similar trend with the original graphs. Most of the social network graphs (following power-law distributions) are very sparse. That a small part of edges are infected will not result in dramatic infections in a social network graph.
Figures 5 and 6 show how the computational cost and memory requirements of the algorithms scale with the increase of the parameter k. The fig. 3 show that the IMS and AA have identical performance on time, which is greatly affected by the value of k. This is due to the fact that the performance of this method rely on the size of dynamic structures used in clustering procedures. On the other hand, IRB is initially much faster compared to the other methods and converges to their cost as k increases. This is expected, because as k increases clusters with less than i vertices for i < k become infrequent. As a result, IMS is not able to prune the space of clusters with i vertices to be included in the clustering procedure at the pre-processing procedure(Relexed Balance). This is also evident from the memory requirements of IMS. As we see in fig. 6, IMS have a realistic problem on space allocation at the initial stage. Note that, IRB is not plotted since IRB runs without the storage structure. Figure 7 shows the information loss incurred by the three methods in this experiment. Note that the IMS and AA have achieve the same loss, which indicates the ability of Merge-Split algorithm in finding the optimal solution. This behavior is a result of the limited vertices numbers, which leads to a relatively small solution space. In this space, the IMS and AA manage to find the same, optimal solution.
Overall, the most efficient method of the three is AA. It requires significantly less memory and run time than the other algorithms and incurs an information loss similar to the information loss of IMS. The superiority of AA lies in the fact that we have better control over the partitioning procedure of anonymization space, which is affected mainly by the vertex combinations considered at the anonymization. This number is significantly smaller at the clustering procedure of AA, which is well-balanced according to this factor, compared to the IMS and of course the full space partitioned by IRB. However in a few cases, especially when k is small, AA incurs more information loss than IMS. The reason for this observation is that RB produce clusters by partitioning vertices into two parts(like positive and negative), whereas IMS always produce clusters of size between k and 2k. As the cluster size increases, the similarity metric of cluster centors decrease. On the other hand, larger clusters are likely to provide better privacy protections as the crowd of similar individuals gets larger. Thus, these sets of experiments show a trade-off between clustering performance and data utility. In the cases when k increases, a few large clusters begin to dominate the overall information loss. In general, AA performs well for different cluster sizes and graph characteristics.
Conclusion
In this paper, we studied the privacy preserving publishing of social network data. We defined the novel concept of k-anonymity for such data and analyzed possible solutions. Based on our analysis, we developed a graph anonymization approach. We developed a graph clustering algorithm which is not practical for large, realistic datasets. In view of this, our proposed approach employed a relaxed-balance step to lower computational cost and space requirements. We showed that the AA algorithm could effective group similar vertices on large social network graph to protect individuals’ privacy while minimizing information loss.
For future work, we decide to study a different approach to protect social network data from social relationship attack. Instead of averaging weights of edges in the graph generalization, we can try to permute the weights for each vertex within each anonymization group [28]. Another possible avenue of research is to investigate the use of random perturbation with our proposed method to guarantee differential privacy [29]. These ideas may be better to serve some datasets, but it remains to be seen how these approaches will affect NDP or other utility measures. In addition, we will conduct experiments to investigate the effectiveness in achieving other utility goals.
Acknowledgments
We would like to thank the anonymous referees for providing us with constructive comments and suggestions. This work was partially supported by the National Key Technology R&D Program Foundation of China (No. 2013BAH44F01), the Joint Funds of the Ministry of Education of China and China Mobile (No. MCM20122081), and the Key Technology Foundation of Xiamen Municipal (No. 3502Z20133011).
Appendix
A. Proofs of Theorem 3
Let N be the number of undersized clusters. The merge-split algorithm terminates when there is no undersized clusters (N = 0). At each iteration, three cases can happen and N decreases in all three cases. If two undersized clusters are merged into one undersized cluster, then N = N - 1. If two undersized clusters are merged into one cluster whose size is no less than k, then N = N - 2. If one undersized cluster and one cluster whose size is not less than k is merged, then N = N - 1.
Thus, each iteration strictly reduces the number of undersized clusters and the algorithm converges in a finite number of iterations.
B. Proofs of Theorem 6
Suppose an adversary is able to identify user o in a certain anonymization group. Based on the existence of a link to user e, the adversary would like to infer the identity of user e. However the existence of this link in the anonymized graph implies at least one user has rated user o which can be devided into two cases asfollows.
Let Pr (u, o) be the probability that user u has rated user o. if user o and e belong to the same anonymization group whose size is p, Pr (u, o) =1/p. Since p ≥ k, Pr (u, o) ≤1/k. if user o belongs to another anonymization group, we can make the same conclusion as case 1.
Thus, the adversary can only infer that user u has rated user o with probability at most 1/k.
Let Pr (A, o) be the unconditional probability that users in anonymization group A has rated user o. Given the probability Pr (A, o), we wish to prove .
Due to Bayes’ rule, Pr ((u, o) | (A, o)) = Pr ((A, o) | (u, o)) ·Pr (u, o)/Pr (A, o). Note that Pr ((A, o) | (u, o)) = 1. Since each user rates users independently with the probability p,
Finally, we can get the following results:
