The increasing population of online communication and telecommunication has interested scholars and researchers considering their social networks. These social networks datasets play an exceptionally important role in the research of data mining. However, large amounts of social network data are produced by using social networking applications. And these data inevitably contain a large amount of personal privacy information. Therefore, in order to avoid disclosure of privacy, the data holders need adopt privacy protection before these data are released. Furthermore, most current methods of privacy protection are based on the simple graph only. The weight values on the edges represent the tightness between the nodes. The algorithm based on weights in privacy protection field is still relatively rare. In real social networks, the weight can indicate tightness between two individuals of social relations. The weight may be as attackersâ background knowledge to re-identify the target individual and lead to loss of privacy. In this paper, we consider protecting the weighted social networks from weight-based attacks and propose a method based on the weighted social networks, named -weighted generalization anonymity (KWGA). And This method combines -anonymous with generalization method to ensure the security of the social network data when it is published. In order to ensure the higher validity of privacy protection, this paper introduces a concept of the weight difference to reduce the modification for weight graph. Finally, we firstly use real dataset C-DBLP to verify the validity of our method perform much better than the Fast -degree anonymity (FKDA) in average clustering coefficient (ACC), global clustering coefficient (GCC), average path length (APL) and rate of edges change four aspects. Furthermore, we also use two common datasets in the research of weighted graphs to verify our algorithm.
Social network is an interconnected social roles set. And the social network is based on “network” (mutual connections between nodes) instead of “groups” (clear boundaries and order) forms of social organization. This analytical perspective began from Western sociology in the 1960s. In the book of “Networked: The New Social Operating System” published in 2012 [1], the social network revolution was seemed as the one of third revolutions for affecting human society in the new period. The increasing popularity of the social network data has sparked a fertile research area in information extraction and data mining [2, 3, 4, 5], which applies various fields such as sociology, marketing, biomedicine and counterterrorism. As a consequence, these data should be published for research. However, social networks data always contain individuals’ sensitive information [6, 7, 8]. This information may be disclosed through some background knowledge.
At present, most of the privacy protection studies about data publishing in the social networks are mainly for simple graphs (i.e. undirected, unweighted and loop less graphs), which consider graph structure as adversaries’ background knowledge [9, 10, 11, 12, 13]. However, weighted graphs are more common in real world than simple graphs. The weight values on the edges represent the tightness between the nodes. And there are few works concentrating on edge weighted networks [14, 15, 16, 17]. The weighted network can be used for analyzing the formation of communities in the network [34], targeted marketing and advertising [35], in addition to the traditional applications on weighted graphs such as shortest path length and -Nearest Neighbors (kNN) and so on. Depending on the application, the edge weight information can semantically represent degree of friendship, trustworthiness, behavior and so on [14] and there exists numerous weight related information that can be used to attack a victim’s privacy.
In a real social network NetSci [32], 6.48% of nodes could be uniquely identified based on weight sets information, therefore, Li and Shen consider a histogram attack proposed in [15]. When the weight network data is published, the weight value may be used as background knowledge of the attackers, thus lead to the disclosure of privacy. An example is shown in Fig. 1, the nodes represent social individuals, the edges represent relationships between a pair of nodes and the weights represent the closeness between nodes.
Examples of degree anonymity in weight graph.
Figure 1 shows two -degree anonymity graphs according to the model proposed in paper [18], Fig. 1a is a 3-degree anonymous graph that there are three nodes with the same degree and Fig. 1b is a 2-degree anonymous graph that there are two nodes in each group, {, } and {, } with the same degree. However, the edge weights become some extra information for the nodes. For example, the node A in Fig. 1a is an unique node with weight set [1, 1], which causes an identity disclosure problem. Conversely, the nodes in Fig. 1b are not attacked by the degree and weight values, because all nodes have the same weight set with other node.
In this paper, we focus on protecting sensitive identities of individuals in a graph from background knowledge attacks. Our goal is to protect the privacy information of a certain author’s from being re-identified by an attacker, and to find this sensitive information and make some anonymous operations. In order to complete this goal, we propose a -weighted generalization anonymity (KWGA) method to protect the edge weights, and to change the structure and generalized edge weights of the original social networks as little as possible.
To summarize, we made the following contributions:
C-DBLP dataset processed by adding weights. The original C-DBLP data contains only the information of authors and co-authorship information between a pair of authors. We obtained the additional information by crawling from web pages. Furthermore, we built the data compiled model and calculated the right edge weight values for the social network;
We propose -weighted generalization anonymity (KWGA) model for privacy protection in weighted social networks based on the -degree anonymity. We want to anti attacks through this model by the weight values. Two of the key points in this problem are as follows: 1) creating the weight anonymous group, we first propose a concept of the weight difference for selecting boundary nodes for anonymous group; 2) designing the edge weight anonymous model within groups.
We apply our algorithm on real Chinese dataset C-DBLP to show the effectiveness of our method in the real world weighted social network. We compare our algorithm with FKDA [18] in C-DBLP dataset which we compiled. And we compare our KWGA with FKDA on other two common real social network datasets NetSci and Hep-Th which compiled by Mark Newman [19, 20, 21]. The experiments results show our KWGA is better than FKDA.
The rest of this article is organized as follows. Section 2 is a summary of the related works. In Section 3, we present some problems of definition and standardization of our anonymous method -weighted generalization of anonymity. Our algorithm based on the weight is described in detail in Section 4. Section 5 shows the progress of dataset compiled. The results and analysis of experiments are displayed in Section 6. Finally, Section 7 concludes this paper.
Related works
In recent years, the social network data brings great convenience for sharing information between various organization and scientific researches. How to ensure personal privacy information not revealed and published data with a high utility has attracted many scholars. Most of the previous studies about privacy protection for social network can model social network data on undirected and simple graphs only [9, 10, 11, 12, 13]. And a small number of researchers are modeling the data as bipartite graphs [22] or directed graphs [23]. Backstrom et al. [9] firstly proposed the privacy protection based on social network. In the studies of social network structure, the best one scientist is Mark Newman. He proposed and analyzed weight structure features of the social network [19, 20, 21].
Most of the existing methods to protect privacy of social network have focused on graph anonymous, which considers the topological structure. Especially, these methods mainly abstract the data into undirected and unweighted simple graphs. This type of protection methods mainly focus on preventing the structure of origin graphs as attackers’ background knowledge to re-identify the target. This kind of privacy protection of social networks can be divided into two categories. One is based on clustering methods, and the other is based on modifying the origin graphs.
Privacy protection based on clustering
Hay et al. [10] gave a special type of attack, node re-identifies attack. They proposed an anonymous method based on node clustering to prevent this attack. They clustered some nodes which are similar to an anonymous group and then aggregated these nodes as a super node. This method mimicked the privacy protection K-anonymity model of relational data by Sweeney [11]. Zheleva and Getoor [12] proposed a clustering method based on edges. This method is carried out from the perspective of sensitive edges researches. Campan and Truta [13] proposed an anonymous method which combined node clustering and edge clustering.
Privacy protection based on graph modified
Hay et al. [24] proposed to remove some existing edges in the original social network graph, and then add some other edges to the original graph. The two steps could satisfy the nodes in the published graph are not the only one, which made it impossible for an attacker to re-identify the target node. Zhou and Pei [17] introduced the idea of greedy algorithm to hide target node in social network graph. Zou et al. prevented the target individuals are re-identified by attackers who uses structure information through adding edges, deleting edges to achieve -Isomorphism anonymous [25] and -Automorphism anonymous [26] and -Symmetry anonymous [27]. Ma et al. [28] proposed a method, KDVEM, to protect the privacy by vertex and edge modification.
For the traditional privacy protection methods, most of them are relied on unweighted graphs. However, the weighted social networks have attracted the attention of scholars in recently studies [14, 15, 16, 17, 23, 24]. Das et al. [14] use a linear programming model to modify the weight value to satisfy anonymous for social networks. Li and Shen [15] put forward a -histogram anonymization model by modifying weight values in a nodes group as a same value for privacy protection in social network. Liu et al. [16] define a threshold value and maintain the range of disturbance within this threshold. The aim is to protect privacy information as well as the shortest path length in the process of anonymous for social network. Zhou and Pei [17] propose an anonymous method to against neighborhood attacks based on weights which are the strength of relationship between a pair of friends. Chen et al. [30] consider protecting sensitive labels in weighted social networks. In their algorithm, they combined -diversity for sensitive labels and -anonymity for weight values for social networks. However, these methods still have some problems. For example, these method all changed directly the weight value in the weight set and the research for Chinese collaboration network is not yet been proposed. In this paper, we propose a new method to anonymize the original social networks graph. In our method, we propose a new grouping strategy based on weight difference and edges addition. And then we propose a generalization method to anonymize the node in each group based on degree and weight set.
Basic problem definitions
In this paper, we abstract the collaboration network as an undirected weighted graph , where represents a nodes set, co-author set, is the number of nodes; represents a collection of edges between nodes in , that is co-authorship between the two authors, is the number of edges; represents a weight set of the edges in , represents weight value of , that is intensity of co-authorship between a pair of authors.
Definition 1. (Edge set) For each node , we define the edge set for node as a set of the all edges for node , denoted by .
For instance, in Fig. 1a, the edge set for node , , respectively are , , .
Definition 2. (Degree sequence) For each node , we define the degree sequence as a descending sort order degree set by the value of degrees.
For instance, in Fig. 1b, for node , the degree sequence is .
Definition 3. (Weight set) For each node , we define the weight set for node as a set of the weights on all edges connecting node , denoted by . We obtain a sequence of weights by descending sort order this set by the value of weights, denoted by where represents the weight set of .
In the real world cases, the weight values for different edges may be the same, therefore, we can use the complete weight sets to represent the weight for a node in graph . In this paper, we obtain the weights for every edge by computing the strength of a pair of coauthors in papers.
For instance, the weight sets for the nodes in Fig. 1b, based on the Definition 3, we get , , , .
Definition 4. (-degree anonymity) A graph is -degree anonymity if for every node in , there exist at least other nodes in this graph which have the same degree with .
Definition 5. (-weighted generalization anonymity) A graph is -weighted generalization anonymity if for every node in , there exist at least other nodes in this graph which have the same generalization weight set with .
Definition 6. (Weight difference) For two node and in graph , the weight difference
where is a node in the current degree and is a node in the with the next sequence degree, the - weight in the weight set of nodes . We calculate the weight difference for several nodes with the same degree and select the node which has the smallest weight difference to join the current group, then, the remaining node put into the next group. Our objective is to transform the original graph to a graph which satisfies -weighted generalization anonymity, and the utility of the anonymous graph is good.
Problem definition: (graph anonymization) Given a weighted graph , and an integer , we need find weight generalizations that transform to a -weighted anonymity graph .
-Weight generalization anonymity
In this section, we introduce our anonymous method in detail. Based on the previous part of definition model and -weighted generalization anonymity, we propose an anonymous algorithm to solve weight leakage problem and give explanations of how to protect the original collaborate networks. Based on the -degree anonymity, we combined the weight value with it. We firstly use a weight difference to decide the node which should be added into current group. And we learn generalization method from Sweeney [11], because that the weight value can be as a special attribution and the values can be change as a mathematic interval. Given an original collaboration graph , nodes in represent authors; edges in are undirected and represent co-authorships between a pair of authors; the weight values on the edges represent the tightness between the co-authors. The goal of the algorithm is to find an anonymous collaboration network graph to satisfy the requirement of -weighted generalization anonymity. Finally, we hide all authors in the graph and make it impossible for an attacker to re-identify the target individual.
We define our algorithm as two step strategy approach. The first step is to modify the original graph as little as possible and then achieve some potential anonymous groups to satisfy -anonymity. The second step is to generalize the weight value in each anonymous group. We will explain our algorithm in detail use Fig. 2. Figure 2 can represent a small social network, such as a friendship network with six members. If two members in this network have mutual friends, they will connect to each other. For every node, the degree represents the number of friends and the weights represent the number of mutual friends between a pair of members. Figure 2a is an original graph from a social network. Figure 2b is a graph which satisfies 2-weighted generalization anonymity.
Examples for explaining our algorithm.
Create anonymous groups
How to build utility anonymous groups is the first problem to achieve the ultimate anonymous graph. At first, we should introduce a method of creating anonymous groups which combines node degree with edge weights. We group all nodes according to the node degree based on the ideas of clustering, and confirm border elements within the group according to weight differences. According to the above methods, the node in same anonymous group will have the highest similarity. In this section, we put forward a concept called weight difference to select suitable nodes for anonymous groups.
For all nodes in each graph, all nodes are sorted in descending order, access node sequences, represents the degree of node . For each node, we sort the weight values in descending order for every weight set, then, each node will have a weight set sequence, . In order to make every group has the basic conditions to satisfy anonymity, we will guarantee each group having at least nodes according to the parameter . First of all, we obtain the number of groups based on total number of nodes , and anonymous parameter . Each group has the same requirements for the number of nodes. Therefore, the node that has less degree than the other, they have to be modified by adding some new edges. For example, the maximum degree in the group is , , is the number of nodes in the anonymous group , and the node requires new edges. In Fig. 2a, we can obtain the degree sequence, {4, 3, 3, 2, 2, 2}, we give is 2. Therefore, we will divide them into 3 groups, 6/2 3.
In the group division process, there may be fewer nodes needed than the number of nodes which have the same degree in the sequence with next degree. In Fig. 2a, for the , B with degree 4 must be in it and the group need other node with degree 3, but there are two nodes with degree 3. At this point, we use Eq. (2) to select the smallest weight difference boundary node.
is a node in the current degree sequence and is a node with the next degree sequence, represents the -th weight value in the weight set of nodes .
We calculate the weight difference for several nodes with the same degree and select the node which has the smallest weight difference based on Eq. (1) to join the current group, then, the remaining node put into the next group. For Fig. 2a, the first node is , we will select one of node and node based on the weight difference formula.
Thus we select node into group . In the same way, the all anonymous groups are shown as follows, , , . The algorithm for creating anonymous groups is shown in Algorithm 1.
[!h] Generate anonymous groupsInput: An original graph and an integer parameter ;
Output: All anonymization groups ; [1] get a degree set D from the original graph, then sorted all degrees in this set, ; get the number of anonymization groups, , and the group set ,
=1 to do the rest number of the node required in the current the number of nodes with the next degree; compute the weight distance between the last node in current based on the ; select the smallest distance, then add the corresponding node into current ; return set.
Weight generalized
Owing to the uncertainty increased can improve the effectiveness of privacy protection in anonymous social networks. Simultaneously, we consider the generalization of value to make the weight value uncertainty to protect the individual privacy. Generalizing the weight value is the second step in whole anonymous method. The goal of this step is to achieve the final anonymous social network by weight generalization in all anonymous groups. We will consider two general situations in this step, one is generalizing by finding candidate nodes in the same group, and the other is generalizing by finding candidate nodes from other groups.
The first type is a generalization in one group. If the degrees of nodes in an anonymous group are the same, we will give a priority this method. In the group which need anonymous, we arrange a mark value to every weight. Then, each weight value in weight set will match up to the first node in corresponding group. If the values are the same, that is to say these weights match succeed and these two weights will be marked, . This matching progress continues until all weight values of node matching is done. Then, the weight value do not need to generalize which has . And conversely, the weights which have need to change their weights. We will select the nodes which have the minimum gap between a pair of weight value in weight set of the max degree node to select generalized value. Based on this two values, we get a generalization interval to represent the two original edge weights, and give mark to this two weights. And finally we add Marks for this two nodes, that is to say they have completed generalization. For instance, for the third group, , their weight are 2, so this group belong to the first type. , , for each edges weights, weight values match succeed and set . In the rest weights set, the maximum weight is 3 and the minimum weight is 1. We get a generalization interval . Finally, the two edge weights change to , .
The second type generalization is finding candidate nodes from other groups. In a potential anonymous group, if all nodes do not exactly have the same degree, we will add the degree for the nodes who have smaller degree to satisfy -anonymity. We think that if the solution contains both adding and removing edges, the distance from original network will be larger. In our algorithm, we consider edges and weights, removing may cause same real weights to disappear. Therefore, we selected adding edges. we will add edges for node to reach up to , , is the number of nodes in the anonymous group . In Fig. 2a, for the first grope, , the degree of node is large than . At this point, the nodes in group which are not reaching the maximum degree will seek for the candidate nodes. Our algorithm chooses candidate nodes in reverse order that give a priority to select the node which has smallest degree in the group. Furthermore, we consider the smaller nodes in the same group firstly when choosing candidate nodes, namely the same group priority strategies. If the other nodes in the same group reach up to the maximum degree or they are already neighbor, we will consider these nodes in the next group. As well as the preceding, we find the node which has the smallest degree to be a candidate node. We add a new edge to link the candidate node. Then, we deal with this new group based on the first type generalization method. According to weights matching, we arrange the lack weight for new edge. If all old edges for the node which need add degree can match succeed with the node which has the max degree, we will arrange the weight with for new edge. And if not only one old edge with , we will compare the every weight this two nodes and select a minimum gap weight for every weight, then arrange the rest weight to the new edge. For , node requires to add one new edge for reaching the degree with node . In , node has smaller degree than node , therefore, we check the existence of edge between and . The result is that the edge is not existence. Then, we link and , and give a weight value to it based on the node . Finally, we get a 2-weighted generalization graph as Fig. 2b.
The worst situation is that the nodes still cannot reach up to when the preceding step completed. We have to add new node to be as candidate node and repeat the above steps. The generalization algorithm is shown in Algorithm 2.
Weight Generalized Input: All alternative anonymous groups GP in the previous step; Output:-weighted generalization anonymous ;[1] each group in GP do the degree of all nodes in GP are the same do , and all weights set ; to the number of nodes in the current GP do matching with , set , generalized each weight with , , ,
find candidate nodes in current GP, the node whose degree is smallest and this two nodes have no connection; Or find candidate nodes in the other ; Or add new nodes and connect with them; arrange the rest weights which do not match for the new edges; return k-weighted generalization anonymous .
Dataset processing
The C-DBLP data is a Chinese collaboration network, and there are four sub sets based on it, computer science, physics, law and economic. These datasets only contain the connection information for edges and nodes information. This C-DBLP dataset only can reflect that the relationship exist between authors. For collaboration networks, the strength of relationship between a pair of co-authors is also important. This closeness relationship plays an important role for privacy protection and community discovery in collaboration networks. Thus, the edge weight can be as an evaluation for the strength of co-authorship. At the present, weighted collaboration network datasets are most from Mark Newman [19, 20, 21]. He added weight to each edge in the original dataset based on his algorithm, such as HepTh, Cond-Mat, Actro and so on. On his personal home page, he posted on a number of compiled weights collaboration network datasets. In this paper, we use one dataset from us and the other two from Newman.
C-DBLP is a database system for the publications developed by Chinese authors (http://c-dblp.cn), such as authoritative journals, conference papers and so on. The C-DBLP system developed similar to DBLP system, which is an author database system based on the English literature in the field of computer. According to this system, researchers generated two datasets, one for authors and the other for references [31]. And the C-DBLP dataset was used to detect structural-connected communities in [33]. We obtain C-DBLP weight dataset based on C-DBLP data through a 4-step method.
Basic datasets split
Original datasets contains three pieces of information, Nodes, Edges and Triangles. We extract Nodes and Edges information from original dataset to use.
Author’s information crawls
Based on the original data, we cannot be informed of the number of collaborators in per paper, therefore, we must get every author’s information on the basis of author’s name from the ScholarSpace (C-DBLP) Website. We crawl the authors’ information from the ScholarSpace (C-DBLP) Website according to authors’ name.
Get collaborators list
When calculating edges weights, we need to carry out collaborators’ name matching in the collaborator files, thus we must know what collaborators each author has.
Edge weights calculate
We compute the weight value by the method of Mark Newman [19, 20, 21]. In collaboration networks, researchers usually use the amount of collaborative papers between a pair of authors as their connection weight. If a pair of authors collaborate one paper, the weight of an edge will be assigned straightly to 1, which is not very realistic for the real co-authorship intensity. The reason is that the authors in per paper are not only these two authors, but most of the time another authors are also in the papers. The traditional method does not consider the tendency for collaborators in large group of authors not to know one another, or to know one another less well.
For this case, Mark Newman put forward a method to evaluate collaborative ties [19, 20, 21]. We apply it to the C-DBLP dataset. We assumed on the writing of a Chinese paper that has n authors, namely, one author has coauthors on the paper. Then we assume that the familiarity between this author and every collaborator is the same, therefore, we assign the weights of 1 to collaborators. As though only one collaborate, every co-author has on average, but it is clear that this is an approximate value. However, it is more practical than the other method to calculate weights.
In reality, most authors have not only one paper and not only one coauthor. Therefore, it is more likely that two authors know each other while they collaborated on many papers better than those who have few collaborative papers. The weight between authors is Eq. (3):
If author is the author of paper , then is 1, otherwise it is 0, is the number of coauthors of paper and we exclude from our datasets all single-author paper. N is the number of papers in this dataset. is the edges weight which represents the strength of collaboration between author and .
The next problem is how to get the appropriate data for calculation. We obtain the statistical number of authors on per paper based on collaborator file and author’s paper list file and take them into Eq. (2) for calculation, then stored the result to the corresponding weight file.
Data format changes
In the present study, the common format of weight and label datasets for social networks is GML format. We get this GML format dataset by reading the basic node information in the underlying data and weight file.
Experiments
In this section, we conduct several experiments to evaluate our algorithm.
Datasets
We use the real collaboration network datasets in C-DBLP to evaluate our algorithm. And we expend our experiments to two common datasets which compiled by Mark Newman to verify us performance. These data contain NetSci and Hep-Th.
The first dataset is C-DBLP, and the original data contains four sub sets which represent four different fields, computer science, physics, law and economics. And every sub set has the co-authorship information about 1,000 scholars. When an author do not have any co-author, he will be an noise node. Fortunately, there is not noise node for four sub set in C-DBLP dataset. The properties of data are shown in Table 1.
Dataset properties
Computer
Physics
Law
Economics
Nodes in largest connected component subgraphs
1000
1000
1000
1000
Edges in largest connected component subgraphs
2443
4051
2423
1435
Average clustering coefficient
0.444
0.585
0.363
0.297
Diameter
13
14
20
28
Average path length
5.142
5.230
6.270
11.039
The second dataset is NetSci. This data contains a collaboration network of scientists working on network theory and experiment, as compiled by Mark Newman [19, 20, 21]. The version given here contains all components of the network which published in [32]. This data includes a total of 1589 scientists and 2742 edges.
The third dataset is Hep-Th. This data contains the collaboration network of scientists posting preprints on the high-energy theory archive at www.arxiv.org, 1995–1999, as compiled by Mark Newman [19, 20, 21]. The network is weighted, with weights assigned as described in the original papers. This dataset contains 8361 nodes and 15751 edges.
Evaluation measures
The evaluation criterions of my algorithm are mainly based on the structure characteristics of the social networks. The evaluation criterions in the field of unsupervised learning do not apply to our algorithm. Because the published social networks play an important role for the study of the overall structure of the social network and local structure characteristic, such as the analysis of the social networks center, the social network interest center and so on. In friendship networks, we may need to analyze the closeness relationship between friends. Therefore, we firstly need to analyze the node degree distribution in the social networks that we can see the characteristics of the relationship between the friends in the networks. In the current research for social networks, the mainly indicators include clustering coefficient, average path length, social networks center, community structure and so on. In this paper, we choose average clustering coefficient, global clustering coefficient, average path length and the rate of edges change as the evaluation criterion for my algorithm.
The concept of clustering coefficient which represents the aggregation of the node is originally from graph theory. Watts and Strogatz put forward the average clustering coefficient which is the average value of the local clustering coefficient for all nodes in the graph. In the real social networks graphs, especially in some particular social networks, the nodes which always tend to have the closer relationship for the structure, then they will have a larger clustering coefficient. The concept of “small world” was proposed from the clustering coefficient. It refers that the average shortest path length is close to the corresponding random graph, but the average clustering coefficient is significantly larger than the random graph. This phenomenon reflects the close of relationship between some nodes in social network. We calculate the average clustering coefficient based on the Eq. (4), as follows.
Where represents the number of edges which contain node , that is the number of the close triples which contain node , represents the number of neighbors with , that is the node degree of .
Global clustering coefficient is the rate of the close triples in all triples and the range of it between 0 and 1. When there is no triangle in graphs, the global clustering coefficient is 0. It is said that this social network does not have two individuals know each other. When the graph is a complete graph, the global clustering coefficient is 1. It said that each two individuals in this graph all know each other. In undirected graph, the global clustering coefficient is defined as shown in the Eq. (5).
Where represents the number of triangles and represents the number of connected triples of nodes.
The average path length is the shortest distance from one node to other nodes in the graph. And the average shortest path length in social networks is the average value of the shortest path length for each reachable node in the social networks. In undirected graph, the average path length is defined as shown in the Eq. (6).
Where represents the set of reachable nodes in social networks graph; nodes and represent a pair of reachable nodes; represents the average shortest path length between node and node .
Utility
We compare the utilities of our algorithm -weight generalization anonymity (KWGA) with the Fast -degree anonymity (FKDA) [18] in our experiments. In this paper, we use C-DBLP, NetSci and Hep-Th datasets to verify the utility of my algorithm.
For the C-DBLP dataset, the result of experiments is as follows. If a graph is -degree anonymity, for each node in this graph, there exist or more nodes that have the same degree. And similarly, if a graph is -weight generalization anonymity, for each node in this graph, there exist at least nodes that have the same weight set. The results of K-weighted generalization anonymity and the Fast -degree anonymity based on ACC, GCC and APL are shown from Figs 3–6.
The evaluating indicators contain average clustering coefficient, global clustering coefficient and average path length. The baseline represents the measure of the original data. The results are closer to the baseline, the utility is better. From all result we can see, the most results in -weighted generalization anonymity are better.
ACC, GCC, APL for different in computer dataset.
ACC, GCC, APL for different in physics dataset.
ACC, GCC, APL for different in law dataset.
Figures 3a,4a,5a and 6a list the average clustering coefficient of the anonymized and original graphs with different on four graphs. From these figures, we can observe that our approach is good in overall performance, but some perturbation appeared, such as in economics. But in computer data, our approach performs better only as the is larger than 40. This is because the distribution of degrees in computer graph is large and decentralized. When the is small, we need add more edges to original graph based on our approach.
Figures 3b,4b,5b and6b show the global clustering coefficient of the anonymized and original graphs with different . The GCC value of the anonymized graph tends to become smaller on these four datasets. This is because the transitivity of the original graphs is relatively higher. When adding new edges to original graph for anonymous, the triangles form slower than connected triples. Our approach performs better than the FKDA in computer with all varying . While in physics, law and economics, our approach only performs better as the is comparatively small. The reason is that these datasets are small and the number of nodes who has large degree is small.
Figures 3c,4c,5c and 6c detail the average path length between a pair of nodes of the anonymized and original graphs with different . After anonymization, all figures show the trend that APL value becomes shorter than original graph. This is because new edges are added to the original graph. As expected, the APL value decreases since new edges are increased. The result shows that our approach performs better than , except law graph. From our perspective, the law dataset has a large range of degree distribution and little number of nodes who have large degree.
ACC, GCC, APL for different in economics dataset.
ACC, GCC, APL for different in NetSci dataset.
ACC, GCC, APL for different in Hep-Th dataset.
For the dataset NetSci and Hep-Th, the results of experiment are shown as follows from Figs 7 and 8. For these two datasets, we can see our algorithm shows better than the -degree anonymity in ACC, GCC and APL. In all figures, we can know that the results of our algorithm are closer to the original graph, that is to say, our algorithm can protect original social networks with fewer changes. Based on the Figs 7 and 8, our method is more suitable for large graph.
Obviously, in some cases, the performance of our algorithm is worse than FKDA, but the gap is very small. In view of some differences in evaluation standard is too small, we counted the differences between each value and the original value, then we obtain a sum of ten difference values for ten values, the result is shown in Table 2.
In Table 2, the smaller values are bold, that is to say the results are better. We can see that in all differences values, only two results for our algorithm performed worse than the FKDA algorithm, the GCC values of physics and law. These results are mainly because that we create anonymous groups according to and FKDA based on the node degrees. And the increase of new edges based on our algorithm with the increase of , the open triples increase faster than the close triples.
The finally experiment compares the rate of edges change between KWGA and FKDA. Due to two methods are all by adding edges to protect privacy for social networks, the rate of edges change can be an evaluation criterion. The results are shown as follows in Fig. 9.
From the figures in Fig. 9, we can see the rate of edges change for our method is much smaller than the FKDA. Generally speaking, when increases from 5 to 50, the curves for KWGA and FKDA are increases, too. From our perspective, the bigger is, the more edges need to add, especially, the first anonymous group requires more edges. The reason is that more nodes need add edges to reach the max degree in group.
The different values in different datasets
Computer
Physics
Law
Economics
NetSci
HepTh
ACC
KWGA
0.1131
0.3050
0.2707
0.0591
0.2645
0.0188
FKDA
0.1187
0.3286
0.4065
0.1055
0.3459
0.0614
GCC
KWGA
0.1814
0.8775
1.1683
0.5747
1.6315
0.2861
FKDA
0.2468
0.6692
0.4899
0.5946
1.7396
0.3567
APL
KWGA
3.5568
6.9093
9.6350
25.7243
10.9357
4.9775
FKDA
5.2890
8.2650
12.3473
30.9016
13.0840
6.5660
Rate of edges change for different in C-DBLP, NetSci and Hep-Th datasets.
Our algorithm considers the edges weight values and weight difference, but the FKDA do not do this. Therefore, our algorithm, KWGA, performs much better than the FKDA in the rate of edges change. In Fig. 9b, c and e, the rates of edges change in some conditions are more than 50 percent. And in Fig. 9c, when is 50, the rate is over 100 percent. That is to say, the FKDA algorithm has to add more than one time nodes in the original graph.
Conclusion and future works
Most existing privacy protection methods for social networks are based on simple graphs only. They do not consider the edge weight. However, depending on the network structure and properties of social networks, the edge weights represent the tightness between the two nodes. In our view, edge weights may be deemed be the attacker’s background knowledge to re-identify the target individuals. Certainly, more and more researchers begin to focus on the weight of the edges and apply it to privacy protection. In this paper, we propose a -weighted generalization anonymity model based on edge weights for privacy protection in social networks. And we give a conception of weight difference to select a boundary node for each anonymous group. Out method KWGA bases on two steps to complete protection for social network. One is that we create potential anonymous group, and the other is that each group requires anonymous by add edges and weight generalization. Finally, we verify the validity of our algorithm through three real datasets. One of the datasets is C-DBLP which compiled by us according to the real data and two common weighted datasets which complied by Mark Newman. Furthermore, we consider four evaluation criterions to verify the utility of our algorithm, clustering coefficient, global clustering coefficient, average path length and the rate of edges change. From the result of our experiment, we can see that our algorithm has a highly utility with fewer changes. In the future works, we will consider the privacy protection of multi sensitive properties on the basis of weighted social networks.
Footnotes
Acknowledgments
This work was supported in part by National Science Foundation of China (No. 61572259), supported by the National Social Science Foundation of China (No.16ZDA054), Special Public Sector Research Program of China (No. GYHY201506080), and was also supported by PAPD
References
1.
RainieL. and WellmanB., Networked: The in New Social Operating System, 2012.
2.
GuB.SunX. and ShengV.S., Structural Minimax Probability Machine, IEEE Transactions on Neural Networks and Learning Systems (2016). DOI: 10.1109/TNNLS.2016.2544779
3.
GuB.ShengV.S.TayK.Y.RomanoW. and LiS., Incremental Support Vector Learning for Ordinal Regression, IEEE Transactions on Neural Networks and Learning Systems26(7) (2015), 1403–1416.
4.
GuB. and ShengV.S., A Robust Regularization Path Algorithm for ν-Support Vector Classification, IEEE Transactions on Neural Networks and Learning Systems (2016). DOI: 10.1109/TNNLS.2016.2527796
5.
XueY.JiangJ.ZhaoB. and MaT., A self-adaptive articial bee colony algorithm based on global best for global optimization, Soft Computing (2017). DOI: 10.1007/s00500-017-2547-1
6.
RongH.MaT.H.TangM.L. and CaoJ., A Novel Subgraph K+ -Isomorphism Method in Social Network Based on Graph Similarity Detection, Soft Computing (2017). DOI: 10.1007/s00500-017-2513-y
7.
MaT.WangY.TangM. et al., LED: A Fast Overlapping Communities Detection Algorithm Based on Structural Clustering, Neurocomputing207 (2016), 488–500.
8.
LvY.MaT.TangM. et al., An efficient and scalable density-based clustering algorithm for datasets with complex structures, Neurocomputing171 (2016), 9–22.
9.
BackstromL.HuttenlocherD.KleinbergJ. and LanX., Group formation in large social networks: membership, growth, and evalution, in KDD’06: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006, pp. 44–54.
10.
HayM. et al., Anonymized social networks, hidden patterns and structural steganography, in International World Wide Web Conference (WWW 2007), 2007, pp. 181–190.
11.
SweeneyL., Achieving k-anonymity privacy protection using generalization and suppression, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems10(5) (2002), 571–588.
12.
ZhelevaE. and GetoorL., Preserving the privacy of sensitive relationships in graph data, Proceedings of the 1st ACM SIGKDD International Conference on Privacy, Security, and Trust in KDD, 2008.
13.
CampanA. and TrutaT.M., A clustering approach for data and structural anonymity in social networks, Proceedings of the 2nd ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD, 2008.
14.
DasS.EgeciogluO. and AbbadiA.E., Anonymizing weighted social network graphs, The 26th International Conference on Data Engineering, ICDE 2010, 2010.
15.
LiY. and ShenH., Anonymizing graphs against weight-based attacks, International Conference on Data Mining Workshops (ICDMW 2010), 2010, pp. 491–498.
16.
LiuL.WangJ.LiuJ. and ZhangJ., Privacy preservation in social networks with sensitive edge weights, 2009 SIAM International Conference on Data Mining (SDM 2009), Sparks, Nevada, 2009, pp. 954–965.
17.
ZhouB. and PeiJ., Preserving Privacy in Social Networks Against Neighborhood Attacks, Proceedings of IEEE 24th International Conference on Data Mining (ICDE ’,08), 2008, pp. 506–515.
18.
LuX.SongY. and BressanS., Fast identity anonymization on graphs, Database and expert systems applications (2012), pp. 281–295.
19.
NewmanM.E.J., The structure of scientific collaboration networks, Working Papers98(2) (2001), 404–409.
20.
NewmanM.E.J., Scientific collaboration networks: I. Network construction and fundamental results, Physical Review E64(2) (2001), 016131.
21.
NewmanM.E.J., Scientific collaboration networks: II. Shortest paths, weighted networks, and centrality, Physical Review E64(1 Pt 2) (2001), 132–158.
22.
LiH.L. and BiaoC., Weighted Social Networks Anonymizing Publication, Recent Advances in Computer Science and Information Engineering, Lecture Notes in Electrical Engineering124 (2012), 413–421.
23.
WangS.L. et al., Shortest path anonymization on weighted graph, International Journal of Software Engineering and Knowledge Engineering23(1) (2013), 65–79.
24.
HayM.MiklauG.JensenD. and TowsleyD., Resisting structural re-identification in anonymized social networks, The Proceedings of the 34th Int’l Conference. on Very Large Databases, 2008, pp. 102–114.
25.
ChengJ.FuW.C. and LiuJ., K-isomorphism: privacy preserving network publication against structural attacks, Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010.
26.
ZouL.ChenL. and OzsuM.T., K-automorphism: a general framework for privacy preserving network publication, Proc. VLDB Endow2(1) (2009), 946–957.
27.
WuW. et al., k-symmetry model for identity anonymization in social networks, Proceedings of the 13th International Conference on Extending Database Technology, 2010.
28.
MaT.H.ZhangY.L. et al., KDVEM: a k-degree anonymity with Vertex and Edge Modification algorithm, Computing97(12) (2015), 1165–1184.
29.
TianQ. and ChenS., Cross-Heterogeneous-Database Age Estimation Through Correlation Representation Learning, Neurocomputing238 (2017), 286–295.
30.
ChenK.ZhangH.Y.WangB. and YangX.C., Protecting Sensitive Labels in Weighted Social Networks, Web Information System and Application Conference (WISA), 2013 10th, 2013, pp. 221–226.
31.
BiryukovM. and DongC., Analysis of computer science communities based on dblp, Research and Advanced Technology for Digital Libraries (2010), pp. 228–235.
32.
NewmanM.E.J., Finding community structure in networks using the eigenvectors of matrices, Physical Review E Statistical Nonlinear and Soft Matter Physics74(3) (2006), 92–100.
33.
MaT.H. et al., Detect structural-connected communities based on BSCHEF in C-DBLP, Concurrency Computat: Pract. Exper.28(2) (2016), 311–330.
34.
KumpulaJ.M.OnnelaJ.P.SaramakiJ.KaskiK. and KerteszJ., Emergence of communities in weighted networks, Physical Review Letters99(22) (2007), 228701-1–228701-4.
35.
HillS.ProvostF. and VolinskyC., Network-based marketing: Identifying likely adopters via consumer networks, Statistical Science22(2) (2006), 256–275.