Abstract
Most existing methods use only network topology information, which neglect much other important information such as the network background information which could be useful to uncover the network communities. In this paper, we proposed a fast semi-supervised algorithm to uncover the fuzzy network communities using label propagation technology, which incorporating the prior information to facilitate the community detection process. Specially, we know the true community membership of a certain percentage nodes in advance. Firstly, the node graph is transformed to line graph, and the weight of line graph are defined. Then, the proposed algorithm is applied in the line graph to propagate the label of certain labeled nodes to the whole network until convergence, and the final label of a given node is its community id. It worth mentioning that our algorithm is very fast and with almost linear time complexity. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmic performance.
Introduction
Many complex networks, such as Internet, human social network and protein interaction network, exhibit community structure, which means the existence of groups of densely connected nodes that sparsely connect with the rest of the graph. Detecting such communities in graphs can provide a useful coarse-grained representation for network analysis, and will help in understanding the topology and functions of the real systems [1–3]. The most famous quality function is the Modularity scores proposed by Newman et al. [21, 22], which prefer those partitions containing communities with an internal edge density larger than expected in a given graph model. Label propagation(LP) algorithm is also a famous method for community partition [8, 31]. Initially, the algorithm starts with assigning a community label to each node, randomly. Then, each node iterates its label by replacing it by the label most used by its neighbors. The other well-known optimization approaches used in community detection problem include simulated annealing (SA) [26], external optimization (DA) [14, 25], Bayesian inference [4, 19] and model selection analysis [15].
In many real networks, nodes may simultaneously participate in multiple communities [17–28]. In this paper, we proposed a new semi-supervised algorithm to uncover the network communities using label propagation technology, which incorporating the prior information to facilitate the community detection process. Specifically, two different type background information exist in the network, i.e. the label of individuals and practical constraint of correlation. Here we make use of the label of nodes in the network as prior knowledge, i.e., we know the true community membership of a certain percentage nodes in advance. Firstly, the node graph is transformed to line graph, and the weight of line graph are defined. Then, the proposed algorithm is applied in the line graph to propagate the label(community id) of certain labeled nodes to the whole network until convergence. The value in the label matrix is its community id of a given node. It worth mentioning that our algorithm is very fast and with almost linear time complexity. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmic performance.
Link communities
In real networks, nodes may participate in more than one modules, which can be viewed as “overlapping nodes” that exist pervasively in the community structure. In this paper, rather than regarding communities as assemble of nodes simply, we reveal and analyze communities as groups of links, which successfully coincide with the organizing principles of overlapping communities. Different from the existing hard partition methods, link communities [29–34] focus on revealing the overlapping organization using topological information, naturally. Even though each link is assigned only a single membership, we can use link communities to capture multiple relationships between nodes, since multiple nodes may belong to several different communities together, simultaneously. A representative example is the heterogeneous cliques network, which shown in Fig. 1. In this network, each community is a single clique which marked by different colors and two adjacent communities are overlapped by only one node, which highlighted by red color.
To reveal the emerged link communities, we first transform the original unweighted node-node network N, to an equivalent weighted link-link line graph L. For a given network N, each node of its corresponding line graph L represents an link of N and two nodes of L are adjacent when their corresponding links share a common endpoint in N. A simple example is shown in Fig. 2. Different from traditional unweighted node graph, the line graph is weighted and edge weight is calculated by the following function:
From the practical point of view, the overlapping node in any communities can be naturally detected by partitioning the links into communities in line graphs using any hard partition method, because the links connected to a node could belong to different link communities and consequently the node could be assigned to multiple communities of links.
Let L = {V, E} denotes a weighted line graph, where V = is the set of nodes(links in node graph N) in L, where n represents the size of L. For nodes set V, it contains labeled nodes {(v1, y1) , . . . , (v l , y l )} and unlabeled nodes vl+1, . . . , v n . We denote the label set as Y l = {y1, . . . , y l }, where y i ∈ K (i = 1, . . . , l) and K is the set of label values. The label set is associates with the labeled nodes set, which indicate the community id these nodes belong to. In this paper, we use the label propagation method to percolate the set of labels Y l to all unlabeled nodes, which inferring the missing labels (community id) and find the community membership.
The label propagation process is executed by updating the label of each node based on the labels of its adjacent neighbors, iteratively. To illustrate the procedure more clearly, let’s first consider a simple two community situation, i.e. the label value set K = {1, - 1}. We denote as the label of node i, and the label of unlabeled node is updated by the following equation:
The proposed model can also be extended to identify more than two communities. Suppose the network contains k communities, we let denotes a specific partition of line graph which labels node i as . Thus F can also be viewed as a label assignment function which applied to every node. We initially set F0 = Y, where Y
iθ
= 1 if node i is labeled as θ, and Y
iθ
= 0 otherwise, and for unlabeled nodes Y
uθ
= 0 (1 < θ < k). Then, the dynamical iteration function of labels can be rewritten as
To illustrate the proposed method more clearly, a small artificial network with 10 nodes is employed, which shown in Fig. 3. One can observe that two communities exist, which contain nodes {1, 2, 3, 4, 5} and {6, 7, 8, 9, 10}, respectively. We set node 5 and node 10 as the initial labeled nodes, and the label assignment can dynamical spread to the whole network. Finally, the community membership is shown in Fig. 3, and different communities are highlighted by different colors.
We can also prove the convergence of the algorithm in the follows. Based on the initial assignment that F0 = Y, Equation (4) can be rewritten as
Since w ij ≥ 0 and ∑ i w ij = 1, according to the Perron-Frobenius theorem [16], the spectral radius of weight matrix W satisfies ρ (W) ≤1. In addition, 0 ≤ α ≤ 1, thus
Generally, we can conclude and exhibit the main procedure of the proposed method in Algorithm 1.
1. Transform the node graph N to an equivalent line graph L;
2. Calculate the weight matrix W using Equation (1);
3. Iterate Equation (4) until convergence;
4. Calculate the element of label matrix via .
In Algorithm 1, the computational complexity mainly contains two parts: calculating the link weight matrix in line graph by Equation (1), and iterating Equation (4) until convergence. For the first part, we transform to the line graph and calculating the weight matrix needs O (m) time, where m represents the size of line graph(number of nodes in line graph). For the second part, each iteration costs O (m) time. If we assume that it needs t iterations to convergence, then the second procedure requires O (tm) time. Since the computational complexity of the proposed model depends on the highest part of these two procedures, the overall cost time is O (tm). So, Algorithm 1 is very fast and nearly linear, especially for some well-connected networks as the iteration number t is smaller.
The proposed algorithm exhibits an excellent performance on both artificial benchmark networks and real-world data networks. Results show the proposed algorithm obtains quite good results for all cases.
In Fig. 4(a), the result of our algorithm in Girvan-Newman(GN) benchmark networks [21, 22] is presented. The GN benchmark has been wildly used to compare the performance between different community partition algorithms. Therefore, for each node, the expected intra-community degree and the expected inter-community degree is Z in and Z out respectively and Z in + Z out = 16 on average. As can be seen, when Z out ≤ 5.5 the community structure of the network is almost completely identified (Normalized Mutual Information ≥0.99).
We have also tested our algorithm for LFR benchmark networks [3] and compared its performance with other methods. In the LFR benchmark, the value of mixing parameter (μ) varies within the interval ⌊0, 1⌋, which determines the degree of the fuzziness of the communities in the LFR graph and the larger the μ, the more fuzzy the communities. One can note that the GN benchmark is a special case of the LFR benchmark. The outcome is shown in Fig. 4(b). As can be seen, the results are fabulous almost in all cases and when μ ≤ 0.55, the value of Normalized Mutual Information ≥0.9.
These three communities are all reasonable modules listed in Ref. [11] and the elements of each are all have same meaning. Among these elements, {Musician, Intelligence} are uncovered as overlapping nodes between communities 1 and 2, and {Using, Tool, Mechanic} are the overlapping nodes between communities 3 and 4.
Conclusion
In this paper, we proposed a near linear semi-supervised algorithm to uncover the fuzzy communities using label propagation technology, which incorporating the prior information to facilitate the community detection process. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmic performance.
Footnotes
Acknowledgments
The authors are separately supported by NSFC grants 71401194, 71401188, 91324203, 11131009 and Young Elite Teacher Project of Central University of Finance and Economics under Grants QYP1603.
