Detecting fuzzy network communities based on semi-supervised label propagation

Abstract

Most existing methods use only network topology information, which neglect much other important information such as the network background information which could be useful to uncover the network communities. In this paper, we proposed a fast semi-supervised algorithm to uncover the fuzzy network communities using label propagation technology, which incorporating the prior information to facilitate the community detection process. Specially, we know the true community membership of a certain percentage nodes in advance. Firstly, the node graph is transformed to line graph, and the weight of line graph are defined. Then, the proposed algorithm is applied in the line graph to propagate the label of certain labeled nodes to the whole network until convergence, and the final label of a given node is its community id. It worth mentioning that our algorithm is very fast and with almost linear time complexity. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmic performance.

Keywords

Line graph link community label propagation semi-supervised linear time

1 Introduction

Many complex networks, such as Internet, human social network and protein interaction network, exhibit community structure, which means the existence of groups of densely connected nodes that sparsely connect with the rest of the graph. Detecting such communities in graphs can provide a useful coarse-grained representation for network analysis, and will help in understanding the topology and functions of the real systems [1 –3]. The most famous quality function is the Modularity scores proposed by Newman et al. [21, 22], which prefer those partitions containing communities with an internal edge density larger than expected in a given graph model. Label propagation(LP) algorithm is also a famous method for community partition [8, 31]. Initially, the algorithm starts with assigning a community label to each node, randomly. Then, each node iterates its label by replacing it by the label most used by its neighbors. The other well-known optimization approaches used in community detection problem include simulated annealing (SA) [26], external optimization (DA) [14, 25], Bayesian inference [4, 19] and model selection analysis [15].

In many real networks, nodes may simultaneously participate in multiple communities [17 –28]. In this paper, we proposed a new semi-supervised algorithm to uncover the network communities using label propagation technology, which incorporating the prior information to facilitate the community detection process. Specifically, two different type background information exist in the network, i.e. the label of individuals and practical constraint of correlation. Here we make use of the label of nodes in the network as prior knowledge, i.e., we know the true community membership of a certain percentage nodes in advance. Firstly, the node graph is transformed to line graph, and the weight of line graph are defined. Then, the proposed algorithm is applied in the line graph to propagate the label(community id) of certain labeled nodes to the whole network until convergence. The value in the label matrix is its community id of a given node. It worth mentioning that our algorithm is very fast and with almost linear time complexity. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmic performance.

2 Link communities

In real networks, nodes may participate in more than one modules, which can be viewed as “overlapping nodes” that exist pervasively in the community structure. In this paper, rather than regarding communities as assemble of nodes simply, we reveal and analyze communities as groups of links, which successfully coincide with the organizing principles of overlapping communities. Different from the existing hard partition methods, link communities [29 –34] focus on revealing the overlapping organization using topological information, naturally. Even though each link is assigned only a single membership, we can use link communities to capture multiple relationships between nodes, since multiple nodes may belong to several different communities together, simultaneously. A representative example is the heterogeneous cliques network, which shown in Fig. 1. In this network, each community is a single clique which marked by different colors and two adjacent communities are overlapped by only one node, which highlighted by red color.

To reveal the emerged link communities, we first transform the original unweighted node-node network N, to an equivalent weighted link-link line graph L. For a given network N, each node of its corresponding line graph L represents an link of N and two nodes of L are adjacent when their corresponding links share a common endpoint in N. A simple example is shown in Fig. 2. Different from traditional unweighted node graph, the line graph is weighted and edge weight is calculated by the following function: $w (l_{ik}, l_{jk}) = \frac{a_{i} \cdot a_{j}}{| a_{i} |^{2} + | a_{j} |^{2} - a_{i} \cdot a_{j}},$ (1) where a_i is the i-th row of adjacent matrix A = {a_i} in node network N, the numerator a_i · a_j represents the number of common inclusive neighbors between node i and j; the denominator represents the number of all inclusive neighbors of node i and j. One can observe that the value of w (l_ik, l_jk) lies between [0, 1] and larger the weight w, more common neighbors shared by the endpoint i and j. The weight w defined in Equation 1 is a generalized definition, which can be easily extended to directed, signed or mixing networks.

From the practical point of view, the overlapping node in any communities can be naturally detected by partitioning the links into communities in line graphs using any hard partition method, because the links connected to a node could belong to different link communities and consequently the node could be assigned to multiple communities of links.

3 The algorithm

Let L = {V, E} denotes a weighted line graph, where V = is the set of nodes(links in node graph N) in L, where n represents the size of L. For nodes set V, it contains labeled nodes {(v₁, y₁) , . . . , (v_l, y_l)} and unlabeled nodes v_l+1, . . . , v_n. We denote the label set as Y_l = {y₁, . . . , y_l}, where y_i ∈ K (i = 1, . . . , l) and K is the set of label values. The label set is associates with the labeled nodes set, which indicate the community id these nodes belong to. In this paper, we use the label propagation method to percolate the set of labels Y_l to all unlabeled nodes, which inferring the missing labels (community id) and find the community membership.

The label propagation process is executed by updating the label of each node based on the labels of its adjacent neighbors, iteratively. To illustrate the procedure more clearly, let’s first consider a simple two community situation, i.e. the label value set K = {1, - 1}. We denote $f_{i}^{t}$ as the label of node i, and the label of unlabeled node is updated by the following equation: $f_{i}^{t + 1} = α \sum_{j} w_{ij} f_{i}^{t} + (1 - α) y_{i},$ (2) where t is the number of iterations, w_ij represents weight between node i and j which calculated by Equation (1), and 0 < α < 1 is the tunable parameter that adjust the fraction that node i absorbs from its adjacent neighbors. Let y = {y₁, y₂, . . . , y_n} ^T, with y_i ∈ K (i ≤ l) and y_j = 0 (l < j ≤ n). $f^{t} = (f_{1}^{t}, f_{2}^{t}, . . ., f_{n}^{t})^{T}$ represents the vector of labels at t-th iteration and f⁰ = y. Then we can rewrite the dynamical equation of label $f_{i}^{t}$ in a matrix form as: $f^{t + 1} = α {Wf}^{t} + (1 - α) y .$ (3)

The proposed model can also be extended to identify more than two communities. Suppose the network contains k communities, we let $F = [F_{1}^{T}, F_{2}^{T}, . . ., F_{n}^{T}]$ denotes a specific partition of line graph which labels node i as $y_{i} = arg max_{i ⪡ k} F_{ij}$ . Thus F can also be viewed as a label assignment function which applied to every node. We initially set F₀ = Y, where Y_iθ = 1 if node i is labeled as θ, and Y_iθ = 0 otherwise, and for unlabeled nodes Y_uθ = 0 (1 < θ < k). Then, the dynamical iteration function of labels can be rewritten as $F^{t + 1} = α {WF}^{t} + (1 - α) Y .$ (4)

To illustrate the proposed method more clearly, a small artificial network with 10 nodes is employed, which shown in Fig. 3. One can observe that two communities exist, which contain nodes {1, 2, 3, 4, 5} and {6, 7, 8, 9, 10}, respectively. We set node 5 and node 10 as the initial labeled nodes, and the label assignment can dynamical spread to the whole network. Finally, the community membership is shown in Fig. 3, and different communities are highlighted by different colors.

We can also prove the convergence of the algorithm in the follows. Based on the initial assignment that F₀ = Y, Equation (4) can be rewritten as $F^{t} = (α W)^{t} Y + (1 - α) \sum_{i = 0}^{t - 1} Y .$ (5)

Since w_ij ≥ 0 and ∑_iw_ij = 1, according to the Perron-Frobenius theorem [16], the spectral radius of weight matrix W satisfies ρ (W) ≤1. In addition, 0 ≤ α ≤ 1, thus

$\begin{matrix} lim_{t \to \infty} (α W)^{t} = 0, \\ lim_{t \to \infty} \sum_{i = 0}^{t - 1} (α W)^{i} = (I - α W)^{- 1}, \end{matrix}$ (6) where I represents as the identity matrix. One can observe that Equation (4) will converge to the following form: $F^{t} = (α W)^{t} Y + (1 - α) \sum_{i = 0}^{t - 1} Y .$ (7)

Generally, we can conclude and exhibit the main procedure of the proposed method in Algorithm 1.

Algorithm 1 The algorithm based on semi-supervised label propagation

Input: Node graph N, label matrix Y, the parameter α

Output: The community membership matrix X

1. Transform the node graph N to an equivalent line graph L;

2. Calculate the weight matrix W using Equation (1);

3. Iterate Equation (4) until convergence;

4. Calculate the element of label matrix via $y_{i} = arg max_{i ⪡ k} F_{ij}$ .

In Algorithm 1, the computational complexity mainly contains two parts: calculating the link weight matrix in line graph by Equation (1), and iterating Equation (4) until convergence. For the first part, we transform to the line graph and calculating the weight matrix needs O (m) time, where m represents the size of line graph(number of nodes in line graph). For the second part, each iteration costs O (m) time. If we assume that it needs t iterations to convergence, then the second procedure requires O (tm) time. Since the computational complexity of the proposed model depends on the highest part of these two procedures, the overall cost time is O (tm). So, Algorithm 1 is very fast and nearly linear, especially for some well-connected networks as the iteration number t is smaller.

4 Experiments

The proposed algorithm exhibits an excellent performance on both artificial benchmark networks and real-world data networks. Results show the proposed algorithm obtains quite good results for all cases.

Artificial networks. We have tested our algorithm on both GN and LFR benchmark networks [2]. Here, the Normalized Mutual Information(NMI) is used which has been extensively examined in information theory [3]. This measure is varied within the interval ⌊0, 1⌋ which indicates the proportion that true and found communities own overlapping information. Specially, when they are same as each other, the value of NMI is equal to 1. Otherwise, the less the overlapping between the true communities and the found one, the smaller the NMI value. Furthermore, the parameter α = 0.5 is set for all cases.

In Fig. 4(a), the result of our algorithm in Girvan-Newman(GN) benchmark networks [21, 22] is presented. The GN benchmark has been wildly used to compare the performance between different community partition algorithms. Therefore, for each node, the expected intra-community degree and the expected inter-community degree is Z_in and Z_out respectively and Z_in + Z_out = 16 on average. As can be seen, when Z_out ≤ 5.5 the community structure of the network is almost completely identified (Normalized Mutual Information ≥0.99).

We have also tested our algorithm for LFR benchmark networks [3] and compared its performance with other methods. In the LFR benchmark, the value of mixing parameter (μ) varies within the interval ⌊0, 1⌋, which determines the degree of the fuzziness of the communities in the LFR graph and the larger the μ, the more fuzzy the communities. One can note that the GN benchmark is a special case of the LFR benchmark. The outcome is shown in Fig. 4(b). As can be seen, the results are fabulous almost in all cases and when μ ≤ 0.55, the value of Normalized Mutual Information ≥0.9.

The Zachary karate networks. Next, we apply our work on famous Zachary karate network [32]. In 1970’s, Zachary constructed a network represent the social ties between members within the karate club. We set the α = 0.85, and find the partition result detected by the proposed algorithm is exactly matches the original partition [21, 22], which shown in Fig. 5(a). In this case, node 3 is found as an overlapping node. Actually, node 3 is on the border between the communities and so it is understandable that it might be an ambiguous case. Furthermore, compared with the optimal situation, when α is reduced to 0.85, three communities are detected in this situation reveals another scale of membership, which shown in Fig. 5(b). Four overlapping nodes including nodes 3, 10, 20, 28 are marked in a dashed curve. Such members have good friendship with more than one clubs at the same time, so they are overlapping nodes in this situation. In conclusion, we can partition the network using our algorithm in different scales of α, which able to reflect multi-scale property of the real networks.

Performance on more real world networks. Furthermore, we evaluate our model for more wildly used real-world data, which shown in Table 1. Seven wildly used networks are employed and there corresponding references are represented. For sake of comparison, the best published Modularity Q results are provided, which are obtained in the networks by the computation of a lot of partition methods. As can be seen in Table I, the results are very close to the best published values after applying our algorithm, but at a very low computational cost.

A large scale semantic network. Finally, we apply the proposed model on the weighted semantic network contains 7207 phrases and 31784 edges [11]. The weights of edges are calculated in terms of phrase co-occurrences. To show the visualization of community partition, our algorithm outputs a transformed adjacency matrix (in which the vertices within the same communities have been arranged together) with a hierarchical community structure. The output matrix is shown in Fig. 6(a). The distribution of size of communities is shown in Fig. 6(b). Totally, 367 communities are detected and the maximum size of community is 339, the minimum size is 4, and the average size is 15.57. One can see an approximate power-law phenomenon, that is, most communities are small and only a few are big. Among them, we have selected three interesting communities listed as follows:

Community 1 = {Scientist, Inventor, Genius, Gifted, Brilliant, Intelligent, Smart, Science, Intelligence, Musician};

Community 2 = {Sax, Piano, Violin, Cornet, Tuba, Timpani, Cello, Band, Fiddle};

Community 3 = {Boxing, Boat, Race, Elevator, Ascent, Staircase, Stairwell, Climb, bobsleigh, Resting};

These three communities are all reasonable modules listed in Ref. [11] and the elements of each are all have same meaning. Among these elements, {Musician, Intelligence} are uncovered as overlapping nodes between communities 1 and 2, and {Using, Tool, Mechanic} are the overlapping nodes between communities 3 and 4.

5 Conclusion

In this paper, we proposed a near linear semi-supervised algorithm to uncover the fuzzy communities using label propagation technology, which incorporating the prior information to facilitate the community detection process. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmic performance.

Footnotes

Acknowledgments

The authors are separately supported by NSFC grants 71401194, 71401188, 91324203, 11131009 and Young Elite Teacher Project of Central University of Finance and Economics under Grants QYP1603.

References

Clauset

, Newman

M.E.J.

and Moore

, Finding community structure in very large networks, Physical Review E70(6) (2004), 066111.

Lancichinetti

and Fortunato

, Community detection algorithms: A comparative analysis, Physical Review E80(5) (2009), 056117.

Lancichinetti

, Fortunato

and Radicchi

, Benchmark graphs for testing community detection algorithms, Physical Review E78(4) (2008), 046110–046115.

Yang

, Liu

J.M.

and Liu

D.Y.

, Characterizing and Extracting Multiplex Patterns in Complex Networks, IEEE Transactions on Systems, Man, and Cybernetics, Part B42(2) (2012), 469–481.

Knuth

D.E.

, The stanford GraphBase: A platform for combinatorial computing, Reading, MA: Addison-Wesley Professional37 (1993), 592.

, Liu

, Zhang

, Jin

and Yang

, Discovering link communities in complex networks by exploiting link dynamics, Journal of Statistical Mechanics: Theory and Experiment10 (2012), 10015.

, Jin

, Chen

and Zhang

, Identification of hybrid node and link communities in complex networks, Scientific Reports5(8638) (2015), 1–14.

Liu

, Bai

H.Y.

, Li

H.J.

and Wang

W.J.

, Semi-supervised community detection using label propagation, International Journal of Modern Physics B28(29) (2014), 1450208.

Lusseau

, Schneider

, Boisseau

O.J.

, Haase

, Slooten

and Dawson

S.M.

, The bottlenose dolphin community of Doubtful Sound features a large proportion of long-lasting associations, Behavioral Ecology and Sociobiology54(4) (2003), 396–405.

10.

Agarwal

and Kempe

, Modularity-maximizing graph communities via mathematical programming, The European Physical Journal B66(3) (2008), 409–418.

11.

Palla

, Barabási

A.L.

and Vicsek

, Quantifying social group evolution, Nature446 (2007), 664–667.

12.

H.J.

, Wang

, Wu

L.Y.

, Zhang

and Zhang

X.S.

, Potts model based on a Markov process computation solves the community structure problem effectively, Physical Review E86(1) (2012), 016109.

13.

H.J.

and Zhang

X.S.

, Analysis of stability of community structure across multiple hierarchical levels, Europhysics Letters103(5) (2013), 58002.

14.

Duch

and Arenas

, Community detection in complex networks using extremal optimization, Physical Review E72(2) (2005), 027104.

15.

Hofman

J.M.

and Wiggins

C.H.

, Bayesian approach to network modularity, Physical Review Letters100(25) (2008), 258701.

16.

Keener

J.P.

, The Perron-Frobenius theorem and the ranking of football teams, SIAM Review35(1) (1993), 80–93.

17.

and Havens

T.C.

, Quadratic program-based modularity maximization for fuzzy community detection in social networks, IEEE Transactions on Fuzzy Systems23(5) (2015), 1356–1371.

18.

Whang

, Gleich

and Dhillon

, Overlapping community detection using neighborhood-inflated seed expansion, IEEE Transactions on Knowledge and Data Engineering28(5) (2015), 1272–1284.

19.

Hastings

M.B.

, Community detection as an inference problem, Physical Review E74(3) (2006), 035102.

20.

Boguna

, Pastor-Satorras

, Diaz-Guilera

and Arenas

, Models of social networks based on social distance attachment, Physical review E70(5) (2004), 056122.

21.

Newman

M.E.J.

, Fast algorithm for detecting community structure in networks, Physical Review E69(6) (2004), 066133.

22.

Newman

M.E.J.

and Girvan

, Finding and evaluating community structure in networks, Physical Review E69(2) (2004), 026113.

23.

Girvan

and Newman

M.E.J.

, Community structure in social and biological networks, Proceedings of the National Academy of Sciences of the United States of America99(12) (2002), 7821–7826.

24.

Gleiser

and Danon

, List of edges of the network of jazz musicians, Advances in Complex Systems6 (2003), 565.

25.

Ronhovde

and Nussinov

, Local resolution-limit-free Potts model for community detection, Physical Review E81(4) (2010), 046114.

26.

Guimera

and Nunes

L.A.

, Amaral, Functional cartography of complex metabolic networks, Nature433(7028) (2005), 895–900.

27.

Guimera

, Danon

, Diaz-Guilera

, Giralt

and Arenas

, Self-similar community structure in a network of human interactions, Physical Review E68(6) (2003), 065103.

28.

Bhat

S.Y.

and Abulaish

, HOCTracker: Tracking the evolution of hierarchical and overlapping communities in dynamic social networks, IEEE Transactions on Knowledge and Data Engineering27(4) (2015), 1019–1013.

29.

Evans

T.S.

and Lambiotte

, Line graphs, link partitions, and overlapping communities, Physical Review E80(1) (2009), 016105.

30.

Evans

T.S.

and Lambiotte

, Line graphs of weighted networks for overlapping communities, European Physical Journal B77(2) (2010), 265.

31.

Raghavan

U.N.

, Albert

and Kumara

, Near linear time algorithm to detect community structures in large-scale networks, Physical Review E76(3) (2007), 036106.

32.

Zachary

, An information flow model for conflict and fission in small groups, Journal of Anthropological Research1 (1977), 452–473.

33.

Ahn

Y.Y.

, Bagrow

J.P.

and Lehmann

, Link communities reveal multiscale complexity in networks, Nature466(7307) (2010), 761–764.

34.

, Zhang

X.S.

, Wang

R.S.

, Liu

and Zhang

, Discovering link communities in complex networks by an integer programming model and a genetic algorithm, PloS One8(12) (2013), e83739.