Abstract
Detecting fuzzy network communities in directed network is a classic and very difficult task in the field of complex network analysis, principally for its applications in domains such as social or biological networks analysis. Present techniques rely heavily on network topology, which cannot provide a lot of important information, such as module correlation and hierarchical structure. In this paper, we present a new fuzzy community detection method, which is able to find fuzzy communities in directed line graphs by maximizing likelihood function. Firstly, the directed node graph is transformed to a new type directed line graph, and the direction and weight of line graph are defined. Then, the community unit consists of membership and correlation information is defined in the line graph. Specifically, there are two main contributions of this method: 1) to adequately characterize the community structure, the node and module correlation with different granularity can be calculated; 2) based on the membership and correlation information, we can extract the multiplex patterns between communities, according to different domain requirements. Furthermore, we are able to map the link community configuration to the optimal situation dynamically by maximizing the likelihood function with rigorous mathematical proof. Based on the spectral analysis of the Markovian transition matrix, a mathematical theory is provided to identify the optimum number of network communities, and to analyze the stability of the community structure. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmic performance.
Introduction
Recent years have witnessed an increasing interest in detecting communities in large-scale networks of various kinds. Communities are clusters of closely connected nodes within a network. Since massive online social networks have been deeply integrated into our daily life, detecting meaningful communities from them has become a critical task for research and applications of various purposes. Early work in graph partitions [1–3] could be adopted for community detection. However, these methods usually require the knowledge on the number of communities as part of the input, which is unrealistic in community detection problems. Modularity [2, 3], proposed by Newman et al., is the first successful attempt to resolve the drawback specified above. Modularity scores prefer those partitions containing communities with an internal edge density larger than expected in a given graph model. The other well-known optimization approaches used in community detection problem include simulated annealing (SA) [4], external optimization (DA) [5, 10], Potts dynamical [6, 7], Bayesian inference [8, 12] and model selection analysis [9]. To the best of our knowledge, the most competitive approach of this kind is Louvain [11], which can scale to graphs with hundreds of millions of objects.
Most existing community detection algorithms focus on partitioning a network such that each node belongs to only one community exactly, however, in many real social and biological networks, nodes may simultaneously participate in multiple communities [13–15]. In this paper, we present a new fuzzy community detection method, which is able to find fuzzy communities in directed line graphs by maximizing likelihood function. Firstly, the directed node graph is transformed to a new type directed line graph, and the direction and weight of line graph are defined. Then in the line graph, we define the community unit including membership and correlation information. Specifically, there are two main contributions of this method: 1) to adequately characterize the community structure, the node and module correlation with different granularity can be calculated; 2) based on the membership and correlation information, we can extract the multiplex patterns between communities, according to different domain requirements. Furthermore, we are able to map the link community configuration to the optimal situation dynamically by maximizing the likelihood function with rigorous mathematical proof. Based on the spectral analysis of the Markovian transition matrix, a mathematical theory is provided to identify the optimum number of network communities, and to analyze the stability of the community structure. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmicperformance.
Line graph and link communities
The fuzzy characteristic exists pervasively in the community structure of many real networks, where each node participates in more than one modules. In this paper, rather than regarding communities as assemble of nodes simply, we reveal and analyze communities as groups of links, which successfully coincide with the organizing principles of overlapping communities. Different from the existing hard partition methods, link communities [16–18] focus on revealing the overlapping organization using topological information, naturally.
In Ref [16, 17], Evans et al. introduced a pioneering model which transformed a network of nodes into a line graph using a random walk, and then detected link communities by applying an existing algorithm for node partition on this line graph. The most popular method for identifying link communities was proposed by Ahn et al. [18], which used a hierarchical clustering on a similarity between links to build a dendrogram of links to form a hierarchy of link structure. In order to derive the most relevant communities, they introduced a density function to determine the best level to cut the tree of links. He et al. [19] proposed a link dynamics based algorithm, called UELC, which using the stochastic process of a link-node-link random walk is employed to unfold an embedded bipartition structure of links in a network. The local mixing properties of the Markov chain underlying the random walk are then utilized to extract two emerged link communities. Li et al. [20] proposed a quantity function for identifying link community, and based on this quantity function, they further formulate the link community partition problem into an integer programming model which able to partition a complex network into overlapping communities. He et al. [21] introduced a mixing approach for identifying complex structures including both node and link communities, called hybrid node-link communities, which using a probabilistic model that accommodates node, link and hybrid community structure.
Even though each link is assigned only a single membership, we can use link communities to capture multiple relationships between nodes, since many nodes may belong to several different communities together, simultaneously. A representative example is the heterogeneous cliques network, which shown in Fig. (1). In this network, each community is a single clique which marked by different shapes and two adjacent communities are overlapped by only one node.

A heterogeneous cliques network, where each link community is a clique. Between two adjacent link communities, two overlapping nodes are highlighted.
To reveal the emerged link communities, we first transform the original unweighted directed node-node network N, to an equivalent weighted directed link-link line graph G. For a given network N, each node of its corresponding line graph G represents a directed link of N and two nodes of G are adjacent when their corresponding links share a common endpoint in N. We denote the adjacent matrix of node graph N as A = {a
i
}. Different from traditional unweighted node graph, the line graph is weighted. However, few researchers has focused on the study of direction and weight calculation of directed line graph. Here, we explore this problem and transform the node-node direction in node graphs to the link-link direction in the corresponding line graphs (Fig. 2). Then, the edge weight is calculated by the following function:

The transformation of node-node direction in node graphs to the link-link direction in the corresponding line graphs.

An toy example that transforming the original network into the corresponding weighted line graph. (a) The original network N contains two link communities highlighted by different shapes. As the joint part, node 3 can be considered as the overlapping node. (b) The transformed line graph L. The width of edges is proportional to its weight, which calculated according to Equation (1). If we cluster the line graph using any hard partition algorithm(weighted), the links belong to different link communities can also be recognized.
From the practical point of view, the overlapping node in any communities can be naturally detected by partitioning the links into communities in line graphs using any hard partition method, because the links connected to a node could belong to different link communities and consequently the node could be assigned to multiple communities of links.
The community unit
For a weighted and directed line graph G = (V, E), denote V (G) and E (G) are the set of nodes and directed edges, respectively. We denote the weight matrix of line graph G as Wn×n, where n represents the size of G (number of nodes). Suppose the network G is divided and separated into L (1 ≤ L ≤ n) modules, the node membership is denoted as Bn×L, where b il = 1 if node i belongs to module l, otherwise b il = 0. Consider each module is inseparable, one can use the average size of modules g = n/L as the “granularity”, which represents the line graph G evolves from the finest to coarsest as g increases from 1 to n.
In our framework, two different kinds of correlations, i.e. node correlation and module correlation, are defined. Correspondingly, two kind of correlation matrix with different resolutions are defined. We denote node forward-correlation matrix as Pn×n and node back-correlation matrix as Qn×n, where p ij and q ij represent the expect probabilities that node i correlates with or to be correlated by node j, respectively. We expect to cluster nodes with similar forward-correlation and back-correlation distributions into the same modules. If L represents the number of modules, we also define the module forward-correlation matrix ΦL×L and module back-correlation matrix ΨL×L, where φ pq and ψ pq represent the expect probabilities that module p correlates with or to be correlated by module q, respectively. It will be shown later that module correlation matrix can be inferred based on node correlation matrix and viceversa.
As it can be expected that the distribution of node correlations within the same modules are homogeneous, we should characterize the correlation distribution for modules instead of nodes. We denote the module matrix B with specific granularity g as B g . Specifically, when g = 1, there is B1 = In×n. It can be expected that cluster L modules into a specific number of communities so that the nodes within modules belonging to the same communities exhibit similar correlation distributions. The module membership is represented by matrix ZL×K (1 ≤ K ≤ L), where z lk = 1 if module l is clustered into community k, otherwise z lk = 0, K is the number of communities. Furthermore, for a specific Z, ΘK×n = (θ kj ) is denoted as the probability that a node out of community k expects to correlate with node j; ΔK×n = (δ kj ) is denoted as the probability that a node out of community k expects to be correlated with node j; Ω (ω1, . . . , ω K ) T = (ω k ) is denoted as the prior probability that a randomly selected node will belong to community k. It can be easily found that P = {p ij } = B g ZΘ and Q = {q ij } = B g ZΔ.
For B
g
, we consider X = (K, Z, Θ, Δ, Ω) as a community unit (CU) of line graph G with respect to B
g
, where Z = {z
ik
} is the module membership matrix that Z
ik
= 1 indicates node i belongs to module k, and otherwise. From the hypothesis space of X, one can expect to find the optimal one to predict the module correlation exactly in terms of B
g
. Based on the posteriori maximum principle, for a given line graph G, we expect to select the optimal X according to B
g
by maximizing the posterior probability. Specifically, there is
The X is used to characterize the membership and correlation behavior of line graph G. Next, in order to compress the correlation behavior in node level to module level, we propose the detailed definition of module correlation, module forward-correlation Φ = {φ
ij
} and module back-correlation Ψ = {ψ
ij
} in terms of Θ, Δ, Z, and Ω:
Then some important properties can bedefined. The forward and back coupling strength between two communities k and s can be represented by Φ
ks
and Ψ
ks
, defined as the module correlation. The “importance” of a given community μ is defined intuitively as the sum of forward and back coupling strength, i.e. Iμ = ∑
s
Φ
μs
+ Ψ
μs
.
Specifically, we can also compress and map the membership matrix Z into module label y, where y (i) = k if the entry (i, k) of B1Z = Z is 1 (node i belongs to module k). For a given combination y, Φ, and Ψ, we can transform the node correlation p
ij
and q
ij
by
As discussed above, we expect to find the optimal X by maximizing the posterior probability of Equation (2). According to Equation (2), as P (X|G, B g ) proportional to the product of P (G|X, B g ) and P (X|B g ), we can maximize P (X|G, B g ) by maximizing P (G|X, B g ) and P (X|B g ), respectively.
First, let’s consider the likelihood P (G|X, B g ). We use the logarithmic transformation L (G|X, B g ) = lnP (G|X, B g ), and there is
Next, we employ the information theory and approximate the prior P (X|B
g
). As 1 ≤ K ≤ L = n/g, a larger number of communities K implies a finer granularity. In the following, we will show that a more complexity of community unit X will be caused by a larger K. One can note that a finer granularity results in more complex models. There is the following mathematical relationship:
Based on the information theory, with prior P (X), ln (1/P (X)) can be used to describe the “minimum description length” of X. We denote
In order to calculate the prior P (X|B
g
), we need to design a compressed coding schema as close to
In the proposed algorithm, maximizing the likelihood function of Equation (5) of multiplex structures is the most computationally costly step, and its running time will dominate the time complexity of the whole computing process. We will analyze its worst time complexity. It is easy to verify that the calculation of θ, δ and ω by Equation (6) will take O (nK) time, and the calculation of γ by Equation (7) will take O (n2K2) time. As mentioned, n is the number of nodes in the whole network, and K is the number of clusters. Therefore, the time complexity of the proposed algorithm is O (tn2K2), where t is the required iterations. Actually, by means of an efficient greedy strategy, the efficiency of the procedure would be greatly improved.
Determine the optimal number of communities
One issue that should be addressed before running the algorithm is the number of communities. In fact, The number of the communities is not known as a priori and it always matters for community detection algorithms. Here, by employing the Markov process and spectral theory, we provide a rigorous proof and deduce the optimal number of community theoretically. Furthermore, some important hidden information, such as stability of community structure, can also be obtained.
In an ergodic Markov process, P
t
is represented as the transition probability matrix of line graph G among nodes after t steps. In order to calculate the matrix P
t
, we use the eigenvalue decomposition of P. The eigenvalue λ
k
, k = {1, ⋯ , n} and its corresponding eigenvectors satisfy:
Assuming that the network A is divided into c disjoint sets(communities) Vμ ⊂ V, μ ∈ {1, 2, ⋯ , c}, forming a partition of V jointly. We use nμ = |Vμ| to represent the number of nodes in each community. To facilitate further discussion, we reformulate the objective function in a matrix form. We use binary vector xμ, which describes the involvement of each node in community μ, to represent the membership vector. Considering the orthogonal vectors x
l
and x
s
, the c community structure can be detected by the maximization of the following function:
On the basis of Rayleigh-Ritz theorem [6, 7], we can calculate the maximum of this problem on condition that columns of Y are the right eigenvectors U = {u1, . . . , u
c
} according with the c largest eigenvalues of
The proposed algorithm shows an excellent performance on both artificial benchmark networks and real-world data networks. In addition, it can be used to detect overlapping communities and corresponding correlation patterns. Results show the proposed algorithm obtains quite good results for all cases. Please refer to Supplementary materials for more details, for example, experiments on artificial networks and “Les Miserables” network [30].
First, according to the partition membership, we bundle the node with identical community membership together and exhibit a transformed adjacent matrix in Fig. 4(a), which illustrate a quite strong hierarchical structure. Specifically, among the total 793 communities, the size of the biggest community is 184, the size of the smallest community is 3, and the average size is 69. Moreover, 1,433 of 6,931 pairs of communities have fuzzy membership with each other, and the 5% communities with largest size account for 25.4% nodes while the other ones are relatively small. Furthermore, in Fig. 4(b), to visualize the community partition membership, we highlight four regions which include 10 overlapping nodes by four circles in a sub-network including eight communities, which is identical with the Refs. [6] and [23]. Experimental results on this large scale real-world networks have demonstrated that our method can successfully discover higher quality communities. From the efficiency validation, the proposed method exhibits good scalability and can efficiently handle networks with millions of edges.

(a) The transformed adjacent matrix which bundles the node with identical community membership together. (b) We pick up a small subnetwork for illustration, which contains 8 communities and 10 overlapping nodes. Different communities are highlighted with different shapes, and overlapping nodes found by our method are enclosed by circles.
This summary of results using our algorithm in real-world networks. For sake of comparison, the best published Modularity Q results are provided
In this paper, we first transform the directed node graph to a new type directed line graph, and the direction and weight of line graph are defined. Then in the line graph, we define the community unit including membership and correlation information. There are two main contributions of this method. 1) To adequately characterize the community structure, the node and module correlation with different granularity can be calculated; 2) based on the membership and correlation information, we can extract the multiplex patterns between communities, according to different domain requirements. Furthermore, we are able to map the link community configuration to the optimal situation dynamically by maximizing the likelihood function with rigorous mathematical proof. Based on the spectral analysis of the Markovian transition matrix, a mathematical theory is provided to identify the optimum number of network communities, and to analyze the stability of the community structure. Extensive simulations using both synthetic and real-world benchmark networks are performed to verify the algorithmic performance.
Acknowledgments
We are grateful to the anonymous reviewers for their valuable suggestions which are very helpful for improving the manuscript. The authors are separately supported by NSFC grants 71401194, 91324203, 11131009 and “121” Youth Development Fund of CUFE grants QBJ1410.
