Abstract
Community detection in graphs is widely used in social and biological networks, and the stochastic block model is a powerful probabilistic tool for describing graphs with community structures. However, in the era of “big data”, traditional inference algorithms for such a model are increasingly limited due to their high time complexity and poor scalability. In this paper, we propose a multi-stage maximum likelihood approach to recover the latent parameters of the stochastic block model, in time linear with respect to the number of edges. We also propose a parallel algorithm based on message passing. Our algorithm can overlap communication and computation, providing speedup without compromising accuracy as the number of processors grows. For example, to process a real-world graph with about 1.3 million nodes and 10 million edges, our algorithm requires about 6 seconds on 64 cores of a contemporary commodity Linux cluster. Experiments demonstrate that the algorithm can produce high quality results on both benchmark and real-world graphs. An example of finding more meaningful communities is illustrated consequently in comparison with a popular modularity maximization algorithm.
Introduction
Community structures, in which nodes belonging to the same community are more densely connected to each other than externally, prevail in real graphs. Finding those structures can be beneficial in many fields, such as finding protein complexes in biological networks and topical, disciplinary groups in collaborative networks, etc. [35, 47, 17, 51, 26, 2, 46].
In this work, we consider non-overlapping community structures. Many community detection algorithms handle such a problem [37, 40, 6, 17, 34, 49, 43, 32]. However, they come along with limitations for large graphs, for example, in handling community heterogeneity [36, 16, 50].
Alternatively, model-based methods can produce more reliable and accurate results when the model assumptions are in accordance with real graphs. Stochastic block models (SBMs) are among the important probabilistic tools describing the connectivity relationship between pairs of nodes [23], and have received considerable attention both in theoretical [7] and application domains [14].
Many algorithms have been proposed to infer the parameters of SBMs, such as Bayesian estimation [22] and nuclear norm minimization [8]. Bayesian estimation defines the prior distributions on latent variables (community labels of nodes) and maximizes the posterior distribution when a graph is given [22]. Nuclear norm minimization solves a relaxed problem to find the community indication matrix.
We can also utilize parallel computers to detect communities on large graphs. Parallel computing has been proposed for decades [19, 3]. It can decrease the execution time of applications by executing them on multiple processors [15].
Parallel computers are classified according to the programming model natural to their memory structure, whether shared, or distributed, or hybrid. In a shared memory model [12], each processor can access the entire system memory, while each processor in a distributed memory model [21] can access only its local memory and requires an interprocessor network to communicate data to and from other local memories. In the hybrid model [39], subsets of processors share memory and these subsets are connected by a communication network. The hardware supporting these programming models need not fully resemble them. The convenient global addressing of shared memory can be implemented over distributed memory hardware and the efficient controlled access of distributed memory over shared memory hardware. However, there is a general association of distributed memory protocols with the systems of the largest memory capacities. Therefore, in our algorithm, we choose the message passing model as the most performance-portable to the hardware resources that will be required for graph applications of the future.
In this work, we propose a multi-stage likelihood maximization algorithm based on SBMs, which has three advantages:
Speed: We devise an algorithm based on coordinate descent and use approximations to simplify computations. Our algorithm runs in linear time with respect to the number of edges.
Scalability: We propose a parallel algorithm based on message-passing, which tends to be the most portable and performance-consistent parallel programming model for a variety of memory structures. We overlap the communication and computation in the algorithm, so that for large graphs, it can achieve significant speedup proportional to the number of processors. To the best of our knowledge, it is the first parallel algorithm for inferring SBMs.
Quality: The algorithm can produce high-quality results. In the initialization, it considers each node as one community and then employs a multi-stage iterative strategy to construct larger communities gradually. It outperforms traditional community detection algorithms empirically as measured by normalized mutual information (NMI) [13] with respect to the ground-truth. Experiments on real-world graphs show that our algorithm can produce more meaningful communities with good quality.
Related work
Many community detection algorithms have been proposed in the past decades. Some algorithms rely on a quality function, for example, modularity. Modularity is the fraction of inter-community connections minus the expected fraction when edges are randomly placed [38]. Newman [37] optimizes the modularity in a hierarchical manner. A typical spectral clustering approach [45] can also detect communities. Li and Schuurmans [33] use an iterative rounding strategy for the spectral method to improve the quality of the results. The Louvain method [6] improves the modularity hierarchically. Nevertheless, they may have some disadvantages. For example, modularity-based algorithms may have resolution limit [16], which means they may be unable to find small communities without modifying their objective function [44]. In some cases, other algorithms can give high quality results when maximizing the modularity, but the time complexity could be high. For example, spectral clustering has a typical time complexity no less than
On the other hand, using SBMs is more reliable under particular assumptions. Theoretically, the threshold of exact community recovery can been identified [1]. In practice, many algorithms have been proposed to infer the parameters of SBMs. Bayesian estimation defines the prior distributions on latent variables (community labels of nodes) and maximizes the posterior distribution when a graph is given [22]. Nuclear norm minimization of a matrix minimizes the sum of its singular values, and their algorithm is verified on graphs containing one thousand nodes [8]. Spectral algorithms also demonstrate the capability of inferencing parameters of SBMs [9]. However due to the high time complexity, the run time of those algorithms typically becomes enormous when the graph size grows to millions or billions, which prohibits the application on big graphs.
Recent years have seen work on parallel implementations for non-SBM community detection algorithm. For example, Riedy et al. proposed an parallel agglomerative approach for community detection that supports both maximizing modularity and minimizing conductance [41]. Bhowmick and Srinivasan used a shared memory interface for parallelizing the Louvain method [4]. Staudt and Meyerhenke developed a parallel framework for multiple algorithms, including label propagation and the Louvain method with refinement [43]. Gregori et al. devised an parallel algorithm for k-clique community detection [20]. Parallel algorithms for spectral clustering are also proposed [24]. Those implementations are mostly on shared-memory systems, which constrain the maximum speedup due to the hardware limit. In addition, none of these implementations can suitably address the problem in SBMs.
In this work, we propose a community shrinking and expanding approach to overcome the limitation of the popular coordinate descent method for stochastic blockmodels. It is also scalable on parallel computers. Our fast and scalable algorithm fills the gap of processing very large graphs using stochastic block models.
Stochastic block model
In this work, we develop a community detection algorithm based on a stochastic block model (SBM). We define
Using the matrices
for
For example, we can define a probability matrix
meaning that nodes inside each community are connected at probability 0.2 and nodes from different communities are connected at probability 0.01. We also define a matrix
indicating the first half and the second half of the nodes are from two different groups. Thus, we can have the probability matrix
In this case, a possible adjacency matrix
One adjacency matrix generated by the model.
Typically in practice, the adjacency matrix
In this section, through a simplified example, we illustrate how our algorithm solves the SBM problem. In particular, we assume the diagonal part of the ground truth
Self-assembling process
We define the community that a subset of nodes indeed belongs to as the true community for those nodes, and the community determined by an algorithm at each iteration as the temporary community.
We initialize
At each iteration, each node will associate with a particular temporary community where it has a tight connection. In this case, the temporary community with more nodes from the same true community will take an advantage because the probability of the intra-community connection is higher than other temporary communities. Due to this bias, the association will result in
for the fraction of nodes in a temporary community and from the same true community. In Eq. (4),
In practice, we generate many more temporary communities than the number of true communities at the initialization, to avoid two or more true communities assembling at the same temporary community.
The gap between
Let
Then we will demonstrate that node
Using an auxiliary variable
where
and Hoeffding’s inequality
for Bernoulli random variables.
By choosing an appropriate
Hence,
Because two factors in Eq. (5) are the same, we have
As the lower bound of the probability
where the expansion is based on the Taylor series.
Therefore, as
Thus, the lower bound of
It is possible that the sizes of the partitions are different, and we analyze the consequent self-assembling process under this assumptions.
First, we define
Then, when temporary communities are initialized to have the same size, nodes from different communities are assigned to temporary communities in equal probability. To have the self-assembling process still work, we shall have the following expression, so that nodes from community 1 will not assemble to temporary community
where
During the self-assembling procedure, similar to Eq. (4), we have the following expression for community
Hence, the convergence rate is still quadratic.
At the end of the iteration, to obtain the correct result, the following expression is expected
which is in accordance with the assembling condition at the initialization in Eq. (10).
For undirected networks,
For a specific model, the likelihood is a function of the model parameters, describing the probability of obtaining the observed data with these parameters. The maximum likelihood estimator is the setting of parameters that maximizes the likelihood function.
As defined in Eq. (1), if only
It is very time consuming to maximize such a likelihood function directly through traditional optimization methods (for example, branch-and-bound) for large graphs in which there are at least
For the sake of speed and scalability, we propose a fast algorithm that updates
Estimation algorithm
Given
subject to
We first describe the alternating updating procedure. When
Taking the derivative of
and setting the derivative to be zero, we have
As the inter-block connection probabilities are small, we use a representative scalar value to replace the off-diagonal entries of
Thus, the total time complexity of updating
Proof..
When
By taking the second derivative of the objective function, we have
Therefore, when
When
The time complexity for updating one node by Eq. (18) is
To reduce the time complexity, we propose a faster method. Let
The time complexity is
For fixed
Therefore, the objective function can be rewritten as
If entries in
Proof..
First we reshape the matrix
Let
If
where the square operation is element-wise. It is positive semi-definite because it is a covariance matrix multiplied with a positive scalar.
Therefore, for any column vector
Hence, the objective function is convex. As obviously the constraints are convex as well, this problem is a convex optimization. ∎
However, in our model,
A community shrinking and expanding approach works as follows. First, randomly initialize the nodes into
This approach reduces the “collision” probability in the initialization significantly, where a “collision” is the situation that two true communities both have most of the nodes in one temporary community. The fraction of the “non-collision” probability after shrinking to the probability without shrinking is:
The lower bound is not sensitive to the choice of
Initializing an extra number of communities may introduce some tiny communities in the final result. The community expanding is devised by considering the likelihood of the inter-community edges. When they are more likely to be in certain communities, the corresponding nodes are merged. In detail, for any two communities
which is out of
where
The time complexity for each merging is
Serial Algorithm (ML-SBM)set community number to
no more merging is possible
In Fig. 2, we use the graph in Fig. 1 as an example to illustrate the process. When two nodes at both ends of an edge belong to the same community, we assign the edge with a specific color corresponding to the community. Otherwise, the edge is in background color (white) and invisible.
Changes of community labels in each iteration.
At the beginning, we shrink the temporary communities as initialization, so at Iteration 1 of Stage 1, there are many small communities (temporary communities) and the edge colors are very diverse. At the end of Stage 1, due to the self assembling process, some temporary communities become a bit larger. After each stage, we expand the communities by merging temporary communities that very likely belong to the same community. Therefore, after the transition of each stage, we can see much larger communities which are slightly refined in the stage. Eventually, the algorithm converges and the two true communities are identified.
In this part, we consider an ideal stochastic block model. We assume that there are
Proof..
When the sizes of communities are equal, each community has
which means that the convergence is quadratic. ∎
Parallelism
The parallelism is designed as follows. Let
[b] Parallel Algorithm (Par-SBM)
set community number to
update
no more merging is possible
We choose a message-passing model which is applicable on a variety of memory structures, whether shared, or distributed, or hybrid. We use non-blocking MPI communications to improve the communication-computation overlap. This has two-fold benefits: if the hardware allows, it utilizes the network bandwidth during computation, and it maintains a convergence rate for
The algorithm uses similar partitioning and communication strategies when computing
In parallel computing, the rows of
where
The serial algorithm
In this section, we compare our algorithm (ML) with other serial algorithms in MATLAB. The competitors include the Louvain method (LV) [6], a Bayesian inference algorithm (BI) [22], spectral clustering (SC) [45], label propagation (LP) [40], and modularity optimization (MM) [37]. We use the LFR benchmark generator [28] to create graphs and run all of them for comparison. Each generated graph is of size 1000, average degree 30, maximum degree 50, exponent of degree distribution
The accuracy is evaluated by normalized mutual information (NMI) [13], which compares the similarity between the computed community structure and the ground-truth one. The larger the NMI is, the more similar the two schemes are. If two schemes are identical, NMI is 1. The results are averaged on five runs.
From Fig. 3a and b we can see that our algorithm is relatively fast and has the best accuracy. When
However, for large graphs, SC is not scalable, and LV is not accurate. Our algorithm is the method of choice for both scalability and accuracy. Figure 3c and d shows the results on graphs similar as the previous experiment but the average degree is 10 and
Comparison to popular algorithms in MATLAB. (a) and (b) represent the variation of average accuracy and run time over different choices of 
In this section, we exploit distributed memory parallelism for larger graphs. When the input graph has more than one million nodes, typical algorithms for stochastic block models are not able to infer the parameters within a reasonable amount of time (for example, 24 hours). Therefore, we compare our algorithm (PS+number of processors) to a popular community detection algorithm, the Louvain method (LV) [6] and a fast graph partitioning algorithm Metis (MT) [25]. The Louvain method implemented in C can also generate level one structures as byproducts containing the finest communities (LVF). Those communities are merged into larger ones in LV. Parallel algorithms are often extended from serial versions [49], but the results are typically invariant. We use several metrics to compare the results, such as NMI, modularity [11], and conductance (the smaller, the better) [30]. The results on those quantities are averaged on five runs.
First, we run our algorithm on benchmark graphs (LFR-1e6) generated by LFR benchmark [28]. The parameters are similar to the serial case except the graph size is increased to one million, and we change two parameters (the mixing parameter
Properties of the test graphs
Properties of the test graphs
Comparison on benchmark data.
Comparison on real-world data
Number of communities versus community size.
We also compare algorithms on real-world graphs: cit-Patents [29] and dblp-Coauthor [31]. Nodes in cit-Patents are patents in U.S. patent dataset and edges represent the citation relationships between the two patents. dblp-Coauthor is a coauthor graph extracted from publications in DBLP database. It is also worth noting that LVF has higher NMI and more communities than LV in benchmark test. It indicates that LV suffers from the resolution limit problem [16] that prefers unrealistically large communities. A similar problem happens on LV for real-world data as it generates many communities containing more than
On the other hand, MT tends to find even-size communities, which is not desirable for community detection either. Table 2 shows the quality of results by different algorithms, where run time is in seconds.
In this section, we analyze the community detection results of a run by our Par-SBM (PS) and the Louvain method [6] using random initialization. For the Louvain method, we pick the Level One structure with finest communities (LVF) and the Level Two structure denoted by LV2. Communities become larger in higher level structures by the Louvain method. For example, we consider community containing author “Michael_Wooldridge,” and use
All the algorithms are able to identify blue nodes in Fig. 6 as a separate community, while our Par-SBM provides the clearest detail.
Subgraph of the dblp-Coauthor network, where colors represent the community labels determined by different algorithms.
Comparison to the state-of-the-art algorithm by nuclear norm minimization. The curves represent the average smallest 
In addition, Par-SBM can find relevant nodes more than LVF does. For example, Par-SBM uniquely includes “H._H._Pham” and “Ali_M._Aseere” into the purple community in Fig. 6a. The corresponding authors have “agent” in the title of all their publications.
In 2012, a nuclear norm minimization algorithm (NN) [8] is proposed to solve such SBM problem. It naturally comes with time complexity
Properties of the test graphs
Properties of the test graphs
Performance evaluation for graphs.
The test graph contains 1000 nodes with 5 communities; over different
Figure 7 compares the accuracy of our algorithm (ML) with NN, which is the best reported in [8]. This figure shows that our algorithm is very competitive to the nuclear norm minimization (NN), and the minimum possible gap between
We use four different graphs. The first graph (LFR-1e6) is generated by LFR benchmark [28] using the same setting as in Fig. 3c and d except that the graph size is much larger. The second graph (Bitcoin) is a Bitcoin transaction network [27], where each node corresponds to one Bitcoin address, and each edge indicates that there is at least one transaction between two addresses. The third graph (soc-LiveJournal1 [30]) is obtained from Stanford Large Network Dataset Collection. The fourth graph (Wikipedia) contains most of the pages and the interlinks on Wikipedia in 2009.
The graph properties are listed in Column “Node Number” and “Edge Number” in Table 3. These graph data are preprocessed and partitioned on each processor so that different processors will have a similar number of nodes and edges.
We run our algorithm on these graphs using different number of processors, but the accuracy is quite stable as expected. For example, when running on LFR-1e6 with ground-truth cluster information, the resulted NMI by our algorithm is in the range of 0.997
Figure 8 demonstrates the run time and the speedup. Our algorithm on all the graphs can achieve a significant speedup as the number of processors grows. The Wikipedia graph has the highest average degree in all the test graphs, and Fig. 8c shows that our algorithm can achieve the biggest speedup of 48x on 128 cores. This phenomenon is in accordance with Eq. (23).
We also compare the speedup performance with a recently proposed parallel community detection algorithm [41, 42] running on shared-memory systems. Their community detection algorithm achieves 5.1x speedup on 16 cores using Intel Xeon X5570 CPU and 9.4x speedup using 32 cores on XMT2 [42] for soc-LiveJournal1 data.
Using the same data set and the same Intel Xeon CPU, our algorithm achieves an 8.2x and 12.8x speedup on 16 and 32 cores respectively, as shown in Fig. 8d. Furthermore, 25.7x speedup can be achieved on 128 cores.
Concluding remarks
In this paper, we have proposed a fast algorithm for clustering nodes in large graphs using stochastic block models. We have adopted alternative updates and coordinate descent methods to infer the latent parameters, and used community shrinking and expanding to improve accuracy. Our algorithm has linear time complexity in the number of edges, and can scale to multi-processor systems.
Compared with other community detection algorithms, our SBM-based method can produce high quality results on benchmark graphs and find interesting communities in the real-world graphs.
This work can boost the application of SBMs on big data and bring new insights by overcoming the accuracy or run time shortages of traditional algorithms. The code is available at
Footnotes
Acknowledgments
Research reported in this publication was supported by King Abdullah University of Science and Technology (KAUST).
