Abstract
In this paper, we have proposed a novel overlapping community detection algorithm based on an ensemble approach with a distributed neighbourhood threshold method (EnDNTM). EnDNTM uses pre-partitioned disjoint communities generated by the ensemble mechanism and then analyzes the neighbourhood distribution of boundary nodes in disjoint communities to detect overlapping communities. It is a form of seed-based global method since boundary nodes are considered as seeds and become the starting point for detecting overlapping communities. A threshold value for each boundary node is used as the minimum influence by the neighbours of a node in order to determine its belongingness to any community. The effectiveness of the EnDNTM algorithm has been demonstrated by testing with five synthetic benchmark datasets and fifteen real-world datasets. The performance of the EnDNTM algorithm was compared with seven overlapping community detection algorithms. The F1-score, normalized mutual information ONMI and extended modularity
Introduction
Discovering communities in networks from different domains like social networks, biological networks, collaboration networks, word-association networks, traffic networks have a deep history and is of tremendous importance [1, 2]. Since a community is composed of nodes representing the members of a community, and edges representing relationships amongst its members, the study of such community structures in real-world networks reveal significantly important properties of these networks. Consequently, the exploration of methods to extract such communities from these networks is of great importance. Community detection (also termed as graph clustering) where a network is represented as a graph, is a process of grouping the most densely connected nodes of a graph, to form a community (cluster). In general, the nodes of the discovered clusters are associated with a high number of edges within the cluster when compared to the rest of the network. Methods for graph clustering can be categorized in two groups: hard (non-overlapping) clustering and soft (overlapping) clustering.
Flow diagram of EnDNTM algorithm.
In hard clustering, the graph network is partitioned into disjoint partitions resulting in crisp clusters where each node (member) can belong to at most one cluster only. Well-known disjoint community detection methods include: graph partitioning [3], hierarchical methods [1], random walk [4], methods based on optimization [5] and greedy approach [6]. Recently, a novel tolerance-based method [7] was proposed to detect disjoint communities.
In contrast to hard clustering, in soft clustering, the clusters are overlapped such that a node can belong to more than one cluster. Many real-world networks inherently display overlapping characteristics. For example, in a Facebook network, a user can be a member of many groups such as a study group, sports team or music club. Similarly, co-author networks, collaboration networks are some examples of real-world networks having an overlapping community structure. There are a plethora of methods for detecting overlapping communities in social networks for both synthetic and real-world datasets starting from [8]. Classical strategies include: local expansion of seed nodes [9, 10], label propagation [11, 12, 13], clique-based [14] and ensemble-based methods [15, 16] to name a few.
In general, overlapping community structures are identified using two different strategies. Direct methods where an overlap structure is assumed and naturally occurring communities are detected (where the links between the communities are examined to detect overlapping communities). In contrast, indirect methods assume a partitioned structure, and then proceed to uncover the overlapping structures. It is the second method that is used in this paper, with a focus on detecting overlapping communities using a collection of disjoint methods as a starting point and selecting the best method using an ensemble mechanism for subsequent processing.
In this paper, we propose a new method based on detecting overlapping communities by i) utilizing disjoint communities produced by ensemble method, and ii) analyzing the neighbourhood distribution of boundary nodes in disjoint communities to detect overlapping clusters. The neighbourhood distribution of boundary nodes was first presented in [17]. Our method is akin to the more recent class of ensemble methods [15] that uses disjoint methods as a starting point for development of overlapping method. Figure 1 shows the high level flow of our proposed method.
The contribution of this paper is a novel graph clustering algorithm EnDNTM to detect overlapping clusters by utilizing disjoint clusters produced by an ensemble and analyzing its neighbourhood distribution. This includes an optimized implementation of TCD (tolerance community detection) method to enable it to work efficiently on large networks. The robustness of the EnDNTM method is demonstrated by testing its performance with various well-known overlapping community detection measures. Empirical evidence of our proposed solution is given by comparing the performance of EnDNTM with 8 popular overlapping methods on 5 benchmark synthetic networks and 15 real-world datasets.
The paper is organized as follows: In Section 2 we discuss research related to our paper. In Section 3, we give basic definitions used in this work. In Section 4, we give details of the EnDNTM algorithm and its time complexity. In Section 5, we give implementation details and describe the datasets used in this work. In Section 6, we discuss the results of experiments performed on various synthetic and real-world networks.
Detecting communities in graph networks is an active research area. In the case of real-world social networks, it has been observed that certain node belongs to multiple communities which draw the attention of researcher to study the nature of overlapping communities. Several algorithms have been proposed to discover overlapping communities. Clique percolation method CPM [14] is a clique based approach in which communities are detected by discovering cliques. Since a node can occur in multiple cliques this property resembles the overlapping nature of node hence in CPM this concept is used as a foundation for detecting overlapping communities. Order statistics local optimization method OSLOM [9] follows the local expansion and optimization strategy. In this, community expansion is performed by comparing its statistical significance with the random graph called global model which is generated using the configuration model. In 2010, the label propagation algorithm (LPA) is extended in order to develop Community overlap propagation algorithm COPRA [12] to produce overlapping communities. Speaker-listener label propagation algorithm SLPA [13] is also based on LPA. It uses speaker-listener mechanism to detect communities. Democratic estimate of the modular organization of a network DEMON [11, 18] is another overlapping communities detection algorithm which uses label propagation at its core. In recent years some more methods for finding overlapping communities has been produced which includes Node-centric overlapping community detection algorithm NECTAR [19] it is a node-centric overlapping community detection algorithm in which the best communities for a given node are found using objective function and further this node is added to these communities to obtain the overlapping communities. It is observed that results obtained from the node-centric method are exceptionally good as compared to result obtained from the community-centric method. In this method, Louvain’s local search heuristic approach is generalized to discover overlapping communities. This algorithm tries to maximize the dynamically chosen objective function (i.e. Weighted overlapping community clustering (WOCC) and Extended modularity (
Meta clustering based disjoint and overlapping community detection MEDOC [16] and Ensemble-based overlapping community detection EnCoD [15] introduced by Chakraborty et al., are ensemble based overlapping community detection methods which leverages the disjoint clusters generated by disjoint methods.
Theoretical framework
Principles of graph theory
The theoretical concepts, which are used to develop this algorithm, will be introduced in this section.
Undirected graph
A graph
Path
A path is composed of a series of nodes
Neighbourhood node
In a graph
where
where
In this section, the basic concepts used for the clustering method described in this work will be explained.
Neighbourhood cluster
Let
In Fig. 2, the neighbourhood cluster(s) NC for the green node belonging to cluster
Distributed neighbourhood threshold
It is the measurement of distribution of neighbours of a node
where
This parameter is introduced with the intuition that for a node to be included in a cluster, one must consider its edges as one of the important characteristics of that node. In addition, the distribution of these characteristics (edges) is analyzed and the minimum threshold
A node
All such nodes that have neighbours outside their home clusters are candidates and are labelled as overlapping candidate nodes.
Overlapping candidate node.
The green node in Fig. 2 has its neighbours in clusters
A overlapping candidate node is classified as overlapping node if its number of neighbours in at least one of its neighbourhood clusters is greater than (or equal to) the distribution threshold
Sample overlapping clusters
To obtain the best disjoint cluster from a group of disjoint clusters is one of the critical tasks of EnDNTM. In order to filter out the best quality disjoint cluster the following evaluator function is defined. This function is referenced from [7] which is composed of two well-known disjoint cluster quality functions: modularity (
The evaluator function is defined as a combination of modularity
where
To measure the quality of generated overlapping clusters, different measures can be used based on the availability of ground-truth values. Measures such as Overlapping Normalized Mutual Information ONMI and F1-score can be used only if the ground-truth communities are available, whereas Extended Modularity
Overlapping normalized mutual information (ONMI)
ONMI [22] is derived from the domain of information theory. It is considered a benchmark to measure the quality of overlapping clusters where the similarity of detected communities are measured using known ground-truth communities. For a given network,
where
F1-Score [23] is another most widely used ground-truth measure to assess cluster quality. In this method the generated cluster is compared with the best matching cluster of ground-truth cover. It is computed as the harmonic mean of Precision and Recall.
The best value obtained for
where Precision is the percentage of nodes in
Recall is the percentage of nodes in
To measure the quality of generated overlapping clusters, Extended Modularity(
In this section, we have utilized Extended Modularity (
In overlapping communities, each node can belong to multiple communities but with different strengths of belonging. An array of belonging factor
where
The value of
EnDNTM algorithm takes crisp partitioned clusters produced from a family of disjoint methods and generates overlapping clusters irrespective of the specific method used to produce the disjoint clusters. The motivation underlying EnDNTM is to take advantage of the best disjoint clusters generated by pre-existing non-overlapping algorithms and analyzing the obtained disjoint clusters to produce overlapping clusters for a given real-world network. Details of the EnDNTM algorithm shown in Fig. 1 are given in the following sections.
Generate non-overlapping clusters
In step 4.1.2, for a given input graph
TCD is computationally bound to work with smaller networks (the biggest network tested is Political Books [26] with 105 nodes and 441 edges). In EnDNTM this limitation of TCD is addressed by optimizing it with multi-processing parallelization to enable it to work efficiently with bigger networks (like Pretty Good Privacy Network [27] with 10680 nodes and 24316 edges). Figure 4 shows the flow of optimized TCD where i) all combinations of parameters (
Flow of optimized TCD
The architecture of EnDNTM is given in Fig. 5. For a given graph, all such disjoint methods with non-deterministic output will be executed 10 times and the best cluster measured by the evaluator function is stored. Disjoint methods with deterministic output are executed only once and the generated clusters are stored. All of the obtained disjoint clusters are further evaluated using an evaluator function in order to obtain the best cluster.
Architecture of EnDNTM
[!ht] Evaluator function
Generate disjoint clusters
clusterList
clusterList
clusterList
clusterList
lock.acquire()
shrDict
shrDict[2][1]
shrDict[2][2]
shrQ.put(shrDict)
lock.release()
Algorithm 4.1.1 describes the evaluator function. It takes a graph
Generate disjoint clusters
Algorithm 4.1.1 represents a generic disjoint community wrapper function responsible for generating disjoint clusters using any disjoint algorithm. Graph
Ensemble method
The ensemble mechanism given in Algorithm 4.1.3, takes a graph
Ensemble method
method
clusterList
algClstr
shrDict
shrQ
shrQ.put(shrDict)
lock
Processes
disjoint method
Processes.append(
p.start()
Process
recvDict
val
clusterList
clusterList
[!h] Find overlapping clusters
clusterId
The generated clusters in this step is used as input along with the input graph
Find overlapping clusters
There are four main steps in this method which are described below and refer to line numbers in Algorithm 4.1.3.
Preprocessing disjoint clusters
Preprocessing of disjoint clusters is done in order to create and store the information required for generating overlapping clusters. Two data structures: Node-Cluster Dictionary
Find overlapping candidate node
In order to find an overlapping candidate node, Node-Neighhbour Dictionary
Calculate distributed neighbourhood threshold
Prior to computing
Filter overlapping node
Using
Generate overlapping clusters
This is the final step of the EnDNTM algorithm. In this step, the actual overlapping clusters are generated shown in lines 31–33. The overlapping node
Each of the resultant clusters are stored in the overlapping clusters list
From a detailed analysis of the EnDNTM Algorithm, it can be observed that it is composed of two modules: first obtaining the disjoint clusters and next finding the overlapping clusters.
Time complexity for disjoint cluster detection
The run time of each disjoint method is given in Table 1 where
Running time of disjoint methods
Running time of disjoint methods
In the EnDNTM algorithm for a graph
Implementation and data sets
This section includes a description of data structures and frameworks used for manipulating the input graph, the development environment used for the coding of the EnDNTM method, followed by a detailed explanation of datasets used in this work.
Data structures used in EnDNTM
Various other data structures used include dictionary, sets, lists and numpy array. Some of them are illustrated in Table 2.
Data structure used in EnDNTM
Data structure used in EnDNTM
For the development of the overlapping method proposed in this work, we used Python programming language version 2.7 and 3.6, Networkx libraries, and Gephi 0.9.2. The hardware configuration was IntelCore i7 CPU 3.6 GHz, 16GB RAM, 500GB SSD with Ubuntu 16.04 and 18.04 operating system.
Datasets used for experiment
Various small, medium and large-sized real-world datasets with good mixture of social, biological, collaboration and word association network were used for experiments. To test the communities generated by EnDNTM method with ground-truth communities, system generated synthetic networks [28] also known as the LFR benchmark network [8] were used. The input data file was pre-processed by removing repeated edges, self loop and converting non-compatible formats into required formats. Table 3 shows the default number of edges before pre-processing and after pre-processing for the real-world dataset along with the number of nodes. A brief description of these datasets is given in following subsections.
Real-world networks
Summary of real-world networks
Summary of real-world networks
In this work, we have used five LFR benchmark networks [28] to evaluate the EnDNTM method on synthetic networks and to examine the accuracy of communities discovered by the EnDNTM with the ground-truth communities generated by LFR. Detailed description of these synthetic networks is given in Table 4 where Bench_30 represents the benchmark network with 30 nodes, Bench_40 represents the benchmark network with 40 nodes and so on,
Summary of synthetic networks, where
is the number of overlapping nodes and
is the average clustering co-efficient
Summary of synthetic networks, where
F1-score for synthetic networks with all disjoint methods
ONMI-score for synthetic networks with all disjoint methods
To examine the performance of EnDNTM, experiments with 15 real world datasets and 5 synthetic benchmark networks were carried out. The
Experiment with various disjoint methods
Since EnDNTM leverages a variety of disjoint methods to produce crisp clusters, it is critical to select best quality crisp clusters in order to generate good quality overlapping clusters. EnDNTM is first executed with each disjoint method separately. For example, EnDNTM (TCD) represents the independent execution of EnDNTM with the input crisp clusters obtained from disjoint TCD (Tolerance Community Detection) method. Similarly, EnDNTM (LN), EnDNTM (GN), EnDNTM (GM), EnDNTM (IM) represent independent execution of EnDNTM with disjoint methods Louvain, Girvan Newman, Greedy Modularity and InfoMap respectively. EnDNTM (Ensemble) represents the execution of EnDNTM with all disjoint methods taken together. Next, the generated overlapping clusters are evaluated using classical measures such as ONMI,
Synthetic networks
Since the ground-truth clusters are only available for large networks like DBLP [39], Amazon [39] etc, in such cases, computer generated benchmark network(s) known as synthetic network(s) are used. In EnDNTM it is observed that it generates consistent result with both version of input file (i.e. repeated and non-repeated edgelist). Detailed analysis of results obtained using
-score for synthetic networks with all disjoint methods
Analysis using F1-Score:
Table 5 gives the F1-Score of EnDNTM algorithm on 5 synthetic datasets and with 5 disjoint methods. The results in BOLD are the best values. EnDNTM represents the F1-Score for the score after the selection of one of the methods. For the dataset Bench_30, it can be observed that EnDNTM with TCD, LN, IM produce the best score, whereas the final F1-Score is obtained by selecting the TCD method. In this case, since TCD, LN, IM all produce the same score, the first one in the list is chosen. For the dataset Bench_50, the final F1-Score is obtained using TCD rather than IM. This is a case where the best disjoint method was not chosen by the EnDNTM algorithm. The same problem shows up with datasets Bench_60 and Bench_60_dense
Analysis using ONMI-Score:
Table 6 gives ONMI-Score for overlapping clusters generated by the EnDNTM algorithm with each disjoint method separately. EnDNTM represents the ONMI-Score for the score after the selection of one of the methods. As mentioned in previous section, in this table also, results in BOLD are the best values. Again, for datasets Bench_50, Bench_60 and Bench_60_dense, the best disjoint method was not chosen by the EnDNTM algorithm.
Analysis using
Table 7 gives
Here, 15 real-world networks from different domains such as social, collaboration, protein, word-association networks were used.
Analysis using
Since the
Cluster analysis
In this section we will analyze the generated overlapping clusters by EnDNTM, and compare it with the ground-truth clusters generated by the LFR benchmark network for Bench_30. The system generated ground-truth clusters by LFR for the synthetic network Bench_30 is given in Fig. 6. It consists of three clusters with two overlapping nodes [12,21].
Clusters detected for the Bench_30 network by EnDNTM is given in Fig. 7. From Fig. 7, it can be seen that EnDNTM is able to discover three clusters which is identical to the ground-truth clusters and four overlapping nodes [12,17,18,21] including two ground-truth overlapping nodes [12,21].
The overlapping nodes classified by other overlapping methods is shown in Table 9 along with number of communities detected.
Ground-truth clusters of the LFR benchmark network (Bench_30).
EnDNTM generated clusters for the LFR benchmark network (Bench_30).
Overlapping nodes for Bench_30 network
F1-score for Synthetic Networks with different algorithms
ONMI-score for synthetic networks with different algorithms
Comparison of EnDNTM using F1-score on synthetic networks.
Comparison of EnDNTM using ONMI-score on synthetic networks.
Comparison of EnDNTM using 
We have compared the performance of the EnDNTM algorithm with seven overlapping community detection algorithms: CPM [14], OSLOM [9], COPRA [12], SLPA [13], Node Perception [40], DEMON [11, 18], CONGO [41] with
-score for real-world networks with different algorithms
Comparison of EnDNTM using 
Comparison of EnDNTM using 
Table 13 gives the results of the performance of EnDNTM for real world networks using the
Since most of the referenced algorithms have a non-unique output for the different runs, for all such algorithms,
The quality of generated overlapping clusters from EnDNTM is greatly affected by the number of disjoint clusters generated by the initial ensemble method. From Eq. (4), it can be observed that
In this work, we have proposed a new overlapping community detection algorithm (EnDNTM) based on utilizing disjoint communities produced by an ensemble method and analyzing the neighbourhood distribution of boundary nodes of discovered disjoint communities to detect overlapping clusters. We have developed an ensemble model consisting various community detection algorithms and utilized an evaluator function for generating the best disjoint communities. The effectiveness of the EnDNTM algorithm has been demonstrated by testing on five benchmarks synthetic networks with ground-truth communities and fifteen real-world datasets and compared with seven overlapping community detection algorithms in terms of an extended modularity
Footnotes
Acknowledgments
Sheela Ramanna’s research has been supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) discovery grant 194376. Rajesh Jaiswal’s research is supported by the UW Graduate Studies Scholarship and Linda and Vana Kirby Scholarship.
