Abstract
Overlapping communities exist in real networks, where the communities represent hierarchical community structures, such as schools and government departments. A non-binary tree allows a vertex to belong to multiple communities to obtain a more realistic overlapping community structure. It is challenging to select appropriate leaf vertices and construct a hierarchical tree that considers a large amount of structural information. In this paper, we propose a non-binary hierarchical tree overlapping community detection based on multi-dimensional similarity. The multi-dimensional similarity fully considers the local structure characteristics between vertices to calculate the similarity between vertices. First, we construct a similarity matrix based on the first and second-order neighbor vertices and select a leaf vertex. Second, we expand the leaf vertex based on the principle of maximum community density and construct a non-binary tree. Finally, we choose the layer with the largest overlapping modularity as the result of community division. Experiments on real-world networks demonstrate that our proposed algorithm is superior to other representative algorithms in terms of the quality of overlapping community detection.
Keywords
Introduction
In recent years, community detection technologies have been used in many networks, such as mobile networks, the world wide web, citation networks, social networks, and biological networks [1]. The detection of the community structure in the network is important for predicting the dynamic evolution of the network, information interaction between vertices, identifying modules, and understanding the dynamic characteristics of the network. In real networks, the network structure often has the following characteristics: (1) A vertex belongs to multiple communities [2], that is, an overlapping community structure. For example, a person can belong to both a basketball club and a football club in a social network. (2) The network often has a hierarchical structure, and small communities form large communities through merging such as schools and government departments. In particular, the hierarchical community detection reveals the higher-order organization and components at each level and how these components interact with one another to form larger components at a higher-order in the hierarchy.
In 2005 years, Palla et al. [3] proposed an algorithm for the detection of overlapping communities, which assumed that communities are composed of fully connected subgraphs. Since then, a large number of algorithms have been used for overlapping community detection. At present, overlapping community detection algorithms are mainly divided into the following two categories: local methods and global methods. Local methods use extension methods to identify communities. Extension methods mainly include clique percolation [3, 4], seed set extension [5, 6, 7], and the label propagation algorithm [8, 9, 10, 11]. The extension algorithm produces good results in the least time, but most of the algorithms are unstable, and the selection of seed vertices and different extension strategies result in different community division results. Moreover, sometimes there are isolated vertices. Global methods include the optimization method [12] and non-negative matrix factorization (NMF) [13, 14]. The optimization algorithm typically needs to set the objective function in advance to evaluate the community division result to seek the optimal community division, which makes the community division falls into the local optimal. NMF needs the number of communities to be input in advance, which is not possible for a network where the number of communities is unknown.
In 2009 years, Lancichinetti et al. [15] proposed hierarchical and overlapping community detection algorithms in complex networks to discover the deep structural information of networks. Most overlapping community detection algorithms mainly obtain the hierarchical structure using the following three methods: (1) merging the initial communities [16, 17], (2) allowing one vertex to belong to multiple communities during the community detection process [15], and (3) applying a community detection algorithm based on the multi-objective evolutionary algorithm (MOEA) [18, 19]. However, because of the randomness of seed vertex selection and label propagation, this type of algorithm is unstable, and each time the algorithm runs, it may obtain differently community division results.
To obtain a stable overlapping community discovery algorithm and use the local information to perform clustering granulation of the community. In this paper, we define multi-dimensional similarity to obtain the initial particle nodes, form the initial community, and construct the non-binary hierarchical tree. Different from the above algorithms, the proposed algorithm can find overlapping and stable community structures in the case of an unknown number of communities. At the same time, the non-binary hierarchical tree can enable us to observe the different levels of community structure and overlapping node ownership. The main contributions of this paper are summarized as follows:
We propose the HTOCD algorithm which selects leaf vertices based on the similarity matrix and seeks the maximum community density principle for community expansion. During the expansion process, the vertices are allowed to belong to multiple communities to construct non-binary trees. In the clustering process, we coarsen the community into a vertex and rebuild the network until there is only one vertex in the network. Then we select the optimal layer using the value of overlapping community modularity, which is used to evaluate the quality of the overlapping community detection algorithm. We propose a multi-dimensional similarity calculation method. This method not only considers the neighbors of a vertex but also considers the common neighbors between vertices. The similarity function of multi-dimensional structure similarity is composed of the first-dimensional structure similarity and the second-dimensional structure similarity, and the similarity matrix solves the problem of network sparsity. We compare the HTOCD algorithm with five basic algorithms in 12 real complex networks. The experimental results demonstrated that the performance of our proposed algorithm was superior to that of the basic algorithms. In terms of overlapping community detection, this means that the proposed algorithm is competitive and promising.
The structure of this paper is organized as follows: We present related work in Section 2. We provide a detailed description of our algorithm in Section 3, and the results of the performed experiments on real-world networks are shown in Section 4. Finally, we present the conclusion in Section 5.
In the last twenty years, researchers began to study the problems in overlapping communities. Currently, These overlapping community detection algorithms are mainly divided into local methods and global methods. Local methods seek global network community optimization through local community optimization, and can be divided into the following three methods: (1) Clique percolation [3, 4] achieves the union of
Lancichinetti et al. [15] proposed hierarchical and overlapping community detection algorithms in complex networks. Simultaneously, the proposed LFM algorithm also adopts the method of constructing a non-binary hierarchical tree to obtain an overlapping hierarchical community structure. However, the random selection of seed vertices makes the algorithm unstable. The EAGLE algorithm [16] takes the maximal cliques in the network as the initial community and builds the nested structure of communities through agglomerative clustering. The ACLPA algorithm [17] constructs overlapping and hierarchical trees by applying label propagation and agglomerative clustering. The algorithm based on the MOEA [19] builds a hierarchical tree by optimizing two objective functions. One objective function is used to divide the network into smaller communities and the other is used to divide the network into larger communities to seek the balance of the two objective functions and obtain the community division results at different granularities and different levels. Zhang et al. [18] proposed an MR-MOEA for overlapping community detection. In the MR-MOEA, a mixed representation scheme consists of candidate overlapping vertex-based representation and non-overlapping vertex-based representation for fast encoding and decoding of the overlapping divisions. As the MR-MOEA is an algorithm based on the MOEA, it can also achieve community structures at different levels and granularities.
Different from the above work, in this paper, we propose a non-binary hierarchical tree overlapping community detection based on multi-dimensional similarity called HTOCD to detect overlapping communities. The multi-dimensional similarity takes into account the first-dimensional structure similarity and the second-dimensional structure similarity to form the similarity matrix, which can well reflect the affinity between vertices in the network. At the same time, the similarity between vertices is used to determine the initial community and community expansion to solve the problem of algorithm instability. We build a non-binary tree hierarchical structure that is smaller than a binary hierarchical tree. Additionally, compared to other overlapping community discovery algorithms, overlapping vertices are obtained in the process of community detection, so that the communities to which the overlapping vertices belong to can be more clearly understood, and there are no isolated vertices. The hierarchical clustering algorithm not only obtains community division results of different dimensions at different levels but also get the same community division results at each run with the same parameters.
The proposed algorithm
In this section, we will present the HTOCD algorithm for overlapping community detection, including the definition and update of the similarity matrix, construction of the hierarchical tree and the overall procedure.
Definitions
We first consider a network of undirected unweighted graphs
The set of vertices in a community is a subset of all vertices in the network, and the set of vertices in a community can neither be equal to the set of all vertices in the network nor an empty set. If the community set meets the following conditions
then the communities formed are overlapping communities. In this paper, we mainly focus on the detection of overlapping communities, where a vertex can belong to one or more communities.
where
We define the value of
Transformation from an adjacency matrix to an HTOCD similarity matrix.
where
where
Vertex
Where
HTOCD algorithm
During the implementation of the HTOCD algorithm, it can be divided into the following two steps: the initial granulation of communities and the expansion of communities to form a hierarchical structure. For the initial community granulation, we first apply the Decompose sub-algorithm to choose two vertices that have the closest similarity in the network based on Eq. (3) to form the initial community, and then use the Grow sub-algorithm to expand the initial community based on Eq. (5). If the vertex to the initial community’s contribution is greater than a certain fraction of the initial community density, that is Eq. (6) is satisfied, then we add the vertex to the community; otherwise, we select the next initialized community to start the next iteration. When the iteration is complete, for all communities formed by the initial granulation, we apply the Merge sub-algorithm to merge small communities into large communities, coarsen each community as a vertex reconstructs the network based on Eq. (7) and start the next iteration until only one vertex remains in the network. At this time, we stop the algorithm. The flow chart of this algorithm is shown in Fig. 2. Update graph G refers to coarsening each community as a vertex to reconstruct the network.
HTOCD algorithm[1] A graph
the community detection result is updated Choose
Decompose (
HTOCD algorithm flow chart.
The main idea of the Decompose part is that we need to obtain the maximum value
Decompose (
Grow
The Grow part needs to sort vertex
Grow (
Merge
The Merge part merges the initial communities. During merging, we select the communities
Merge (
Experiments
In this section, we demonstrate the effectiveness of our HTOCD algorithm by comparing its results with those of five other overlapping community detection algorithms on a real network.
Experimental setting
1) Comparison algorithms: Five representative algorithms were compared with the HTOCD algorithm.
OCDID [11] is an overlapping community detection algorithm based on the dynamic evolution of information, which regards the network as a dynamic system and allows vertices to share information about their neighbors.
MR-MOEA [18] is an overlapping community detection algorithm based on multi-objective evolution. It combines a fast encoding and decoding overlapping community discovery algorithm with a mixed representation of overlapping vertices and non-overlapping vertices.
LFM [15] is an overlapping and hierarchical overlapping community detection algorithm based on seed vertex expansion. It selects any seed vertex and expands it based on the principle of maximum fitness.
LMD [7] is a community detection algorithm based on the network structure. Seed vertices are selected based on the principle of the maximum local degree of the central vertex, and then neighbor vertices are added iteratively to expand the community.
NMF [14] is an overlapping community detection algorithm based on matrix factorization, and proposes a conceptual model of a network community structure, which can naturally generate close overlapping communities.
For a fair comparison, in the MR-MOEA the population size PS was set to 100 and the maximum number of generations gene was set to 100. The threshold
2) Real-world networks: We selected real network datasets with different sizes and characteristics to evaluate the performance of the HTOCD algorithm. The real network datasets are shown in Table 1. These networks are Zachary’s karate club [24], dolphin social network [25], American college football [26], books about U.S. politics [26], Yeast-D2 [27], Y2H [28], jazz musicians network [29], scientist collaboration network [30], Protein [31], Yeast PPI dataset [32], Power grid [33] and blogs network [34]. Note that karate, football, pollbooks, Yeast-D2, and Y2H are networks with a ground truth community structure.
Some properties of the real network data set
Some properties of the real network data set
3) Evaluation metrics: The first evaluation is the overlapping community modularity [35], which can be applied to the scenario in which the community structure is known or unknown. The modularity of overlapping communities is defined as follows:
where
Another evaluation index is generalized normalized mutual information gNMI [15], which is only applicable to the scenario in which the community structure is known, and it can be used to evaluate the similarity between the discovered community and the real community. The value range of gNMI is 0 to 1. If the calculated value of gNMI is 1, then the division result of the community is consistent with that of the real community; and if it is 0, then the division result is completely different from that of the real community. gNMI is defined as follows:
where
Qov changes with the 
In our proposed HTOCD algorithm to obtain the maximum overlapping modularity, we selected different
The values of different network parameters
Values of different network parameters
In this section, we compare the overlapping community modularity of 12 real networks and the gNMI values of six real networks. In the following sections, we provide a more detailed analysis of the experimental results.
Experimental results in terms of
Comparison results of
on the real-world networks
Comparison results of
Experimental results in terms of gNMI: We used gNMI to evaluate the community detection of six networks with real community structures. From Table 4, we conclude that the gNMI value obtained by the HTOCD algorithm had a higher gNMI and average gNMI value compared with the other overlapping community detection algorithms on most datasets. The MR-MOEA is an MOEA, which optimized the density within and between communities so that the community division results had a high gNMI value. Compared with the MR-MOEA, the HTOCD algorithm obtained a better gNMI value. In the dolphin network, we chose the maximum
Comparison results of gNMI on the real-world networks
Figure 4a shows the overlapping community detection results of the HTOCD algorithm in the karate network. Vertex 9 is considered to be the overlapping vertex, which connects the two communities for information exchange. Figure 4b shows the no-binary hierarchical tree of the HTOCD algorithm in the karate network.
HTOCD algorithm detected the overlapping communities and non-binary hierarchical tree in the karate network.
The five algorithms are as follows: the OCDID algorithm was used to prove the effectiveness of our algorithm; the LFM and NMF algorithms also adopted the method of a vertex belonging to multiple communities to obtain overlapping community structure and the two algorithms demonstrated the effectiveness of the similarity matrix and community based on the matrix expansion; the LMD algorithm adopted the method of merging after obtaining the initialized community structure, and it built a binary tree; the above algorithm proved that the HTOCD algorithm obtained a good community structure by constructing a non-binary tree structure, and the MR-MOEA obtained a group of Pareto solutions to obtain different levels of community structure. This algorithm obtained a large NMI value. We mainly used this algorithm to prove the performance of the HTOCD algorithm for a real community structure. Through the analysis of the above experimental results, we can conclude that our proposed HTOCD algorithm performed better than and was competitive with similar algorithms.
In this paper, we proposed the HTOCD algorithm based on multi-dimensional similarity for overlapping community detection. The multi-dimensional similarity between vertices in a network is used to construct a similarity matrix instead of an adjacency matrix, which not only solves the problem of network sparsity but also the local structural information of the network is better considered. A non-binary hierarchical clustering tree replaces the binary hierarchical clustering tree to form communities of different sizes in different layers. Simultaneously, we can obtain the communities of overlapping vertices in the hierarchical tree so that we can understand the deeper network structure. Additionally, experiments on 12 real networks show that our algorithm is stable and achieves the same community division result with consistent parameters.
We have proved the validity of the HTOCD algorithm in community detection and overlapping vertex detection. Our future work will mainly consider how to determine the most influential vertices in the community and how to determine effective overlapping vertices. This is important for practical applications for overlapping communities in rumor suppression, suppression of the spread of worms in a network, and other aspects.
Footnotes
Acknowledgments
This work is supported by the National Natural Science Foundation of China [Grants numbers 61602003,
