A non-binary hierarchical tree overlapping community detection based on multi-dimensional similarity

Abstract

Overlapping communities exist in real networks, where the communities represent hierarchical community structures, such as schools and government departments. A non-binary tree allows a vertex to belong to multiple communities to obtain a more realistic overlapping community structure. It is challenging to select appropriate leaf vertices and construct a hierarchical tree that considers a large amount of structural information. In this paper, we propose a non-binary hierarchical tree overlapping community detection based on multi-dimensional similarity. The multi-dimensional similarity fully considers the local structure characteristics between vertices to calculate the similarity between vertices. First, we construct a similarity matrix based on the first and second-order neighbor vertices and select a leaf vertex. Second, we expand the leaf vertex based on the principle of maximum community density and construct a non-binary tree. Finally, we choose the layer with the largest overlapping modularity as the result of community division. Experiments on real-world networks demonstrate that our proposed algorithm is superior to other representative algorithms in terms of the quality of overlapping community detection.

Keywords

Hierarchical and overlapping community detection multi-dimensional similarity matrix non-binary tree

1. Introduction

In recent years, community detection technologies have been used in many networks, such as mobile networks, the world wide web, citation networks, social networks, and biological networks [1]. The detection of the community structure in the network is important for predicting the dynamic evolution of the network, information interaction between vertices, identifying modules, and understanding the dynamic characteristics of the network. In real networks, the network structure often has the following characteristics: (1) A vertex belongs to multiple communities [2], that is, an overlapping community structure. For example, a person can belong to both a basketball club and a football club in a social network. (2) The network often has a hierarchical structure, and small communities form large communities through merging such as schools and government departments. In particular, the hierarchical community detection reveals the higher-order organization and components at each level and how these components interact with one another to form larger components at a higher-order in the hierarchy.

In 2005 years, Palla et al. [3] proposed an algorithm for the detection of overlapping communities, which assumed that communities are composed of fully connected subgraphs. Since then, a large number of algorithms have been used for overlapping community detection. At present, overlapping community detection algorithms are mainly divided into the following two categories: local methods and global methods. Local methods use extension methods to identify communities. Extension methods mainly include clique percolation [3, 4], seed set extension [5, 6, 7], and the label propagation algorithm [8, 9, 10, 11]. The extension algorithm produces good results in the least time, but most of the algorithms are unstable, and the selection of seed vertices and different extension strategies result in different community division results. Moreover, sometimes there are isolated vertices. Global methods include the optimization method [12] and non-negative matrix factorization (NMF) [13, 14]. The optimization algorithm typically needs to set the objective function in advance to evaluate the community division result to seek the optimal community division, which makes the community division falls into the local optimal. NMF needs the number of communities to be input in advance, which is not possible for a network where the number of communities is unknown.

In 2009 years, Lancichinetti et al. [15] proposed hierarchical and overlapping community detection algorithms in complex networks to discover the deep structural information of networks. Most overlapping community detection algorithms mainly obtain the hierarchical structure using the following three methods: (1) merging the initial communities [16, 17], (2) allowing one vertex to belong to multiple communities during the community detection process [15], and (3) applying a community detection algorithm based on the multi-objective evolutionary algorithm (MOEA) [18, 19]. However, because of the randomness of seed vertex selection and label propagation, this type of algorithm is unstable, and each time the algorithm runs, it may obtain differently community division results.

To obtain a stable overlapping community discovery algorithm and use the local information to perform clustering granulation of the community. In this paper, we define multi-dimensional similarity to obtain the initial particle nodes, form the initial community, and construct the non-binary hierarchical tree. Different from the above algorithms, the proposed algorithm can find overlapping and stable community structures in the case of an unknown number of communities. At the same time, the non-binary hierarchical tree can enable us to observe the different levels of community structure and overlapping node ownership. The main contributions of this paper are summarized as follows:

(1)
We propose the HTOCD algorithm which selects leaf vertices based on the similarity matrix and seeks the maximum community density principle for community expansion. During the expansion process, the vertices are allowed to belong to multiple communities to construct non-binary trees. In the clustering process, we coarsen the community into a vertex and rebuild the network until there is only one vertex in the network. Then we select the optimal layer using the value of overlapping community modularity, which is used to evaluate the quality of the overlapping community detection algorithm.
(2)
We propose a multi-dimensional similarity calculation method. This method not only considers the neighbors of a vertex but also considers the common neighbors between vertices. The similarity function of multi-dimensional structure similarity is composed of the first-dimensional structure similarity and the second-dimensional structure similarity, and the similarity matrix solves the problem of network sparsity.
(3)
We compare the HTOCD algorithm with five basic algorithms in 12 real complex networks. The experimental results demonstrated that the performance of our proposed algorithm was superior to that of the basic algorithms. In terms of overlapping community detection, this means that the proposed algorithm is competitive and promising.

The structure of this paper is organized as follows: We present related work in Section 2. We provide a detailed description of our algorithm in Section 3, and the results of the performed experiments on real-world networks are shown in Section 4. Finally, we present the conclusion in Section 5.
2. Related work

In the last twenty years, researchers began to study the problems in overlapping communities. Currently, These overlapping community detection algorithms are mainly divided into local methods and global methods. Local methods seek global network community optimization through local community optimization, and can be divided into the following three methods: (1) Clique percolation [3, 4] achieves the union of $k$ large associations in the network by defining $k$ small associations. (2) The label propagation algorithm [8, 9, 10, 11] identifies overlapping communities using multiple labels and modifying the strategy of label propagation. This OCDID algorithm [11] obtains an overlapping community structure through the propagation of information flow in the network structure, which can not obtain a hierarchical community structure. (3) The seed set expansion [5, 6, 7], the seed vertices are selected and gradually expanded to form a community. The LMD algorithm [7] selects seed vertices through the maximum degree of local vertex centrality, expands the seed vertices, and obtains the community structure by combining communities. The hierarchical tree is binary. The global approach assumes the existence of a community and assigns the vertices in the network to all communities. There are two main approaches: (1) Optimization needs to define the objective function in advance and seek the optimal objective function in the process of community detection. For example, genetic algorithms [20], simulated annealing [21], or particle swarm optimization [22]. Similar algorithms also include those based on spectral theory. The optimization algorithm can fall into the local optimum and needs the global information of the network, which is time-consuming. (2) The NMF [13, 14]. The NMF algorithm [14] proposes a perceptive model of network community structure, which allows a vertex to belong to multiple communities and form a non-binary hierarchical tree. However, most algorithms based on NMF require prior knowledge of the number of communities.

Lancichinetti et al. [15] proposed hierarchical and overlapping community detection algorithms in complex networks. Simultaneously, the proposed LFM algorithm also adopts the method of constructing a non-binary hierarchical tree to obtain an overlapping hierarchical community structure. However, the random selection of seed vertices makes the algorithm unstable. The EAGLE algorithm [16] takes the maximal cliques in the network as the initial community and builds the nested structure of communities through agglomerative clustering. The ACLPA algorithm [17] constructs overlapping and hierarchical trees by applying label propagation and agglomerative clustering. The algorithm based on the MOEA [19] builds a hierarchical tree by optimizing two objective functions. One objective function is used to divide the network into smaller communities and the other is used to divide the network into larger communities to seek the balance of the two objective functions and obtain the community division results at different granularities and different levels. Zhang et al. [18] proposed an MR-MOEA for overlapping community detection. In the MR-MOEA, a mixed representation scheme consists of candidate overlapping vertex-based representation and non-overlapping vertex-based representation for fast encoding and decoding of the overlapping divisions. As the MR-MOEA is an algorithm based on the MOEA, it can also achieve community structures at different levels and granularities.

Different from the above work, in this paper, we propose a non-binary hierarchical tree overlapping community detection based on multi-dimensional similarity called HTOCD to detect overlapping communities. The multi-dimensional similarity takes into account the first-dimensional structure similarity and the second-dimensional structure similarity to form the similarity matrix, which can well reflect the affinity between vertices in the network. At the same time, the similarity between vertices is used to determine the initial community and community expansion to solve the problem of algorithm instability. We build a non-binary tree hierarchical structure that is smaller than a binary hierarchical tree. Additionally, compared to other overlapping community discovery algorithms, overlapping vertices are obtained in the process of community detection, so that the communities to which the overlapping vertices belong to can be more clearly understood, and there are no isolated vertices. The hierarchical clustering algorithm not only obtains community division results of different dimensions at different levels but also get the same community division results at each run with the same parameters.

3. The proposed algorithm

In this section, we will present the HTOCD algorithm for overlapping community detection, including the definition and update of the similarity matrix, construction of the hierarchical tree and the overall procedure.

3.1 Definitions

We first consider a network of undirected unweighted graphs $G=(V,E)$ , where $V$ is the vertex set and $E$ is the edge set. The network can be represented as an adjacency matrix $A$ , where any two vertices in the network are connected by an edge and the factor of the matrix is 1; otherwise, it is 0. The community is defined as vertex-set $C_{i}=\{V_{1},V_{2},\ldots,V_{n}\}$ , and the main task of community detection is to divide the vertices in the network according to community structure to form the following community set. $C=\{C_{1},C_{2},\ldots,C_{k}\}$ , where $C_{i}$ meets the following conditions:

$\displaystyle C_{i}\subset V\quad\textit{ and }\quad C_{i}\neq\varnothing,C_{i% }\neq C_{j},\forall i\neq j\quad\textit{ and }\quad i,j\in\{1,2,\ldots k\},% \bigcup_{i=1}^{k}C_{i}=V$ (1)

The set of vertices in a community is a subset of all vertices in the network, and the set of vertices in a community can neither be equal to the set of all vertices in the network nor an empty set. If the community set meets the following conditions

$\displaystyle C_{i}\cap C_{j}\neq\varnothing,\exists i\neq j\quad\textit{ and % }\quad i,j\in\{1,2,\ldots k\}$ (2)

then the communities formed are overlapping communities. In this paper, we mainly focus on the detection of overlapping communities, where a vertex can belong to one or more communities.

Definition 1 (multi-dimensional similarity): We define the similarity between two vertices of the network as

$\displaystyle\textit{similarity}[i][j]=\frac{{\sum_{k\in N_{ij}}\textit{Adj}[i% ][k]\times\textit{Adj}[k][j]}}{|N_{ij}|}+\textit{Adj}[i][j]$ (3)

where $N_{ij}$ is a set of common neighbor vertices, and $|N_{ij}|$ is the number of common neighbor vertices. $\textit{Adj}[i][j]$ is the value of the adjacency matrix. If vertex $i$ and vertex $j$ are connected by edges, then the value is 1; otherwise, it is 0. We define the first half of Eq. (3) as the second-dimensional structural similarity and the second half as the first-dimensional structural similarity.

We define the value of $\textit{similarity}[i][j]$ of a similarity matrix as the weight value of vertices $i$ and $j$ in the network, that is, $\textit{similarity}[i][j]=w[i][j]$ . A similarity matrix is a symmetric matrix. The transformation of the matrix is shown in Fig. 1. Consider vertex 2 and vertex 4 in Fig. 1a as examples. Vertex 2 has no path to vertex 4, and the value of the first dimensional structure similarity is 0. Vertex 2 can reach vertex 4 through vertices 1, 3, and 5. At this time, the value of the second-dimensional structure similarity is obtained. Then the values of vertex 2 and vertex 4 in the similarity matrix are 1. The value of the similarity matrix reflects the similarity between vertices, and the larger the value, the greater the similarity. The bold lines in Fig. 1a indicates a great similarity between vertices. The similarity function is used to obtain the initialization community of two vertices.

Figure 1.

Transformation from an adjacency matrix to an HTOCD similarity matrix.

Definition 2 (community density): Let $G$ be a weight graph with a weight function $w$ . For a subgraph $C\subset G$ , we define the density of $C$ as

$\displaystyle d(C)=\frac{2\sum_{e\in E(C)}w(e)}{|V(C)|(|V(C)|-1)}$ (4)

where $|V(C)|$ is the number of vertices in the community and e is the edge in the community. The formula is used to evaluate the tightness within a community.

Definition 3 (contribution of a vertex to the community): We define the contribution of $v$ to $C$ as

$\displaystyle c(v,C)=\frac{\sum_{u\in V(C)}\alpha\times w(uv)+\textit{Adj}[v][% u]}{|V(C)|}$ (5)

where $v\notin V(C)$ and $\alpha$ is a parameter, $0<\alpha<1$ . We use the second-dimensional similarity and the first-dimensional similarity of vertices to evaluate the closeness between vertices and communities.

Definition 4 (community extensible) [23]: We define the following formula to determine whether to add vertex $v$ to community $C$ :

$\displaystyle c(v,C)>\alpha_{n}d(C)$ (6)

Vertex $v$ is added to $C$ if $c(v,C)>\alpha_{n}d(C)$ then add $v$ to $C$ (where $n=|V(C)|$ , $\alpha_{n}=1-\frac{1}{2\gamma(n+t)}$ , with $\gamma\geqslant 1$ and $t\geqslant 1$ as user-specified parameters). This condition applies to the expansion of the community.

Definition 5 (similarity matrix update): We define the similarity between communities as

$\displaystyle W(C_{i},C_{j})=\eta\times\frac{\sum_{e\in E_{ij}}W(e)}{|E_{ij}|}% +\frac{\sum_{e\in E_{ij}\textit{Adj}[v_{i}][v_{j}]}}{|C_{i}||C_{j}|}$ (7)

Where $\eta$ is a parameter, $0<\eta<1$ , the set of crossing edges $E_{ij}=\{v_{i}v_{j}:v_{i}\in C_{i},v_{j}\in C_{j},v_{i}\neq v_{j}\}$ . After obtaining the community, we regard the community as the vertex reconstruction network. The formula is used to calculate the similarity between the vertices.

3.2 Algorithm description

3.2.1 HTOCD algorithm

During the implementation of the HTOCD algorithm, it can be divided into the following two steps: the initial granulation of communities and the expansion of communities to form a hierarchical structure. For the initial community granulation, we first apply the Decompose sub-algorithm to choose two vertices that have the closest similarity in the network based on Eq. (3) to form the initial community, and then use the Grow sub-algorithm to expand the initial community based on Eq. (5). If the vertex to the initial community’s contribution is greater than a certain fraction of the initial community density, that is Eq. (6) is satisfied, then we add the vertex to the community; otherwise, we select the next initialized community to start the next iteration. When the iteration is complete, for all communities formed by the initial granulation, we apply the Merge sub-algorithm to merge small communities into large communities, coarsen each community as a vertex reconstructs the network based on Eq. (7) and start the next iteration until only one vertex remains in the network. At this time, we stop the algorithm. The flow chart of this algorithm is shown in Fig. 2. Update graph G refers to coarsening each community as a vertex to reconstruct the network.

HTOCD algorithm[1] A graph $G$ and $W_{o}$ is the maximum vertex similarity. Produce a hierarchical clustering tree for graph $G$ .

the community detection result is updated Choose $W_{o}$ according to some criterion;

Decompose ( $G$ , $W_{o}$ ); Grow ( $G$ , $W_{o}$ ); Merge ( $G$ ); Store the resultant graph to $G$ ; Trace the movement of each vertex and generate a non-binary hierarchical tree.

Figure 2.

HTOCD algorithm flow chart.

3.2.2 Decompose

The main idea of the Decompose part is that we need to obtain the maximum value $W_{o}$ of the vertex similarity based on Eq. (3), and then sort the vertex similarity that is greater than or equal to $\lambda W_{o}$ in the network in descending order, where $\lambda\subset[0,0.6]$ is a non-negative real number. The vertices with the largest similarity in the network are selected and initialized as community $C$ , in which at least one vertex $V_{i}$ or $V_{j}$ that is not part of community $C$ is initialized to form a community, and then the community is expanded according to the Grow part of the algorithm. For vertices that are not added to the community, we form a community. However, too many individual vertex communities are formed using this method for large-scale networks. Hence, in the first iteration, we choose to add vertices to the communities with the largest similarity. The similarity between a vertex and community is the sum of all vertices in the community and the vertex. In the subsequent iteration process, we take the communities generated in the previous round as the vertex to reconstruct the network based on Eq. (7) and start a new iteration. Sub-algorithm 2 Decompose summarizes the details of initializing the community in complex networks.

Decompose ( $C$ , $G$ )[1] A graph $G$ with threshold function $W_{o}$ and $\lambda\subset[0,0.6]$ is a non-negative real number. A sequence of clusters of $G$ . Let $E_{0}=\{e\in E(G):w(e)\geqslant\lambda_{W_{o}}\}$ . each $e=uv\in E_{o}$ in decreasing order of $w(e)$ either $u$ or $v$ is not in any community Create a new empty community $C$ and add $u$ , $v$ to it; Grow ( $C$ , $G$ );

3.2.3 Grow

The Grow part needs to sort vertex $V_{k}$ which is not added to the community in descending order according to its contribution Eq. (5) to the community. We select the vertex with the largest contribution and add this vertex to community $C$ if Eq. (6) is satisfied. Eventually, we obtain the initialization community. Sub-algorithm 3 Grow summarizes the details of how to add a vertex to the community in complex networks.

Grow ( $C$ , $G$ )[1] A cluster $C$ of graph $G$ . A grown up cluster $C$ of $G$ .

$V(G)-V(C)\neq\varnothing$ Select $v\in V(G)-V(C)$ such that $c(v,C)$ is a maximum; $c(v,C)>\alpha_{n}d(C)$ add $v$ to $C$ ; where $n=|V(C)|$ , $\alpha_{n}=1-\frac{1}{2\gamma(n+t)}$ , with $\gamma\geqslant 1$ , $t\geqslant 1$ as user specified parameters; return;

3.2.4 Merge

The Merge part merges the initial communities. During merging, we select the communities $C_{i}$ and $C_{j}$ . If conditions $|C_{i}\cap C_{j}|\geqslant\beta\min(|C_{i}|,|C_{j}|)$ are met, then we combine and form a new community into a community setting. Finally, we obtain the community division result, where we regard the formed community as a vertex and update the similarity matrix in the network according to Eq. (7), to start the next round of community division. Sub-algorithm 4 Merge summarizes the details of how to merge communities in complex networks.

Merge ( $C$ , $G$ )[1] A sequence of clusters of $G$ and the value of $\beta$ is 0.5. A new weighted graph $G$ with fewer vertices. any communities $C_{i}$ and $C_{j}$ in $G$ $|C_{i}\cap C_{j}|\geqslant\beta\min(|C_{i}|,|C_{j}|)$ Merge $C_{i}$ and $C_{j}$ into a new community (where $\beta$ is a user-specified parameter); Contract each community to a vertex; the weight of an edge is defined by $W(C_{i},C_{j})=\eta\times\frac{\sum_{e\in E_{ij}}W(e)}{|E_{ij}|}+\frac{\sum_{e% \in E_{ij}\textit{Adj}[v_{i}][v_{j}]}}{|C_{i}||C_{j}|}$ ; where $\eta$ is a parameter ( $0<\eta<1$ ) and the set of crossing edges $E_{ij}=\{v_{i}v_{j}:v_{i}\in C_{i},v_{j}\in C_{j},v_{i}\neq v_{j}\}$ ;

4. Experiments

In this section, we demonstrate the effectiveness of our HTOCD algorithm by comparing its results with those of five other overlapping community detection algorithms on a real network.

4.1 Experimental setting

1) Comparison algorithms: Five representative algorithms were compared with the HTOCD algorithm.

OCDID [11] is an overlapping community detection algorithm based on the dynamic evolution of information, which regards the network as a dynamic system and allows vertices to share information about their neighbors.

MR-MOEA [18] is an overlapping community detection algorithm based on multi-objective evolution. It combines a fast encoding and decoding overlapping community discovery algorithm with a mixed representation of overlapping vertices and non-overlapping vertices.

LFM [15] is an overlapping and hierarchical overlapping community detection algorithm based on seed vertex expansion. It selects any seed vertex and expands it based on the principle of maximum fitness.

LMD [7] is a community detection algorithm based on the network structure. Seed vertices are selected based on the principle of the maximum local degree of the central vertex, and then neighbor vertices are added iteratively to expand the community.

NMF [14] is an overlapping community detection algorithm based on matrix factorization, and proposes a conceptual model of a network community structure, which can naturally generate close overlapping communities.

For a fair comparison, in the MR-MOEA the population size PS was set to 100 and the maximum number of generations gene was set to 100. The threshold $\lambda$ for controlling the number of candidate overlapping vertices in the MR-MOEA was set to 0.1. The experimental results for all algorithms were obtained by averaging over 20 independent runs. In the HTOCD algorithm, the parameters $\gamma$ , $t$ , $\beta$ , $\alpha$ , and $\eta$ were set to 1, 1, 0.4, 0.7, and 0.5 respectively. From our experimental results, we selected the partition result with the maximum overlapping community modularity in the hierarchical clustering tree.

2) Real-world networks: We selected real network datasets with different sizes and characteristics to evaluate the performance of the HTOCD algorithm. The real network datasets are shown in Table 1. These networks are Zachary’s karate club [24], dolphin social network [25], American college football [26], books about U.S. politics [26], Yeast-D2 [27], Y2H [28], jazz musicians network [29], scientist collaboration network [30], Protein [31], Yeast PPI dataset [32], Power grid [33] and blogs network [34]. Note that karate, football, pollbooks, Yeast-D2, and Y2H are networks with a ground truth community structure.

Table 1
Some properties of the real network data set

Real networks	Vertices	Edges	Ave. Degree
Karate	34	78	4.59
Dolphin	62	159	5.13
Football	115	613	10.66
Polbooks	105	441	8.4
Yeast-D2	1443	6993	9.69
Y2H	1966	2705	2.75
Jazz	198	2742	27.70
Netscience	1589	2742	3.45
Protein	1870	2277	2.44
PPI	2456	6265	5.26
Blogs	3984	6803	3.41
Power grid	4941	6594	2.67

3) Evaluation metrics: The first evaluation is the overlapping community modularity [35], which can be applied to the scenario in which the community structure is known or unknown. The modularity of overlapping communities is defined as follows:

$\displaystyle Q_{ov}=\frac{1}{2m}\sum_{i=1}^{l}\sum_{v\in C_{i},w\in C_{i}}% \frac{1}{O_{v}O_{w}}\left[A_{vw}-\frac{K_{v}K_{w}}{2m}\right]$ (8)

where $A$ is the adjacency matrix, $C_{i}$ is the community, $m$ is the number of edges in the network, $O_{v}$ is the number of communities to which vertex V belongs and $K_{v}$ is the degree of the vertex $v$ . We found that the higher the modularity of overlapping communities, the better the community structure.

Another evaluation index is generalized normalized mutual information gNMI [15], which is only applicable to the scenario in which the community structure is known, and it can be used to evaluate the similarity between the discovered community and the real community. The value range of gNMI is 0 to 1. If the calculated value of gNMI is 1, then the division result of the community is consistent with that of the real community; and if it is 0, then the division result is completely different from that of the real community. gNMI is defined as follows:

$\displaystyle\textit{gNMI}=\frac{-2\sum_{i=1}^{C_{A}}\sum_{j=1}^{C_{B}}C_{ij}% \log(C_{ij}/C_{i.}C_{.j})}{\sum_{i=1}^{C_{A}}C_{i.}\log(C_{i.}/N)+\sum_{j=1}^{% C_{B}}C_{.j}\log(C_{.j}/N)}$ (9)

where $C_{A}(C_{B})$ is the number of communities in division $A(B)$ ; $C$ is the confusion matrix, whose element $C_{ij}$ is the number of vertices shared by the community $i$ in division $A$ and by community $j$ in division $B$ ; $C_{i.}(C_{.j})$ is the sum of each row (column) of the confusion matrix; and $N$ is the number of vertices. The larger the value of gNMI, the closer the community structure detected by the algorithm is to the ground truth. Based on the above discussion, gNMI was used to evaluate the community detection results of six real community network datasets and overlapping community modularity were used to evaluate the community detection results of 12 real network communities.

4.2 Parameter analysis

Figure 3.

Qov changes with the $\lambda$ value in different networks.

In our proposed HTOCD algorithm to obtain the maximum overlapping modularity, we selected different $\lambda$ values according to different networks. The change of the $Q_{ov}$ value of different networks with the change of $\lambda$ value is shown in Fig. 3. The curve in Fig. 3 shows a smooth trend in the protein, power grid, blogs, Y2H networks. Because in these networks, the average degree of the vertices was low, and the internal connection density of the network was small. Different $\lambda$ had little effect on the selection of the initial leaf vertices, which resulted in little difference to the structure of the most optimal layer community. In the Yeast-D2, PPI network, the vertices were closely connected, which made the network community structure sensitive to changes in $\lambda$ , and the curve had a downward trend. Simultaneously, in the pollbooks and dolphin networks, because the number of real communities in these networks was small, the number of initial leaf vertices decreased with the increase of $\lambda$ , a large overlapping community modularity was obtained, and the curve had an upward trend.

The values of different network parameters $\lambda$ and the number of overlapping vertices in the final community detection are shown in Table 2. We can see that the HTOCD algorithm did not have the problem of excessive overlapping. Simultaneously, under the condition that the parameters of the HTOCD algorithm were determined, the same community division result was obtained for each run, which also proves that the HTOCD algorithm is stable.

Table 2

Values of different network parameters $\lambda$ and the number of overlapping vertices for community detection

Network	Number of vertices	$\lambda$	Number of overlapping vertices in the optimal layer	The average number of overlapping vertices in the hierarchical tree
Karate	34	0.2	1	3.5
Dolphin	62	0.6	1	6
Football	115	0.2	1	7.7
Polbooks	105	0.2	1	16.7
Yeast-D2	1443	0.1	111	174.3
Y2H	1966	0.1	34	48.7
Jazz	198	0.6	5	23.8
Netscience	1589	0.1	64	75
Protein	1870	0.1	7	19.5
PPI	2445	0.1	157	204.3
Blogs	3984	0.1	34	48.6
Power grid	4941	0.1	66	80.5

4.3 Experimental analysis

In this section, we compare the overlapping community modularity of 12 real networks and the gNMI values of six real networks. In the following sections, we provide a more detailed analysis of the experimental results.

Experimental results in terms of $Q_{ov}$ : Table 3 shows the comparison results of the HTOCD algorithm and five basic algorithms for overlapping community modularity. We can see that the HTOCD algorithm was superior to the other algorithms on most datasets, which indicates that the HTOCD algorithm produced a better overlapping community detection result. This is large because the OCDID algorithm only considers neighbor vertices and does not consider the influence of second-dimensional neighbor vertices on the local community structure. However, the HTOCD algorithm considers the influence of the neighbor vertex and common neighbor vertex on the local community structure in the formation process of the community leaf vertex and hierarchical tree. Compared with the LMD algorithm, the results proved that the overlapping community structure obtained by constructing a non-binary hierarchical tree was more consistent with the result of real community division. Simultaneously, compared with the LFM algorithm and the NMF algorithm, the proposed HTOCD algorithm’s community expansion based on the neighbor vertex and common neighbor vertex edge information obtained a better community structure. We also obtained the community detection results of the HTOCD algorithm, which were consistent for each run and it was not necessary to input the real information of the network in advance.

Table 3
Comparison results of $Q_{ov}$ on the real-world networks

Network	Metric	OCDID	MR-MOEA	LFM	LMD	NMF	HTOCD
Karate	$Q_{ov}-\textit{max}$	0.351	0.229	0.258	0.216	0.205	0.365
	$Q_{ov}-\textit{avg}$	0.351	0.223	0.258	0.204	0.205	0.365
	std	0	0.007	0	0.024	0	0
Dolphin	$Q_{ov}-\textit{max}$	–	0.271	0.28	0.261	0.200	0.471
	$Q_{ov}-\textit{avg}$	–	0.264	0.28	0.194	0.200	0.471
	std	–	0.011	0	0.102	0	0
Football	$Q_{ov}-\textit{max}$	0.572	0.306	0.699	0.284	0.303	0.586
	$Q_{ov}-\textit{avg}$	0.572	0.303	0.699	0.246	0.303	0.586
	std	0	0.005	0	0.090	0	0
Polbooks	$Q_{ov}-\textit{max}$	0.436	0.267	0.349	0.236	0.259	0.424
	$Q_{ov}-\textit{avg}$	0.436	0.265	0.349	0.351	0.259	0.424
	std	0	0	0	0.090	0	0
Yeast-D2	$Q_{ov}-\textit{max}$	–	0.410	–	0.346	0.309	0.641
	$Q_{ov}-\textit{avg}$	–	0.405	–	0.275	0.309	0.641
	std	–	0.004	–	0.063	0	0
Y2H	$Q_{ov}-\textit{max}$	–	0.299	–	0.281	0.228	0.530
	$Q_{ov}-\textit{avg}$	–	0.286	–	0.281	0.228	0.530
	std	–	0.008	–	0	0	0
Jazz	$Q_{ov}-\textit{max}$	0.237	0.223	0.152	0.146	0.155	0.400
	$Q_{ov}-\textit{avg}$	0.237	0.221	0.152	0.143	0.155	0.400
	std	0	0.002	0	0.007	0	0
Netscience	$Q_{ov}-\textit{max}$	–	0.460	0.460	0.396	0.413	0.665
	$Q_{ov}-\textit{avg}$	–	0.456	0.460	0.395	0.411	0.665
	std	–	0.001	0	0.001	0.006	0
Protein	$Q_{ov}-\textit{max}$	0.570	0.357	0.171	0.319	–	0.574
	$Q_{ov}-\textit{avg}$	0.570	0.357	0.171	0.319	–	0.574
	std	0	0	0	0	–	0
PPI	$Q_{ov}-\textit{max}$	–	0.313	0.169	0.283	0.237	0.595
	$Q_{ov}-\textit{avg}$	–	0.311	0.169	0.278	0.230	0.595
	std	–	0.006	0	0.004	0.005	0
Blogs	$Q_{ov}-\textit{max}$	–	0.394	0.508	0.352	0.342	0.472
	$Q_{ov}-\textit{avg}$	–	0.389	0.508	0.345	0.338	0.472
	std	–	0.011	0	0.012	0.003	0
Powergrid	$Q_{ov}-\textit{max}$	0.447	0.337	0.103	0.330	–	0.502
	$Q_{ov}-\textit{avg}$	0.447	0.337	0.103	0.330	–	0.502
	std	0	0	0	0	–	0

Experimental results in terms of gNMI: We used gNMI to evaluate the community detection of six networks with real community structures. From Table 4, we conclude that the gNMI value obtained by the HTOCD algorithm had a higher gNMI and average gNMI value compared with the other overlapping community detection algorithms on most datasets. The MR-MOEA is an MOEA, which optimized the density within and between communities so that the community division results had a high gNMI value. Compared with the MR-MOEA, the HTOCD algorithm obtained a better gNMI value. In the dolphin network, we chose the maximum $Q_{ov}$ and this layer formed four communities. If the community structure of the next layer was selected, then there were two communities, the value of $Q_{ov}$ was 0.379, and the value of gNMI was 0.888. This also proves the significance of the detection of hierarchical overlapping communities, that is, the community structure at different levels and different granularities were obtained and the community detection result of HTOCD is more consistent with the actual community structure.

Table 4

Comparison results of gNMI on the real-world networks

Network	Metric	OCDID	MR-MOEA	LFM	LMD	NMF	HTOCD
Karate	gNMI-max	–	1	0.690	0.513	0.837	1
	gNMI-avg	–	1	0.690	0.447	0.837	1
	std	–	0	0	0.104	0	0
Dolphin	gNMI-max	–	1	0.781	0.611	0.907	0.599
	gNMI-avg	–	1	0.781	0.456	0.907	0.599
	std	–	0	0	0.132	0	0
Football	gNMI-max	–	0.803	0.754	0.783	0.793	0.858
	gNMI-avg	–	0.803	0.754	0.762	0.793	0.858
	std	–	0.005	0	0.090	0	0
Polbooks	gNMI-max	–	0.149	0.394	0.137	0.388	0.485
	gNMI-avg	–	0.139	0.394	0.118	0.388	0.485
	std	–	0.014	0	0.017	0	0
Yeast-D2	gNMI-max	–	0.263	–	0.205	0.228	0.290
	gNMI-avg	–	0.258	–	0.201	0.228	0.290
	std	–	0.004	–	0.063	0	0
Y2H	gNMI-max	–	0.118	–	0.018	0.026	0.156
	gNMI-avg	–	0.115	–	0.018	0.026	0.156
	std	–	0.002	–	0	0	0

Figure 4a shows the overlapping community detection results of the HTOCD algorithm in the karate network. Vertex 9 is considered to be the overlapping vertex, which connects the two communities for information exchange. Figure 4b shows the no-binary hierarchical tree of the HTOCD algorithm in the karate network.

Figure 4.

HTOCD algorithm detected the overlapping communities and non-binary hierarchical tree in the karate network.

The five algorithms are as follows: the OCDID algorithm was used to prove the effectiveness of our algorithm; the LFM and NMF algorithms also adopted the method of a vertex belonging to multiple communities to obtain overlapping community structure and the two algorithms demonstrated the effectiveness of the similarity matrix and community based on the matrix expansion; the LMD algorithm adopted the method of merging after obtaining the initialized community structure, and it built a binary tree; the above algorithm proved that the HTOCD algorithm obtained a good community structure by constructing a non-binary tree structure, and the MR-MOEA obtained a group of Pareto solutions to obtain different levels of community structure. This algorithm obtained a large NMI value. We mainly used this algorithm to prove the performance of the HTOCD algorithm for a real community structure. Through the analysis of the above experimental results, we can conclude that our proposed HTOCD algorithm performed better than and was competitive with similar algorithms.

5. Conclusions

In this paper, we proposed the HTOCD algorithm based on multi-dimensional similarity for overlapping community detection. The multi-dimensional similarity between vertices in a network is used to construct a similarity matrix instead of an adjacency matrix, which not only solves the problem of network sparsity but also the local structural information of the network is better considered. A non-binary hierarchical clustering tree replaces the binary hierarchical clustering tree to form communities of different sizes in different layers. Simultaneously, we can obtain the communities of overlapping vertices in the hierarchical tree so that we can understand the deeper network structure. Additionally, experiments on 12 real networks show that our algorithm is stable and achieves the same community division result with consistent parameters.

We have proved the validity of the HTOCD algorithm in community detection and overlapping vertex detection. Our future work will mainly consider how to determine the most influential vertices in the community and how to determine effective overlapping vertices. This is important for practical applications for overlapping communities in rumor suppression, suppression of the spread of worms in a network, and other aspects.

Footnotes

Acknowledgments

This work is supported by the National Natural Science Foundation of China [Grants numbers 61602003, $\#$ 61673020, $\#$ 61876001]; the Provincial Natural Science Foundation of Anhui Province [Grants numbers 1708085QF156]; the National Key Research and Development Program of China [Grants numbers 2017YFB1401903].

References

Bickel

P.J.

and Chen

, A nonparametric view of network models and NewmanCGirvan and other modularities, Proceedings of the National Academy of Sciences 106(50) (2009), 21068–21073.

Jin

Baquero

et al., Link community detection using generative model and nonnegative matrix factorization, PloS One 9(1) (2014).

Palla

Dernyi

Farkas

et al., Uncovering the overlapping community structure of complex networks in nature and society, Nature 435(7043) (2005), 814–818.

Kumpula

J.M.

Kivel

Kaski

et al., Sequential algorithm for fast clique percolation, Physical Review E 78(2) (2008), 026109.

Whang

J.J.

Gleich

D.F.

and Dhillon

I.S.

, Overlapping community detection using neighborhood-inflated seed expansion, IEEE Transactions on Knowledge and Data Engineering 28(5) (2016), 1272–1284.

Zhang

Ding

and Yang

, Revealing the role of node similarity and community merging in community detection, Knowledge-Based Systems 165 (2019), 407–419.

Chen

T.T.

and Fang

, Detecting local community structures in complex networks based on local degree central nodes, Physica A: Statistical Mechanics and its Applications 392(3) (2013), 529–537.

Xie

Szymanski

B.K.

and Liu

, Slpa: Uncovering overlapping communities in social networks via a speaker-listener interaction dynamic process, in: 2011 Ieee 11th International Conference on Data Mining Workshops, IEEE, 2011, pp. 344–349.

Coscia

Rossetti

Giannotti

et al., Demon: a local-first discovery method for overlapping communities, in: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2012, pp. 615–623.

10.

Kloster

et al., Local spectral clustering for overlapping community detection, ACM Transactions on Knowledge Discovery from Data (TKDD) 12(2) (2018), 1–27.

11.

Sun

Wang

Sheng

et al., Overlapping community detection based on information dynamics, IEEE Access 6 (2018), 70919–70934.

12.

Baumes

Goldberg

M.K.

Krishnamoorthy

M.S.

et al., Finding communities by clustering a graph into overlapping subgraphs, IADIS AC 5 (2005), 97–104.

13.

Chen

Liu

and Chao

H.C.

, Overlapping community detection using non-negative matrix factorization with orthogonal and sparseness constraints, IEEE Access 6 (2017), 21266–21274.

14.

Yang

and Leskovec

, Overlapping community detection at scale: a nonnegative matrix factorization approach, in: Proceedings of the sixth ACM International Conference on Web Search and Data Mining, 2013, pp. 587–596.

15.

Lancichinetti

Fortunato

and Kertsz

, Detecting the overlapping and hierarchical community structure in complex networks, New Journal of Physics 11(3) (2009), 033015.

16.

Shen

Cheng

Cai

et al., Detect overlapping and hierarchical community structure in networks, Physica A: Statistical Mechanics and its Applications 388(8) (2009), 1706–1712.

17.

Zhao

and Wang

, Agglomerative clustering based on label propagation for detecting overlapping and hierarchical communities in complex networks, Advances in Complex Systems 17(6) (2014), 1450021.

18.

Zhang

Pan

et al., A mixed representation based multiobjective evolutionary algorithm for overlapping community detection, IEEE Transactions on Cybernetics 47(9) (2017), 2703–2716.

19.

Gong

Chen

et al., Identification of multi-resolution network structures with multi-objective immune algorithm, Applied Soft Computing 13(4) (2013), 1705–1717.

20.

and Liu

, Optimization on GA-BP neural network of coal and gas outburst hazard prediction, in: 2010 IEEE Fifth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA), IEEE, 2010, pp. 673–678.

21.

C.H.

Xie

Liu

et al., Memetic algorithm with simulated annealing strategy and tightness greedy optimization for community detection in networks, Applied Soft Computing 34 (2015), 485–501.

22.

Cai

Gong

et al., Greedy discrete particle swarm optimization for large-scale social network clustering, Information Sciences 316 (2015), 503–516.

23.

Zhao

and Zhang

C.Q.

, A new clustering method and its application in social networks, Pattern Recognition Letters 32(15) (2011), 2109–2118.

24.

Zachary

W.W.

, An information flow model for conflict and fission in small groups, Journal of Anthropological Research 33(4) (1977), 452–473.

25.

Lusseau

, The emergent properties of a dolphin social network, Proceedings of the Royal Society B Biological Sciences 270(Suppl 2) (2003), S186–S188.

26.

Newman

M.E.J.

, Modularity and community structure in networks, Proceedings of the National Academy of Sciences 103(23) (2006), 8577–8582.

27.

Zaki

Berengueres

and Efimov

, ProRank: a method for detecting protein complexes, in: Proceedings of the 14th Annual Conference on Genetic and Evolutionary Computation, 2012, pp. 209–216.

28.

Braun

Yldrm

M.A.

et al., High-quality binary protein interaction map of the yeast interactome network, Science 322(5898) (2008), 104–110.

29.

Gleiser

P.M.

and Danon

, Community structure in jazz, Advances in Complex Systems 6(4) (2003), 565–573.

30.

Gong

Cai

Chen

et al., Complex network clustering by multiobjective discrete particle swarm optimization based on decomposition, IEEE Transactions on Evolutionary Computation 18(1) (2013), 82–97.

31.

Han

J.D.J.

Dupuy

Bertin

et al., Effect of sampling on topology predictions of protein-protein interaction networks, Nature Biotechnology 23(7) (2005), 839–844.

32.

Cusick

M.E.

Smolyar

et al., Literature-curated protein interaction datasets, Nature Methods 6(1) (2009), 39.

33.

Watts

D.J.

and Strogatz

S.H.

, Collective dynamics of small-worldnetworks, Nature 393(6684) (1998), 440.

34.

Gregory

, Finding overlapping communities in networks by label propagation, New Journal of Physics 12(10) (2010), 103018.

35.

Nicosia

Mangioni

Carchiolo

et al., Extending the definition of modularity to directed graphs with overlapping communities, Journal of Statistical Mechanics: Theory and Experiment 2009(3) (2009), P03024.

A non-binary hierarchical tree overlapping community detection based on multi-dimensional similarity

Abstract

Keywords

1. Introduction

3. The proposed algorithm

3.1 Definitions

3.2.1 HTOCD algorithm

3.2.3 Grow

3.2.4 Merge

4. Experiments

4.1 Experimental setting

Table 1 Some properties of the real network data set

Table 3 Comparison results of Q o ⁢ v on the real-world networks

Footnotes

Acknowledgments

References

Table 1
Some properties of the real network data set

Table 3
Comparison results of $Q_{ov}$ on the real-world networks