Entropy-based link selection strategy for multidimensional complex networks

Abstract

Setting up a multidimensional network is an important problem in complex networks and has become a future development trend in the fields of biological gene networks, social networks and so on. A multidimensional network comprises connections and attributes. Community detection in heterogeneous datasets in different dimensions is more difficult than that in a single network. Traditional methods for dealing with multidimensional networks are ineffective, because of using supervised information or applying strategies for adjusting the graph structure of a single network. In this paper, we propose a semi-supervised community detection method for multidimensional heterogeneous networks. First, we generate a single network by integrating the multidimensional heterogeneous networks. The robust semi-supervised link adjustment strategy is then iteratively applied to the single network to make full use of dynamic supervised information for adding or removing links based on node entropy. Experimental results are obtained by five real multidimensional social datasets. The results show that the proposed method can effectively integrate heterogeneous data. The average accuracy rate and standard mutual information were 90.50% and 93.99%, respectively, representing improvements of 28.97% and 35.06%, respectively, over existing methods.

Keywords

Data mining community detection multidimensional networks Non-negative Matrix Factorization (NMF)

1. Introduction

Complex networks can be found in a range of fields including molecular physics, biomedical sciences, engineering, and human sciences, and they have recently been extended to community-based research. The term “community” refers to an area with a compact connection between nodes and relatively weak connections between communities. The dynamic characteristics of such systems are found in various natural settings such as the tissues and organs, formation of air molecules into a cohesive air mass, and the interest groups of a networked society. Identifying such communities or organizations can help analyze their complex networks, clarify their evolution and constitution, and understand of their dynamics. A graph with the following features can be used to describe a complex network: the network nodes correspond to points in the graph, the connections between nodes are expressed by edges, and the whole graph can be represented by an adjacency matrix. Nodes also have features that can be expressed as an attribute matrix.

The basic concepts are first defined. The single-layer network is denoted by $G(V,E)$ , where $V$ denotes nodes $\{v_{1},v_{2},\ldots,v_{n}\}$ , $E$ is a set of edges $\{e_{1},e_{2},\ldots,e_{m}\}$ , and all nodes are divided into $K$ communities $\{c_{1},c_{2},\ldots,c_{K}\}$ .The adjacency matrix $A\in\mathbb{R}^{n\times n}$ represents the relationship between nodes. If there is an edge between $v_{i}$ and $v_{j}$ , $A(i,j)=1$ , otherwise $A(i,j)=0$ . The attribute matrix $T\in\mathbb{R}^{n\times l}$ has $n$ nodes with $l$ attributes. Each row represents the node weights of the different characteristics.

Definition 1: The multidimensional adjacency matrix $A=\{A^{[1]},A^{[2]},\ldots,A^{[M]}\}$ has $\beta=1,2,\ldots,M$ layer networks. $A=\{a_{ij}\}$ is a complex network [1, 22].

$a_{ij}=\left\{\begin{array}[]{rcl}1&&\text{if}∼{}∼{}{\exists\beta:a^{\beta}_{% ij}=1}\\ 0&&\text{others}\end{array}\right.$

Most community studies focus on a single-view network, in which a single network is used to represent the community structure [3, 4, 5]. The algorithms have often yielded successful results. However, over recent years, it has been observed that the single-view network is unable to reflect the real physical connections between nodes. Because of the diversity of relationships between different nodes and attributes, a multiview approach must be taken to handle heterogeneous data, e.g., when analyzing the social relations and transactions between friends, in working relationships or when using e-pal. Friends may share the same hobbies and express the same views. Conversely, different relationships may exist between two people and a single network cannot fully capture the connections between them. Analyzing a complex network topology as a single relationship may wrongly segment the community, and the integration of more heterogeneous information is required to accurately characterize the community structure. The use of multidimensional heterogeneous data can overcome the lack of information, noise, or invalid connections in a single network and allow the community structure to be accurately obtained [6, 7]. When using multidimensional networks, it is necessary to ensure that the internal nodes of the community are similar and the nodes between communities are dissimilar. In this study, we applied semi-supervised information to community detection. A correct community structure can be used to guide data mining and the accurate positioning of user requirements [8].

Traditional community detection methods use single networks and are based on the clustering of the adjacency and attribute matrices. These methods have difficulty in combining different data sources and cannot be extended to multidimensional networks. In recent years, researchers have proposed the multisource data fusion algorithm, which fuses the relationship between data and properties by integrating two different information sources into the same form and then using clustering to derive the community structure. However, it is difficult to fuse information of different dimensionalities with such algorithms, and combining different types of data remains a challenge in multidimensional community detection. The matrix factorization algorithm has been used successfully in speech signal processing [9], document clustering [10], and image recognition [11], as well as for community detection in single-dimensional networks [12]. For community detection in multidimensional networks, most algorithms transform the attributes into a relational matrix, and the community structure is then obtained through matrix factorization. However, the results have been disappointing. Matrix factorization methods are more suitable for single data source and multidimensional matrix factorization usually requires many parameters or control constraints, which degrades the results. It also exhibits sensitivity to initial conditions, so that the setting of the initial value will affect the optimization. No current algorithm can effectively acquire the community structure of a multidimensional network.

In this analysis, multidimensional information was used to guide community detection and to fuse different networks to study the precise community structure. On this basis, we proposed a method for multidimensional community detection using heterogeneous data. First, a unified graph was used to represent the original multidimensional network, and a robust semi-supervision link adjustment strategy was then iteratively applied, increasing and deleting links according to the dynamic entropy of the node. The method was shown to effectively utilize the relationships between different information sources, and to integrate the overall relationship and attribute information to arrive at the optimal community structure. This provides an effective framework based on the multidimensional networks of attribute and relational data.

2. Previous research

Traditional approaches to community detection generally only consider a single adjacency or attribute matrix. However, a single-layer network is unable to fully reflect the features of nodes, and can produce only a partial outcome. A multidimensional network contains more than one dimension, which means that multiple adjacency matrices and attribute information must be considered. Community detection using multidimensional information can be divided into two categories: attribute characteristics-based and adjacency relationship-based. We analyze the limitations of the existing algorithms in Section 2.3.

2.1 Attribute relation-based community detection

In this method of network representation, the adjacency matrix reflects the network structure, and the node attributes refer to real physical properties [13, 14]. There are two different aspects of supervision information. The general approach is to measure the similarities between properties, converting the attribute matrix into a similarity matrix, in the form of an adjacency matrix. For example, Tang et al. [15] and Bassett et al. [16] used similarity to transform the attributes into an adjacency matrix. They then combined this with the existing network structural information to achieve the integration of heterogeneous information by fusing the decomposition matrix. However, such methods combine only a single attribute and relational datum. Ruan et al. [17] used an adjacency relationship and attributed similarity to calculate the strength of the links between two nodes. The strength of probability for link was maximized to allow community detection using standard clustering algorithms. Yang et al. [18] used the parameter estimation method to establish the probability distribution of the adjacency and attribute matrices and to estimate the parameters of the model. The document clustering method can also be used to find communities. Pei et al. [19] used triple matrix factorization to integrate the adjacency matrix and attribute matrix. The calculation of attribute similarity could be increased by constraining decomposition.

2.2 Adjacency relation-based community detection

In real networks, the attributes of nodes are difficultly acquired. In many cases, the links are multidimensional, and each link reflects the different relationship. Different links refer to different physical entities and the multidimensional adjacency matrix represents the complexity of the physical world. Tang et al. [20] extracted the attributes of each layer of the adjacency matrix, used the largest eigenvalue vector to compose a global attribute matrix, and then used k-means to extract the features to derive the community structure. Triple non-negative matrix factorization performs well when applied to adjacency matrices, and several studies have shown that it is suitable for community detection. Pei et al. [19] combined triple non-negative matrix factorization with regularization terms, and used this not only in the attribute matrix and adjacency matrix, but also directly in the multidimensional adjacency matrix. Liu et al. [21] proposed an unsupervised method named multiview clustering, which extracts the potential structure from each view. He assumed that one node in different views belonged to the same community, allowing the clustering of different views.

2.3 Existing algorithms

As noted in Sections 2.1 and 2.2, the problems with the existing algorithms are as follows: 1) For multidimensional information, approaches using reference text clustering methods, similarity matrix calculation, transformation of the attribute matrix into an equivalent adjacency matrix, and multidimensional relation matrix analysis are still ineffective in the detection of communities. 2) Community detection methods based on multidimensional adjacency matrices can only analyze each graph to identify a community, but cannot analyze the full multidimensional structure. This fracture the relationships in network. 3) Existing multidimensional community detection methods are based on unsupervised clustering and the results are not ideal. In addition, no supervision information is available for evaluation.

3. Community detection using a semi-supervised robust link selection strategy

The information sources of a multidimensional network take the following two forms: relational data (linkage relationships) and attribute data (characteristics relationships). The two forms address community detection from different viewpoints. Single-link relational data may have missing information or contain noise. This makes community detection difficult, and the correctness of the community identified cannot be guaranteed. We combine several link relations and attributes to create a unified graph. A unified graph can help overcome the low-accuracy problem in the matrix factorization of multidimensional information, and supervision information can be used to improve the classification accuracy. The research problem is defined and analyzed in Section 3, and a unified graph model is built by fusing multidimensional networks. Finally, a semi-supervised robust link selection (RLS) strategy is introduced, which uses supervision information to adjust the network structure.

3.1 Related definitions and analysis

Definition 2: In the community detection matrix $H\in\mathbb{R}^{K\times n}$ , the row $i^{\rm th}$ element of $H$ , represents the membership degrees of node $v_{i}$ on community $K$ . $H$ is a matrix of the community characteristics.

Definition 3: The attribute similarity matrix $S\in\mathbb{R}^{n\times n}$ is calculated according to the properties of matrix $T$ through similarity measures. The elements in $S$ are the attribute similarities, which represent the relationships between the attributes.

Definition 4: Non-Negative Matrix Factorization (NMF) [22] is used for single-dimensional relational matrix. NMF divides $A\in\mathbb{R}^{n\times n}$ into the product of two non-negative matrices $W\in\mathbb{R}^{n\times K}$ and $H\in\mathbb{R}^{n\times K}$ , so that $A\approx W\times H^{T}$ . NMF is similar to clustering in which factorization is used to derive the community of $A$ . The objective function of NMF and the iterative updating process are as follows:

$\displaystyle\min\limits_{W,H}{\lVert A-WH^{T}\rVert}^{2}_{F}s.t.W\geqslant 0,% H\geqslant 0,$ (1) $\displaystyle W_{i,k}\leftarrow W_{i,k}\frac{(AH)_{i,k}}{(WH^{T}H)_{i,k}},H_{j% ,k}\leftarrow H_{j,k}\frac{(A^{T}W)_{j,k}}{(HW^{T}W)_{j,k}}.$ (2)

${\lVert\;\cdot\;\rVert}$ is the Frobenius norm.

Definition 5: Entropy

$\displaystyle E(h_{i})=-\sum\limits_{k=i}^{K}P(h_{i}=k)logP(h_{i}=k)=-\sum% \limits_{k=i}^{K}h_{ik}log(h_{ik})$ (3)

$k\in K,h_{ik}\in H$ , $H$ is the community indicator matrix.

Definition 6: Multidimensional network community detection is the problem that given multiple network relationship $A$ and attribute information similarity matrix $S$ , through the analysis of the multidimensional information, to accurately identify the community structure of $H$ .

3.2 Constructing a unified graph model by fusing the multidimensional network

The relationships between users in a group comprise not only connections, but also properties or characteristics. A user relationship is simultaneously observed by the multidimensional networks. For example, in Twitter, messages sent by the user constitute one dimension, forwarded messages constitute another dimension, and the user’s rankings constitute a third dimension. Making use of this multidimensional information is challenging. Greene and Cunningham [23] in 2013 proposed using a multidimensional network to build a unified graph, a method that can deal with both relational data and attribute data.

1.
There are $l$ dimensions of data for $\textit{users}=\{u_{1},u_{2},...,u_{n}\}$ . Each dimension contains part or all of the $n$ users. For each dimension $j(1\leqslant j\leqslant l)$ , we use similarity to measure $u_{i}$ . Thus, the similarity vector $v_{ij}$ is obtained.
2.
The ranking vector $r_{ij}$ , which contains $n-1$ other users, is generated from $v_{ij}$ . If not all users are in the $j^{\rm th}$ dimension, the rank inexistence node is $n^{\prime}_{j}+1$ , and $n^{\prime}_{j}$ is the number of users in this dimension.
3.
The $l$ vector ranking is set as $(n-1)\times l$ matrix $R_{i}$ and each column is normalized.
4.
The SVD of $R_{i}^{T}$ is calculated and the first column of the left singular value vector is extracted in descending order. The first $k$ elements are then selected as neighbors of $u_{i}$ .
5.
After the $k$ nearest neighbors of $n$ users are sorted completely, the information is used to construct a unified picture. The edge set is initially empty if $u_{i}$ and $u_{j}$ are neighbors, adding an edge between $u_{i}$ and $u_{j}$ . The unified graph is asymmetric. If $k$ is the only parameter that represents neighbors, the number of nodes $k$ is smaller and the graph is sparser. In the experiments, $k$ was set as the average of all $l$ dimensional network average degrees. Suppose that $d=\{d^{[1]},d^{[2]},\ldots,d^{[l]}\}$ is the average degree of $l$ dimensions, so $k=\lceil\frac{1}{l}\sum\limits_{q=1}^{l}d^{[q]}\rceil$ .

3.3 Semi-supervised robust link selection strategy (RLS strategy)

Figure 1 shows a flow chart of the Algorithm 1 that is Robust link selection with NMF (RLS+UniNMF). The concrete steps will be explained in the remainder of this paper. The unified graph of the multidimensional network is constructed through the fusion method, making it difficult to derive an accurate community structure by NMF directly, as there exist false or redundant link relationships and boundary nodes between communities. In a real network, it is not the case that all nodes are labeled, so a small amount of supervision information can adjust the links, greatly improving the accuracy of community identification. To make better use of supervision information, we introduced a semi-supervised robust link selection strategy (RLS strategy), see in Algorithm 2.

Robust link selection with NMF (RLS+UniNMF)[1] $G$ $W$ not converge $\textit{[W,H]}∼{}=∼{}\textit{NMF(G,W,H)}$ $G∼{}=∼{}\textit{RLS∼{}Strategy(G,H)}$ Robust Link Selection Strategy (RLS Strategy)[1] $G, H$ $G$ Calculate hub sets of each community Calculate boundary sets of each community each c $\in$ C Estimate hub sets of community $c$ is stable or not Find a link $v$ in stable hub sets of community $c$ , $v=<h_{c},b_{c}>$ , $h_{c}\in Hub_{c}$ , $b_{c}\in\textit{Boundary}_{c}$ $\textit{label}(h_{c})==\textit{label}(b_{c})$ $G(h_{c},b_{c})=G(b_{c},h_{c})=1$ $G(h_{c},b_{c})=G(b_{c},h_{c})=0$ Select two communities $a$ and $b$ which hubs are stable, $a\neq b$ $\textit{EHub}_{ab}$ is a set of links between $\textit{Hub}_{a}$ and $\textit{Hub}_{b}$ each e $\in$ $\textit{EHub}_{ab}$ $e=<h_{a},h_{b}>$ $\textit{label}(h_{a})\neq\textit{label}(h_{b})$ $G(h_{a},h_{b})=G(h_{a},h_{b})=0$ $\textit{EBoundary}_{ab}$ is a set of links between $\textit{Boundary}_{a}$ and $\textit{Boundary}_{b}$ each e $\in$ $\textit{EBoundary}_{ab}$ $e=<b_{a},b_{b}>$ $\textit{label}(b_{a})==\textit{label}(b_{b})$ $G(b_{a},b_{b})=G(b_{b},b_{a})=1$ $G(b_{a},b_{b})=G(b_{b},b_{a})=0$

Figure 1.

Flow chart of semi-supervised robust link selection strategy for multidimensional networks community detection.

First, information entropy was used to measure the level of certainty of a node belonging to its assigned community. By definition, the smaller the entropy, the larger the certainty. The community possibility of a node is defined as the level that the node belonging to a community. When the possibility becomes the same as each other, the entropy of the node is maximal. Link entropy is defined as the mean of two endpoint entropies. The entropy of node $i$ is ranked and parts of the entropy-minimum nodes are chosen as the $\textit{Hub}_{i}$ set, while parts of the entropy-maximum nodes are chosen as the $\textit{Boundary}_{i}$ set. RLS has two phases for adjusting the adjacency matrix: intra-community and inter-community. The goal is to strengthen the connections inside each community and loosen the connections between communities.

The community labels and the application area of the strategy are determined by the community centers. If only the minimal entropy of the node is chosen as the community center, a different community center will be identified in each iteration. The links between the centers will affect the stability of the centers, and thus affect community detection. For robustness and stability, not only one entropy-minimum node is considered, but a minimum set of nodes in the link selection strategy. The specific process is as follows:

Determine the stability of the community center

For community $i$ , we first chose a set of entropy-minimum nodes as $\textit{Hub}_{i}$ , then used supervised information to determine the community center stability. If at least half of the nodes in $\textit{Hub}_{i}$ belonged to community $i$ , $i$ was indicated by most labels of $\textit{Hub}_{i}$ ; otherwise the community was unstable, and the nodes in $i$ were discarded because of the inconsistent division of $\textit{Hub}_{i}$ .

Adjust the intra-community links

If the center of community $i$ was stable, we adjusted the intra-community links. We selected $C_{in}$ links that connected $\textit{Hub}_{i}$ and the $\textit{Boundary}_{i}$ set.

We then determined whether the endpoints of $C_{in}$ links belonged to the same community using supervised information. If the endpoints of a link belonged to the same community, it was kept, otherwise it was deleted [24]. $\textit{Boundary}_{i}$ comprises the most uncertain nodes, and selecting links that contain these nodes can maximize the use of supervised information.

Adjust the inter-community links Communities $a$ and $b$ are center-stable. The links connecting a and b are selected in two ways. One connects $\textit{Hub}_{a}$ and $\textit{Hub}_{b}$ , while the other connects $\textit{Boundary}_{a}$ and $\textit{Boundary}_{b}$ .

–

Links connecting $\textit{Hub}_{a}$ and $\textit{Hub}_{b}$ were selected. If the endpoints of a link belonged to different communities, it was deleted. If the endpoints belonged to the same community, community detection was based on NMF rather than on supervised information. However, the links were retained in case the results turned out to be incorrect.

–

Links connecting $\textit{Boundary}_{a}$ and $\textit{Boundary}_{b}$ were selected. If their endpoints belonged to the same communities, the link was maintained. Otherwise, it was deleted.

4. Experiments

4.1 Datasets

To test the effectiveness of the proposed algorithm (RLS+UniNMF), we conducted experiments using the following five Twitter datasets [23]:

•
Football: A dataset of 248 English Premier League football players and clubs active on Twitter.
•
Olympics: A dataset of 464 athletes and organizations that were involved in the London 2012 Summer Olympics.
•
Politics-ie: A dataset of Irish politicians and political organizations, assigned to seven disjoint ground truth groupings, according to their affiliation.
•
Politics-uk: A dataset of 419 Members of Parliament (MPs) in the United Kingdom. The ground truth consisted of five groupings, corresponding to political parties.
•
Rugby: A dataset of 854 international Rugby Union players, clubs, and organizations currently active on Twitter.

The five datasets are shown in Table 1. Summary of the five datasets, including total number of users, followers, mentions, retweets, user lists, and the number of associated ground truth communities.

Table 1
Detail of datasets

Datasets Users Follows Mentions Retweets Lists Communities

Football 248 3819 3312 1350 7814 20

Olympics 464 10642 9615 3740 4942 28

Politics-ie 348 16856 6318 3019 1047 7

Politics-uk 419 27340 14788 7270 3614 5

Rugby 854 35757 33832 12472 5900 15

4.2 Contrast method

Datasets	Users	Follows	Mentions	Retweets	Lists	Communities
Football	248	3819	3312	1350	7814	20
Olympics	464	10642	9615	3740	4942	28
Politics-ie	348	16856	6318	3019	1047	7
Politics-uk	419	27340	14788	7270	3614	5
Rugby	854	35757	33832	12472	5900	15

4.2.1 Principal Modularity Maximization (PMM)

We calculated the modularity matrix $B=\{B^{[1]},B^{[2]},\ldots,B^{[M]}\}$ via the adjacency matrix $A=\{A^{[1]},A^{[2]},\ldots,A^{[M]}\}$ . $B^{[i]}$ was the $m$ maximum eigenvector corresponding to the eigenvalue, where $S=\{S^{[1]},S^{[2]},\ldots,S^{[M]}\}$ . Applying singular value decomposition to $S$ gave the reduced feature matrix $\widetilde{U}$ , and the community structure was obtained after clustering.

4.2.2 Multi-View NMF (MultiNMF)

A network $X^{(v)}$ comprising $n$ nodes with $n_{v}$ dimensions was denoted by $X^{(v)}=\{X^{(1)},X^{(2)},\ldots$ , $X^{(n_{v})}\}$ , $X^{(v)}\approx U^{(v)}(V^{(v)})^{T}$ , where the objective function of MultiNMF was as follows:

$\displaystyle\sum\limits_{v=1}^{n_{v}}{\lVert X^{(v)}-U^{(v)}(V^{(v)})^{T}% \rVert}^{2}_{F}+\sum\limits_{v=1}^{n_{v}}\lambda_{v}{\lVert V^{(v)}-V^{*}% \rVert}^{2}_{F},$ $\displaystyle s.t.∼{}∼{}\forall 1\leqslant k\leqslant K,{\lVert U_{\cdot,k}^{(% v)}\rVert}_{1}=1U^{(v)},V^{(v)},V^{*}\geqslant 0.$ (4)

The sum of each column in $U^{(v)}$ was constrained by $Q^{(v)}$ and the multidimensional data $U^{(v)}$ was reduced to $V^{*}$ . We assumed that community detection was achieved when $\lambda_{v}=0.01$ .

4.2.3 UniNMF

First, we produced a unified network $G^{\prime}$ for the multidimensional network $A=\{A^{[1]},A^{[2]},\ldots,A^{[M]}\}$ using the methods outlined in Section 3.2. The community structure was then obtained directly by NMF.

4.3 Evaluation measures

Because the datasets had classification labels and nodes belonging to a single community, we were able to evaluate the algorithm using those labels. Here, we used Accuracy (AC) and Normalized Mutual Information (NMI) [25] to confirm the effectiveness of the algorithms. There were $n$ nodes in the network, $\theta_{i}$ is the label, and $l_{i}$ was the label obtained by the algorithms. AC was defined as:

$\displaystyle AC=\frac{\sum_{i=1}^{n}\delta(\theta_{i},map(l_{i}))}{n}.$ (5)

$\delta(x,y)$ the is impulse function. If $\delta(x,y)=1$ , $x=y$ . Otherwise, $\delta(x,y)=0$ . $map(l_{i})$ was the mapping function. Optimal matching was achieved using the Hungarian algorithm.

NMI is widely used to measure the similarity of two clusters. Assuming that $A$ and $B$ are two clustering results, the Mutual Information(MI) can be defined as:

$\displaystyle MI=\sum\limits_{a_{i}\in A,b_{j}\in B}p(a_{i},b_{j})\cdot log_{2% }\frac{p(a_{i},b_{j})}{p(a_{i})p(b_{j})}.$ (6)

$p(a_{i})$ and $p(b_{j})$ represent the probability of extracting a random node in $A$ or $B$ respectively, and $p(a_{i},b_{j})$ is the probability of extracting a random node in both $A$ and $B$ simultaneously. Entropy is:

$\displaystyle H(A)=-\sum\limits_{a}p(a_{i})log_{2}p(a_{i}).$ (7) $\displaystyle H(B)=-\sum\limits_{b}p(b_{j})log_{2}p(b_{j}).$ (8)

If $A$ and $B$ are identical, $MI(A,B)=max(H(A),H(B))$ . If $A$ and $B$ are different, $MI(A,B)=0$ . MI is normalized to $[0,1]$ , and NMI is obtained as follows:

$\displaystyle NMI=\frac{MI(A,B)}{max(H(A),H(B))}.$ (9)

4.4 Results and discussion

Table 2
Experiment results average NMI of 4 algorithms in 5 different datasets

Algorithm\Datasets	Football	Olympics	Politics-uk	Politics-ie	Rugby
UniNMF	47.15 $\pm$ 2.54	44.89 $\pm$ 4.19	33.09 $\pm$ 0.05	37.98 $\pm$ 0.14	43.74 $\pm$ 4.31
PMM	56.21 $\pm$ 0.02	65.08 $\pm$ 0.02	23.73 $\pm$ 0.15	40.17 $\pm$ 0.17	37.61 $\pm$ 0.25
MultiNMF	62.71 $\pm$ 0.06	74.52 $\pm$ 0.03	43.08 $\pm$ 0.04	56.52 $\pm$ 0.43	57.82 $\pm$ 0.02
RLS+UniNMF	100 $\pm$ 0.0	96.06 $\pm$ 0.35	88.52 $\pm$ 0.65	97.01 $\pm$ 0.63	88.36 $\pm$ 1.03

From Table 2, UniNMF can unify the multidimensional relations and property features for detecting the community via NMF. In the five datasets, no difference was found between the NMI of UniNMF and PMM. MultiNMF was superior to UniNMF and PMM because it was able to take advantage of the multidimensional relationships. When using RLS+UniNMF, the NMI was greatly superior to that of the other three methods, as it was able to make full use of the supervised on information when choosing stable centers and adjusting the links in the network. RLS+UniNMF therefore led to substantial improvement, achieving an NMI of more than 90% on the football, olympics, and politics-ie datasets.

Table 3

Experiment results average AC of 4 algorithms in 5 different datasets

Algorithm\Datasets	Football	Olympics	Politics-uk	Politics-ie	Rugby
UniNMF	42.63 $\pm$ 3.43	33.27 $\pm$ 3.52	59.00 $\pm$ 0.15	56.30 $\pm$ 0.17	46.60 $\pm$ 3.23
PMM	4.88 $\pm$ 0.17	4.98 $\pm$ 0.02	14.53 $\pm$ 1.37	19.60 $\pm$ 0.99	7.39 $\pm$ 0.25
MultiNMF	58.31 $\pm$ 0.14	66.03 $\pm$ 0.06	59.95 $\pm$ 0.03	66.15 $\pm$ 0.53	57.19 $\pm$ 0.04
RLS+UniNMF	100.0 $\pm$ 0.0	94.83 $\pm$ 1.21	82.76 $\pm$ 0.34	97.37 $\pm$ 0.26	77.52 $\pm$ 1.42

Figure 2.

The community detection result of RLS+UniNMF on football dataset using 10%, 20%, 30%, 36% supervised information, respectively. (a) 10% supervised information, NMI $=$ 89.28%, AC $=$ 88.31%; (b) 20% supervised information, NMI $=$ 94.76%, AC $=$ 95.16%; (c) 30% supervised information, NMI $=$ 95.16%, AC $=$ 98.79%; (d) 36% supervised information, NMI $=$ 100%, AC $=$ 100%.

In terms of accuracy, from Table 3 the maximum modularity achieved by PMM was far below that of UniNMF and MultiNMF. When no supervised information was used, MultiNMF performed better, but when a small amount of supervised information was available, the performance of RLS+UniNMF increased significantly. For the football dataset, when no supervision information was used, the NMI was 47.15% and the AC was 42.63%. By selecting the RLS strategy and using a small amount of supervised information, the NMI and AC were greatly improved.

As can be seen from Fig. 2, there were errors in community detection when using 30% supervised information, with an NMI of 89.28% and an AC of 88.31%. This was much better than methods using no supervised information. When using 20% supervision information, most communities were correctly divided, with an NMI of 94.76% and an AC of 95.16%. When using 30% supervised information, NMI and AC rose to 98.60% and 98.79% respectively, and errors in community detection were eliminated. An NMI and accuracy of 100% were achieved when using 33.46% supervision information. Therefore, the algorithm is greatly improved by using little known information. The accuracy of RLS+UniNMF is greatly enhanced with supervised information.

5. Conclusions

Complex networks are multidimensional, with link relationships and attribute features. In this research we proposed a semi-supervised robust link selection strategy for detecting communities by integrating multi-dimensional heterogeneous data and using the degree of certainty to identify which node belonged to which community, as measured by entropy. We found a stable hub set for each community and made full use of the supervised information to identify links between hubs and boundary sets, and iteration to optimize the network structure. Experiments on five real network datasets demonstrated that the proposed RLS+UniNMF was effective when applied to multidimensional data and a heterogeneous environment. By applying the link selection strategy to adjust the network structure, the AC and NMI were significantly increased by 28.97% and 35.06%, compared with conventional methods. The proposed method can be parallelized to reduce the iteration time and it may be possible to extend the RLS+UniNMF approach to evolutionary community detection in multidimensional networks. In order to solve large-scale networks, efficiency of strategy need to be improved in the future. Besides, the application of the proposed method in dynamic networks is still an open problem.

Footnotes

Acknowledgments

This work is partially supported by National Technology Research and Development Program of China (863 Program), Grant No. 2012AA01A510, the National Natural Science Foundation of China, Grant No. 61473149, and the Natural Science Foundation of Jiangsu Province, China, Grant No. BK20140073.

References

Battiston

Nicosia

and Latora

, Structural measures for multiplex network, Physical Review E 89(3) (2014), 032804.

Boccaletti

Bianconi

Criado

et al., The structure and dynamics of multilayer networks, Physics Reports 544(1) (2014), 1–122.

Malliaros

F.D.

and Vazirgiannis

, Clustering and community detection in directed networks: A survey, Physics Reports 533(4) (2013), 95–142.

Fortunato

, Community detection in graphs, Physics Reports 486(3) (2010), 75–174.

Palla

Derényi

Farkas

et al., Uncovering the overlapping community structure of complex networks in nature and society, Nature 435(7043) (2005), 814–818.

Nadakuditi

R.R.

and Newman

M.E.J.

, Graph spectra and the detectability of community structure in networks, Physical Review Letters 108(18) (2012), 188701.

Decelle

Krzakala

Moore

et al., Asymptotic analysis of the stochastic block model for modular networks and its algorithmic applications, Physical Review E 84(6) (2011), 066106.

Almendral

J.A.

Criado

Leyva

et al., Introduction to focus issue: Mesoscales in complex networks, Chaos: An Interdisciplinary Journal of Nonlinear Science 21(1) (2011), 016101.

Virtanen

, Spectral covariance in prior distributions of non-negative matrix factorization based speech separation, Signal Processing Conference, European, 2009.

10.

Yang

C.F.

and Zhao

, Document clustering based on nonnegative sparse matrix factorization, Lecture Notes in Computer Science 3611 (2005), 557–563.

11.

Zafeiriou

Tefas

Buciu

et al., Exploiting discriminant information in nonnegative matrix factorization with application to frontal face verification, IEEE Transactions on Neural Networks 17(3) (2006), 683–695.

12.

Psorakis

Roberts

Ebden

et al., Overlapping community detection using Bayesian non-negative matrix factorization, Physical Review E 83(6) (2011), 1509–1520.

13.

Hao

Cai

et al., The interaction between multiplex community networks, Chaos: An Interdisciplinary Journal of Nonlinear Science 21(1) (2011), 016104.

14.

Hidru

and Goldenberg

, EquiNMF: Graph Regularized Multiview Nonnegative Matrix Factorization, arXiv preprint arXiv:1409.4018, (2014).

15.

Tang

Wang

and Liu

, Uncoverning groups via heterogeneous interaction analysis, in: Proceedings of the 9th International Conference on Data Mining, 2009, pp. 503–512.

16.

Bassett

D.S.

Porter

M.A.

Wymbs

N.F.

et al., Robust detection of dynamic community structure in networks, Chaos: An Interdisciplinary Journal of Nonlinear Science 23(1) (2013), 013142.

17.

Ruan

Fuhry

and Parthasarathy

, Efficient community detection in large networks using content and links, in: Proceedings of the 22nd International Conference on World Wide Web, 2013, pp. 1089–1098.

18.

Yang

McAuley

and Leskovec

, Community detection in networks with node attribute, in: Proceedings of the 13th International Conference on Data Mining, 2013, pp. 1151–1156.

19.

Pei

Chakraborty

and Sycara

, Nonnegative matrix tri-factorization with graph regularization for community detection in social networks, in: International Joint Conference on Artificial Intelligence, 2015, p. 2083.

20.

Tang

Wang

and Liu

, Modeling and Mining Ubiquitous Social Media, Springer Berlin Heidelberg, 2012, 1–20.

21.

Liu

Wang

Gao

Jing

and Han

, Multi-view clustering via joint nonnegative matrix factorization, in: SIAM International Conference on Data Mining, 2013, pp. 252–260.

22.

Wang

Nie

F.P.

Huang

and Ding

, Nonnegative matrix tri-factorization based high-order co-clustering and its fast implementation, in: Proceedings of the 2011 IEEE 11th International Conference on Data Mining, 2011, pp. 774–783.

23.

Greene

and Cunningham

, Producing a unified graph representation from multiple social network views, in: Proceedings of the 5th Annual ACM Web Science Conference, 2013, pp. 118–121.

24.

Yang

Jin

Wang

et al., Active link selection for efficient semi-supervised community detection, Scientific Reports 5 (2015), 9039.

25.

Liu

and Gong

, Document clustering based on non-negative matrix factorization, in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, 2003, pp. 267–273.

Entropy-based link selection strategy for multidimensional complex networks

Abstract

Keywords

1. Introduction

2. Previous research

2.1 Attribute relation-based community detection

2.2 Adjacency relation-based community detection

2.3 Existing algorithms

3. Community detection using a semi-supervised robust link selection strategy

3.1 Related definitions and analysis

4.1 Datasets

4.2.1 Principal Modularity Maximization (PMM)

4.2.2 Multi-View NMF (MultiNMF)

4.3 Evaluation measures

Table 2 Experiment results average NMI of 4 algorithms in 5 different datasets

Footnotes

Acknowledgments

References

Table 2
Experiment results average NMI of 4 algorithms in 5 different datasets