Abstract
Setting up a multidimensional network is an important problem in complex networks and has become a future development trend in the fields of biological gene networks, social networks and so on. A multidimensional network comprises connections and attributes. Community detection in heterogeneous datasets in different dimensions is more difficult than that in a single network. Traditional methods for dealing with multidimensional networks are ineffective, because of using supervised information or applying strategies for adjusting the graph structure of a single network. In this paper, we propose a semi-supervised community detection method for multidimensional heterogeneous networks. First, we generate a single network by integrating the multidimensional heterogeneous networks. The robust semi-supervised link adjustment strategy is then iteratively applied to the single network to make full use of dynamic supervised information for adding or removing links based on node entropy. Experimental results are obtained by five real multidimensional social datasets. The results show that the proposed method can effectively integrate heterogeneous data. The average accuracy rate and standard mutual information were 90.50% and 93.99%, respectively, representing improvements of 28.97% and 35.06%, respectively, over existing methods.
Keywords
Introduction
Complex networks can be found in a range of fields including molecular physics, biomedical sciences, engineering, and human sciences, and they have recently been extended to community-based research. The term “community” refers to an area with a compact connection between nodes and relatively weak connections between communities. The dynamic characteristics of such systems are found in various natural settings such as the tissues and organs, formation of air molecules into a cohesive air mass, and the interest groups of a networked society. Identifying such communities or organizations can help analyze their complex networks, clarify their evolution and constitution, and understand of their dynamics. A graph with the following features can be used to describe a complex network: the network nodes correspond to points in the graph, the connections between nodes are expressed by edges, and the whole graph can be represented by an adjacency matrix. Nodes also have features that can be expressed as an attribute matrix.
The basic concepts are first defined. The single-layer network is denoted by
Most community studies focus on a single-view network, in which a single network is used to represent the community structure [3, 4, 5]. The algorithms have often yielded successful results. However, over recent years, it has been observed that the single-view network is unable to reflect the real physical connections between nodes. Because of the diversity of relationships between different nodes and attributes, a multiview approach must be taken to handle heterogeneous data, e.g., when analyzing the social relations and transactions between friends, in working relationships or when using e-pal. Friends may share the same hobbies and express the same views. Conversely, different relationships may exist between two people and a single network cannot fully capture the connections between them. Analyzing a complex network topology as a single relationship may wrongly segment the community, and the integration of more heterogeneous information is required to accurately characterize the community structure. The use of multidimensional heterogeneous data can overcome the lack of information, noise, or invalid connections in a single network and allow the community structure to be accurately obtained [6, 7]. When using multidimensional networks, it is necessary to ensure that the internal nodes of the community are similar and the nodes between communities are dissimilar. In this study, we applied semi-supervised information to community detection. A correct community structure can be used to guide data mining and the accurate positioning of user requirements [8].
Traditional community detection methods use single networks and are based on the clustering of the adjacency and attribute matrices. These methods have difficulty in combining different data sources and cannot be extended to multidimensional networks. In recent years, researchers have proposed the multisource data fusion algorithm, which fuses the relationship between data and properties by integrating two different information sources into the same form and then using clustering to derive the community structure. However, it is difficult to fuse information of different dimensionalities with such algorithms, and combining different types of data remains a challenge in multidimensional community detection. The matrix factorization algorithm has been used successfully in speech signal processing [9], document clustering [10], and image recognition [11], as well as for community detection in single-dimensional networks [12]. For community detection in multidimensional networks, most algorithms transform the attributes into a relational matrix, and the community structure is then obtained through matrix factorization. However, the results have been disappointing. Matrix factorization methods are more suitable for single data source and multidimensional matrix factorization usually requires many parameters or control constraints, which degrades the results. It also exhibits sensitivity to initial conditions, so that the setting of the initial value will affect the optimization. No current algorithm can effectively acquire the community structure of a multidimensional network.
In this analysis, multidimensional information was used to guide community detection and to fuse different networks to study the precise community structure. On this basis, we proposed a method for multidimensional community detection using heterogeneous data. First, a unified graph was used to represent the original multidimensional network, and a robust semi-supervision link adjustment strategy was then iteratively applied, increasing and deleting links according to the dynamic entropy of the node. The method was shown to effectively utilize the relationships between different information sources, and to integrate the overall relationship and attribute information to arrive at the optimal community structure. This provides an effective framework based on the multidimensional networks of attribute and relational data.
Previous research
Traditional approaches to community detection generally only consider a single adjacency or attribute matrix. However, a single-layer network is unable to fully reflect the features of nodes, and can produce only a partial outcome. A multidimensional network contains more than one dimension, which means that multiple adjacency matrices and attribute information must be considered. Community detection using multidimensional information can be divided into two categories: attribute characteristics-based and adjacency relationship-based. We analyze the limitations of the existing algorithms in Section 2.3.
Attribute relation-based community detection
In this method of network representation, the adjacency matrix reflects the network structure, and the node attributes refer to real physical properties [13, 14]. There are two different aspects of supervision information. The general approach is to measure the similarities between properties, converting the attribute matrix into a similarity matrix, in the form of an adjacency matrix. For example, Tang et al. [15] and Bassett et al. [16] used similarity to transform the attributes into an adjacency matrix. They then combined this with the existing network structural information to achieve the integration of heterogeneous information by fusing the decomposition matrix. However, such methods combine only a single attribute and relational datum. Ruan et al. [17] used an adjacency relationship and attributed similarity to calculate the strength of the links between two nodes. The strength of probability for link was maximized to allow community detection using standard clustering algorithms. Yang et al. [18] used the parameter estimation method to establish the probability distribution of the adjacency and attribute matrices and to estimate the parameters of the model. The document clustering method can also be used to find communities. Pei et al. [19] used triple matrix factorization to integrate the adjacency matrix and attribute matrix. The calculation of attribute similarity could be increased by constraining decomposition.
Adjacency relation-based community detection
In real networks, the attributes of nodes are difficultly acquired. In many cases, the links are multidimensional, and each link reflects the different relationship. Different links refer to different physical entities and the multidimensional adjacency matrix represents the complexity of the physical world. Tang et al. [20] extracted the attributes of each layer of the adjacency matrix, used the largest eigenvalue vector to compose a global attribute matrix, and then used k-means to extract the features to derive the community structure. Triple non-negative matrix factorization performs well when applied to adjacency matrices, and several studies have shown that it is suitable for community detection. Pei et al. [19] combined triple non-negative matrix factorization with regularization terms, and used this not only in the attribute matrix and adjacency matrix, but also directly in the multidimensional adjacency matrix. Liu et al. [21] proposed an unsupervised method named multiview clustering, which extracts the potential structure from each view. He assumed that one node in different views belonged to the same community, allowing the clustering of different views.
Existing algorithms
As noted in Sections 2.1 and 2.2, the problems with the existing algorithms are as follows: 1) For multidimensional information, approaches using reference text clustering methods, similarity matrix calculation, transformation of the attribute matrix into an equivalent adjacency matrix, and multidimensional relation matrix analysis are still ineffective in the detection of communities. 2) Community detection methods based on multidimensional adjacency matrices can only analyze each graph to identify a community, but cannot analyze the full multidimensional structure. This fracture the relationships in network. 3) Existing multidimensional community detection methods are based on unsupervised clustering and the results are not ideal. In addition, no supervision information is available for evaluation.
Community detection using a semi-supervised robust link selection strategy
The information sources of a multidimensional network take the following two forms: relational data (linkage relationships) and attribute data (characteristics relationships). The two forms address community detection from different viewpoints. Single-link relational data may have missing information or contain noise. This makes community detection difficult, and the correctness of the community identified cannot be guaranteed. We combine several link relations and attributes to create a unified graph. A unified graph can help overcome the low-accuracy problem in the matrix factorization of multidimensional information, and supervision information can be used to improve the classification accuracy. The research problem is defined and analyzed in Section 3, and a unified graph model is built by fusing multidimensional networks. Finally, a semi-supervised robust link selection (RLS) strategy is introduced, which uses supervision information to adjust the network structure.
Related definitions and analysis
The relationships between users in a group comprise not only connections, but also properties or characteristics. A user relationship is simultaneously observed by the multidimensional networks. For example, in Twitter, messages sent by the user constitute one dimension, forwarded messages constitute another dimension, and the user’s rankings constitute a third dimension. Making use of this multidimensional information is challenging. Greene and Cunningham [23] in 2013 proposed using a multidimensional network to build a unified graph, a method that can deal with both relational data and attribute data.
There are The ranking vector The The SVD of After the
Figure 1 shows a flow chart of the Algorithm 1 that is Robust link selection with NMF (RLS+UniNMF). The concrete steps will be explained in the remainder of this paper. The unified graph of the multidimensional network is constructed through the fusion method, making it difficult to derive an accurate community structure by NMF directly, as there exist false or redundant link relationships and boundary nodes between communities. In a real network, it is not the case that all nodes are labeled, so a small amount of supervision information can adjust the links, greatly improving the accuracy of community identification. To make better use of supervision information, we introduced a semi-supervised robust link selection strategy (RLS strategy), see in Algorithm 2.
Robust link selection with NMF (RLS+UniNMF)[1]
Flow chart of semi-supervised robust link selection strategy for multidimensional networks community detection.
First, information entropy was used to measure the level of certainty of a node belonging to its assigned community. By definition, the smaller the entropy, the larger the certainty. The community possibility of a node is defined as the level that the node belonging to a community. When the possibility becomes the same as each other, the entropy of the node is maximal. Link entropy is defined as the mean of two endpoint entropies. The entropy of node
The community labels and the application area of the strategy are determined by the community centers. If only the minimal entropy of the node is chosen as the community center, a different community center will be identified in each iteration. The links between the centers will affect the stability of the centers, and thus affect community detection. For robustness and stability, not only one entropy-minimum node is considered, but a minimum set of nodes in the link selection strategy. The specific process is as follows:
Determine the stability of the community center For community Adjust the intra-community links If the center of community We then determined whether the endpoints of Adjust the inter-community links Communities
Links connecting Links connecting
Datasets
To test the effectiveness of the proposed algorithm (RLS+UniNMF), we conducted experiments using the following five Twitter datasets [23]:
Football: A dataset of 248 English Premier League football players and clubs active on Twitter. Olympics: A dataset of 464 athletes and organizations that were involved in the London 2012 Summer Olympics. Politics-ie: A dataset of Irish politicians and political organizations, assigned to seven disjoint ground truth groupings, according to their affiliation. Politics-uk: A dataset of 419 Members of Parliament (MPs) in the United Kingdom. The ground truth consisted of five groupings, corresponding to political parties. Rugby: A dataset of 854 international Rugby Union players, clubs, and organizations currently active on Twitter.
The five datasets are shown in Table 1. Summary of the five datasets, including total number of users, followers, mentions, retweets, user lists, and the number of associated ground truth communities.
Detail of datasets
Principal Modularity Maximization (PMM)
We calculated the modularity matrix
Multi-View NMF (MultiNMF)
A network
The sum of each column in
First, we produced a unified network
Evaluation measures
Because the datasets had classification labels and nodes belonging to a single community, we were able to evaluate the algorithm using those labels. Here, we used Accuracy (AC) and Normalized Mutual Information (NMI) [25] to confirm the effectiveness of the algorithms. There were
NMI is widely used to measure the similarity of two clusters. Assuming that
If
Experiment results average NMI of 4 algorithms in 5 different datasets
Experiment results average NMI of 4 algorithms in 5 different datasets
From Table 2, UniNMF can unify the multidimensional relations and property features for detecting the community via NMF. In the five datasets, no difference was found between the NMI of UniNMF and PMM. MultiNMF was superior to UniNMF and PMM because it was able to take advantage of the multidimensional relationships. When using RLS+UniNMF, the NMI was greatly superior to that of the other three methods, as it was able to make full use of the supervised on information when choosing stable centers and adjusting the links in the network. RLS+UniNMF therefore led to substantial improvement, achieving an NMI of more than 90% on the football, olympics, and politics-ie datasets.
Experiment results average AC of 4 algorithms in 5 different datasets
The community detection result of RLS+UniNMF on football dataset using 10%, 20%, 30%, 36% supervised information, respectively. (a) 10% supervised information, NMI 
In terms of accuracy, from Table 3 the maximum modularity achieved by PMM was far below that of UniNMF and MultiNMF. When no supervised information was used, MultiNMF performed better, but when a small amount of supervised information was available, the performance of RLS+UniNMF increased significantly. For the football dataset, when no supervision information was used, the NMI was 47.15% and the AC was 42.63%. By selecting the RLS strategy and using a small amount of supervised information, the NMI and AC were greatly improved.
As can be seen from Fig. 2, there were errors in community detection when using 30% supervised information, with an NMI of 89.28% and an AC of 88.31%. This was much better than methods using no supervised information. When using 20% supervision information, most communities were correctly divided, with an NMI of 94.76% and an AC of 95.16%. When using 30% supervised information, NMI and AC rose to 98.60% and 98.79% respectively, and errors in community detection were eliminated. An NMI and accuracy of 100% were achieved when using 33.46% supervision information. Therefore, the algorithm is greatly improved by using little known information. The accuracy of RLS+UniNMF is greatly enhanced with supervised information.
Complex networks are multidimensional, with link relationships and attribute features. In this research we proposed a semi-supervised robust link selection strategy for detecting communities by integrating multi-dimensional heterogeneous data and using the degree of certainty to identify which node belonged to which community, as measured by entropy. We found a stable hub set for each community and made full use of the supervised information to identify links between hubs and boundary sets, and iteration to optimize the network structure. Experiments on five real network datasets demonstrated that the proposed RLS+UniNMF was effective when applied to multidimensional data and a heterogeneous environment. By applying the link selection strategy to adjust the network structure, the AC and NMI were significantly increased by 28.97% and 35.06%, compared with conventional methods. The proposed method can be parallelized to reduce the iteration time and it may be possible to extend the RLS+UniNMF approach to evolutionary community detection in multidimensional networks. In order to solve large-scale networks, efficiency of strategy need to be improved in the future. Besides, the application of the proposed method in dynamic networks is still an open problem.
Footnotes
Acknowledgments
This work is partially supported by National Technology Research and Development Program of China (863 Program), Grant No. 2012AA01A510, the National Natural Science Foundation of China, Grant No. 61473149, and the Natural Science Foundation of Jiangsu Province, China, Grant No. BK20140073.
