Abstract
In social networks, the traditional locally optimized overlapping community detection algorithm has a free-rider problem in community extension, which mainly relies on the structure information of nodes but ignores the node attributes. Therefore, in this paper, we redefine community based on theoretical analysis and propose an overlapping community discovery algorithm based on the local interaction model. By fusing node attributes and structural information, we first proposed an improved density peak fast search method to obtain multiple core nodes in the community. Then, according to the interaction range and interaction mode of the core node, we established a local interaction model of the core node, which converts the interaction strength or the number of common attributes between nodes in the network into the change of the distance between nodes. Finally, according to the proposed improved clustering algorithm, we obtain the community where the core node is located and merge the communities with a high degree of overlap. The experimental results show that compared with other similar community discovery algorithms, the proposed method outperforms the state-of-the-art approaches for community detections.
Introduction
With the development of digital information networks, social networks have become an essential means as a type of complex network structure for understanding and analyzing potential behaviors in people’s daily lives. Based on social networks, we can expand social relationships, get more opportunities for cooperation, and get help from like-minded social friends. A social network is usually a complex system model that simulates various relationships between people in the real world with a graph structure, where nodes are users and edges are various social relationships between users. As an essential characteristic attribute of social networks, the community is described as close connections between nodes in the same community and sparse connections between different communities. It is widely used in public opinion detection, recommendation systems, and other aspects. It has become an essential tool to understand and analyze the structure of social networks.
For a long time in the past, many scholars detected non-overlapping communities mainly based on the structural relationship of nodes [1, 2, 3], that is, the relationship between users. At present, relevant research shows that users may belong to multiple communities due to the complexity and diversity of users’ social relationships in the real world [4, 5, 6]. In social networks, in addition to structural features, user nodes also include attribute features, such as user interests, occupation, and education level, etc. Therefore, how to use the structural characteristics and attribute characteristics of nodes has become a hot topic in current research to obtain overlapping community [7, 8, 9].
At this stage, more and more methods are proposed to detect overlapping communities on the network. However, most of these methods are based on the structural characteristics of nodes, which are mainly divided into clique percolation [8, 9], link partitions [10, 11], and label propagation [12]. These methods can better detect the community structure, but the network’s global structure information must be obtained. When the network information is incomplete, it will be subject to some restrictions. In addition, the local optimization algorithm (LOM) is also an overlapping community discovery algorithm. LOM usually takes the core node of the network as the community center and uses local information to expand the community through optimized functions. Because LOM reveals the structural characteristics of the whole and part of the network, it has more significant advantages in network local feature mining and cost calculation [13]. Besides, LOM does not need to obtain complete network structure information. Therefore, LOM is more suitable to analyze large-scale social networks containing node attribute characteristics [14]. In 2009, Lancichintti et al. [15] proposed a local fitness maximization algorithm (LFM), which randomly selects core nodes, and expands the community where the core nodes are located through the rate of change of the fitness function value. Coscia et al. [16] proposed the DEMON algorithm, which expanded the community union through the neighborhood information of the core node. However, In the discovery community these methods directly affect the community’s quality due to the random selection of core nodes, forming many redundant communities. For this reason, Lee et al. [17] proposed the Greedy Clique Expansion algorithm, which selects the largest clique not less than
To avoid the above-mentioned problems, related scholars propose a non-optimized method to obtain overlapping community structure, which uses the attribute characteristics of nodes. Zhang et al. [21] proposed the Tsoca-SA algorithm that fuses node attributes. They obtain public edges (connecting edges between communities) using non-overlapping community discovery algorithms based on structural information. Then, they obtain the final overlapping community structure, by judging the community to which the public node belongs based on the node attribute information. Li et al. [22] designed an attribute network-oriented spectrum community detection method, which combines node structure and attributes characteristics. However, when the network scale is large, or the network information is incomplete, these algorithms cannot obtain the desired results. Fang etc. [23] proposed an ACQ algorithm, which searches for nodes that share neighbors and common attributes with core nodes. Huang et al. [24] regarded multiple nodes as core nodes. It was required that the nodes within the community must be tightly connected, and the distance between nodes must be less than a specific value. They proposed the LocTAC algorithm. The algorithm first calculates the correlation between the core node and its neighboring nodes. Then, the neighbor nodes with higher correlation are added to the community where the core node is located, and the community structure is adjusted according to the attribute information of the node. However, these methods do not consider the degree of overlap between communities. In other words, communities with large overlaps are not merged, and there is community redundancy. Different from the above algorithm, in this paper, we propose an overlapping community discovery algorithm based on a local interaction model (OCBLIM), which combines the structure and attribute characteristics of nodes and takes advantage of the local optimization algorithm. We search for the core nodes of the network according to the improved algorithm and build a local interaction model based on the core nodes. Then, according to the proposed algorithm, we obtain overlapping communities with tight connections and larger attribute correlations between nodes.
The main contributions of our work are summarized as follows. (1) We propose an overlapping community discovery algorithm based on a local interaction model (OCBLIM). (2) We redefine the community based on theoretical analysis. (3) For any given node, we propose a framework model to obtain the community where the node is located. (4) Compared with other similar algorithms on the real data set, the experimental results show that the community divided by the proposed algorithm is generally better among multiple evaluation indicators.
The rest of the paper is organized as follows: Section 2 introduces the concepts related to this article. Section 3 redefines the community based on theoretical analysis. Section 4 discusses the algorithm of this paper in detail, including the search of core nodes and the establishment and solution of models. Section 5 verifies the feasibility of the algorithm proposed through experimental analysis. The last section concludes the paper and indicates the work to be done next.
Related technologies
Social networks
A social network is a graphical description of the actual relationship of users. The nodes in the network not only have social relationships but also contain attribute characteristics. For example, in a co-author network, the nodes represent the authors of the paper. The edges represent the cooperative relationships between authors. The attributes are the authors’ related research directions. The user information we collect and extract usually exists in an unstructured form. In this paper, to facilitate the understanding and analysis of social networks, we structure user information through an ontology model which can be used as a standard modeling tool because it guarantees the interoperability between data [25]. As shown in Fig. 1: A social network consists of three parts: nodes, domain information and communities. Each node can be described as a set of attributes composed of implicit features and explicit features. The explicit features are the user’s external attributes, which are manifested as the user’s demographic attributes, such as gender, education, and occupation, as well as ratings and search records, etc. The implicit features are the intrinsic attributes of users, which can only be inferred from user behavior data. For example, we use user search records as keywords to analyze user interest characteristics. For different application scenarios, different social networks generate different domain information. For example, the information generated by the Twitter network is mainly related to the user’s daily life dynamics, while the information generated by the co-authoring network is mainly the user’s academic expertise. The domain information contains the user’s relationship type. For example, in the Twitter network, users are friends, and in co-authoring networks, users are cooperative. At the same time, domain information also includes domain experts, who play a relatively important role in the development and change of the network. For example, authoritative authors in the co-authored network may can determine the research direction of related industries. Besides, social networks contain different communities, and each community reflects the same theme. For example, in a co-authored network, the same community means that users have the same research direction.
Social network ontology model.
After the data is structured, we formally define a social network as an undirected graph
Clustering by Fast Search and find of density peaks
Clustering by Fast Search and find of density Peaks (IDP) is a Clustering algorithm published by Rodriguez et al in Science in 2014 [26]. The main idea of the algorithm is that the cluster center point has a large local density, that is, the number of neighboring nodes, and the minimum distance between the centers of different clusters is as large as possible. The algorithm first determines the cluster center point and the number of clusters by constructing a decision graph. Then, the algorithm classifies the non-class center point
Because the algorithm can quickly find the cluster center node and the number of clusters in arbitrary shape data, some scholars apply it to social networks to find the core nodes of the community and determine the number of communities [27, 28]. However, this algorithm is only applicable to the Euclidean distance between nodes, and it does not apply to the search for core nodes in social networks containing attribute characteristics.
Canopy clustering
Canopy clustering algorithm [29] is an unsupervised clustering algorithm based on distance proposed by McCallum et al. The execution steps of the algorithm are described as follows: First, the entire data node set group is taken as a list, and an initial threshold
Since this algorithm is suitable for large-scale and high-latitude data sets, and has a greater advantage over other clustering algorithms in clustering speed, in this paper, we improve the Canopy clustering to obtain the community structure of the network.
Schematic diagram of node distance change.
In real life, when we interact with different types of people, we will adopt different ways of interaction, and the scope of our social interaction is mainly concentrated on the neighbors who have contact with us or strange users who have common interests and hobbies. With time, due to frequent interactions and common attributes, the cohesion of users in the same community will be enhanced through mutual influence, while the cohesion of users between communities will be weakened [30]. Cohesion reflects the degree of interaction between nodes and the number of common characteristics. The stronger the interaction and the greater the number of common features between users, the stronger their cohesion. On the contrary, their cohesion is weaker. In this paper, we use the distance between nodes to describe cohesion. If the distance between nodes is closer, the cohesion is stronger, and they are more likely to be in the same community. Conversely, if the distance between nodes is farther, the cohesion is weaker, and they are less likely to be in the same community. Based on this relationship, we convert the complex social network structure into a simple distance measure between nodes. Since the core node is the center of the community, we propose a local interaction model based on the core node’s interaction range and interaction method. We transform community discovery into finding a collection of nodes that are close to the core node. Afterward, we obtain the overlapping community structure according to the proposed method, avoiding the shortcomings of the local optimization algorithm.
Figure 2 is a schematic diagram of the distance change between nodes in the co-author network. In Fig. 2A, a1, a2, and a3 are the core author nodes, and the dotted line indicates that the authors have a joint research direction or have co-published papers. With time, the cohesion of authors with the same research direction or with a strong degree of interaction is enhanced through mutual influence and moves in the direction of decreasing distance. As shown in Fig. 2B, the non-core nodes in the network move to the core nodes to reduce the distance. Simultaneously, because the non-core nodes are connected by a dotted line, they move to their internal directions to reduce the distance. When the distance is not changing, the final stable community structure can be obtained through the clustering method. As shown in Fig 2C, the co-author network is divided into two community structures. Node P is in two communities simultaneously, which is an overlapping node.
Key technology
Community definition
Most of the existing local optimization algorithms rely on the structural features of nodes to obtain the core nodes of the network. The core nodes are defined as nodes with more neighbors and called the center of the community [31, 32, 33]. However, user information includes more than just social relationships. User nodes with frequent attributes but few social relationships in the community can also be called core nodes. Therefore, in the text, we will redefine the core node from two aspects: node neighbors
Besides, due to the complexity of user information, in the process of expanding the community, the community divided by the traditional LOM that only uses the structural relationship of the nodes is incomplete. Although existing overlapping community discovery algorithms combine node structure information and attribute characteristics, they cannot obtain an ideal community structure when dealing with large-scale or incomplete data. Based on this, we put forward two hypotheses, combining user real-life scenarios.
According to the relationship between users, we propose hypothesis 1. For example, in the Weibo social network, well-known users with high influence are followed by many fans. Due to the transitivity of the interaction, there are also connections between users who interact frequently with the core node
Hypothesis 2 is proposed based on node attributes. For example, in the coauthor network, ordinary authors in the same direction and authoritative authors with a large number of published papers may come from the same community. Due to the mutual influence between attributes, authors in the same research direction help each other, answer each other’s questions, and jointly complete scientific research goals. However, the experience of communicating with authoritative authors is often greater than that of ordinary authors.
Social relationships are manifested as the structural features of nodes, while implicit features and explicit features reflect the attributes of users themselves. Based on two basic hypotheses, we define communities as:
According to Nature 1 and 2, it can be generalized to general conditions. For any given node
In the same community, since the core node is the center of the community, the core node’s neighbors or the attribute feature frequency is greater than the non-core node, and the value of the parameter
According to the above discussion, the community based on the core node division satisfies the close structural connection or the attribute characteristics appear frequently in the community. This effectively reveals the community structure of the network and makes the proposed algorithm universal. The remainder of the article, if there is no obvious emphasis, discusses community
Overlapping community discovery algorithm based on the local interaction model
Compared with other algorithms, the locally optimized overlapping community discovery method does not need to obtain complete network structure and attribute information, so it is used by many scholars to analyze large-scale social networks. However, there is a free-riding problem in optimization and expansion, which affects community discovery quality. Therefore, in this paper, we obtain the core nodes of the network in advance, and establish the local interaction model of the core nodes, by using the advantages of the local optimization method. Based on this, we have developed an overlapping community discovery algorithm based on a local interaction model (OCBLIM). As shown in Fig. 3, the algorithm is mainly divided into three parts. First, we propose an improved density peak fast search algorithm (ICDP) to obtain several core nodes of the network by combining the structure and attribute characteristics of the nodes. After that, we establish a local interaction model of the core node, based on the core node’s interaction range and method. The degree of interaction or the number of shared attributes between non-core nodes and core nodes is converted into changes in distance. Finally, we propose an improved Canopy algorithm (CI) to obtain the community where the core node is located and merge the communities with a high degree of overlap. In addition, in the process of algorithm execution, the domain information and domain experts of different social network applications will affect the selection of core nodes and the establishment of local interaction models. We will analyze in detail in the algorithm.
Flow chart of OCBLIM algorithm.
To make up for the deficiencies of the IDP algorithm, we propose an improved density peak fast search algorithm (ICDP) to find the core nodes of the network. Since nodes contain attribute features, we will redefine related concepts.
Where
When the local density
According to Eq. (7), the nodeset before the node with the largest value of
Based on the IDP algorithm, we combine the structural features and attribute features of the nodes, transform the Euclidean distance metric into the reciprocal of the two-dimensional vector modulus, and propose an improved CIDP algorithm, which is applied in social networks to find the core node-set
After obtaining the core node of the network, we established a local interaction model of the core node, which calculates the degree of influence between the core node and its neighbors, and changes the relative distance between them. Then, we propose an CI algorithm, which quickly obtains overlapping communities of the network based on the distance between nodes and the degree of overlap between communities.
Local interaction model
The local interaction model takes the core node of the network as the center of the community, and determines the degree of change in the distance between the core node and its neighbors based on the core node’s interaction range and interaction method. The closer the distance, the greater the possibility of belonging to the same community. The core node’s interaction range is limited to the set of nodes that have a structural relationship or similar attributes, and the interaction method depends on the attributes of the node itself. In this paper, we use
The node structure is an external feature that represents the relationship between users. In the field of communications, the longer the path taken in the information dissemination, the weaker the accuracy and completeness of the information. Similarly, in a social network, when two user nodes need to establish an interactive relationship, the more nodes they pass, the farther the distance between them.
In real life, users who have more mutual friends will be closer. For example, users A and B are friends of many mutual friends, which indicates that they have many common interests and show close relationships.
Therefore, according to Nature 3 and 4, we use
Where
The attributes of nodes are implicit characteristics, such as occupation, education, age, and so on. Since different social network applications produce different domain information, the importance of attribute characteristics is not equivalent. Attributes similar to domain information may account for a larger proportion, and the attributes possessed by domain experts may play a decisive role. Because the higher the similarity of the attribute features of the nodes, the stronger the homogeneity. Based on this relationship, we propose node homogeneity
Where,
Where
Finally, we get the final update formula of distance
In the previous section, we converted the complex network structure into a simple distance measurement between nodes, by using a local interaction model, which greatly simplified the scale of problem-solving. In this section, we propose a CI algorithm that selects the core node as the initial community, expands the community, and merges the communities with a high degree of overlap based on the threshold. By specifying the initial community, the algorithm avoids the influence of the random selection of the threshold on the number of divided communities.
As shown below, we first set the initial thresholds
Algorithm description
In this paper, we propose the ICDP algorithm to find the core node, combining the structure and attribute characteristics of the node. By constructing the local interaction model of the core nodes, we convert the complex network into the distance between nodes. After that, the overlapping community structure is obtained based on the proposed CI algorithm. Finally, combining the above three points, we propose an overlapping community discovery algorithm based on a local interaction model (OCBLIM). The algorithm is mainly divided into 3 parts. The first part is to obtain the core node-set
Experimental analysis
This section evaluates the proposed algorithm’s efficiency and feasibility through experimental methods, and all experiments use python coding.
Data set
Data set
Data set
As shown in Table 1, We select 5 social networks [5, 37] with attributes to verify the proposed algorithm’s feasibility. Where V represents the number of nodes in the network. E represents the number of edges.
The Facebook data set comes from the social lists of Facebook users who participated in the survey, containing each surveyor and their corresponding circle of friends. On average, there are 22 users in each social circle, and each user has 26 types of attributes, such as residence, birthday, colleagues, political leanings, etc., and these data are all anonymized. The Cora and PubMed data sets are citation networks, where nodes represent paper id, edges represent the relationship between the papers and the application, and the attributes of nodes are keywords that frequently appear in the paper. The Cornell data set comes from the university’s official website, where nodes represent web pages, edges are hyperlinks existing in web pages, and node attributes are keywords frequently appearing in web pages. DBLP is a co-authored network. Nodes and edges represent the cooperative relationship between authors and authors, respectively. Edge weight is the number of jointly published papers, and the attribute of nodes is the keyword with high frequency in the paper.
The selected data sets are derived from public data sets of social network applications. These data sets collect the active status of some users over a period of time and do not represent all users. These data sets are mainly used to evaluate the feasibility of community discovery algorithms. The selected data sets contain the attributes of users. According to users who have similar interests but no social connection may come from the same community, we integrate the attributes of users to improve the accuracy of community division.
We use the following three evaluation methods to verify the feasibility of the proposed algorithm from different perspectives:
Where
Where
Where
Parameter analysis
The parameter
The value of NMI under different parameters.
It can be seen from Fig. 4 that the NMI value of the proposed algorithm has different changing trends in different data sets. In the DBLP and Facebook datasets, because the number of known communities and the degree of overlap between communities is greater, a larger fusion threshold makes the algorithm performs better. On the contrary, in the Cora, PubMed, and Cornell data sets, because the number of known communities and the degree of overlap between communities are small, the smaller the fusion threshold makes the algorithm performance improve. As the parameter
The parameter
The value of parameter 
It can be seen from the figure that in addition to the Core data set, in the other four data sets, when the value of
In this paper, to verify the feasibility of the proposed algorithm, we choose five classic algorithms and the proposed algorithm for experimental comparison. The algorithms are the LFM algorithm [15], the CFLEM algorithm [19], and the LOC algorithm [18] based on node structure characteristics, and the Tsoca-SA algorithm [21] and LocTAC [24] algorithm that merge node attribute characteristics. As a classic local optimization overlapping community discovery algorithm, the LFM algorithm has become the most referenced algorithm for comparison, analysis, and learning by related researchers due to its easy-to-understand principle and simple implementation. The CFLEM algorithm selects unvisited nodes with large clustering coefficients as core nodes, reducing the uncertainty of the results caused by randomly selecting core nodes. The LOC algorithm expands the community by adding the node with the greatest fitness and all k-factions where the node is located, which improves the expansion efficiency but affects the accuracy. The LocTAC algorithm integrates the structural characteristics and attribute characteristics of nodes, and uses local information to obtain the community structure by calculating the correlation between the core node and adjacent nodes, which improves the quality of community division. Unlike LocTAC, Tsoca-SA algorithm obtains overlapping community structure by using global network information. Therefore, this paper selects the above five algorithms as experimental references.
Table 2 shows the value of the evaluation index of each algorithm on different data sets. It can be seen from the table that the community quality obtained by the Tsoca-SA and LocTAC algorithms that integrate node structure and attribute information is significantly better than the LFM, CFLEM, and LOC algorithms based on node structure characteristics. Simultaneously, the proposed algorithm (OCBLIM) is generally better than other algorithms in three evaluation indicators. Since we convert the complex network structure in the social network into a simple distance measurement between nodes through the local interaction model, the community structure is obtained through the traditional clustering method. The free-riding problem of local optimization methods is avoided, and the quality and accuracy of the community are improved.
Performance comparison of different algorithms
Performance comparison of different algorithms
The
Performance comparison of community 
In order to verify the difference between OCBLIM algorithm and the compared algorithm, we conducted T-test analysis on each comparison algorithm and OCBLIM algorithm respectively. Without distinguishing data sets, we conducted T-test analysis for each evaluation indicator. As can be seen from Table 3, the
To verify the given condition of any node
Summary and conclusions
Aiming at the existing problems of the traditional local optimization algorithm, we merge the node attribute characteristics and propose an overlapping community discovery algorithm based on the local interaction model (OCBLIM). We first obtain the core nodes of the network according to the proposed ICDP algorithm. Then, the local interaction model of the core node is established, and the interaction strength or the number of common attributes between the nodes in the network is converted into the change of the distance between the nodes. Finally, we obtain the overlapping community structure of social networks through the proposed CI algorithm. Experimental results on real data sets show that the community structure quality obtained by our proposed algorithm is generally better than other similar algorithms. In future work, to provide users with more humane community services, we propose a general framework for community discovery algorithms, which combines various complex information generated by users in real life.
Footnotes
Acknowledgments
This work was supported by the National Natural Science Foundation of China (61967013). This work was also supported by Northwest Normal University Graduate Research Funding Project (2020).
