Heterogeneous graph community detection method based on K -nearest neighbor graph neural network

Abstract

Traditional community detection models either ignore the feature space information and require a large amount of domain knowledge to define the meta-paths manually, or fail to distinguish the importance of different meta-paths. To overcome these limitations, we propose a novel heterogeneous graph community detection method (called KGNN_HCD, heterogeneous graph Community Detection method based on $K$ -nearest neighbor Graph Neural Network). Firstly, the similarity matrix is generated to construct the topological structure of $K$ -nearest neighbor graph; secondly, the meta-path information matrix is generated using a meta-path transformation layer (Mp-Trans Layer) by adding weighted convolution; finally, a graph convolutional network (GCN) is used to learn high-quality node representation, and the $k$ -means algorithm is adopted on node embeddings to detect the community structure. We perform extensive experiments and on three heterogeneous datasets, ACM, DBLP and IMDB, and we consider as competitors 11 community detection methods such as CP-GNN and GTN. The experimental results show that the proposed KGNN_HCD method improves 2.54% and 2.56% on the ACM dataset, 2.59% and 1.47% on the DBLP dataset, and 1.22% and 1.67% on the IMDB dataset for both NMI and ARI. Experiments findings suggest that the proposed KGNN_HCD method is reasonable and effective, and KGNN_HCD can be applied to complex network classification and clustering tasks.

Keywords

Heterogeneous graph meta-path K-nearest neighbor graph graph neural network community detection

1. Introduction

1.1 Background

Community detection is a fundamental and important research area in network science, which aims to divide the tightly connected nodes in the network into communities, making the nodes within a community tight and the connections between communities relatively sparse [1, 2, 3]. In social networks, platform sponsors promote products and place topic recommendations [4, 5], and in this scenario, community detection reveals the complexity of metabolism and proteins with similar biological functions. Community detection in citation networks [6, 7] determines the importance and interconnectedness of research topics, evolution and identification of research trends.

Many excellent works have emerged for community detection tasks, and these research models can be divided into three main categories. Traditional community detection methods [8, 9, 10, 11] highly depend on the topological information of the network to detect communities, relevant examples of traditional methods are graph partitioning, and hierarchical clustering approaches. Graph embedding based approaches [12, 13, 14, 15] learn node representations which are subsequently exploited to carry out the community detection task. In recent years, graph neural networks have been widely used for various tasks on graphs, and important efforts have been made on community detection research topics [16, 17, 18], these approaches propagate node features from a node to its neighbors and aggregate these messages through. The node representations learned by graph neural networks have been shown to achieve state-of-the-art performance for community detection on most datasets.

However, most of the existing community detection tasks (e.g., [16]) focus on homogeneous graphs with the same node type and edge type. Since real-world networks tend to have multiple node and edge types, such solutions are obviously less effective on these networks. This kind of graph with multiple nodes and edges commonly exists in the real world is called heterogeneous graph. Heterogeneous graphs contain more comprehensive information and rich semantics, so it has been widely used in many data mining tasks, such as social networks, knowledge graphs, and citation networks. For example, in a citation co-authorship network with three node types including author, paper and conference, while edge types have both published (being published) and authored (being authored) relations. Existing community detection approaches on heterogeneous graphs are mainly divided into two types, one is to detect clusters which have multiple types of node objects inside as much as possible, and the other is to generate nodes to form clusters according to specific target types. In this paper, we focus on the second approach, which learns node embedding by graph neural network and cluster nodes to achieve a high degree of similarity and strong correlation among nodes in the same community.

1.2 Motivation

It is more challenging to detect communities on heterogeneous graphs, and the traditional graph neural network models cannot be directly applied to heterogeneous graphs due to their high heterogeneity. There exists a kind of composite relationship which has rich semantic information in heterogeneous graph which is called meta-path. In fact, because of the rich multi-hop semantic information in meta-paths, some nodes without direct edge connections are highly likely to form a community, so when performing community detection on heterogeneous graphs, it is important to fully consider the similarity in the feature space of nodes and their higher-order relationships, which cannot be limited to the existing topological edge connections only. To effectively capture higher-order information between nodes, many innovative studies have been proposed for community detection. However most of them rely on manually defined meta-paths [19], and they need to change meta-paths manually according to different datasets. These efforts are highly dependent on the quality of the meta-paths, and different meta-paths chosen by experts can cause quite markedly different results for the model. Moreover, the semantic information embedded in each meta-path is different, which means that for each meta-path it is desirable to distinguish their level of importance to better fit the research task and learn a more efficient and high-quality node representation.

To address these limitations, this paper proposes a $K$ -nearest neighbor graph neural network for heterogeneous graph community detection, called KGNN_HCD for short, which aims to detect communities with the same target type nodes in heterogeneous graphs. The KGNN_HCD method firstly uses the structural information in the feature space, constructs the similarity matrix according to the node feature vector, and generates a $K$ -nearest neighbor graph to enhance the similarity between nodes and increase the possibility of nodes without edges being divided into a community. Secondly, KGNN_HCD generates meta-path transition matrix by Mp-Trans Layer and meta-path transformation matrix by matrix multiplication, adaptively learns meta-paths, and captures higher-order relationships between nodes by fusing meta-path information through GCN. Thirdly, KGNN_HCD uses the weight matrix to allocate attention scores for different heterogeneous node-edge relationships for the purpose of distinguishing the importance of different meta-paths. Finally, the learned node representation is used to conduct $k$ -means operation according to the number of node types to form the community division for specific nodes.

1.3 Main contributions

The main contributions of this paper are:

We construct the K-nearest neighbor graph topology by taking account of the feature information of the node space. We carry out information fusion to fully enhance the similarity of nodes and improve the probability of nodes that are divided into the same community based on the unconnected nodes in the meta-path.

We propose a heterogeneous graph neural network community detection method called KGNN_HCD, which can learn meta-paths end-to-end, capture higher-order relationships, distinguish the importance between different meta-paths and can learn high-quality node representations. This paper also provides interpretability analysis of the Mp-Trans Layer.

Numerous comparison experiments are conducted on three real heterogeneous datasets, ACM, DBLP and IMDB, and against many community detection models, such as CP-GNN and GTN. The experimental results show that the proposed KGNN_HCD method has significant improvement over existing state-of-the-art heterogeneous community detection methods, such as CP-GNN and GTN in terms of F1, NMI, ARI and Purity metrics.

2. Related work

2.1 Traditional community detection methods

Traditional community detection methods highly relied on the network structure to explore communities, which has attracted a great deal of research attention [20, 21, 22]. The Infomap [21] algorithm simultaneously encoded the communities in the network and the nodes in the communities to generate unique encoded representations of the nodes. After that, a random walk is performed on the network to obtain a set of total encoding lengths, and when the encoding length reached the shortest, the tightly connected nodes are divided into the same community so that the optimal solution is obtained. LPA [22] used the label information of marked nodes to predict the label information of unmarked nodes and then identify the communities. However, the methods mentioned above work on homogenous graphs. Other works [23, 24, 25] are designed for heterogeneous graphs. Het-SE and Het-RSE [23] applied $k$ -means clustering on the corresponding feature vectors of different node types. AGGMMR [24] proposed a framework to perform community detection using attribute and topology information through a greedy modular maximization model. Based on regularization and non-negative matrix factorization, AJNMF [26] comprehensively considered link information and node content information to construct heterogeneous network matrix and introduces regulation function to reduce the impact of noisy data on communities, so as to improve the effect of community detection. In [27] the authors used the meta-path to capture the higher-order relationships between nodes to detect the community structure in heterogeneous graphs. The approach described in [42] proposes a method based on $\alpha$ -normalized adjacency matrix spectral community detection method improves the performance of spectral methods in heterogeneous dense graphs by optimizing the normalization method and finding the optimal parameter values. BHCD [43] shifts the research focus towards sparse heterogeneous graphs, and proposes spectral clustering methods based on the Bethe Hessian matrix in sparse heterogeneous graphs and finds that for specific parameter values $r=\zeta$ , This method is robust to degree heterogeneity and predicts the accuracy of community partitioning by studying the information feature vectors of the Bethe Hessian matrix.

2.2 Network embedding based community detection methods

Community detection based on network embedding is a network representation learning method that represents a high-dimensional, sparse vector space associated with nodes with a low-dimensional, dense vector space and enhances the representation of node embeddings based on community characteristics. Many researchers are committed to solving the problem of community detection with the help of network embedding. Deepwalk method [12] obtains the co-occurrence relationship between nodes in the graph by simulating uniform random walks in the network, and then learns the vector representation of nodes. The sampling strategy in Deepwalk can be regarded as a special case of node2vec with $p=1$ and $q=1$ . Node2vec [13] proposes random walks based on DFS and BFS, respectively mining node representations with homogeneity and structural similarity. CDE [14] proposed a novel embedding-based method. It embedded the inherent community structure into the structural embedding through the known community memberships, and then defined community detection as a matrix factorization optimization problem based on the embedding of node attributes and community structure. NEC [15] proposed a learnable network embedding algorithm for community detection tasks in heterogeneous graphs which learns graph structure-based representations and clustering-oriented representations together and finally adopts $k$ -means for community detection. Metapath2vec [28], a node embedding method for Heterogeneous Information Network (HIN) proposed by Dong et al. in 2017, uses meta-paths to guide random walks on a heterogeneous graph so that the generated sequence of nodes contains rich semantic information. Compared with some previous models, the HIN2Vec [29] model retained more contextual information, not only assuming that two nodes with a relationship are related, but also distinguishing different relationships between nodes and differentiating them by co-learning the relationship vector.

2.3 Graph neural network-based community detection methods

Graph neural network-based community detection methods are deep learning methods on graph domain, they can capture the independence of graphs, solve the disorder of input graphs, and learn the state embedding of each node’s neighbors, so scholars have researched many novel graph neural network-based methods. As one of the widely used deep learning techniques, graph neural networks rely on their own advantages to learn graph structured data, extract and explore features and patterns in graph structured data and have shown great ability to occupy a place in the field of community detection. LGNN [16] exploits the adjacency information of edges in the graph with powerful feature representation capabilities for homogeneous community detection. HAN [16] was one of the earliest works on heterogeneous graphs, which required predefined meta-paths applicable to the datasets. The hierarchical attention mechanism is used to capture node-level importance and semantic-level importance, and GAT is applied to assign attention scores to neighbors based on different meta-paths to get the final representation. MAGNN [18] improved HAN, which only considers the nodes at both ends based on the meta paths by using several Meta-path Encoders to encode all the information along the meta-paths. CP-GNN [6] proposed the context path based on the meta-path, which indicates that two primary nodes have a semantic relationship if they are connected by the context path. CP-GNN maximized the co-occurrence probability of context neighbors to learn the node embeddings, capturing higher-order relationships, and do not need to define the meta-path in advance. Gsim [44] proposed a new correlation measurement method based on GNN for heterogeneous graphs, which extends CP-GNN to measure correlation in heterogeneous graphs. It theoretically proves that GNN can effectively measure correlation and capture the semantics of correlation measurement. GTN [30] is a method for heterogeneous graphs that does not require manual definition of meta-paths. GTN can be regarded as a graphical simulation of spatial transformation networks that explicitly learns the spatial transformation of the input network or features to obtain an effective node representation, reducing high heterogeneity and enhancing community detection results.

3. Preliminaries

To facilitate the subsequent elaboration, some terms and network model definitions related to this research work are given here. The interpretation of the heterogeneous diagram is shown in Fig. 1, and in addition, Table 1 summarizes the notations commonly used for quick reference.

Table 1
Notations

Notations	Definitions
${\cal V},{\cal E}$	The set of nodes and the set of edges in a heterogeneous graph
${\cal A},{\cal R}$	The set of node types and the set of edge types in a heterogeneous graph
$\varphi(v),\psi(e)$	Node-type mapping function and edge-type mapping function
$v_{i},e_{ij}$	A node $v_{i}\in{\cal V}$ , an edge $e_{ij}\in{\cal E}$
${\cal G}$	A heterogeneous graph ${\cal G}=({{\cal V},{\cal E},{\cal A},{\cal R}})$
$P$	A meta-path in a heterogeneous graph
$A_{{\cal R}_{i}}$	A heterogeneous adjacency matrix with a node-edge relationship of ${\cal R}_{i}$
$A_{P}$	A heterogeneous adjacency matrix composed of a meta path $P$
${\cal G}_{k}=({{\cal V}_{k},{\cal E}_{k}})$	$K$ -nearest neighbor graph
$K$	The number of most similar neighbors selected
$S$	The similarity matrix
$C_{m}$	The $m$ -th community
$N$	The number of nodes in a heterogeneous graph or constructed $K$ -nearest neighbor graph
$d$	Dimensions of node features
$C N$	The number of multi-head channels
$x_{i},X$	The feature vector of node $v_{i}$ , node feature matrix
$W$	Weight matrix
$H^{(l)}$	The feature representation of layer $l$
$\tilde{D},\tilde{A}$	Degree matrix, the adjacency matrix adding the identity matrix
$\sigma(\cdot)$	Activation function

Definition 3.1. Heterogeneous Graph [31]. Heterogeneous graphs (or HIN) are an abstract modeling language for modeling heterogeneous relational data and many complex systems, as shown in Fig. 1. Unlike homogenous graph, the heterogeneous graph has different types of nodes and edges, usually defined as ${\cal G}=({{\cal V},{\cal E},{\cal A},{\cal R}})$ , where ${\cal V}$ denotes the set of nodes and ${\cal E}$ denotes the set of edges, each of which has a node-type mapping function $\varphi(v):{\cal V}\to{\cal A}$ and an edge-type mapping function $\psi(e):{\cal E}\to{\cal R}$ , and $|{\cal A}|+|{\cal R}|>2$ . Each node $v_{i}\in{\cal V}$ in the graph ${\cal G}$ corresponds to a node type, $\varphi({v_{i}})\in{\cal A}$ . Similarly, each edge $e_{ij}\in{\cal E}$ in the graph corresponds to an edge type, $\psi({e_{ij}})\in{\cal R}$ .

Figure 1.

An illustration of heterogeneous graph (DBLP). (a) Three types of nodes (i.e., author, paper, conference). (b) Example of a heterogeneous graph on the DBLP dataset. (c) Three meta-paths used in DBLP (i.e. author-paper-author (APA), (author-paper-author-paper-author (APAPA), author-paper-conference-paper-author (APCPA)).

Example 1. As shown in Fig. 1a and b, for the convenience of illustration, a simple heterogeneous citation graph is constructed here using the DBLP dataset. In this graph, there are three types of nodes, namely, Paper (P), Author (A) and Conference (C). There are two types of edges, paper-author (P-A) and paper-conference (P-C).

Definition 3.2. Meta-path [32]. In a heterogeneous graph, a path that connects two heterogeneous nodes via a composite relationship is called a meta-path, which can effectively capture semantic information and is a fundamental means to study heterogeneous graphs. More formally, a meta-path ${\cal P}$ is constructed in the form of $v_{1}\mathop{\to}\limits^{{\cal R}_{1}}v_{2}\mathop{\to}\limits^{{\cal R}_{2}}% v_{3}\mathop{\to}\limits^{{\cal R}_{3}}\ldots\mathop{\to}\limits^{{\cal R}_{l}% }v_{l+1}$ (which can also be simplified to ${\cal R}_{1}{\cal R}_{2}{\cal R}_{3}\ldots{\cal R}_{l}$ ), where ${\cal R}={\cal R}_{1}\circ{\cal R}_{2}\circ{\cal R}_{3}\circ\ldots\circ{\cal R% }_{l}$ , its length is the number of heterogeneous relationships ${\cal R}$ .

Example 2. As shown in Fig. 1c, two authors can be connected by different meta-paths, such as author-paper-author-paper-author (APAPA) and author-paper-conference-paper-author (APCPA). Although only the middle node is different, the semantic information expressed by these two meta-paths is completely different, the former indicates that both author 1 and author 4 have co-authored papers with author 2 respectively, while the latter two authors have authored papers belonging to the same conference. The richness of semantic information expressed by meta-paths of different lengths also differs, for example, the meta-path author-paper-author (APA) and author-paper-conference-paper-author (APCPA), the latter of which being richer in semantic information.

Definition 3.3. Heterogeneous Adjacency Matrix [33]. In the heterogeneous graph, due to the different node types and edge types, the heterogeneity of the graph will be lost if the traditional adjacency matrix is used to represent the original graph, so a method which can construct the heterogeneous adjacency matrix is proposed to represent it. Given a heterogeneous graph with a set of node types ${\cal A}=\{{{\cal A}_{1},{\cal A}_{2},{\cal A}_{3}}\}$ , a set of edge types ${\cal R}=\{{\cal R}_{1},{\cal R}_{2},{\cal R}_{3},{\cal R}_{4}\}$ , and ${\cal R}_{1}:{\cal A}_{1}{\cal A}_{2}$ , ${\cal R}_{2}:{\cal A}_{2}{\cal A}_{1}$ , ${\cal R}_{3}:{\cal A}_{1}{\cal A}_{3}$ , ${\cal R}_{4}:{\cal A}_{3}{\cal A}_{1}$ . $A_{{\cal R}_{1}}$ denotes the node-edge relationship matrix consisting of ${\cal A}_{1}$ and ${\cal A}_{2}$ . If $v_{i}\in{\cal A}_{1}$ and $v_{j}\in{\cal A}_{2}$ have edges, the corresponding element value is 1, otherwise, it is 0, and the values of the other three types of node-edge relationship matrices are all 0. Then the heterogeneous graph can be expressed as ${\cal G}=\{A_{{\cal R}_{1}},A_{{\cal R}_{2}},A_{{\cal R}_{3}},A_{{\cal R}_{4}}% \},A_{{\cal R}i}\in R^{N\times N}$ , where $N$ is the number of nodes in the heterogeneous graph.

Without causing ambiguity if a composite relationship ${\cal R}$ or a sequence of edge types $[({\cal R}_{1},{\cal R}_{2},\ldots,{\cal R}_{l})]$ is given, then the meta-path can also be represented directly by the object type, i.e., ${\cal P}=({v_{1}v_{2}\ldots v_{l+1}})$ . The adjacency matrix $A_{{\cal P}}$ used to represent meta-path ${\cal P}$ can be obtained according to the following formula.

$\displaystyle A_{\cal P}=A_{{\cal R}_{1}}A_{{\cal R}_{2}}\ldots A_{{\cal R}_{l}}$ (1)

Definition 3.4. KNN_Graph [34]. The $K$ -nearest Neighbor Graph (KNN_Graph) is a weighted directed graph. ${\cal G}_{k}=({{\cal V}_{k},{\cal E}_{k}})$ , where ${\cal V}_{k}$ represents the set of nodes ${\cal V}_{k}=\{{v_{k1},v_{k2},\ldots,v_{kn}}\}$ and ${\cal E}_{k}$ denotes the set of edges ${\cal E}_{k}=\{{e_{k1},e_{k2},\ldots,e_{km}}\}$ . Unlike the ordinary graph, in the $K$ -nearest neighbor graph, ${\cal E}_{k}$ stores the connected edges of $K$ most similar nodes of $v_{ki}\in{\cal V}_{k}$ under some similarity measure.

Definition 3.5. Community [35]. Given a set of communities ${\cal C}=\{{C_{1},C_{2},\ldots,C_{m}}\}$ , each community $C_{m}$ is a division of the original graph that preserves the regional structure and clustering properties. The condition that a node $v_{i}$ is clustered into a community $C_{m}$ should be satisfied is that the internal node degree of the community exceeds its external degree.

Definition 3.6. Graph Convolutional Network (GCN) [36]. As a specific graph-based neural network model, GCN constructs a multi-layer graph convolutional network with the following propagation rules.

$\displaystyle H^{({l+1})}=\sigma\left({\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde% {D}^{-\frac{1}{2}}H^{(l)}W^{(l)}}\right)$ (2)

where $H^{(l)}$ implies the feature representation of layer $l$ , $\tilde{A}=A+I_{N}\in R^{N\times N}$ is the adjacency matrix of the heterogeneous graph with self-connection added, $\tilde{D}_{ij}=\sum\nolimits_{j}{\tilde{A}_{ij}}$ expresses the degree matrix, and $W^{(l)}\in R^{d\times d}$ is a trainable weight matrix. $\sigma(\cdot)$ represents an activation function, such as $\textit{ReLU}(\cdot)=\max(0,\cdot)$ . Since the object of study is directed graph, the in-degree matrix $\tilde{D}^{-1}$ is used to carry out inverse normalization operation during graph convolution operation.

4. Proposed method

The proposed method aims to mine the structural information in the feature space and generate high-quality node embeddings. The feature nodes are used to generate $K$ -nearest neighbor graphs to obtain the underlying topological information in the feature space. Different from most graph neural network-based methods, the proposed KGNN_HCD constructs a heterogeneous adjacency matrix to find meaningful meta-paths based on the datasets and executes a more efficient graph convolutional network to learn more robust node representations. The proposed heterogeneous graph community detection method based on $K$ -nearest neighbor graph neural network will be described in detail next, and the specific method architecture is shown in Fig. 2.

Figure 2.

The framework diagram of KGNN_HCD.

In Fig. 2, the input layer of KGNN_HCD consists of a heterogeneous adjacency matrix and a $K$ -nearest neighbor graph adjacency matrix, which represent the topological information in the feature space and the structural information of the $K$ -nearest neighbor graph, respectively. At the weight allocation layer, weights are assigned to different heterogeneous node edge relationships using a weight matrix, which reflects the importance of different relationship types in community detection. The Mp-Trans Layer generates a meta path transition matrix using the results of heterogeneous adjacency matrix and weight allocation layer. The meta path transition matrix encodes the transformation relationship between nodes in the feature space and the meta path. And use the meta path transformation matrix to map the node representations in the feature space to the meta path space. This can capture the relationships between nodes in the meta path space and better express the semantic information of nodes. GCN layer: In this layer, a Graph Convolutional Network is used to perform graph convolution operations on the new graph structure, learning more robust node representations. Finally, input the node representations learned in the GCN layer into the $k$ -means algorithm. The clustering algorithm divides node representations into different communities to detect community structures in heterogeneous graphs.

Through the above steps, the KGNN-HCD model can extract structural information from the feature space and generate high-quality node embedding representations. This model utilizes a heterogeneous adjacency matrix to construct a meta path transformation matrix to capture the relationships between nodes, while performing graph convolution operations in GCN to learn more robust node representations. Finally, clustering algorithms are applied for node representation learning to detect community structures in heterogeneous graphs.

4.1 K-nearest neighbor graph information fusion

A good fusion capability of community detection should substantially extract and fuse the most correlated information; however, one biggest obstacle is that community detection methods that take a holistic approach find it difficult to take account of the specific connections and differences between individuals in the feature space, making it hard to discover the inherent connections between data objects. Therefore, in terms of technology, to obtain the structural information in the feature space, a $K$ -nearest neighbor graph ${\cal G}_{k}=(A_{k},X)$ , which can reveal the underlying structure of the data, is constructed here based on the feature matrix $X$ of the input heterogeneous graph, where $A_{k}$ denotes the adjacency matrix of the $K$ -nearest neighbor graph. The specific operation process is shown in Fig. 3.

Figure 3.

The construction process of K-nearest neighbor graph based on feature vectors.

In Fig. 3, the node similarity is calculated based on the feature vectors of the nodes, and the top- $K$ nodes that are most similar to the target node are selected based on the similarity matrix to construct the $K$ -nearest neighbor graph adjacency matrix required for the model. The core of the KGNN_HCD method for constructing a $K$ -nearest neighbor graph is to select a similarity measure, according to which the similarity between nodes can be calculated, which in turn constitutes a similarity matrix. There are several mainstream similarity measures, and three common methods are briefly listed here, where $x_{i}$ and $x_{j}$ are the feature vectors of nodes $v_{i}$ and $v_{j}$ .

Cosine Similarity: It measures the similarity by the cosine of the angle between two vectors, which takes values in the range $[-1,1]$ .

$\displaystyle S_{ij}=\frac{x_{i}\cdot x_{j}}{|{x_{i}}||{x_{j}}|}$ (3)

Heat Kernel: Its similarity is calculated as shown in Eq. (4), where $t$ is the time parameter in the heat conduction equation.

$\displaystyle S_{ij}=e^{-\frac{\|{x_{i}-x_{j}}\|^{2}}{t}}$ (4)

Dot Product: It is mainly applied to discrete data, such as bag-of-words, where the calculated similarity is only related to the number of identical words, as shown in Eq. (5).

$\displaystyle S_{ij}=x_{j}^{T}x_{i}$ (5)

The cosine similarity is uniformly chosen as the similarity measure to obtain the similarity matrix. After that, the top- $K$ similar nodes of each node are selected as the nearest neighbors to form the connected edges to generate the undirected $K$ -nearest neighbor graph, and then the required $K$ -nearest neighbor graph adjacency matrix $A_{k}$ is obtained.

After the above operation, considering an individual approach, understanding the inherent relationships between data objects, maintaining the validity of associations during the clustering process, and further obtaining the underlying structural information in the feature space.

It also adds its most similar $K$ neighbors for each node to further aggregate the information, increase the similarity of nodes, and improve the possibility of node co-occurrence. The final input $\tilde{A}_{k}$ of KGNN_HCD is obtained by concatenating the adjacency matrix $A_{k}$ of the $K$ -nearest neighbor graph with the adjacency matrix $\tilde{A}$ in which the identity matrix is added.

4.2 Meta-path information transformation

The rich semantics of meta-paths is an important feature of heterogeneous information networks. Based on different meta-paths, objects have different connection relationships and different path definitions, which may affect many specific tasks. Previous meta-path-based works are highly dependent on pre-defined meta-paths, and relying on different meta-paths defined manually by domain experts directly affects the performance of the model. The difference is that KGNN_HCD can define meta-paths end-to-end and can generate meta-paths of arbitrary length depending on the characteristics of the dataset and the needs of the task.

The $K$ -nearest neighbor graph adjacency matrix expressing the topology of the feature space has been added to the heterogeneous graph. Afterwards, when aggregating the meta-path information, it should be noted that node-edge relationships under different types play different roles and will show different importance when learning the node embedding of a specific task. Therefore, a weight convolution $W_{\textit{att}}$ needs to be introduced here to constrain the node-edge relationships of different importance, and the normalization operation of the weight convolution is performed by the softmax function to realize the difference of the node-edge relationships matrix expression, thus generating a meta-path transformation matrix with heterogeneous edge information, which is calculated as follows.

$\displaystyle T=F({\tilde{A}_{k};W_{\textit{att}}})=\textit{conv}_{1\times 1}(% {\tilde{A}_{k};\textit{softmax}({W_{\textit{att}}})})=\sum\limits_{i=1}^{|{% \cal R}|}{A_{{\cal R}i}}W_{i}^{(l)}$ (6)

where $F$ denotes the convolutional layer, further represented as $\textit{conv}_{1\times 1}$ in the formula. $W_{\textit{att}}\in R^{1\times 1\times|{\cal R}|}$ is the parameter of $1\times 1$ convolution, and $W_{i}^{(l)}=\textit{softmax}({W_{\textit{att}}})$ implies the convex combination of heterogeneous adjacency matrices. To make the proposed method learn different behaviors based on the same mechanism and combine which as knowledge, KGNN_HCD sets the output channel of the convolutional layer as $C N$ to fully consider different types of meta-paths, and this mechanism resembles the multi-head attention mechanism. In the first meta-path information fusion layer, that is, in the Mp-Trans Layer, two meta-path transition matrices $T_{1}\in R^{N\times N\times C}$ and $T_{2}\in R^{N\times N\times C}$ are calculated, and at this time the two transition matrices contain heterogeneous node-edge relationship information (e.g., if the dataset is ACM, it contains the information of the four relationships PA, AP, PS and SP). carry information of heterogeneous node-edge relations (e.g., if the dataset is ACM, it carries information of the four relations PA, AP, PS, and SP). The multiplication of the $T_{1}$ and $T_{2}$ matrix means that the first meta-path information fusion is carried out, and the meta-path information under all effective node-edge relationship combinations of the meta-path of length 3 is learned. Also, for numerical stability, KGNN_HCD normalizes the meta-path transition matrix $A^{(l)}$ with the degree matrix of the heterogeneous graph, i.e., $A^{(1)}=\tilde{D}^{-1}T_{1}T_{2}$ . Each subsequent meta-path transformation matrix can be obtained by multiplying the currently calculated meta-path transition matrix with the meta-path transformation matrix of the previous layer, i.e., Eq. (7).

$\displaystyle A^{(l)}=\tilde{D}^{-1}A^{({l-1})}F({\tilde{A}_{k};W_{\textit{att% }}^{(l)}})$ (7)

Next, an explanation of how KGNN_HCD learns meta-paths of arbitrary length is performed. Given a meta-path ${\cal P}$ consisting of a series of composite relationships ${\cal R}_{1}{\cal R}_{2}{\cal R}_{3}\ldots{\cal R}_{l}$ , the meta-path transformation matrix corresponding to this meta-path can be calculated as follows.

$\displaystyle A_{P}=\left({\sum\limits_{{\cal R}_{1}\in{\cal R}}{W_{{\cal R}_{% 1}}^{(1)}}A_{{\cal R}_{1}}}\right)\left({\sum\limits_{{\cal R}_{2}\in{\cal R}}% {W_{{\cal R}_{2}}^{(2)}}A_{{\cal R}_{2}}}\right)\ldots\left({\sum\limits_{{% \cal R}_{l}\in{\cal R}}{W_{{\cal R}_{l}}^{(l)}}A_{{\cal R}_{l}}}\right)$ (8)

where $A_{{\cal R}_{i}}({i=1,\ldots,l})$ denotes a heterogeneous adjacency matrix with edge type ${\cal R}_{i}$ , $W_{{\cal R}_{i}}^{(l)}$ can be regarded as the weight of each meta-path transition matrix, so that it also accounts for the fact that KGNN_HCD can distinguish meta-paths of different importance, which can then be seen as a weighted sum of the $l$ heterogeneous adjacency matrices, so that the proposed method can learn meta-path information of arbitrary length.

Since these transition adjacency matrices contain the original node-edge relationship information, the characteristics of the original edge themselves are ignored when performing the fusion of meta-path information, so it is necessary to add the identity matrix $I$ to the heterogeneous adjacency matrix. In this way, KGNN_HCD can learn meta-paths of any length, and to fuse meta-path information of $l+1$ lengths when performing the multiplication operation on the $l$ meta-path transition matrices.

4.3 Community detection

The main goal of this module is to perform community detection on the learned node representations. To enable the final node representation to contain rich meta-path semantic information, the variance of the data is high due to the scale-free nature of the heterogeneous graph. To address the above challenges, KGNN_HCD extends the information fusion at the meta-path level to multi-head channels and increases the number of output channels $C N$ . This approach takes account of different types of meta-paths and also facilitates the learning of different node embeddings, making the training process more stable.

After $l$ meta-path transformation matrix multiplications, GCN and MLP are applied to each channel of the meta-path transformation matrix, then the final node representation takes the form as

$\displaystyle Z=\mathop{||}\limits_{i=1}^{CN}\sigma({\tilde{D}_{i}^{{({-1})}}% \tilde{A}_{{}_{i}}^{(l)}XW})$ (9)

where $||$ denotes a concatenation operator that concatenates the node representation of each channel $C N$ represents the number of channels, $\sigma$ implies the activation function, $\tilde{A}_{i}^{(l)}=A_{i}^{(l)}+A_{k}+I$ represents $A_{i}^{(l)}$ with the i-th channel of the layer $l$ , the meta-path transformation matrix with self-loop identity matrix and $K$ -nearest neighbor graph adjacency matrix; $\tilde{D}_{i}$ is the degree matrix of $A_{i}^{(l)}$ . $X\in R^{N\times d}$ is the feature matrix of the input heterogeneous graph, $N$ is the number of nodes, $d$ is the node feature dimension, $W\in R^{d\times d}$ is the training matrix of the neural network model, dim is the output embedding dimension of the model, and $Z\in R^{N\times\textit{dim}}$ is the node embedding representation of the final output.

After obtaining the node representations, the $k$ -means algorithm is used to detect communities for learned embeddings. The ground-truth community labels are matched with the community labels obtained by KGNN_HCD.

5. Experiments

Three real-world heterogeneous graph datasets are selected for the experiments, namely the heterogeneous citation networks ACM [16] and DBLP [37], and a movie dataset IMDB [18, 38], specifically, their detailed information including the number of nodes, number of edges, edge types, meta-paths, etc. are summarized in Table 2.

5.1 Datasets

Table 2
Statistics of the datasets

Dataset	Node type	Nodes	Edge type	Edges	Meta-path	Features	Training set	Validation set	Test set
ACM	Paper (P)	3025	P-A	9936	PAP	1902	600	300	2125
	Author (A)	5835	A-P	9936	PSP
	Subject (S)	56	P-S	3025
			S-P	3025
DBLP	Paper (P)	14328	P-A	19645	APA	334	800	400	2857
	Author (A)	4057	A-P	19645	APAPA
	Conference (C)	20	P-C	14328	APCPA
			C-P	14328
IMDB	Movie (M)	2939	M-A	4661	MAM	1256	300	300	2339
	Actor (A)	5841	A-M	4661	MDM
	Director (D)	2269	M-D	13983
			D-M	13983

ACM: The Association for Computing Machinery (ACM) dataset is a bibliographic information network that extracts papers published on KDD, SIGMOD, SIGCOMM, MobiCOMM and VLDB. The whole dataset includes 3 node types, which are 3025 Paper (P), 5835 Author (A) and 56 Subject (S), where the nodes with node type P have the three labels, i.e., Database, Wireless Communication and Data Mining, which means that nodes with node type P can be classified into three categories. The ACM dataset also contains four edge types, which are 9936 P-A and A-P, 3025 P-S and S-P respectively. Obviously, the dataset has four heterogeneous adjacency matrices corresponding to the node-edge relationships $A_{P-A}$ , $A_{A-P}$ , $A_{P-S}$ and $A_{S-P}$ . For the ACM dataset, the meta-paths with semantic information are PAP and PSP.

DBLP: The DataBase systems and Logic Programming (DBLP) dataset is a network that reflects the relationship between authors and papers. There are three types of nodes: 14328 Papers (P), 4057 Authors (A) and 20 Conferences (C). Authors are grouped into 4 fields: Database, Data Mining, Machine Learning, Information Retrieval. Each author is labelled with their research field according to the conference they submitted to. The DBLP dataset contains 4 edge types, which are 19645 P-A and A-P, 14,328 P-C and C-P. The corresponding heterogeneous adjacency matrices are $A_{P-A}$ , $A_{A-P}$ , $A_{P-C}$ and $A_{C-P}$ . For the DBLP dataset, the meta-paths with semantic information are APA, APCPA, APAPA.

IMDB: The Internet Movie DataBase (IMDB) is a website about movies and their related information, recording users’ preferences for different movies. This dataset contains 3 node types, which are 2939 Movie (M), 5841 Actor (A) and 2269 Director (D), where the nodes with node type Movie have three properties, i.e., Action, Comedy and Drama. There are four edge types, i.e., 4661 M-A and A-M and 13983 M-D and D-M, corresponding to heterogeneous adjacency matrices of $A_{M-A}$ , $A_{A-M}$ , $A_{M-D}$ and $A_{D-M}$ . When studying the IMDB dataset, task mining of which can be carried out by studying MAM and MDM which has semantic information in IMDB.

5.2 Baselines

The proposed KGNN_HCD method is compared here with the state-of-the-art heterogeneous community detection methods such as CP-GNN and GTN. To fully verify the fairness as well as the validity of the experiments, traditional network embedding algorithms are first considered; these algorithms are originally designed to study homogeneous graphs. So, during the experiments, the heterogeneity of the nodes is totally ignored and the homogeneous operation is performed over the whole heterogeneous graph. Also, for a fair comparison, graph neural network-based approaches have been introduced, most of them designed for heterogeneous network embedding, and they all use meta-paths to capture features of heterogeneous information networks.

5.2.1 Traditional network embedding methods

Infomap [21, 39] is an algorithm based on information encoding theory that encodes paths generated by random walk and uses the encoding length as an objective function for optimal community partitioning. Here, this proposed method views the heterogeneous graph as a homogeneous graph, converting the community detection problem into an information compression problem.

Node2vec [13] is a network embedding method that integrates DFS neighborhoods and BFS neighborhoods and can be seen as an extension of Deepwalk. Node2vec is a second-order random walk, which can not only travel far to depict the macroscopic characteristics of the network, but also random walk locally to retain the community information of nodes.

Metapath2vec [28] (Mp2vec) is the most advanced heterogeneous information network embedding method. It uses random walks based on meta-path to construct heterogeneous neighborhoods of each node, and then uses Skip-Gram model to complete node embedding. Based on metapath2vec, the author also proposes metapath2vec $++$ to model both structural and semantic associations in heterogeneous networks.

HIN2vec [29] learns the richness of information in heterogeneous information networks by studying different types of relationships between nodes and network structure. It samples random walks based on meta-paths of a given size and feeds them into a neural network model to learn more meaningful node representations based on the fact that different meta-paths have different semantic information, encoding the rich information embedded in the meta-paths and in the overall network structure.

5.2.2 Graph neural network-based methods

GCN [36] is the state-of-the-art graph convolution method for homogeneous graphs. GCN is a semi-supervised graph convolutional network model that performs convolutional operations in the graph Fourier domain and captures global complex features by aggregating information from neighbors for learning node representations. The GCN is tested here on a homogeneous graph based on meta-paths and the best results from the meta-paths are reported.

GCN_KG [34] is a special GCN. In order to make a better comparison and to explore the advantages of $K$ -nearest neighbor graph, the input of the model abandons the traditional topological graph and uses a sparse $K$ -nearest neighbor graph computed from the feature matrix as the input fed to the GCN, which is represented as GCN_KG.

GAT [40] is a semi-supervised homogeneous graph neural network model. The model performs convolutional operations in the graph space domain and introduces an attention mechanism. For each node, it aggregates neighbor representations by means of importance scores learned by node-level attention. Similarly, all meta-paths under the GAT model were tested and the best performance is reported.

HAN [16] is one of the earliest attempts to address heterogeneous graphs by converting a heterogeneous information network into multiple homogeneous graphs with given symmetric meta-paths and using a hierarchical attention mechanism to capture node-level importance and semantic-level importance and gives the final representation by implementing node-level attention on its corresponding meta-path neighborhood graph via GAT.

CP-GNN [6] utilizes the context path to capture higher-order relationships between nodes and builds a heterogeneous graph neural network model based on the context path. It recursively embeds higher-order relationships between nodes into node embeddings through an attention mechanism to distinguish the importance of different relationships. The embeddings of nodes are better learned by maximizing the expectation of co-occurrence of nodes connected by the context path.

GTN [30] is a method suitable for heterogeneous graphs that automatically finds valuable meta-path, rather than relying on manual selection as HAN does. GTN considers all possible cases by calculating all possible meta-path-based graphs, on which performs graph convolutions.

5.3 Implementation details

For the proposed KGNN_HCD method, parameters are randomly initialized and Adam [41] is used to optimize the model firstly, after which the hyperparameters are chosen separately: the learning rate is set to 0.005 and the regularization parameter is set to 0.001, making each baseline yield its best performance. For the random walk-based models, the window size is set to 5, the step size is set to 100, the step size of each node is set to 40 and the negative sample number is set to 7. For semi-supervised graph neural networks including GCN, GAT and HAN, the same training set, validation set and test set are split to ensure fairness. For the KGNN_HCD method, the number of the Mp-Trans Layer of DBLP and IMDB datasets is set to 3, and the number of Mp-Trans Layer of ACM dataset is set to 2. In addition, for $K$ -nearest neighbor graph, the value of $K$ is set to 4. For a fair comparison, the embedding dimension of all the above algorithms is set to 64.

5.4 Performance comparison

The proposed KGNN_HCD method first learns node embeddings and then uses the $k$ -means algorithm on these embeddings for community detection, where $k$ is set as the number of node categories. The real labels and four measures, i.e., F1, NMI, ARI [35] and Purity [6], are used to evaluate the performance of KGNN_HCD in comparison with 11 methods such as Infomap, CP-GNN and so on, and experimental results are shown in Table 3.

Table 3
Performance comparison of different community detection methods

Dataset	Metrics	Infomap	Node2 vec	Mp2 vec	HIN2 vec	GCN	GCN_KG	GAT	HAN	CP-GNN	GTN	KGNN_HCD
ACM	F1	0.5728	0.6995	0.7374	0.7736	0.5458	0.6613	0.6878	0.8012	0.8596	0.9268	0.9536
	NMI	0.1918	0.2667	0.3559	0.4066	0.1873	0.2301	0.2871	0.3991	0.4832	0.5047	0.5301
	ARI	0.1248	0.2483	0.2986	0.3389	0.1335	0.1568	0.1626	0.3218	0.3924	0.4219	0.4575
	Purity	0.5826	0.4389	0.4962	0.6972	0.5901	0.6254	0.6377	0.6874	0.7150	0.7509	0.7756
DBLP	F1	0.3611	0.7579	0.7113	0.3139	0.3217	0.3922	0.8559	0.9023	0.9125	0.9418	0.9553
	NMI	0.0887	0.0634	0.2556	0.0084	0.1255	0.1829	0.6381	0.6247	0.7089	0.7125	0.7384
	ARI	0.0174	0.0493	0.2794	0.0027	0.0894	0.1547	0.5343	0.6578	0.7661	0.7934	0.8081
	Purity	0.4111	0.3884	0.6185	0.3082	0.3576	0.3826	0.7776	0.8562	0.9004	0.9118	0.9276
IMDB	F1	0.3345	0.5556	0.4887	0.4273	0.3688	0.3976	0.3559	0.4867	0.6144	0.6092	0.6281
	NMI	0.0246	0.1029	0.0297	0.0052	0.0923	0.1381	0.0712	0.1329	0.1225	0.1593	0.1715
	ARI	0.0028	0.0471	0.0165	0.0019	0.0870	0.1293	0.0615	0.1353	0.1231	0.1664	0.1831
	Purity	0.4093	0.448	0.4380	0.3948	0.3912	0.4128	0.3748	0.3734	0.4949	0.5381	0.5629

From Table 3, it is clear that the F1 metric of the proposed KGNN_HCD method outperforms the GTN method by 2.68% on the ACM dataset, 1.35% on the DBLP dataset, and 1.89% on the IMDB dataset, and on both NMI and ARI, the proposed KGNN_HCD method improves by 2.54% and 2.56% on the ACM dataset, 2.59% and 1.47% on the DBLP dataset, and 1.22% and 1.67% on the IMDB dataset. The above results show that by fusing K-nearest neighbor graph information and transforming meta-path information, the structural information of the feature space can be captured and more meaningful meta-paths can be acquired adaptively so that the learned node representations are better adapted to community detection.

All network embedding-based methods outperform the traditional Infomap algorithm, showing their great potential for community detection tasks. For the network embedding-based methods, the overall performance of HIN2vec is better than Node2vec and Metapath2vec, but the superiority of HIN2vec is only outstanding on the ACM dataset, and the results are not satisfactory on the two datasets of DBLP and IMDB. This outcome depends on the fact that HIN2vec can automatically detect meta-paths which may not be suitable for community detection through random walks.

GCN, GCN_KG and LGNN have the worst results in GNN-based baselines. The possible reason is that they were originally proposed for homogeneous graphs and did not consider the complex contextual information in heterogeneous graphs. GCN_KG outperforms GCN in all results, which further demonstrates the need to introduce $K$ -nearest neighbor graphs to obtain feature space information. The superior performance of GAT over GCN and LGNN, which strongly supports the importance of the attention mechanism. The attention mechanism in GAT can be regarded as a simple way to distinguish between node types and edge types in heterogeneous graphs. Due to the presence of meta-paths, HAN is able to mine complex semantic information explicitly and it achieve better results. CP-GNN captures higher-order relationships well and performs equally well. The graph transformer network of GTN is able to identify useful connections between nodes where edges do not exist on the original graph, and does not require domain knowledge in learning meta-paths, and the model performance proves to be better. Among the graph neural network-based approaches, HAN, CP-GNN, GTN and KGNN_HCD all perform better than GCN and GAT, which indicates that these heterogeneous graph neural network-based community detection models can take account of the heterogeneity of nodes and edges well. KGNN_HCD is superior to other models here, and a key reason is that meta-paths play a crucial role in heterogeneous graph processing. Specifically, we define meta-paths between different types of nodes, calculate the similarity between node pairs, and use them to construct adjacency matrices. The purpose of doing this is to capture the community structure in heterogeneous graphs by considering the correlation between different types of nodes. In the process of optimizing node representation learning, we use the constructed adjacency matrix as the input of the graph neural network to better learn node representations and obtain high-quality community structures.

5.5 Parameters sensitivity experiment

The sensitivity comparative analysis experiments of parameters will be conducted on three different data sets such as ACM.

The dimensions of the final embedding $Z$ . Firstly, the influence of the size of the final embedding $Z$ was tested. In the five experiments, the embedded dimension is set as 4, 16, 64, 128 and 1024 respectively, and in order to control variables, the $K$ in the $K$ -nearest neighbor graph is uniformly set as 4. The final result is shown in Fig. 4.

Figure 4.

Four metrics of three datasets in different dimensions.

On all three datasets in Fig. 4, the performance increases first and then starts slowly decreasing as the embedding dimension grows. The reason may be that KGNN_HCD needs a suitable dimension to encode the meta-path information, and a smaller dimension may not lead to learn sufficient meaningful information, while a larger dimension may introduce additional redundancy. Therefore, according to several experiments, we found that 64 is the most appropriate embedding dimension.

Analysis of $K$ in $K$ -nearest neighbor graphs. To test the effect of the first top- $K$ neighborhoods in the $K$ -nearest neighbor graph on the proposed method, the performance of KNN_GNN with $K$ ranging from 2 to 10 in Fig. 5 is therefore investigated.

Figure 5.

Analysis of $K$ in $K$ -nearest neighbor graphs under three datasets.

It can be intuitively found from Fig. 5 that for the three datasets of ACM, DBLP and IMDB, the accuracy of KGNN_HCD basically increases first and then decreases. This may be because the smaller the value of $K$ is, the less likely it is to select neighbors, so the node that is most similar to the target can be easily selected. However, with the increase of $K$ , some nodes with a low degree of similarity with the target node will be forced to retain the edge, which increases the complexity of the graph, so it is impossible to intuitively find out the node that is similar to the target node. If the graph becomes denser, the features are easier to smooth out, which is one of the reasons the model performance degrades. From Fig. 5, we can see that KGNN_HCD works best when $K$ is taken as 4.

The number of Mp-Trans Layers. To explore the effect of the Mp-Trans Layer on KGNN_HCD, the number of layers is set to 2, 3, 4, and 5 respectively, and the NMI and ARI metrics of the model are observed on the three data sets, as shown in Fig. 6.

Taking the DBLP dataset as an example, it is easy to see from Fig. 6 that the performance of KGNN_HCD continuously decreases when the number of layers increases. This is due to the fact that the role of the Mp-Trans Layer is to aggregate meta-paths, for the DBLP dataset, the longest effective length of the meta-path is 5, which is the APCPA. When the number of Mp-Trans Layers increases, meta-paths without semantic information will also be used by KGNN_HCD, causing additional redundancy and thus affecting performance. Therefore, according to several experiments, the overall performance of KGNN_HCD method is best when the number of Mp-Trans Layer is set to 2 for ACM dataset and 3 for DBLP and IMDB datasets.

Figure 6.

NMI and ARI with different numbers of MP-Trans Layers on three datasets.

5.6 Interpretation of Mp-Trans Layer

The function of Mp-Trans Layer is to fuse two meta-path information matrices to obtain a meta-path transition matrix with the degree of importance to further express the differences between different meta-paths. This section then explains in detail how Mp-Trans Layer distinguishes the importance of each meta-path from the generated meta-path transition matrix. For illustration convenience, the output channel is set to 1 here, and the convex combination of the adjacency matrix of the input heterogeneous graph is defined as $\alpha\cdot A=\sum_{{\cal R}_{i}}^{|{\cal R}|}\alpha_{{\cal R}i}A_{{\cal R}_{i}}$ . Accordingly, the meta-path transition matrix $A^{(l)}$ generated by the l-th Mp-Trans Layer can be obtained using the meta-path transition matrix output from the previous layer and the current meta-path transition matrix, as shown below.

$\displaystyle A^{(l)}=(D^{(l-1)})^{-1}A^{({l-1})}\sum\nolimits_{{\cal R}_{i}}^% {|{\cal R}|}\alpha_{{\cal R}_{i}}^{(l)}A_{{\cal R}_{i}}=(D^{(l-1)})^{-1}\left(% (D^{({l-2})})^{-1}A^{(l-2)}\sum\nolimits_{{\cal R}_{i}}^{|{\cal R}|}\alpha_{{% \cal R}_{i}}^{({l-1})}A_{{\cal R}_{i}}\right)\sum\nolimits_{{\cal R}_{i}}^{|{% \cal R}|}\alpha_{{\cal R}_{i}}^{(l)}A_{{\cal R}_{i}}=(D^{(l-1)})^{-1}\ldots(D^% {(1)})^{-1}((\alpha^{(0)}{A})(\alpha^{(1)}{A})(\alpha^{(2)}{A})\ldots(\alpha^{% (l)}{A}))=(D^{(l-1)})^{-1}\ldots(D^{(1)})^{-1}\left(\sum\limits_{{\cal R}_{0},% {\cal R}_{1},{\cal R}_{2},\ldots,{\cal R}_{l}\in{\cal R}}\alpha_{{\cal R}_{0}}% ^{(0)}\alpha_{{\cal R}_{1}}^{(1)}\alpha_{{\cal R}_{2}}^{(2)}\ldots\alpha_{{% \cal R}_{l}}^{(l)}A_{{\cal R}_{0}}A_{{\cal R}_{1}}A_{{\cal R}_{2}}\ldots A_{{% \cal R}_{l}}\right)$ (10)

In Eq. (10), $D^{(1)}$ represents the degree matrix, $A_{{\cal R}_{i}}$ denotes the heterogeneous adjacency matrix whose node-edge relation type is ${\cal R}_{i}\in{\cal R}$ , and $\alpha_{{\cal R}_{i}}$ is the weight of the heterogeneous adjacency matrix $A_{{\cal R}_{i}}$ . As can be seen from the above, two meta-path transition matrices will be generated at the first Mp-Trans Layer, and the corresponding allocation coefficients are $\alpha^{(0)}=\textit{softmax}(W_{\textit{att}}^{(0)})$ and $\alpha^{(1)}=\textit{softmax}(W_{\textit{att}}^{(1)})$ respectively. It is worth noting that the matrix multiplication operation is performed on the first two generated meta-path transition matrices, so the meta-path information with length 3 is fused here, such as P-A-P. To fuse the information further, it is only necessary to calculate the weight $\alpha^{(2)}=\textit{softmax}(W_{\textit{att}}^{(2)})$ of the heterogeneous adjacency matrix once and multiply it with the meta-path transition matrix obtained before to fuse the meta-path information of length 4. Therefore, to study a meaningful meta-path ${\cal P}$ composed of a series of composite relations ${\cal R}_{1}{\cal R}_{2}{\cal R}_{3}\ldots{\cal R}_{l}$ , it can be represented by the heterogeneous adjacency matrix $A_{P}=A_{{\cal R}_{1}}A_{{\cal R}_{2}}\ldots A_{{\cal R}_{l}}$ corresponding to this meta-path, and only $l-1$ times of Mp-Trans Layer superposition is needed to fuse the information of the meta-path ${\cal P}$ , and at the same time, according to Eq. (8), the weighted sum of the weights corresponding to the heterogeneous adjacency matrix under each type, i.e., $\sum\nolimits_{{\cal R}_{0},{\cal R}_{1},{\cal R}_{2},\ldots,{\cal R}_{l}\in{% \cal R}}{\alpha_{{\cal R}_{0}}^{(0)}\alpha_{{\cal R}_{1}}^{(1)}\alpha_{{\cal R% }_{2}}^{(2)}\ldots\alpha_{{\cal R}_{l}}^{(l)}}$ is the contribution of the meta-path ${\cal P}$ , which is the importance of this meta-path.

This explains why KGNN_HCD is able to express the importance of different meta-paths. The weight $\prod\nolimits_{i=0}^{l}{\alpha_{{\cal R}_{i}}^{(i)}}$ of the meta-path $P({{\cal R}_{1}{\cal R}_{2}{\cal R}_{3}\ldots{\cal R}_{l}})$ is an attention score when it provides the semantic information and importance of the meta-path in a particular task. Figures 7 and 8 summarize the attention scores corresponding to the heterogeneous adjacency matrices under each relation type and learn the attention scores for all meta-paths of length 3 using the ACM and IMDB datasets as examples.

Figures 7a and b and 8a and b represent the attention scores of the adjacency matrices (edge types) from the first and second Mp-Trans Layers, where the ACM dataset and the IMDB dataset are chosen. In the ACM dataset, we have that the attention scores of the heterogeneous edge relations PA are the highest in the first meta-path transition layer; in the second layer, the attention scores of the heterogeneous edge relations AP are the highest. Obviously, the importance of meta-path PAP must be higher, which can be corroborated in Fig. 7c. In the IMDB dataset, the attention scores of heterogeneous edge relations MA and MD in the first layer have little different, but in the second layer, the attention score of DM is obviously higher than that of AM, so the importance of meta-path MDM is higher than that of meta-path MAM, which can be confirmed in the subsequent heat map. Meanwhile, in these four graphs, the attention weight of the identity matrix is relatively high, as discussed in Section 4.2, and KGNN_CD is able to try to adhere to a shorter meta-path even at a deeper level.

Figure 7.

Numerical analysis in MP-Trans Layer on ACM dataset. (a) Attention scores corresponding to $T_{1}$ . (b) Attention scores corresponding to $T_{2}$ . (c) Correlation of two attention scores.

Figure 8.

Numerical analysis in MP-Trans Layer on IMDB dataset. (a) Attention scores corresponding to $T_{1}$ . (b) Attention scores corresponding to $T_{2}$ . (c) Correlation of two attention scores.

Figures 7c and 8c give the visualization of the correlation between each heterogeneous relationship in the first and second Mp-Trans Layers in the ACM dataset and IMDB dataset (For the clearer representation, the values are enlarged by 100 times here). It can be intuitively seen from the figure that meta-path PAP is more important than meta-path PSP. For example, according to the figure above, in the first Mp-Trans Layer, the attention score of PA is 0.1691, and that of PS is 0.1689; in the second MP-Trans Layer, the attention score of AP is 0.1721, and that of SP is 0.1707. According to equation $\prod\nolimits_{i=0}^{l}{\alpha_{{\cal R}_{i}}^{(i)}}$ in 5.6, the importance of meta-path PAP is $\alpha_{{\cal R}_{PA}}^{(1)}\cdot\alpha_{{\cal R}_{AP}}^{(2)}=0.0291$ , and similarly, the importance of meta-path PSP is 0.0288, which shows that in community detection task, the meta-path PAP is more important than meta-path PSP, which can also be confirmed in the Metapath2vec [28]. Similarly, for the IMDB dataset, the attention score of MD is 0.1656 and that of MA is 0.1645 in the first Mp-Trans Layer, and the attention score of DM is 0.1591 and that of AM is 0.1549 in the second Mp-Trans Layer. Then the importance of meta-path MDM is $\alpha_{{\cal R}_{MD}}^{(1)}\cdot\alpha_{{\cal R}_{DM}}^{(2)}=0.0263$ and the importance of meta-path MAM is 0.0255, which leads to the conclusion that the importance of meta-path MDM is greater than that of MAM, which can also be proved in Metapath2vec [28].

5.7 Ablation study

5.7.1 Comparison of similarity measures

4.1 provides a detailed introduction to several mainstream similarity measurement methods. To find the best measurement method, we conducted relevant experiments using the ACM dataset as an example and presented the results in Table 4.

Table 4
Quantitative results of ablation experiments on ACM datasets for similarity measures

Variant	ACM
Metrics	F1	NMI	ARI	Purity
KGNN_HCD_{w/oI& Dot Product}	0.9382	0.5196	0.4464	0.7605
KGNN_HCD_{w/Heat Kernel}	0.9564	0.5271	0.4503	0.7684
KGNN_HCD	0.9536	0.5301	0.4575	0.7756

In Table 4, KGNN_HCD_{w/Dot Product} and KGNN_HCD_{w/Heat Kernel} respectively represent the use of dot product and heat kernel methods for similarity calculation in constructing $K$ -nearest neighbor graphs. It can be seen that the results obtained by KGNN_HCD_{w/Dot Product} are significantly lower than the other two methods, mainly because it only focuses on the inner product of the vector and ignores the information of other node attributes and edges. Compared to KGNN_HCD, the result of KGNN_HCD_{w/Heat Kernel} is slightly worse than that of KGNN_HCD, which seems to have little difference for the model. However, due to the high computational complexity of heat kernel, KGNN_HCD uses cosine similarity measurement to calculate similarity and construct $K$ -nearest neighbor graph.

5.7.2 Key modules of the model

To verify the validity of the components of KGNN_HCD, further experiments are conducted on different KGNN_HCD variants. The performance results for the three datasets are given in Table 5. KGNN_HCD_w/oKG here indicates the removal of the $K$ -nearest neighbor graph information fusion module as a way to explore the importance of this module in KGNN_HCD. KGNN_HCD_w/oI ignores the role of identity matrix; KGNN_HCD_{w/oI& KG} removes both modules of identity matrix and $K$ -nearest neighbor graph information fusion to compare with KGNN_HCD_w/oKG and KGNN_HCD_w/oI, which can better highlight the significance of these two parts.

Table 5
Quantitative results of ablation studies on ACM, DBLP, and IMDB datasets for community detection

Variant	ACM				DBLP				IMDB
Metrics	F1	NMI	ARI	Purity	F1	NMI	ARI	Purity	F1	NMI	ARI	Purity
KGNN_HCD_{w/oI& KG}	0.8913	0.4907	0.3853	0.7498	0.9021	0.6449	0.7171	0.8374	0.5233	0.1337	0.1493	0.4896
KGNN_HCD_w/oKG	0.9028	0.5080	0.3909	0.7471	0.9111	0.6806	0.7424	0.8791	0.5476	0.1382	0.1519	0.5098
KGNN_HCD_w/oI	0.9252	0.5136	0.4262	0.7589	0.9374	0.7012	0.7740	0.9036	0.5989	0.1579	0.1714	0.5421
KGNN_HCD	0.9536	0.5301	0.4575	0.7756	0.9553	0.7384	0.8018	0.9276	0.6281	0.1715	0.1831	0.5629

As can be seen from Table 5, KGNN_HCD_w/oKG always has lower performance than KGNN_HCD in the three datasets, and there is a significant gap, which indicates the effectiveness and necessity of $K$ -nearest neighbor graph information fusion. Secondly, to verify the effect of the identity matrix, KGNN_HCD_w/oI is trained and evaluated here. KGNN_HCD_w/oI has exactly the same system architecture as KGNN_HCD, but its candidate adjacency matrix $A$ does not include the identity matrix. As can be seen from the figure, usually, KGNN_HCD_w/oI always performs worse than KGNN_HCD. If both of the above modules are removed and KGNN_HCD_{w/oI& KG} are trained directly, it is not difficult to find from the table that this is less effective, but to add one of these two modules, such as KGNN_HCD_w/oI or KGNN_HCD_w/oKG, the final performance will be improved, which also laterally reflects the effectiveness of the identity matrix and $K$ -nearest neighbor graph information fusion module.

6. Conclusion

In this paper, a heterogeneous graph community detection method based on $K$ -nearest neighbor graph neural network, called KGNN_HCD, is proposed to solve the existing problem of heterogeneous graph community detection. The method not only can effectively use the structural information of the feature space to enhance node similarity and improve the possibility of disconnected nodes being classified as a community; but also avoids the use of pre-defined meta-paths and is able to learn meta-paths end-to-end. In order to evaluate the rationality and effectiveness of the KGNN_HCD method, 11 different models, such as CP-GNN and GTN, are compared on three real heterogeneous datasets, i.e., ACM, DBLP and IMDB, and the results show that KGNN_HCD outperforms the baseline and its weight convolution can distinguish the importance of different meta-paths well; to provide a clear description of the MpTrans Layer in KGNN_HCD, an interpretability analysis of the module is conducted; in order to study the effect of hyperparameters in KGNN_HCD on the model, their effect on the model is studied by changing the $K$ value of the $K$ -nearest neighbor graph, changing the embedding dimension and changing the number of Mp-Trans Layers; in order to verify the effectiveness of each module, ablation experiments are conducted on three datasets; finally, the higher-order relationships between nodes are well captured by GCN, and the final learned node representations are subjected to clustering operations to detect the community structure in the heterogeneous graphs, and the experimental results comprehensively demonstrated the rationality and effectiveness of the proposed KGNN_HCD method.

KGNN_ HCD only focuses on the excellent performance of symmetric meta paths in community detection. In the future, we will consider extending symmetric meta paths to asymmetric meta paths for community detection to capture richer relationships. In addition, heterogeneous temporal networks are considered to dynamically control community spatial changes, dynamic network pattern recognition and community detection robustness from both spatial and temporal dimensions, so as to achieve rapid capture and evolution of network communities.

Footnotes

Acknowledgments

This work was supported in part by Chongqing Federation of Social Sciences Key Project (2023NDZD09), Postgraduate Innovation Fund of Chongqing University of Technology (gzlcx20223202).

References

Yang

McAuley

and Leskovec

, Community detection in networks with node attributes, in: 2013 IEEE 13th International Conference on Data Mining, IEEE, 2013, pp. 1151–1156.

Wang

and Chen

, Polymorphic Graph Attention Network for Chinese NER, Expert Systems with Applications 203 (2022), 117467.

Satuluri

Zheng

Qian

Wichers

Dai

and Lin

, Simclusters: Community-based representations for heterogeneous recommendations at twitter, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 3183–3193.

Liu

Miao

Fiumara

and De Meo

, Information Propagation Prediction Based on Spatial-Temporal Attention and Heterogeneous Graph Convolutional Networks, IEEE Transactions on Computational Social Systems 11(1) (2024), 945–958.

Ravasz

Somera

A.L.

Mongru

D.A.

Oltvai

Z.N.

and Barabási

A.L.

, Hierarchical organization of modularity in metabolic networks, Science 297(5586) (2002), 1551–1555.

Luo

Fang

Cao

Zhang

and Zhang

, Detecting communities from heterogeneous graphs: A context path-based graph neural network model, in: Proceedings of the 30th ACM International Conference on Information & Knowledge Management, 2021, pp. 1170–1180.

Chen

and Redner

, Community structure of the physical review citation network, Journal of Informetrics 4(3) (2010), 278–290.

Girvan

and Newman

M.E.

, Community structure in social and biological networks, Proceedings of the National Academy of Sciences 12(99) (2002), 7821–7826.

Liu

Fiumara

and De Meo

, Influence nodes identifying method via community-based backward generating network framework, IEEE Transactions on Network Science and Engineering 11(1) (2024), 236–253.

10.

Pons

and Latapy

, Computing communities in large networks using random walks, Computer and Information Sciences-ISCIS 2005: 20th International Symposium, Istanbul, Turkey, October 26–28, 2005. Proceedings 20. Springer Berlin Heidelberg, 2005, 284–293.

11.

Amini

A.A.

Chen

Bickel

P.J.

and Levina

, Pseudo-likelihood methods for community detection in large sparse networks, The Annals of Statistics 41(4) (2013), 2097–2122.

12.

Perozzi

Al-Rfou

and Skiena

, Deepwalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 201, pp. 701–710.

13.

Grover

and Leskovec

, node2vec: Scalable feature learning for networks, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 855–864.

14.

Sha

Huang

and Zhang

, Community detection in attributed graphs: An embedding approach, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32(1), 2018.

15.

Sun

Huang

Shu

and Wang

, Network embedding for community detection in attributed networks, ACM Transactions on Knowledge Discovery from Data (TKDD) 14(3) (2020), 1–25.

16.

Liu

Fiumara

and De Meo

, Link prediction approach combined graph neural network with capsule network, Expert Systems with Applications 212 (2023), 118737.

17.

Wang

Shi

Wang

Cui

and Yu

P.S.

, Heterogeneous graph attention network, in: The World Wide Web Conference, 2019, pp. 2022–2032.

18.

Zhang

Meng

and King

, Magnn: Metapath aggregated graph neural network for heterogeneous graph embedding, in: Proceedings of The Web Conference, 2020, pp. 2331–2341.

19.

Sun

Han

Yan

P.S.

and Wu

, Pathsim: Meta path-based top-k similarity search in heterogeneous information networks, Proceedings of the VLDB Endowment 4(11) (2011), 992–1003.

20.

Kernighan

B.W.

and Lin

, An efficient heuristic procedure for partition in graphs, The Bell System Technical Journal 49(2) (1970), 291–307.

21.

Rosvall

and Bergstrom

C.T.

, Maps of random walks on complex networks reveal community structure, Proceedings of the National Academy of Sciences 105(4) (2008), 1118–1123.

22.

Zhu

and Ghahramani

, Learning from Labeled and Unlabeled Data with Label Propagationc, Tech Report, 2002.

23.

Sengupta

and Chen

, Spectral clustering in heterogeneous networks, Statistica Sinica 25(3) (2015), 1081–1106.

24.

Sun

Norick

Han

Yan

P.S.

and Yu

, Pathselclus: Integrating meta-path selection with user-guided object clustering in heterogeneous information networks, ACM Transactions on Knowledge Discovery from Data (TKDD) 7(3) (2013), 1–23.

25.

Jin

Jiao

Pan

Philip

S.Y.

and Zhang

, A survey of community detection approaches: From statistical modeling to deep learning, IEEE Transactions on Knowledge and Data Engineering 35(2) (2021), 1149–1170.

26.

Pan

Zhang

and Hu

, Efficient community detection in heterogeneous social networks, Mathematical Problems in Engineering, 2016, 12–21.

27.

Zhe

Sun

and Xiao

, Community detection on large complex attribute network, in: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2019, pp. 2041–2049.

28.

Dong

Chawla

N.V.

and Swami

, metapath2vec: Scalable representation learning for heterogeneous networks, in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 135–144.

29.

Lee

W.C.

and Lei

, Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning, in: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, 2017, pp. 1797–1806.

30.

Yun

Jeong

Kim

Kang

and Kim

H.J.

, Graph transformer networks, Advances in neural information processing systems, 2019, 32.

31.

Shi

Zhang

Sun

Philip

S.Y.

et al., A survey of heterogeneous information network analysis, IEEE Transactions on Knowledge and Data Engineering 29(1) (2016), 17–37.

32.

Wang

Shi

Fan

and Philip

S.Y.

, A survey on heterogeneous graph embedding: Methods, techniques, applications and sources, IEEE Transactions on Big Data 9(2) (2022), 415–436.

33.

Zheng

Chen

Zhang

Yang

and Wang

, Heterogeneous-temporal graph convolutional networks: Make the community detection much better, arXiv preprint arXiv:1909.10248, 2019.

34.

Wang

Zhu

Cui

Shi

and Pei

, Am-gcn: Adaptive multi-channel graph convolutional networks, in: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 1243–1253.

35.

Xue

Liu

Jia

Jian

and Chuan

, A comprehensive survey on community detection with deep learning, IEEE Transactions on Neural Networks and Learning Systems, 2022.

36.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv: 1609.02907, 2016.

37.

Gao

Liang

Fan

Sun

and Han

, Graph-based consensus maximization among multiple supervised and unsupervised models, Advances in Neural Information Processing Systems, 2009, 22.

38.

Shi

Gui

Zhu

Kaplan

and Han

, Aspem: Embedding learning by aspects in heterogeneous information networks, in: Proceedings of the 2018 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, 2018, pp. 144–152.

39.

Rosvall

and Bergstrom

C.T.

, Maps of information flow reveal community structure in complex networks, arXiv preprint physics.soc-ph/0707.0609, 2007.

40.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

and Polosukhin

, Attention is all you need, Advances in neural information processing systems, 2017, 30.

41.

Kingma

P.D.

and Pa

, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980, 2014.

42.

Ali

H.T.

and Couillet

, Improved spectral community detection in large heterogeneous networks, Journal of Machine Learning Research 18(225) (2018), 1–49.

43.

Dall’Amico

Couillet

and Tremblay

, Revisiting the bethe-hessian: improved community detection in sparse heterogeneous graphs, Advances in neural information processing systems, 2019, 32.

44.

Luo

Fang

Cao

Zhang

and Zhang

, GSim: A Graph Neural Network Based Relevance Measure for Heterogeneous Graphs, IEEE Transactions on Knowledge and Data Engineering, 2023.

Heterogeneous graph community detection method based on K -nearest neighbor graph neural network

Abstract

Keywords

1. Introduction

1.1 Background

1.2 Motivation

1.3 Main contributions

2. Related work

2.1 Traditional community detection methods

2.2 Network embedding based community detection methods

2.3 Graph neural network-based community detection methods

3. Preliminaries

Table 1 Notations

5.1 Datasets

Table 2 Statistics of the datasets

5.2.1 Traditional network embedding methods

5.2.2 Graph neural network-based methods

5.3 Implementation details

5.4 Performance comparison

Table 3 Performance comparison of different community detection methods

5.7.1 Comparison of similarity measures

Table 4 Quantitative results of ablation experiments on ACM datasets for similarity measures

Table 5 Quantitative results of ablation studies on ACM, DBLP, and IMDB datasets for community detection

Footnotes

Acknowledgments

References

Table 1
Notations

Table 2
Statistics of the datasets

Table 3
Performance comparison of different community detection methods

Table 4
Quantitative results of ablation experiments on ACM datasets for similarity measures

Table 5
Quantitative results of ablation studies on ACM, DBLP, and IMDB datasets for community detection