Abstract
Graph-structured data is ubiquitous in real-world applications, such as social networks, citation networks, and communication networks. Graph neural network (GNN) is the key to process them. In recent years, graph attention networks (GATs) have been proposed for node classification and achieved encouraging performance. It focuses on the content associated on nodes to evaluate the attention weights, and the rich structure information in the graph is almost ignored. Therefore, we propose a multi-head attention mechanism to fully employ node content and graph structure information. The core idea is to introduce the interactions in the topological structure into the existing GATs. This method can more accurately estimate the attention weights among nodes, thereby improving the convergence of GATs. Second, the mechanism is lightweight and efficient, requires no training to learn, can accurately analyze higher-order structural information, and can be strongly interpreted through heatmaps. We name the proposed model content- and structure-based graph attention network (CSGAT). Furthermore, our proposed model achieves state-of-the-art performance on a number of datasets in node classification. The code and data are available at https://github.com/CroakerShark/CSGAT.
Introduction
Graph-structured data [37, 47] expresses the relational model of real things. Thus, it received a growing amount of attention in recent years, due to the ubiquity of graph-structured data in real-world applications, including molecular structures [2, 42], social networks [28, 48], citation networks [8, 25] and protein interactions [19, 30]. Graph-structured data plays an important role in analyzing real-world models. Graph neural networks (GNNs) [18, 60], as generic deep learning models for graph structured data, have demonstrated great superiority on network analysis. GNNs aim to learn effective vector representation for each node by taking into account the graph structure and aggregating information from its neighbors. The learned representations will be further utilized for the classification task. It is worth noting that graph-structured data is usually only partially known, that is, network analysis focuses on predicting unknown information [26, 46]. Semi-supervised node classification on graphs [9] means at training time we only know the labels of a subset of the graph nodes, and its task aims at predicting the labels for the unlabeled nodes. An efficient GNNs model is the key to successfully predict unknown labels, which means that GNNs need to make full use of the known information in the graph, including node content and topology [16, 61].
Message passing neural network(MPNN) [11, 58] is a general framework for the working mechanism of graph neural network. The power of current GNNs largely relies on their message-passing mechanisms [24], which recursively update the features of a node by aggregating the received features from its neighbor nodes. In some research, messages are passed along edges uniformly without accounting for priority of either graph structure or node attributes [55]. However, each neighbor node’s impact is distinctive to the center node in the node classification task. Thus, attention-based GNNs are proposed to further evaluate how the contribution of neighbors to the center node varies. GAT [35] is the first attempt to introduce attention module into graph node classification, in which the weighting coefficients for each neighbor are calculated based on a self-attention strategy. Despite the numerous success, GAT computed the attention weighting coefficients mainly based on node attributes. Those strategies may be suboptimal because they neglect the importance of the structure information in the graph [20]. Note that graph consists of two pieces of information, node content and network structure. For example, Fig. 1 depicts the node content and network structure of a graph. Imaging that ww need to update the features of node A, the features of node B and C should be aggregated. In Fig. 1(a), it can be found through observation that node A and C have higher similarity in node content, which means that they are more likely to be in the same category. This information is useful for graph learning. However, due to the tradeoff between efficiency and performance, the content based attention mechanism may be accompanied by the loss of important information when mapping content, which interferes with the weight allocation. In Fig. 1(b), you can intuitively see that node C is more impactive than node B, because node A and C belong to the same community, while A and B do not. This structural information cannot be extracted by content based attention. However, it has important guiding significance for neural network to focus on important nodes.

Presentation of a graph data example.
The example mentioned above demonstrates that content and structure information have important impact on achieving a more discriminative node representation. However, the existing neural network models cannot take into account both content and structure information, so there are great limitations. In this paper, we propose a novel attention-based algorithm for semi-supervised node classification. The algorithm includes a hybrid attention mechanism, in which node content and structure information are all considered to compute the attention weights. This method can make up for the defects of the existing model effectively and make full use of the information in the graph. Specifically, node features among nodes are employed for calculating the content based coefficient, while structure in adjacent matrix are considered to compute structure based coefficient. Then, two types of coefficients are combined to form attention weights. These weighs are used to linearly combine the feature vectors of neighbor nodes, then the features of the central node are updated. Note that friendly neighbor nodes are better discovered through this hybrid attention mechanism. Extensive experiments verify that combining those two types of information deduce the more effective performance for semi-supervised node classification. Importantly, the mechanism is lightweight and efficient. It can accurately calculate the weight based on the high-order structure without complicated learning tasks, which makes the proposed model further improve the performance in the task while maintaining the efficiency. Moreover, our model is more explanatory, and ablation analysis and heatmap are used to support the model in the experimental part.
Graph neural network
Convolutional neural networks (CNNs) were used to extract features from images [15, 51]. Motivated by the effectiveness of CNN, the graph convolution operation is designed to extract information from graphs, which generally falls into two categories: spectral and spatial methods [6, 43]. Bruna et al. [14] utilized Fourier base vector to perform convolution in the spectral domain. ChebNet [27] introduced a well-approximated K-order Chebyshev polynomials for convolution operations, in which the computationally expensive Laplacian eigen decomposition is removed. Spatial methods are defined by a message-passing mechanism operating on the target node and its topological neighbors. The most representative early work is the algorithm GCN [45]. It conducts the layer-wise convolution operation iteratively in order to expand a node’s neighborhood, so as to aggregate the neighbor nodes’ feature to update its representation. Different from GCN, GraphSAGE [49] generates a node’s representation by aggregating features from a fixed number of neighbors, which can deal with inductive learning for unseen nodes or entirely new graphs. Further, GNNs have been widely studied from various perspectives, including over-smoothing problem [7], the adversarial attack and defense [50], and the model explainability [13]. Simplifying graph convolutional network (SGC) [10] considers that GCN may inherit unnecessary complexity and computation. Therefore, it removes nonlinearities between GCN layers, resulting in a linear model for repeated feature transformation and propagation. Labeling real data has a huge cost, but when labels are limited, GNN performance suffers a severe drop. Contrastive graph poisson networks (CGPN) [40] improves model performance by efficiently propagating limited labels throughout the graph. K-hop graph neural networks (k-hopGNNs) [12] update the representation of nodes by aggregating information from its k-hop neighborhood and immediate neighbors, which solves the problem that GNNS cannot recognize basic graph properties. The framework recognizes connectivity and triangular degrees of freedom. Two-level graph neural network (TL-GNN) [52] proposes a novel subgraph counting method based on dynamic programming algorithm that combines subgraph level and node-level information to enrich the features captured by the graph neural network, such as high-level information and node-level information. Furthermore, it overcomes the characterization limitation of existing neural networks, which is caused by local permutation problem. Multiresolution reservoir graph neural network (MRGNN) [23] reduces the high computational cost in the process of model training by generating explicit k-hop unsupervised graph representation processing, and excessive smoothing will occur in reservoir calculation when iterating graph structure. In general, this algorithm has absolute efficiency advantage over traditional algorithms, and maintains the performance. Weighted feature fusion of convolutional neural network and graph attention network (WFCG) [53] integrates convolutional neural network and graph neural network with the goal of complementary advantages, so as to fully explore high-dimensional features and achieve excellent results in hyperspectral images. Meta-learning spatial-temporal graph attention network (MetaSTGAT) [32] captures the dynamic changes of nodes in the graph based on the graph neural network, and then performs dynamic weighting.Furthermore, the model also takes advantage of the potential correlation between spatial and temporal information through graph neural network to improve the effectiveness of the model.
Attention mechanism
It is intuitive that the impact of each neighbor node is different when aggregating its features to the central node. To deal with this problem, the algorithm GAT utilizes a self-attention mechanism to learn the different contributions from each neighbor node and aggregates information from these neighbors based on the learned weights. Graph attention mechanisms have been used in many directions, such as targeted drug research [41, 59] and molecular property prediction [4]. Li et al. [29] demonstrated that GAT is dataset-dependence and the distributions across heads and layers are uniform. They adapted GAT by proposing margin-based constraints to reduce information propagation among nodes belonging to different classes. However, in most current attention-based approaches, the learned aggregation weights are based on node features, and they neglect other useful information inheriting in graph structure. Multi-view graph attention networks (MGAT) [56] explore an attention-based model that learns node representation from each view, and design a novel regularization term constraint for this algorithm, in order to cooperatively integrate multiple types of relationships in different views. To improve the generalization performance of the model. It can capture various relationships in the multi-view network, and then update the low-dimensional representation of nodes, and finally characterize the complex types of proximity relationships in the graph. Graph conjoint attention networks (CATs) [1] have proposed combining content and structure intervention to calculate attention score, and good results are achieved. However, fatally, it requires a large number of heavy and complex learning tasks to obtain additional representations of nodes, and the experiment lacks a convincing explanation. Structural attention network (SAN) [1] uses transition matrix to extract structural information among nodes. In order to distinguish the structure of multi-order in the graph, the output features of nodes in the graph are expressed as series of multi-order features. Although structural information is beneficial to the weighting process of the model, the node content information of the model is ignored in the weighting process, so there are limitations. Sparse graph attention networks (SGATs) [57] found that Gats were prone to overfitting and could not avoid noise interference when dealing with a large number of graph structure data. Therefore, SGATs sets activation state for connections by generating edge sparse maps, thus eliminating noise connections and maintaining model classification accuracy. Path reliability-based graph attention networks (PRGATs) [54] use multi-hop adjacent context information in the calculation of attention scores, which further expands the receptive domain of the network and captures the dependence relationships in a farther range. Solved the problem that GAT could not capture multi-hop neighbor information. Secondly, the model also uses an attention layer based on path reliability to further improve the performance of the graph against attacks. However, these models hardly fully exploit the potential of attention mechanism. Therefore, we propose to use both content and structure information to learn better attention patterns for node classification. We use a lightweight attention mechanism that greatly reduces the learning stress of the model, is more interpretive, and yields better results.
The proposed algorithm
How to learn a new feature representation for each node over multiple layers is an important issue for graph node classification. In this section, we introduce the proposed incorporated attention mechanism to do it, in which two primary components are involved for attention computation: graph structure and node attributes. In other words, we believe that the weight required for the information aggregated by a central node from its neighbors depends on these two components. Figure 2 depicts the architecture of the proposed incorporated attention mechanism. The upper part of Fig. 2 illustrates the generation of content coefficient, while the lower part illustrates the generation of structural coefficient. Then, those two type of coefficients are combined to be used in the message passing step to update node features. The downstream task-node classification is done by means of a softmax classifier based on the final representation.

A simple architectural diagram of the proposed model.
For convenience, mathematical preliminaries and notations used in the paper are firstly illustrated. Given a graph G (V, E) with a set of nodes V and a set of edges E. Each node i has an original feature vector
We follow the method introduced in GAT for content coefficient generation. For a L-layer model, the content coefficient f ij is calculated as Equation (1). f ij describes the relevance between two adjacent nodes i and j, which is normalized among its neighbourhood.
Considering the adjacent matrix A that describes the structural information, we generates the structure coefficients from it. Please note that A needs to add a self-connection, because the node’s own information is the key to node feature update. For node i, a i ∈ A denotes the adjacency information between it and the other nodes. Therefore, a intuitive way to evaluate the structure coefficient between node i and node j is shown as Equation (2).
Note that the adjacency matrix A needs to be normalized, which guarantees the value of s ij is between 0 and 1. The bigger the value of s ij , the more similar their structure is. For example, if node i and node j share lots of neighbors or they allocate in the same community, s ij will be large.
In order to incorporate content coefficients and structure coefficients, we introduce two learnable parameters g f and g s to adjust the relative importance between s ij and f ij . r f and r s are normalized results of g f and g s respectively, as shown in Equations (3) and (4).
Then, the attention weights are computed as
In Equation (5), r f and r s are used to control their proportions between f ij and s ij . e ij measures the influence of node j on node i. Note that e ij is different from e ji , because node i and node j have different neighbors.
After getting attention weights from the center’s neighbors, it features can be updated according to the messaging passing mechanism.
To make the self-attention calculation more stable, we applied multi-head attention for feature transformation and concatenated the resulting features as the returned output:
We set K = 8. Also, we used the Dropout mechanism and set it to 60 percent. This means that during model training, the activation value of the node stops working with a certain probability. Because this mechanism alleviates the over-dependence of some local features, it improves the model generalization. Secondly, this measure will reduce the complexity of the model in the forward propagation process and improve the training efficiency of the model. In addition, it can effectively alleviate the occurrence of overfitting and achieve the regularization effect to a certain extent.
After updating each node through information aggregation, we can obtain the corresponding class label of each node. In general, the dimension of the node representation obtained through the neural network is the same as the number of categories of the dataset. After processing by softmax, each element in the node representation represents the probability that it belongs to the corresponding class, and the class with the highest probability is the predicted result.
In the semi-supervised node classification task, we use standard cross entropy as the objective function, and use backpropagation to optimize the parameters in the model. Our goal is to minimize the inconsistency between the ground truth label and the prediction label. The specific formula is as follows:
For the node classification, the cross entropy loss is adopted, in which the inconsistence between the ground true labels and the predicted labels is minimized. In the training state, we construct the two-layer network structure for the proposed approach. The first layer utilizes multi-head attention. The output of the second layer is used to predict, so multi-head attention is no longer sensible. The whole process is described in Algorithm 1. Note that it would be run for multiply epochs to achieve convergence.
Experiments
We have validated our model against a wide variety of state-of-the-art baselines and previous approaches, and achieved the best performance across all of them. The dataset, comparison algorithms, results and qualitative analysis are included in this section.
Datasets
We conducted experiments on three widely-used datasets, including Cora, Cite and Pubmed [21, 34]. They are standard citation network, in which nodes represent documents and edges represent citations. The original node features correspond to elements of a bag-of-words representation of a document. And, each node has a class label. All datasets are splitted to three parts: training, validation and testing. Considering the huge amount of nodes and edges in Pubmed, a connected subgraph is selected for our experiments. Table 1 showns the statistics of these datasets.
Datasets statistics
Datasets statistics
We compared out method with nine typical algorithms, including multilayer perceptron (MLP), Deepwalk, text-associated deep walk (TADW), deep graph infomax (DGI), GCN, edge enhanced graph neural network (EGNN), CAT, heat kernel graph convolutional network (HKGCN) and GAT. For convenience, we name the proposed model as content- and structure-based graph attention network (CSGAT).
Parameter settings
Our uses a two-layer model. The first layer is multi-head convolution, which consists of eight graph convolutions. Before entering the second layer, we concatenate the output of the first layer. The second layer consists of a single graph convolution, followed by a softmax activation, and its output is used for the classification task, and the element corresponding to the maximum value of the final output is the prediction result. To prevent overfitting, regularization is used. During training, we use λ = 0.0005 for Cora and Cite datasets, λ = 0.001 for Pubmed. Also, Dropout [31] p = 0.6 is applied to the input of each layer. All the above measures can effectively prevent overfitting and improve the generalization performance of the model.
The training objective is to minimize the cross-entropy on the training nodes, and the learning rate is 0.005 for all datasets except Pubmed which is 0.01. During the training process, we used an early stop strategy for the cross-entropy loss of the validation nodes, with a patience of 100 epochs. That is, if the validation loss does not decrease within 100 epochs, we stop training.
Evaluation
The experimental results of all methods are evaluated by Micro-F1, Macro-F1 and Weighted-F1. In those evaluations, precision P and recall R will be used in the calculation process.
Among them, TP is the number of instances which are positive and predicted correctly. FP is the number of instances which are negative but predicted to be positive. And, FN is the number of instances which are positive but predicted to be negative.
Equations (9) and (10) are calculated over all tested instances, which means that they do not need to distinguish between categories. Micro-F1 directly uses the precision p and recall R to calculate the F1-score:
If Equations (9) and (10) are calculated for each category, Macro-F1 can be calculated based on them. Specifically, denote P e and R e as precision and recall for class label e, respectively. Then, Macro-F1 can be calculated:
Weighted-F1 considers the importance of different categories. The ratio a e of the number of instances in each category e to the total number of instance is used as the weight to calculate it:
For each set of testing data, all aproaches are run ten times to obtain the statistically steady performance. Table 2 records the summary of results in terms of Micro-F1, Macro-F1 and Weighted-F1 over three datasets. For each measure, we highlight the best results over each dataset in bold. MLP is a network embedding method based on word vector information extraction. Deepwalk is a random walk-based network embedding method. We can see that graph weighted convolution based methods like GCN, GAT, etc. outperform MLP and Deepwalk. DGI is an unsupervised model, so its model parameters do not perform as well as semi-supervised models in the absence of effective optimization. Interestingly, EGNN exploits edge features but does not yield overall performance advantages over GCN. CAT uses structural interventions to learn new node representations in model training. However, the lack of accurate information extraction mechanism makes the performance greatly limited. HKGCN, as a simplified version of the GCN model, not only improves the model efficiency, but also achieves stronger performance in some experimental structures. It is worth noting that GCN based on topological weighting and GAT that introduces node content attention mechanism usually have good performance. However, our method effectively improves GAT, making its graph structure information and node content a dual evaluation metric for the attention mechanism. This enables our model to discover meaningful neighbor nodes better than traditional method assignments. In addition, the structure-based attention used by the model is lightweight and efficient. It does not need to be trained in the process of model training, but only needs to simply traverse the graph structure before training, so that the structural information can be effectively extracted accurately. This mechanism can effectively extract higher-order structure information. In the following chapters, we will use heatmap to visualize it to prove its effectiveness. CSGAT algorithm demonstrates the best performance being achieved across all three datasets. It is worth mentioning that because we adopt more advanced attention mechanism compared with GAT, we can optimize the model parameters more efficiently, which speeds up the convergence rate of the model, which means that the training time is effectively shortened. In conclusion, the experimental results verify that the mixture of content-based and structure-based attention is effective.
Experimental results in terms of F1-Score
Experimental results in terms of F1-Score
We perform complexity analysis on the proposed algorithm CSGAT and compare it with GAT. As shown in Table 3, we record the running time of the algorithm on the Cora dataset to represent the complexity of the algorithm. It can be seen that our algorithm takes slightly more time than GAT in a single epoch because our algorithm makes full use of the graph information. However, thanks to the complementary advantages of content and structure information, our algorithm can converge faster in graph learning, which makes CSGAT reach the optimal solution with less number of epochs. In other words, CSGAT can achieve better performance in less total time. Therefore, CSGAT achieves better results in both running time and performance.
Time taken to run the algorithm on the Cora dataset
Time taken to run the algorithm on the Cora dataset
In this section, we analyze the parameter sensitivity through the Cite data set and report the performance of ours with different number of multi-head attention. The results are shown in Fig. 3. Note that when the number of multi-head attention is set to 1, the multi-head attention will be removed. It can be seen that with the increase of the number of multi-head attention, the overall performance of ours shows a trend of improvement. Therefore, to obtain the best performance on a certain dataset, we recommend searching for this parameter value first.

Parameter analysis of different number of multi-head attention K.
To further explore the key component of our proposed hybrid attention mechanism, we design two variants of CSGAT method for ablation analysis, named CSGAT-content and CSGAT-structure. Note that CSGAT-content is the same as GAT in fact. CSGAT-content only utilizes the content-based attentions, while CSGAT-structure only utilizes the structure-based attentions. For convenience, all variants are single-layered, including no dropout and multi-head. In this way, other factors that affect the experimental results are removed. Table 4 reports the results of the comparing algorithms. It clearly indicates that the hybrid attention mechanism achieves the best performance. And, the content-based attention achieves the worst performance, which indicates the structure-based attention is superior to the content-based attention to some extend.
Experimental results for ablation analysis
Experimental results for ablation analysis
We test the ability of CSGAT-structure to discover dominant neighbors under different graph structures, and visualize it as a heatmap in Fig. 4. In Fig. 4(a) we design a large community, and through CSGAT-structure, we can find important neighbor nodes within 2-hop neighbors. In Fig. 4(b)-(d), we design several special subgraphs, all of which contain communities. CSGAT-structure can well judge whether the central node is in the community or not, and further confirm whether the neighbor nodes belong to the same community or not. We eliminate the community in Fig. 4(e), (f), as we can see from the heatmap, the weights of the neighbor nodes have changed. Nodes that originally belonged to different communities from the central node gained relatively greater weights due to the elimination of the community structure. This is very beneficial.

The heatmap of the attention mechanism based on topology structure under different graph structures. Left: A simple artificially constructed graph, the yellow node is the center node, and the blue node is the neighbor node, and the two are connected by edges; Middle: Visualization of weights; Right: Contours of weights.
As shown in Equation (5), r f and r s adjust the content-based scores and structure scores, leading to a standard weighted average. The two variable are tuned in the training. It is interesting to explore the importance of them in the proposed model. We visualized these those two weight coefficients in CSGAT model over the datasets Cora, Cite and Pubmed, as shown in Fig. 6. According to Fig. 6, the content-based scores are more important over dataset Cora, while the structure-based scores are more important over datasets Cite and Pubmed. Because Cora has the densest network structure in all datasets, it also has the most node classes, which means that there are connections between a large number of nodes of different classes. Therefore, there is a lot of noise in the structural information in Cora, which degrades the performance of the structure-based attention mechanism. Then, thanks to the adaptive mechanism of the network, the weight distribution takes the content as the main index. This does not mean that structure-based attention mechanisms have completely failed. As mentioned earlier, our attention mechanisms are mixed. Under different datasets, the attention mechanism will adopt different strategies, that is, different proportions, to achieve the optimal solution of the model. To sum up, both weighted scores play a role in constituting the attention of the model, however, their importance is different in different application domains.

Neighbors and their weights of node p408 in dataset Cite. (a), (b), (c) represent the neighbors, weights in CSGAT-Content and weights in CSGAT, respectively.

An illustration of weight coefficients between content-based scores and structure-based scores.
Some important neighbors that are useful for specific tasks should have greater attention value. The key to improving the performance of node embeddings is how to discover important neighbor nodes and assign appropriate weights to the nodes. As mentioned earlier, take node P408 in the dataset Cite as an example. We enumerate the neighbors of node P408 and their weights achieving in CSGAT-content and CSGAT, as shown in Fig. 5. Here are two neighbors for discussion. Among them, P408 belongs to ’label1’ and P637 belongs to ’label4’. The CSGAT-content assigns maximum weight to P637, which is obviously unreasonable. Therefore, the unreasonable weight distribution of CSGAT-content makes them predict the p408 node as ’label4’. which obviously reduces the performance of it. It is worth noting that CSGAT gives the most reasonable weight combination after comprehensively considering the graph structure and node content. We assigned the highest weight to the meaningful node P408 and finally made the correct prediction. Based on the above analysis, we can see that CSGAT has an excellent performance in discriminating the differences between neighbor nodes.
Conclusion
In this paper, we point out the defects of the existing neural network model in weight allocation, that is, it does not use both content and structure information. We propose a graph attention mechanism based on content and structure information to make full use of the information in the graph, so as to make up for the shortcomings of existing models. In this paper, we propose a new adaptive graph attention network structure for node classification. First, unlike traditional GAT, Our considers not only the similarity of node content, but also the similarity of structural information. Secondly, the structure-based attention mechanism is efficient and explainable, and it can accurately assign weight to neighbor nodes without additional training and learning. Furthermore, the attention mechanisms we use are mixed, which means they can extract more effective information from multiple dimensions. Finally, because the proposed model can find the key neighbor node for the central node more efficiently, the convergence rate of the model is improved, which means that our model not only has better performance, but also is more efficient than the existing model. In the experiment, we used ablation analysis and heatmaps to support and analyze the model. It can be seen from the experimental results that this improvement effectively improves the representation learning ability of graph neural network. In the future, we will further improve this method, for example, by removing the nonlinear transformation to further improve the efficiency while striving to improve its performance.
Footnotes
Acknowledgment
This work is supported by National Social Science Fund (Grant No. 21BTJ011). The authors also gratefully acknowledge the helpful comments and suggestions of the reviewers, which have improved the presentation. Yong Chen proposed the methodology, performed the experiment and wrote the manuscript; Xiao-zhu Xie and Wei Weng revised the manuscript and helped perform the analysis with constructive discussions.
https://github.com/shenweichen/GraphEmbedding
https://github.com/dedekinds/Graph-Embedding
https://github.com/PetarV-/DGI
https://github.com/tkipf/pygcn
https://github.com/vietph34/Edge GNN
https://github.com/he-tiantian/cats
https://github.com/hazdzz/HKGCN
https://github.com/Diego999/pyGAT
