Attribute graph clustering via transformer and graph attention autoencoder

Abstract

Graph clustering is a crucial technique for partitioning graph data. Recent research has concentrated on integrating topology and attribute information from attribute graphs to generate node embeddings, which are subsequently clustered using classical algorithms. However, these methods have some limitations, such as insufficient information inheritance in shallow networks or inadequate quality of reconstructed nodes, leading to suboptimal clustering performance. To tackle these challenges, we introduce two normalization techniques within the graph attention autoencoder framework, coupled with an MSE loss, to facilitate node embedding learning. Furthermore, we integrate Transformers into the self-optimization module to refine node embeddings and clustering outcomes. Our model can induce appropriate node embeddings for graph clustering in a shallow network. Our experimental results demonstrate that our proposed approach outperforms the state-of-the-art in graph clustering over multiple benchmark datasets. In particular, we achieved 76.3% accuracy on the Pubmed dataset, an improvement of at least 7% compared to other methods.

Keywords

Attribute graph clustering transformer autoencoder node embedding

1. Introduction

Deep neural networks such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have demonstrated remarkable performance in dealing with traditional Euclidean-structured data, such as images, text, and video. However, in practical applications, the relationships between objects are intricate and complex. They are naturally represented as graphs, with each node containing a wealth of attribute information. Graph clustering is a fundamental technique for identifying dense subgraphs within a given graph. The objective of this technique is to classify graph data into distinct clusters, wherein nodes within the same cluster are structurally closely connected and exhibit as many similarities in node attributes as possible, while nodes in different clusters are distant from each other and exhibit as little similarity in attribute features as possible. Graph clustering has been extensively applied in various domains, including personalized recommendations, disease analysis, link prediction, community detection, face clustering, target detection, information retrieval, image segmentation, and contrast learning. Therefore, the development and implementation of graph clustering algorithms have gained considerable attention among researchers.

Traditional clustering algorithms such as subspace clustering networks [1], fuzzy number data clustering [2], evidential clustering [3], density-based clustering [4], and extended clustering [5] are unsuitable for graph structures. To overcome this limitation, Graph Neural Networks (GNNs) have been developed as an effective solution.

There has been a proliferation of graph clustering algorithms based on GNNs over the years, covering a wide range of techniques such as sparse autoencoder-based graph clustering algorithms [6], random walk-based graph clustering algorithms [7], and non-negative matrix decomposition-based graph clustering algorithms [8]. These algorithms prioritize the graph topology, which is essential for identifying dense subgraphs within a given graph. However, in practical applications, relationships between objects are intricate and complex, with each graph node containing a wealth of attribute information. The presence of attribute information necessitates considering both graph topology and node attributes, leading to a significant challenge in the implementation of attribute graph clustering.

In response to this challenge, Wang et al. [9] proposed the deep attention embedded graph clustering approach DAEGC, which is a goal-directed learning method that enables graph embedding and self-training processes to be learned together within the unified framework. Weng et al. [10] use lightweight attention mechanisms to capture the relationships between nodes and their neighbors. More recently, Peng et al. [11] integrated a deep autoencoder (DAE) and a graph convolutional network (GCN) [12] to extract node attribute information and topology information simultaneously. Also, they proposed a deep attention-guided graph clustering method with dual self-supervision to improve the clustering performance [13]. These methods demonstrate the potential of combining topology and attribute information to improve the accuracy and efficiency of graph clustering.

Although the aforementioned methods successfully integrate both structure and attribute information, their effectiveness is still limited by certain factors. As the utilization of these two types of information may not be comprehensive enough, those methods show suboptimal clustering results, which become more apparent with large graphs. Therefore, there is a need to explore more advanced and efficient methods that can address these limitations and improve the overall performance of attribute graph clustering.

To overcome the limitations of existing attribute graph clustering algorithms, we proposes a novel approach that combines the transformer and graph attention self-encoder. Building on the work of DAEGC, our algorithm employs the graph attention autoencoder to generate robust node embeddings by leveraging both the graph topology and node attribute information. Specifically, the graph attention autoencoder is employed to encode graph structure and node attributes into a latent space. Subsequently, the obtained node embeddings are further refined by a Transformer. The clustering effect is optimized via clustering loss and reconstruction loss. Notably, compared to DAEGC, this paper has been improved in the following ways.

(1)
We introduce Mean Squared Error (MSE) loss into the model to improve the quality of the reconstructed graphs, and incorporate L2-Norm Normalization and layer normalization techniques to ensure smoother attention coefficients and stable feature distributions across the data.
(2)
We employ the Transformer to refine the node embedding feature information obtaind from the encoder to improve clustering performance.
(3)
By conducting comparative experiments, the results demonstrate that our proposed method is highly competitive for attribute graph clustering.

2. Relate work

2.1 Attribute graph clustering

The significance of research in attribute graph clustering is highlighted by its wide range of real-world applications. Yang et al. [14] proposed the CESNA (Communities from Edge Structure and Node Attributes) algorithm, which integrates the network structure with node attributes to detect overlapping communities in attribute graphs. While Kuang et al. [15] developed a graph clustering algorithm based on symmetric non-negative matrix decomposition. their approach only considers graph topology. To overcome this, Huang et al. [16] and Li et al. [17] applied non-negative matrix decomposition to merge topological structure and node attribute information.

In recent years, the advancement of GNNs has significantly improved the performance of attribute graph clustering. Many of these methods employ deep neural networks to learn node embeddings and then use traditional clustering techniques to perform clustering. Kipf et al. [18] proposed a graph autoencoder (GAE) and a variational graph autoencoder (VGAE), which use a GCN encoder and an inner-product decoder to learn latent representation of undirected graphs. To enhance clustering performance, Zhu et al. [19] introduced supervised information into VGAE and proposed a collaborative decision-reinforced self-supervision method for learning graph node representation. Wang et al. [20] proposed that the multi-scale graph attention subspace clustering network MSGA improves the clustering performance. Zhang et al. [21] proposed a method for attribute graph clustering using multi-task embedding learning. This method constructs two prediction tasks based on structure and features, ensuring that node embeddings retain both types of information after backpropagation. Li et al. [22] improved the information extraction capability of non-negative matrix decomposition and learned the graph structure and node attribute information uniformly in a comparative learning framework. However, the smoothing of the attention coefficients and the stability of the node feature distribution is not always taken into account, which may lead to poor-quality node features.

2.2 Transformer

The Transformer model [23] was introduced by the Google Brain team in 2017 primarily for natural language processing (NLP). It aimed to address the issue of recurrent network models, such as the long short-term memory [24] and the gated recurrent unit [25], which could not be efficiently trained in parallel. Building on the Transformer’s fundamental architecture, researchers have developed numerous related models that demonstrate the Transformer’s potent language learning capabilities.

The application of Transformer-based models to graph-related tasks is becoming increasingly important. For instance, Dwivedi et al. [26] developed a graph Transformer model that handles arbitrary graphs using Laplacian feature vectors as node position encoding. Similarly, Kreuzer et al. [27] proposed an attention-based spectral network (SAN) that employs learnable location encoding to improve the performance of downstream tasks. Furthermore, Mialon et al. [28] introduced the GraphiT model, which leverages the Transformer structure to better incorporate both graph structure and node position information, thereby outperforming classical graph neural networks. These studies have demonstrated the effectiveness of Transformer-based models in graph-related applications beyond natural language processing.

In summary, the Transformer architecture is widely used in natural language processing and computer vision tasks [29, 30]. However, its application in graph clustering is still limited, making it a challenging task to explore the correlation between Transformer-based models and graph clustering.

3. Method

3.1 Formal definition

In this paper, we use $G = (V, E, X)$ to denote the graph data, $V = {v_{1}, v_{2}, \dots, v_{n}}$ to denote the set of nodes, and $E$ to denote the set of edges in the graph data. specifically, $E = {e_{i j}}$ denotes the edges that exist between the node $v_{i}$ and the node $v_{j}$ . The topology of the graph is represented by the adjacency matrix $A = {a_{i j}} \in R^{n \times n}$ , in which the $a_{i j} = 1$ means there exists an edge between the node $v_{i}$ and the node $v_{j}$ , otherwise, $a_{i j} = 0$ . The $X = {x_{1}, x_{2}, \dots, x_{n}}$ are the attribute values, where $x_{i} \in R^{d}$ is the vector of real-valued attributes associated with the vertex $v_{i}$ . The main terms used in this paper are shown in Table 1.

Table 1
Summary of key symbols

Notations Descriptions

$X$ $\in R^{n \times d}$ The attribute values

$A$ $\in R^{n \times n}$ The adjacency matrix

$Z$ $\in R^{n \times d}$ The feature matrix

$\bar{A}$ $\in R^{n \times n}$ The reconstructed structure matrix of the graph

$B$ $\in R^{n \times n}$ The transfer matrix

$M$ $\in R^{n \times n}$ Structural information of order $t$

$C$ $\in R^{n \times n}$ Attention coefficient matrix

$P$ $\in R^{n \times k}$ The target distribution

$Q$ $\in R^{n \times k}$ Student’s t-distribution

$n$ The number of samples

$d$ The dimension of $X$

$k$ The number of clusters

$l$ The number of network layers

$\cdot | | \cdot$ The concatenation operation

Notations	Descriptions
$X$	$\in R^{n \times d}$	The attribute values
$A$	$\in R^{n \times n}$	The adjacency matrix
$Z$	$\in R^{n \times d}$	The feature matrix
$\bar{A}$	$\in R^{n \times n}$	The reconstructed structure matrix of the graph
$B$	$\in R^{n \times n}$	The transfer matrix
$M$	$\in R^{n \times n}$	Structural information of order $t$
$C$	$\in R^{n \times n}$	Attention coefficient matrix
$P$	$\in R^{n \times k}$	The target distribution
$Q$	$\in R^{n \times k}$	Student’s t-distribution
$n$		The number of samples
$d$		The dimension of $X$
$k$		The number of clusters
$l$		The number of network layers
$\cdot \| \| \cdot$		The concatenation operation

3.2 General structure

Figure 1 illustrates the proposed attribute graph clustering algorithm that combines Transformer and graph attention autoencoder. The algorithm is divided into two main modules: graph embedding and self-optimization. The first module consists of two stages. In the first stage, the graph adjacency matrix and node attribute information are fed into a graph attention autoencoder to obtain transformed features, which are then normalized to obtain the node embedding. In the second stage, a simple inner-product decoder is used to reconstruct the graph structure. The reconstruction loss is calculated using MSE and binary cross-entropy. In the second module, it takes the node embedding as input and uses a Transformer to obtain new node embeddings, and then optimizes the node embedding and clustering results by Kullback-Leibler(KL) divergence clustering loss along with reconstruction loss.

3.2.1 Graph attentional autoencoder

Figure 1.

The workflow of the proposed attribute graph clustering algorithm, which includes Transformer and graph attention self-encoder.

We modified the encoder used in DAEGC [9]. Specifically, to achieve smoother attention coefficients and increase the generalization ability of the model, the attention coefficients are processed by L2-Norm normalization. The effectiveness of this approach is verified through ablation experiments in Section 4.5. The attention coefficient matrix is denoted as $C = {c_{i j}}$ , and the following equation holds:

\begin{aligned} L 2 (c_{i j}) = \frac{c_{i j}}{max (| | c_{j} | |_{2}, ϵ)}, \end{aligned}

(1)

where

ϵ

is introduced to prevent division by 0.

c_{j}

refers to the

j

th column of the attention coefficient matrix

C

, and

c_{i j}

represents the attention coefficient of node

i

and node

j

, which can be computed using the following equation:

\begin{aligned} c_{i j} = M_{i j} {\vec{a}}^{T} [W x_{i} | | W x_{j}], \end{aligned}

(2)

where

M = (B + B^{2} + \dots + B^{t}) / t

, with

B

representing the transfer matrix, and

B_{i j} = 1 / d_{i}

when

e_{i j} \in E

. Otherwise,

B_{i j} = 0

d_{i}

denotes the degree of node

i

t

denotes the order of the neighborhood.

M_{i j}

denotes the topological correlation of node

j

to node

i

. When

M_{i j} > 0

j

is a neighbor of

i

\vec{a}

is a vector of weights for the self-attentive module. To facilitate comparisons between different nodes, the attention coefficients obtained from the L2-Norm normalization calculation are further passed through a softmax function. The formula is as follows:

\begin{aligned} α_{i j} = softmax (L 2 (c_{i j})) = \frac{\exp (L 2 (c_{i j}))}{\sum_{r \in N_{i}} \exp (L 2 (c_{i r}))}, \end{aligned}

(3)

$N_{i}$ denotes the neighboring nodes of $i$ in $M$ . After adding the LeakyReLU activation function (denoted here as $δ (\cdot)$ ), the attention coefficient $α_{i j}$ can be expressed as follows:

\begin{aligned} α_{i j} = \frac{\exp (δ (L 2 (M_{i j} ({\vec{a}}^{T} [W x_{i} | | W x_{j}]))))}{\sum_{r \in N_{i}} \exp (δ (L 2 (M_{i r} ({\vec{a}}^{T} [W x_{i} | | W x_{r}]))))} . \end{aligned}

(4)

We leverages the node neighbor information and node attributes to learn the embedding representation of each node. Unlike most existing methods that treat node neighbors equally, the proposed method assigns different weights to node neighbors using the attention coefficients. Additionally, layer normalization [31] is applied in the final layer of the network. This helps to stabilize the data distribution of the same node features, mitigate the model overfitting problem, and provide a smoother training process. The formula for layer normalization is expressed as follows:

\begin{aligned} LayerNorm (u_{i}^{l + 1}) = γ \frac{u_{i}^{l + 1} - μ_{i}^{l + 1}}{\sqrt{σ_{i}^{l + 1} + ε}} + β . \end{aligned}

(5)

In the formula of (5), $u_{i}^{l + 1}$ represents the output representation of node $i$ in layer $l + 1$ , $γ$ and $β$ are learnable parameters, $ε = 1 \times 10^{- 5}$ . $μ_{i}^{l + 1}$ and $σ_{i}^{l + 1}$ denote the mean and standard deviation of the $i$ th node feature in layer $l + 1$ , respectively:

\begin{aligned} μ_{i}^{l + 1} = \frac{1}{d} \sum_{k^{'} = 1}^{d} u_{i k^{'}}^{l + 1}, \end{aligned}

(6)

\begin{aligned} σ_{i}^{l + 1} = \sqrt{\frac{1}{d} \sum_{k^{'} = 1}^{d} (u_{i k^{'}}^{l + 1} - μ_{i k^{'}}^{l + 1})^{2}} . \end{aligned}

(7)

Thus the final node embedding can be obtained by the following equation:

\begin{aligned} z_{i}^{l + 1} = δ (LayerNorm (\sum_{j \in N_{i}} α_{i j} W z_{j}^{l})), \end{aligned}

(8)

where

z_{i}^{l + 1}

denotes the output representation of node

i

at layer

l + 1

δ (\cdot)

denotes the LeakyReLU function. In this experiment, we use two graph attention layers as follows:

\begin{aligned} z_{i}^{(1)} = δ (\sum_{j \in N_{i}} α_{i j} W^{(0)} x_{j}), \end{aligned}

(9)

\begin{aligned} z_{i}^{(2)} = δ (LayerNorm (\sum_{j \in N_{i}} α_{i j} W^{(1)} z_{j}^{1})) . \end{aligned}

(10)

The node embedding obtained by the graph encoder above can be used for subsequent decoding work.

3.2.2 Reconstruction loss

After obtaining the node embeddings, decoding is performed to predict the relationships between nodes. Various types of decoders can be employed for this purpose. In this study, a straightforward inner product decoder is used. The formula for the inner product decoder is as follows:

\begin{aligned} {\bar{A}}_{i j} = sigmoid (z_{i}^{T} z_{j}) . \end{aligned}

(11)

In order to better reconstruct the graph, we add the MSE loss. Thus, our reconstruction loss is composed of both binary cross-entropy loss and MSE loss. The two loss functions are formulated as follows:

\begin{aligned} L_{1} = - \frac{1}{N} \sum_{i = 1, j = 1}^{N} A_{i j} \cdot \log ({\bar{A}}_{i j}) + (1 - A_{i j}) \cdot \log (1 - {\bar{A}}_{i j}), \end{aligned}

(12)

\begin{aligned} L_{2} = \frac{1}{N} \sum_{i = 1, j = 1}^{N} {(A_{i j} - {\bar{A}}_{i j})}^{2} . \end{aligned}

(13)

To improve the training results, we employ two different loss functions with different weight coefficients during training. The overall reconstruction loss is thus defined as follows:

\begin{aligned} L_{first} = L_{1} + 0.05 * L_{2} . \end{aligned}

(14)

3.2.3 Combining transformer with self-optimizing embedding

As the graph clustering task is unsupervised, it is not feasible to directly assess the quality of the model’s output features during the training process. To address this issue, a self-optimizing embedding algorithm is introduced. Here, we minimize the following objectives:

\begin{aligned} L_{K L} = K L (P | | Q) = \sum_{i} \sum_{u} p_{i u} \log \frac{p_{i u}}{q_{i u}} . \end{aligned}

(15)

Graph Transformers allow nodes to interact with all other nodes in the graph, making it easier to access global information and mitigating the problem of over-smoothing caused by sparse graphs [32]. Therefore, we consider using the Transformer to refine the node embedding $z_{i}$ obtained from the encoder to calculate $p_{i u}$ and $q_{i u}$ . The formula for this approach is as the follows:

\begin{aligned} Att (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V, \end{aligned}

(16)

where

d_{k}

denotes the dimensionality of

K

. The Transformer model applies the multi-headed self-attentive mechanism, which enables the model to jointly attend to information from different representations at different locations. Thus, several

Q

K

and

V

values are generated by the

h

parallel attention “heads”, whose values are connected and aggregated to participate in the representation:

\begin{aligned} {head}_{i} = Att (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}), \end{aligned}

(17)

\begin{aligned} MultiAtt (Q, K, V) = Concat ({head}_{1}, \dots, {head}_{h}) W, \end{aligned}

(18)

where

W_{i}^{K}

W_{i}^{K}

W_{i}^{V}

and

W

are the projection matrices. In the following, the Transformer is replaced by

Tans (\cdot)

, where

z_{i}

represents the node embedding obtained from previous training, so the features obtained are calculated as follows.

\begin{aligned} T_{i} = Trans (z_{i}) . \end{aligned}

(19)

Then $q_{i u}$ is calculated as follows:

\begin{aligned} q_{i u} = \frac{(1 + | | T_{i} - μ_{u} | |^{2})^{- 1}}{\sum_{k} (1 + | | T_{i} - μ_{k} | |^{2})^{- 1}} . \end{aligned}

(20)

The $μ_{u}$ indicates the cluster center, which is obtained by single K-means clustering, and $q_{i u}$ represents the similarity between $T_{i}$ and $μ_{u}$ . At the same time, the target distribution $p_{i u}$ needs to be calculated, which is expressed as follows:

\begin{aligned} p_{i u} = \frac{q_{i u}^{2} / \sum_{i} q_{i u}}{\sum_{k} (q_{i k}^{2} / \sum_{i} q_{i k})} . \end{aligned}

(21)

To increase the credibility of the target distribution $P$ , $Q$ is raised to the second power. The clustering loss forces the current distribution $Q$ to be close to the target distribution $P$ , and the training of $Q$ is supervised by employing soft distributions with high probabilities as soft labels.

In this model, the loss function is used as follows:

\begin{aligned} L_{second} = L_{K L} + γ L_{first}, \end{aligned}

(22)

where

L_{K L}

denotes clustering loss.

γ

is a non-zero parameter used to adjust the ratio of clustering loss to reconstruction loss. We can get the clustering result directly from the last optimized

Q

and use the result as a label prediction for node

v_{i}

\begin{aligned} s_{i} = \arg max_{u} q_{i u} . \end{aligned}

(23)

3.3 Training strategies

To achieve improved clustering performance, this study employs a two-phase model training approach. The first training phase focuses on generating node embeddings $z_{i}$ , while the second training phase utilizes a Transformer to calculate both the target distribution $P$ and the current distribution $Q$ . During the first training phase, the model is optimized using the Adam optimizer and the $L_{first}$ loss function. In the second training phase, the model is optimized with the Adam optimizer, but with the $L_{second}$ loss function. It is worth noting that, in the experiments, updating the target distribution $P$ can be performed every five iterations to optimize the training process.

4. Experiments and results

To thoroughly evaluate the performance of our proposed algorithm, this section conducts extensive experimental investigations.

4.1 Datasets

The performance of the proposed algorithm was evaluated on three widely accepted benchmark datasets for attribute graph clustering. The data set are briefly summarized in Table 2.

Table 2
Benchmark graph datasets

Datasets Nodes Edges Features Clustering

Cora 2708 5429 1433 7

Citeseer 3327 4732 3703 6

Pubmed 19717 44338 500 3

Datasets	Nodes	Edges	Features	Clustering
Cora	2708	5429	1433	7
Citeseer	3327	4732	3703	6
Pubmed	19717	44338	500	3

The Cora dataset consists of 2708 machine learning papers, each represented by a word vector of dimension 1433. In other words, the graph in this dataset consists of 2708 nodes, each with a feature dimension of 1433. There are 5429 edges between the nodes. The dataset is categorized into seven research areas: case based, genetic algorithms, neural networks, probabilistic methods, reinforcement learning, rule learning and theory.

The Citeseer dataset is another citation network graph dataset, comprising 3327 research papers, each with a feature dimension of 3703. The nodes of this graph are connected by 4732 edges. The papers in this dataset are categorized into six research areas: Agents, Artificial Intelligence (AI), Databases (DB), Information Retrieval (IR), Machine Learning (ML), and Human-Computer Interaction (HCI).

The Pubmed dataset comprises 19717 research papers related to diabetes, which can be categorized into three groups: Diabetes Mellitus Experimental, Diabetes Mellitus Type 1, and Diabetes Mellitus Type 2. There are 44,338 links between papers, with each paper represented by a 500-dimensional word vector.

4.2 Baseline algorithm

This paper compares the proposed algorithm with several classical algorithms, which can be categorized into three groups: those focusing exclusively on the structural information of the graph, such as Deepwalk [7]; those concentrating exclusively on the attribute information of graph nodes, such as K-means [33]; and those considering both structure and attribute information of the graph nodes, including TADW [34], AANE [35], ASNE [36], ARGA & ARVGA [37], DAEGC [9], Similarity-based Attention Embedding Approach for Attributed Graph Clustering (SAEAGC) [10], CDNMF [22].

K-means: K-means is a renowned clustering algorithm mainly used for unsupervised learning.

Deepwalk: Deepwalk is a classic graph embedding algorithm that learns the vector representation of nodes based on the structural relationships between graph nodes. In other words, the algorithm only takes into account the structural information of the graph.

TADW: TADW integrates textual information with structural information, taking into account both structural and node attribute information of the graph.

AANE: AANE used a distributed approach to achieve fusion learning of graph structure and node attribute information.

ASNE: ASNE combines structural proximity with attribute proximity to represent the final embedding of a node.

ARGA & ARVGA: ARGA & ARVGA proposes two novel adversarial methods, namely adversarial regularized graph autoencoder (ARGA) and adversarial regularized variational graph autoencoder (ARVGA), for learning graph embeddings and joint optimizing graph encoder and adversarial regularization in a unified framework.

DAEGC: DAEGC employs an attention mechanism to learn node embeddings and incorporates graph structure information to achieve goal-oriented attribute graph clustering. The method jointly optimizes embedding learning and graph clustering to achieve its objectives.

SAEAGC: SAEAGG applies a lightweight attention scheme to capture the relationship between a node and its neighbors, improving the aggregation of the graph structure and node attribute information.

CDNMF: CDNMF uses a comparative learning framework to integrate graph topology and node attributes.

4.3 Evaluation indicators and parameter settings

4.3.1 Evaluation metrics

This paper employs four commonly used evaluation metrics in clustering tasks [38]: Accuracy (ACC), F-score, Normalized Mutual Information (NMI), and Adjusted Rand Index (ARI). The higher the value of each metric, the better the clustering performance.

4.3.2 Parameter settings

In this study, the model architecture is specified to have 256 hidden layer neurons and 16 embedded layer neurons, as suggested in previous research [9]. The Layer Normalization module is applied with a “normalized_shape” parameter of 16, and the L2-Norm normalization dimensionality reduction parameter is configured at 0. The number of layers in the Transformer is set according to the dataset. The Adam optimizer is utilized for optimization with a learning rate of 0.0001.

The comparison methods employed in this study are initialized according to the description of the original paper. For Deepwalk, the dimension of node embeddings is set to 128, and the number of random walks generated for each node is set to 80. In TADW, the regularization term $λ$ is set to 0.2. In AANE, the embedding dimension is set to 100. In ASNE, the softsign activation function is utilized for the hidden layers, and the batch size and learning rate parameters are obtained via grid search. In ARGA and ARVGA, the encoder consists of 32 hidden layer neurons and 16 embedding layer neurons. In both DAEGC and SAEAGC, there are 256 hidden layer neurons and 16 embedding layer neurons. Finally, in CDNMF, the number of hidden layers is 3. The algorithm is executed the same number of times as described in the original article.

4.4 Analysis of results

Table 3 displays the experimental results for the Citeseer, Cora, and Pubmed datasets. In the table’s input section, “F” means that only node attribute information is taken into account, “S” means to focus exclusively on graph structure information, and “B” indicates the incorporation of both node attribute and graph structure information. The most outstanding experimental outcomes are highlighted in bold.

Table 3
Experimental results of each method on different datasets

DataSet Method Input ACC F-score NMI ARI

Cora K-means F 0.381 0.379 0.245 0.135

Deepwalk S 0.554 0.454 0.396 0.307

TADW B 0.683 0.677 0.477 0.437

AANE B 0.377 0.333 0.208 0.127

ASNE B 0.329 0.314 0.111 0.078

ARGA B 0.664 0.651 0.477 0.434

ARVGA B 0.697 0.671 0.495 0.471

DAEGC B 0.710 0.641 0.514 0.470

SAEAGC B 0.707 0.678 0.510 0.486

CDNMF B 0.609 0.543 0.401 0.379

Ours B 0.726 0.711 0.529 0.496

Citeseer K-means F 0.471 0.428 0.236 0.191

Deepwalk S 0.370 0.291 0.121 0.129

TADW B 0.539 0.479 0.285 0.248

AANE B 0.566 0.527 0.295 0.279

ASNE B 0.377 0.369 0.103 0.087

ARGA B 0.615 0.585 0.349 0.353

ARVGA B 0.607 0.572 0.316 0.322

DAEGC B 0.684 0.641 0.425 0.442

SAEAGC B 0.689 0.645 0.425 0.444

CDNMF B 0.471 0.417 0.255 0.228

Ours B 0.696 0.653 0.430 0.453

Pubmed K-means F 0.595 0.581 0.311 0.280

Deepwalk S 0.698 0.683 0.290 0.317

TADW B 0.627 0.620 0.258 0.211

AANE B 0.511 0.519 0.163 0.147

ASNE B 0.635 0.634 0.281 0.255

ARGA B 0.657 0.659 0.255 0.250

ARVGA B 0.493 0.378 0.075 0.042

DAEGC B 0.669 0.663 0.285 0.275

SAEAGC B 0.665 0.665 0.267 0.264

CDNMF B 0.653 0.487 0.221 0.261

Ours B 0.763 0.755 0.370 0.421

DataSet	Method	Input	ACC	F-score	NMI	ARI
Cora	K-means	F	0.381	0.379	0.245	0.135
	Deepwalk	S	0.554	0.454	0.396	0.307
	TADW	B	0.683	0.677	0.477	0.437
	AANE	B	0.377	0.333	0.208	0.127
	ASNE	B	0.329	0.314	0.111	0.078
	ARGA	B	0.664	0.651	0.477	0.434
	ARVGA	B	0.697	0.671	0.495	0.471
	DAEGC	B	0.710	0.641	0.514	0.470
	SAEAGC	B	0.707	0.678	0.510	0.486
	CDNMF	B	0.609	0.543	0.401	0.379
	Ours	B	0.726	0.711	0.529	0.496
Citeseer	K-means	F	0.471	0.428	0.236	0.191
	Deepwalk	S	0.370	0.291	0.121	0.129
	TADW	B	0.539	0.479	0.285	0.248
	AANE	B	0.566	0.527	0.295	0.279
	ASNE	B	0.377	0.369	0.103	0.087
	ARGA	B	0.615	0.585	0.349	0.353
	ARVGA	B	0.607	0.572	0.316	0.322
	DAEGC	B	0.684	0.641	0.425	0.442
	SAEAGC	B	0.689	0.645	0.425	0.444
	CDNMF	B	0.471	0.417	0.255	0.228
	Ours	B	0.696	0.653	0.430	0.453
Pubmed	K-means	F	0.595	0.581	0.311	0.280
	Deepwalk	S	0.698	0.683	0.290	0.317
	TADW	B	0.627	0.620	0.258	0.211
	AANE	B	0.511	0.519	0.163	0.147
	ASNE	B	0.635	0.634	0.281	0.255
	ARGA	B	0.657	0.659	0.255	0.250
	ARVGA	B	0.493	0.378	0.075	0.042
	DAEGC	B	0.669	0.663	0.285	0.275
	SAEAGC	B	0.665	0.665	0.267	0.264
	CDNMF	B	0.653	0.487	0.221	0.261
	Ours	B	0.763	0.755	0.370	0.421

The experimental results indicate that the proposed method significantly outperforms other methods. This suggests that both types of information contain crucial information for node embedding, and that considering both can enhance the quality of node embedding representation. Furthermore, it can be seen more intuitively from Fig. 2 that the proposed method achieves optimal results for each metric in each dataset, indicating that it is effective in learning node embeddings and improving clustering performance. The proposed method has made significant progress on the Pubmed dataset, indicating that it is suitable for learning large datasets. Furthermore, this paper analyzes the impact of L2-Norm normalization and finds that its use results in an ACC of 0.717, while its non-use results in an ACC of 0.669. This demonstrates the effectiveness of L2-Norm normalization.

Figure 2.

A comparative analysis of the performance of each method on various datasets.

4.5 Parameter analysis

(1) Analysis of the parameter $ϵ$ : To evaluate its impact, we select $ϵ$ from the set $S = {0, 1 e - 15, 1 e - 14, 1 e - 13, 1 e - 12, 1 e - 11, 1 e - 10, 1 e - 9, 1 e - 8}$ to evaluate its impact. Figure 3 shows the experimental results based on Cora dataset and Citeseer dataset, and four metrics, ACC, NMI, ARI, and F-score are used. Because $ϵ$ is the threshold to avoid the denominator being too small, a smaller value is adopted. As can be seen from the figure, the model is more effective when the value of $ϵ$ is $1 e - 12$ compared to other values, so $1 e - 12$ is used as the value of $ϵ$ .

(2) Parametric analysis of the loss function: We investigate the effect of hyperparameters in the reconstruction loss and clustering loss on the clustering effect. We performed a corresponding comparison experiment using the Citeseer dataset as an example. Figure 4 shows the resulting clustering effect when different values are chosen for these two parameters:

–
Since the MSE loss is a summation of the squares of the differences between the predicted and actual values, it is more sensitive to outliers. To mitigate this, we give a smaller parameter to the MSE loss. We can see that the performance is the worst when the weight of $L_{2}$ is 0, which also proves the validity of the MSE loss, and when the weight of $L_{2}$ is 0.05, the effect is optimal, so we choose to set the weight of $L_{2}$ to 0.05.
–
In this paper, we combine the reconstruction loss with the clustering loss to optimize the clustering results. As can be seen from Fig. 4, the clustering effect is the best when $γ =$ 0.1, so we choose to use 0.01 as the weight of reconstruction loss.

4.6 Model complexity analysis

We performed a complexity analysis of the models. As shown in Tables 4 and 5, we recorded the model parameters and running times for each dataset as an indication of the complexity of the model. It can be seen that due to the addition of the Transformer in the second stage, our algorithm requires more parameters and runtime, but we can obtain better clustering results. For example, the accuracy of the Pubmed dataset is improved by 10% without significantly increasing the number of model parameters and running time, so the approach in this paper achieves a good balance between parameter sizes, running time and performance.

Table 4
Evaluate model parametric quantities across multiple datasets

DataSet DAEGC (one stage) DAEGC (two stage) Ours (one stage) Ours (two stage)

Cora 1453KB 1455KB 1455KB 1731KB

Citeseer 3723KB 3725KB 3724KB 3725KB

Pubmed 520KB 520KB 513KB 797KB

DataSet	DAEGC (one stage)	DAEGC (two stage)	Ours (one stage)	Ours (two stage)
Cora	1453KB	1455KB	1455KB	1731KB
Citeseer	3723KB	3725KB	3724KB	3725KB
Pubmed	520KB	520KB	513KB	797KB

Figure 3.

The effect of the value of $ϵ$ on different data sets.

Table 5

Model run time (s/100 epoch)

DataSet	One stage (s/100 epoch)	Two stage (s/100 epoch)	Total (s/100 epoch)
Core (DAEGC)	40.07	6.37	46.44
Citeseer (DAEGC)	37.43	7.88	45.31
Pubmed (DAEGC)	414.10	291.85	705.95
Cora (Ours)	40.40	13.64	54.04
Citeseer (Ours)	44.32	17.10	61.42
Pubmed (Ours)	543.21	835.47	1378.68

Figure 4.

The effect of hyperparameters $γ$ and the weigh of $L_{2}$ on the Citeseer dataset.

Figure 5.

A comparative visualization of the clustering effects from four distinct methods.

4.7 Visualization analysis

In this paper, we leverage the t-distributed Stochastic Neighbor Embedding (t-SNE) algorithm [39] to visualize the clustering effects of the Citeseer dataset under different methods in a two-dimensional space, as depicted in Fig. 5. Each color denotes a distinct category. it can be seen from the figure that the clustering results of Deepwalk, which only considers the graph structure, are scattered and disordered. The results of ARVGA, which considers both graph structure and attribute information, are relatively partitioned into several categories, but the boundaries between different categories are still unclear. The results of DAEGC have clearer boundaries, but the categories are not clear enough, and there is some overlapping between different categories. In contrast, the proposed clustering method achieves relatively well-distinguished boundaries between categories.

5. Summary and prospect

In this paper, we propose several improvements to the DAEGC method. We optimized graph embedding and clustering by incorporating a normalization step into the node embedding learning process and jointly leveraging MSE loss, binary cross-entropy loss, and clustering loss. Furthermore, we strenghened the clustering effect by incorporating the Transformer architecture to deepen the learning of node embeddings while making full use of the topology and attribute information of the attribute graph. In addition, we optimized the attention mechanism to strenghthen the generalization ability of the model. In the future, we will focus on improving the quality of node embeddings by optimizing the position encoding in the Transformer architecture.

Footnotes

Acknowledgments

This work is partially supported by the Natural Science Foundation of Fujian Province of China (Nos.2021J011187 and 2021J011182). The authors extend their gratitude to the reviewers for their constructive comments and suggestions that have significantly enhanced the quality of the presentation. Wei Weng and Fengxia Hou collaboratively conceived the model framework; where Wei Weng took responsibility for revising the manuscript, while Fengxia Hou was responsible for devising the algorithm, conducting pertinent experimental investigations, and preparing the initial draft of the paper. Shengchao Gong, Fen Chen, and Dongsheng Lin jointly contributed to discussions on the model design.

References

Peng

Z.H.

Liu

Jia

Y.H.

Hou

J.H

, Adaptive attribute and structure subspace clustering network, IEEE Transactions on Image Processing 31 (2022), 3430–3439.

Liu

, Credal-based fuzzy number data clustering, Granular Computing 8 (2023), 1907–1924.

Liu

Huang

H.J.

Letchmunan

, Adaptive weighted multi-view evidential clustering, in: International Conference on Artificial Neural Networks, 2023, pp. 265–277.

Chen

Zhou

Shen

T.Y.

, A domain density peak clustering algorithm based on natural neighbor, Intelligent Data Analysis 27(2) (2023), 443–462.

Xie

H.B.

Shi

Y.F.

, Extended clustering algorithm based on cluster shape boundary, Intelligent Data Analysis 26(3) (2022), 567–582.

Tian

Gao

Cui

Chen

E.H.

Liu

T.Y.

, Learning deep representations for graph clustering, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2014, pp. 1293–1299.

Perozzi

Al-Rfou

Skiena

, Deepwalk: Online learning of social representations, in: Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014, pp. 701–710.

Wang

Cui

Wang

Pei

Zhu

W.W.

Yang

S.Q.

, Community preserving network embedding, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2017, pp. 203–209.

Wang

Pan

S.R.

R.Q.

Long

G.D.

Jiang

Zhang

C.Q.

, Attributed graph clustering: a deep attentional embedding approach, in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019, pp. 3670–3676.

10.

Weng

Liao

J.C.

Guo

Chen

Wei

B.W.

, Similarity-based attention embedding approach for attributed graph clustering, Journal of Network Intelligence 7 (2022), 848–861.

11.

Peng

Z.H

Liu

Jia

Y.H.

Hou

J.H

, Graph Augmentation Clustering Network, 2022, arXiv preprint arXiv:2211.10627.

12.

Kipf

T.N.

Welling

, Semi-supervised classification with graph convolutional networks, 2016, arXiv preprint arXiv:1609.02907.

13.

Peng

Z.H

Liu

Jia

Y.H.

Hou

J.H.

, Deep attention-guided graph clustering with dual self-supervision, IEEE Transactions on Circuits and Systems for Video Technology 33(7) (2022), 3296–3307.

14.

Yang

McAuley

Leskovec

, Community detection in networks with node attributes, in: 13th International IEEE Conference on Data Mining, 2013, pp. 1151–1156.

15.

Kuang

Ding

Park

, Symmetric nonnegative matrix factorization for graph clustering, in: Proceedings of the SIAM International Conference on Data Mining, 2012, pp. 106–117.

16.

Huang

Z.C.

Y.M.

X.T.

Liu

Chen

H.J.

, Joint weighted nonnegative matrix factorization for mining attributed graphs, in: Advances in Knowledge Discovery and Data Mining: 21st Pacific-Asia Conference, 2017, pp. 368–380.

17.

Sha

C.F.

Huang

Zhang

Y.C.

, Community detection in attributed graphs: An embedding approach, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2018, pp. 338–345.

18.

Kipf

T.N.

Welling

, Variational graph auto-encoders, 2016, arXiv preprint arXiv:1611.07308.

19.

Zhu

P.F.

J.L.

Wang

Xiao

Zhao

, Collaborative decision-reinforced self-supervision for attributed graph clustering, IEEE Transactions on Neural Networks and Learning Systems 34(12) (2022), 10851–10863.

20.

Wang

J.H.

Zhang

Z.Q.

Zhou

Chen

Liu

S.S.

, Multi-scale graph attention subspace clustering network, Neurocomputing 459(C) (2021), 302–314.

21.

Zhang

X.T.

Liu

Zhang

Liu

X.Y.

, Attributed graph clustering with multi-task embedding learning, Neural Networks 152 (2022), 224–233.

22.

Y.C.

Chen

J.L.

Chen

Yang

Zheng

Z.B.

, Contrastive Deep Nonnegative Matrix Factorization for Community Detection, 2023, arXiv preprint arXiv:2311.02357.

23.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

Polosukhin

, Attention is all you need, in: Proceedings of the 31st Conference on Neural Information Processing Systems, 2017, pp. 5998–6008.

24.

Hochreiter

Schmidhuber

, Long short-term memory, Neural Computation 9(8) (1997), 1735–1780.

25.

Chung

Gulcehre

Cho

Bengio

, Empirical evaluation of gated recurrent neural networks on sequence modeling, 2014, arXiv preprint arXiv:1412.3555.

26.

Dwivedi

V.P.

Bresson

, A generalization of transformer networks to graphs, 2020, arXiv preprint arXiv:2012.09699.

27.

Kreuzer

Beaini

Hamilton

Tossou

, Rethinking graph transformers with spectral attention, in: Proceedings of the 35th Conference on Neural Information Processing Systems, 2021, pp. 21618–21629.

28.

Mialon

Chen

D.X.

Selosse

Mairal

, Graphit: Encoding graph structure in transformers, 2021, arXiv preprint arXiv:2106.05667.

29.

Shen

Xie

Zhu

J.Q.

Zhu

X.B.

Zeng

H.Q.

, Git: Graph interactive transformer for vehicle re-identification, IEEE Transactions on Image Processing 32 (2023), 1039–1051.

30.

Chu

Wang

You

Ling

H.B.

Liu

, Transmot: Spatial-temporal graph transformer for multiple object tracking, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 4870–4880.

31.

J.L.

Kiros

J.R.

Hinton

G.E.

, Layer normalization, 2016, arXiv preprint arXiv:1607.06450.

32.

Ramp

Galkin

Dwivedi

V.P.

Luu

A.T.

Wolf

Beaini

, Recipe for a general, powerful, scalable graph transformer, in: Proceedings of the 36th Conference on Neural Information Processing Systems, 2022, pp. 14501–14515.

33.

MacQueen

, Some methods for classification and analysis of multivariate observations, in: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1965, pp. 281–297.

34.

Yang

Liu

Z.Y.

Zhao

D.L.

Sun

M.S.

Chang

E.Y.

, Network representation learning with rich text information, in: Proceedings of the 24th International Joint Conference on Artificial Intelligence, 2015, pp. 2111–2117.

35.

Huang

J.D.

, AcceleratedA attributed network embedding, in: Proceedings of the 2017 SIAM International Conference on Data Mining, 2017, pp. 633–641.

36.

Liao

Zhang

Chua

T.S.

, Attributed social network embedding, IEEE Transactions on Knowledge and Data Engineering 30(12) (2018), 2257–2270.

37.

Pan

S.R.

R.Q.

Long

G.D.

Jiang

Yao

Zhang

C.Q.

, Adversarially regularized graph autoencoder for graph embedding, in: Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 2609–2615.

38.

Xia

R.K.

Pan

Yin

, Robust multi-view spectral clustering via low-rank and sparse decomposition, in: Proceedings of the AAAI Conference on Artificial Intelligence, 2014, pp. 2149–2155.

39.

Van der Maaten

Hinton

, Visualizing data using t-SNE, Journal of Machine Learning Research 9(86) (2008), 2579–2605.

Attribute graph clustering via transformer and graph attention autoencoder

Abstract

Keywords

1. Introduction

2.1 Attribute graph clustering

2.2 Transformer

3. Method

3.1 Formal definition

3.2.1 Graph attentional autoencoder

4. Experiments and results

4.1 Datasets

Table 2 Benchmark graph datasets Datasets Nodes Edges Features Clustering Cora 2708 5429 1433 7 Citeseer 3327 4732 3703 6 Pubmed 19717 44338 500 3

4.3 Evaluation indicators and parameter settings

4.3.1 Evaluation metrics

4.3.2 Parameter settings

4.4 Analysis of results

Table 4 Evaluate model parametric quantities across multiple datasets DataSet DAEGC (one stage) DAEGC (two stage) Ours (one stage) Ours (two stage) Cora 1453KB 1455KB 1455KB 1731KB Citeseer 3723KB 3725KB 3724KB 3725KB Pubmed 520KB 520KB 513KB 797KB

5. Summary and prospect

Footnotes

Acknowledgments

References

Table 2
Benchmark graph datasets

Datasets Nodes Edges Features Clustering

Cora 2708 5429 1433 7

Citeseer 3327 4732 3703 6

Pubmed 19717 44338 500 3

Table 4
Evaluate model parametric quantities across multiple datasets

DataSet DAEGC (one stage) DAEGC (two stage) Ours (one stage) Ours (two stage)

Cora 1453KB 1455KB 1455KB 1731KB

Citeseer 3723KB 3725KB 3724KB 3725KB

Pubmed 520KB 520KB 513KB 797KB