Temporal link prediction in directed networks based on self-attention mechanism

Abstract

The development of graph neural networks (GCN) makes it possible to learn structural features from evolving complex networks. Even though a wide range of realistic networks are directed ones, few existing works investigated the properties of directed and temporal networks. In this paper, we address the problem of temporal link prediction in directed networks and propose a deep learning model based on GCN and self-attention mechanism, namely TSAM. The proposed model adopts an autoencoder architecture, which utilizes graph attentional layers to capture the structural feature of neighborhood nodes, as well as a set of graph convolutional layers to capture motif features. A graph recurrent unit layer with self-attention is utilized to learn temporal variations in the snapshot sequence. We run comparative experiments on four realistic networks to validate the effectiveness of TSAM. Experimental results show that TSAM outperforms most benchmarks under two evaluation metrics.

Keywords

Directed network temporal link prediction graph neural network autoencoder self-attention mechanism

1. Introduction

Complex systems in real world can be naturally described with complex networks, where nodes represent entities and links represent the interactions between them. Complex networks are highly dynamic objects whose topology evolves quick over time with the appearance of new interactions [17]. Predicting the dynamics of complex networks is a meaningful and promising problem. For example, in data center networks, the prediction of network topology can guide the design of routing protocols to improve the efficiency [19]. In online social networks such as Twitter and Sina Weibo, the prediction of people’s interactions can help infer the potential friends and recommend them to users, increasing their loyalty to the platform in return [22]. In most literatures, such task is referred to as temporal link prediction, the goal of which is to predict future topology of evolving networks based on historic network information. Generally, temporal link prediction is more challenging than link prediction in static networks, because it requires to capture not only the structural feature of but also the temporal evolution [30].

A number of methods have been proposed to solve the temporal link prediction problem in the last two decades. Most existing works compact the historic network structures into one single network, and use methods in static networks to predict future links. Similarity-based temporal link prediction methods are the simplest and most efficient ones, which assume nodes with higher similarity will form links in the future [21]. Such methods include Common Neighbors [16], Jaccard [10], Adamic-Adar [1], Resource allocation [37], Katz [13], etc. [22]. Even though these methods are efficient, they only take into account the structural feature of previous moment, regardless of informative temporal features such as the evolving pattern. Recently, the development of machine learning and graph neural network (GNN) makes it possible to build learning models for non-Euclidean data including complex networks [15]. Taking advantage of GNN, many works focus on solving temporal link prediction problem with machine learning models which learn both structural and temporal feature at the same time. These works mainly fall into two categories. The first kind of methods use dynamic graph representation learning to learn latent representations of nodes in temporal networks, and train a downstream logistic regression classifier for link prediction. Representative works of this kind of methods include DySAT [27], dyngraph2vec [6], DynamicTriad [36], etc. The other kind of methods reconstruct the predicted network through an autoencoder architecture, such as GCNGAN [19], GCLSTM [4], etc. Even though the latter one is more complicated with more parameters to train, it usually performs better in the task of temporal link prediction.

In reality, plenty of realist networks are directed in which link orientations have specified meanings. The prediction of link orientations are equally important with connectivity [28]. For example, in email networks, a directed link represents an outgoing email from one to another. If only the existence of a link is predicted, it would lead to ambiguity because one cannot decide the source and target of this email. Directed networks are quite different from undirected ones in topological properties, making temporal link prediction for directed networks a complicated and challenging problem. Since most existing works idealize their subjects into undirected networks, few works have investigated this problem thoroughly.

In this paper, we address the problem of predicting temporal links in directed networks. Our main contributions can be summarized as follows:

(1)
We design a model for link prediction in directed and temporal networks, namely TSAM. The model utilizes graph attentional layers to capture structural features of each snapshot. It also leverages matrix transformation to mine additional structural feature from local network structure.
(2)
To capture the temporal features efficiently and overcome the long term dependency of evolving structure, we use graph recurrent units with self-attention mechanism to learn from a sequence of snapshots and predict future snapshots.
(3)
Both node-level self-attention and time-level self-attention mechanisms are adopted in our model to accelerate the learning process and improve the prediction performance.
(4)
We use comparative experiments on realistic networks to validate the effectiveness of our model.

The remainder of this paper is organized as follows: Section 2 introduces several related works. Section 3 describes the aiming problem of this paper. Section 4 presents the proposed method. Section 5 describes experimental setups and analyzes the results. Finally, Section 6 draws conclusion of the paper.
2. Related works

2.1 Temporal link prediction

Temporal networks are usually described in two ways: snapshot sequence which is a set of evolving snapshots at discrete time, and timestamped graph which is a graph with timestamped links. In this paper we adopt the first description.

Zhou et al. [36] leverage the concept of triadic closure as guidance to capture the evolving pattern across different snapshots. Goyal et al. [7] proposed DynGEM based on depth autoencoder to incrementally update node embeddings through initialization from the previous step. These methods cannot capture the dynamics among a long period of time, which leads to limit on accuracy. Since recurrent neural network (RNN) and its variations are powerful tools to capture temporal dynamics of input sequences, they are adopted in many temporal link prediction methods. Chen et al. [4] proposed GC-LSTM which uses graph convolutional network (GCN) embedded long short term memory network (LSTM) to predict temporal links. Instead of learning structural and temporal feature separately, Pareja et al. [26] use the RNN to evolve the GNN so that the dynamic features are captured in the evolving network parameters. Nevertheless, these recurrent methods are inefficient in capturing the most relevant historical snapshots because they treat the effect of each snapshot equally. In our proposed TSAM model, we utilize self-attention mechanism to differentiate the influence of snapshots.

On the other hand, some works have studied the problem of link prediction in directed and temporal networks. Jawed et al. [11] addressed time frame based link prediction problem in directed citation networks and proposed time frame-based score. Bütün et al. [3] designed a measure by extending neighbor based measures as directional pattern based ones to consider the role of link directions.

2.2 Graph neural network

Graph convolution is a key technique to perform machine learning on non-Euclidean data structure such as complex networks. It generalizes the standard definition of convolution over a regular grid topology to graph structure. Graph aggregators are basic building blocks of graph convolution methods. Most existing works on graph aggregators are based on either pooling over neighborhoods or the weighted sum of neighboring features. Typically, graph convolutions can be categorized into two types: spectral domain convolution and spacial domain convolution. The classic GCN designed by Kipf et al. [15] employs spectral domain convolution by leveraging the decomposition of Laplace matrix. Since the Laplace matrix should be symmetric to perform decomposition, GCN cannot deal with directed networks whose adjacency matrices are asymmetric. Hamilton et al. [9] extended graph convolutional methods through trainable neighbor aggregation functions, and proposed a spacial domain convolution method named GraphSAGE. Since GraphSAGE does not use decomposition of Laplace matrix, it can deal with directed networks. Other efforts have also been done to learn representations of directed networks, such as motif2vec [5], DIAGRAM [14], ATP [29], MotifNet [23], etc.

2.3 Self-attention mechanism

Neural attention networks use a subnetwork to compute the correlation weight of the elements in a set. Among the family of attention models, the multi-head attention model proves to be effective for machine translation tasks. Later it has been adopted as a graph aggregator to solve the node classification problem [32] and static link prediction problem [8], referred to as graph attention network (GAT). Each attention head sums the elements that are similar to the query vector in one representation subspace, which provides more modeling power naturally. Zhang et al. [35] further proposed gated attention networks (GaAN) which treats the effect of different attention head unequally. Leveraging self-attention mechanism, Sankar et al. [27] proposed the DySAT to learn representation of temporal networks. Our work differs from theirs in two aspects: 1) On structural level, our TSAM model target directed networks and leverage additional structural features through matrix transformation and feature fusion, while DySAT aims at undirected networks. 2) On temporal level, TSAM adopts gated recurrent unit (GRU) layer and self-attention layer to capture temporal features, while DySAT only uses self-attention layer which cannot distinguish between different positions of the input.

3. Problem description

A directed and temporal network can be described with a sequence of network snapshots $G=\{G_{1},G_{2},\cdots,G_{T}\}$ , where $T$ is the number of time steps, $G_{t}=G({V}_{t},{E}_{t})$ is the snapshot at time step $t$ , with ${V}_{t}$ and ${E}_{t}$ being the set of nodes and links, respectively. For simplicity, we only investigate the evolution of links and assume all snapshots share the same set of nodes, i.e., $V$ . The adjacency matrix at time step $t$ can then be denoted as $\mathbf{A}_{t}=[a_{ij}^{t}]_{N\times N}$ , $N=|V|$ is the total number nodes. Since we focus on directed and unweighted networks, when link $e(i,j)\in{E}_{t}$ , $a_{ij}^{t}=1$ , otherwise $a_{ij}^{t}=0$ . Notice that in directed networks, $a_{ij}^{t}\neq a_{ji}^{t}$ . Denote $\mathbf{X}\in\mathbb{R}^{N\times F}$ as the feature matrix of nodes, where $F$ is the number of features in each node.

Given the adjacency matrices during $T$ , i.e., $\{\mathbf{A}_{t-T},\mathbf{A}_{t-T+1},\cdots,\mathbf{A}_{t}\}$ , temporal link prediction problem aims at learning a function $f(\cdot)$ which can predict the adjacency matrix $\mathbf{A}_{t+1}$ at time step $t+1$ based on the link formation history, denoted as:

$\displaystyle\hat{\mathbf{A}}_{t+1}=f(\mathbf{A}_{t-T},\mathbf{A}_{t-T+1},% \cdots,\mathbf{A}_{t})$ (1)

The goal of our model is to learn function $f(\cdot)$ of a given network and use it to predict its future links. For simplicity, in the following we denote $\mathbf{A}_{t-T}^{t}=\langle\mathbf{A}_{t-T},\mathbf{A}_{t-T+1},\cdots,\mathbf% {A}_{t}\rangle$ .

4. The proposed method

We propose a temporal link prediction model based on self-attention mechanism, referred to as TSAM. The basic architecture of TSAM model is an autoencoder as shown in Fig. 1. First, the temporal encoder consisted of graph convolutional layers, graph attention layers and GRU layers draws both the structure-related and time-related features from the input snapshots, generating node embedding of the last time step. Then the decoder utilizes full-connected layers to interpret the embedding to the predicted adjacency matrix. Here we introduce the main parts of TSAM model separately.

Figure 1.

Overall architecture of the proposed TSAM model.

4.1 Node-level attention block

For each snapshot at time step $t$ , we take advantage of the graph attentional (GAT) layer to specify different weights to different nodes in a neighborhood. The GAT layer attends over the immediate neighbors of a node in snapshot $G$ , by computing the attention weights as a function of their input feature vectors. At time step $t$ , the inputs to the GAT layer are the feature matrix $\mathbf{X}=\{\mathbf{x}_{1},\mathbf{x}_{2},\cdots,\mathbf{x}_{N}\}$ , $\mathbf{x}_{i}\in\mathbb{R}^{F}$ and the adjacency matrix $\mathbf{A}_{t}\in\mathbb{R}^{N\times N}$ , where $N$ is the number of nodes, $F$ is the number of features in each node. The output of GAT layer is a new set of features $\mathbf{Y}^{o}=\{\mathbf{y}^{o}_{1},\mathbf{y}^{o}_{2},\cdots,\mathbf{y}^{o}_{% N}\}$ , $\mathbf{y}_{i}\in\mathbb{R}^{F^{\prime}}$ , where $F^{\prime}$ is the dimension of new features in each node. First, a shared linear transformation parametrized by $\mathbf{W}^{(n)}\in\mathbb{R}^{F^{\prime}\times F}$ is applied to every node. Then a shared attentional mechanism $a:\mathbb{R}^{F^{\prime}}\times\mathbb{R}^{F^{\prime}}\to\mathbb{R}$ computes the attention coefficients of two nodes as:

$\displaystyle e_{ij}=\textit{LeakyReLU}(\mathbf{a}^{\mathrm{T}}\cdot\textit{% Concat}(\mathbf{W}^{(n)}\mathbf{x}_{i},\mathbf{W}^{(n)}\mathbf{x}_{j}))$ (2)

where $\textit{LeakyReLU}(\cdot)$ is the LeakyReLU activation function with $\alpha=0.2$ , $\cdot^{\mathrm{T}}$ represents vector transposition operation, $\textit{Concat}(\cdot,\cdot)$ represents vector concatenation operation.

The calculated attention coefficient $e_{ij}$ represents the importance of node $i$ to node $j$ . Since we focus on directed networks, we perform the so-called masked attention which only computes $e_{ij}$ for nodes $j\in{\cal{N}}^{\mathrm{in}}_{t}(i)$ , where ${\cal{N}}^{\mathrm{in}}_{t}(i)$ is the first-order incoming neighbors of node $i$ in the snapshot. The attention coefficients are then normalized with the softmax function, denoted as:

$\displaystyle\alpha_{ij}=\frac{\mathrm{exp}(e_{ij})}{\sum_{k\in{\cal{N}}^{% \mathrm{in}}_{t}(i)}\mathrm{exp}(e_{ik})}$ (3)

We follow [32] to perform the multi-head attention mechanism which can stabilize the learning process of self-attention. $K_{N}$ independent attention mechanisms are performed with the output features concatenated as the final result, denoted as:

$\displaystyle\mathbf{y}^{o}_{i}=\textit{ELU}\left(\frac{1}{K_{N}}\sum_{k=1}^{{% K_{N}}}\sum_{j\in{\cal{N}}^{\mathrm{in}}_{t}(i)}\alpha^{k}_{ij}\mathbf{W}^{(n)% }_{k}\mathbf{x}_{i}\right)$ (4)

where $\alpha^{k}_{ij}$ is the attention coefficient of the $k$ -th attention head. $\textit{ELU}(\cdot)$ is the exponential linear unit (ELU) activation function.

4.2 Feature generation

In order to leverage more structural information of directed networks, we generate a set of transformed adjacency matrices using the simplest matrix operations. In detail, at each time step $t$ , we define a set of mapping functions $\{g_{M_{1}},g_{M_{2}},\cdots,g_{M_{i}}\}$ , where $\{M_{1},M_{2},\cdots,M_{i}\}$ denote different transformed adjacency matrices, and function $g_{M_{i}}:{\mathbb{R}}^{N\times N}\to{\mathbb{R}}^{N\times N}$ maps the adjacency matrix to the transformed adjacency matrices as:

$\displaystyle{\mathbf{C}}_{t}^{M_{i}}=g_{M_{i}}({\mathbf{A}}_{t})$ (5)

There are multiple choices of mapping functions. Here we list four simplest forms as:

$\displaystyle{\mathbf{C}}_{t}^{M_{1}}={\mathbf{A}}\cdot{\mathbf{A}},{\mathbf{C% }}_{t}^{M_{2}}={\mathbf{A}}^{\mathrm{T}}\cdot{\mathbf{A}},{\mathbf{C}}_{t}^{M_% {3}}={\mathbf{A}}\cdot{\mathbf{A}}^{\mathrm{T}},{\mathbf{C}}_{t}^{M_{4}}={% \mathbf{A}}^{\mathrm{T}}\cdot{\mathbf{A}}^{\mathrm{T}}$ (6)

The meaning of these four transformed adjacency matrices can be interpreted with network motifs [2]. Take two simple networks shown in Fig. 2 as example. In Fig. 2a, operation $\mathbf{A}\cdot\mathbf{A}$ can be calculated as:

$\displaystyle{({{\bf{A}}\cdot{\bf{A}}})_{13}}=\sum\limits_{k=1}^{3}{{a_{1k}}% \cdot{a_{k3}}={a_{12}}\cdot{a_{23}}}=1$ (7)

In Fig. 2b, operation $\mathbf{A}\cdot\mathbf{A}$ can be calculated as:

$\displaystyle{({{\bf{A}}\cdot{\bf{A}}})_{15}}=\sum\limits_{k=1}^{5}{{a_{1k}}% \cdot{a_{k3}}=}{a_{12}}\cdot{a_{25}}+{a_{13}}\cdot{a_{35}}+{a_{14}}\cdot{a_{45% }}=3$ (8)

Obviously, operation $\mathbf{A}\cdot\mathbf{A}$ counts the number of motif $\{u\to t\to v\}$ between two nodes, generating a symmetric matrix whose elements stand for the number of such type of motif. Other operations in Eq. (6) have similar meanings.

Figure 2.

Example of matrix transformation in directed networks.

We use a set of graph convolutional layers (GCL) to exact structural features and generate corresponding embeddings from the transformed adjacency matrices. For each transformed adjacency matrix $\mathbf{C}^{M_{i}}_{t}$ , the output of GCL can be denoted as:

$\displaystyle\mathbf{Y}^{M_{i}}_{t}=\textit{ELU}(\hat{\mathbf{D}}^{-1/2}\hat{% \mathbf{C}}^{M_{i}}_{t}\hat{\mathbf{D}}^{-1/2}\mathbf{X}\mathbf{W}^{(g)})$ (9)

where $\hat{\mathbf{C}}^{M_{i}}_{t}=\mathbf{C}^{M_{i}}_{t}+\mathbf{I}_{N}$ , $\mathbf{I}_{N}$ is the $N$ -dimensional identity matrix. $\hat{\mathbf{D}}_{uu}=\sum_{v=1}^{N}(\hat{\mathbf{C}}^{M_{i}}_{t})_{uv}$ is the degree matrix.

The extracted features from GAT layer and the set of GCLs captures different structural properties of the input network $G_{t}$ . We adopt a feature fusion layer to combine them together. For such early fusion tasks, concatenation method and addition method are two simplest and effective ways. Here we adopt the element-wise addition method to ensure that the output feature has the same dimension with the input. Notice that possible information losses may occur when features are added up directly, but computational costs are saved at the same time. The output of feature fusion layer is denoted as:

$\displaystyle\mathbf{Y}=LN(\textit{Add}(\mathbf{Y}^{o},\mathbf{Y}^{M_{1}},% \cdots,\mathbf{Y}^{M_{i}}))$ (10)

where $LN(\cdot)$ is the layer normalization function to normalize the outputs with range [0, 1], $\textit{Add}(\cdot,\cdot)$ is the element-wise add function. Finally, a flatten layer reshapes the output embeddings into row-wise vectors in order to feed them into the RNN networks later.

Figure 3.

Diagram of feature generation in each snapshot.

4.3 Time-level attention block

To capture the temporal variations of network structure through multiple time steps, we feed the sequence of row-wise vectors into a time-level attention block. The time-level attention block consists of two layers: a nonlinear RNN layer and a temporal self-attention layer.

Recurrent neural networks has flexible nonlinear transformation ability on time-series inputs, while the attention mechanism has limited representational power for it uses weighted sum to generate output vectors. To increase the expressive power on time-level, the row-wise vector sequence $\mathbf{Y}^{t}_{t-T}=\{\mathbf{y}_{t-T},\mathbf{y}_{t-T+1},\cdots,\mathbf{y}_{% t}\},\mathbf{y}_{t}\in\mathbb{R}^{1\times(N\times F^{\prime})}$ are first fed into a recurrent neural network to mine the evolving patterns of the temporal directed network. LSTM and GRU are two well-performing models in RNN networks which are capable to learn the long-term dependencies of sequential data. Since GRU is able to achieve similar performance compared with LSTM with fewer trainable parameters and lower computational complexity, we use GRU hidden layer to deal with this specific task. At each time step $t$ , the input vector $\mathbf{y}_{t}$ and the last time step state vector $\mathbf{h}_{t-1}$ are taken as the input of the GRU cell, then the output state vector $\mathbf{h}_{t}$ can be denoted as:

$\displaystyle\mathbf{a}_{t}=\sigma(\mathbf{W}^{(z)}\mathbf{y}_{t}+\mathbf{U}^{% (z)}\mathbf{h}_{t-1}+\mathbf{b}^{(z)})$ (11) $\displaystyle\mathbf{r}_{t}=\sigma(\mathbf{W}^{(r)}\mathbf{y}_{t}+\mathbf{U}^{% (r)}\mathbf{h}_{t-1}+\mathbf{b}^{(r)})$ (12) $\displaystyle\tilde{\mathbf{h}}_{t}=\mathrm{tanh}(\mathbf{W}^{(n)}\mathbf{y}_{% t}+\mathbf{r}_{t}\odot\mathbf{U}^{(n)}+\mathbf{b}^{(n)})$ (13) $\displaystyle\mathbf{h}_{t}=(1-\mathbf{z}_{t})\odot\mathbf{h}_{t-1}+\mathbf{z}% _{t}\odot\tilde{\mathbf{h}}_{t}$ (14)

where $\{\mathbf{W}^{(i)},\mathbf{U}^{(i)},\mathbf{b}^{(i)}\},i=\{z,r,n\}$ are the trainable parameters of the update gate, reset gate and new memory, respectively. $\sigma(\cdot)$ and $\mathrm{tanh}(\cdot)$ respectively represent the sigmoid and tanh activation function. $\odot$ denotes the Hadamard product. We denote $H_{R}$ as the hidden layer dimension of the GRU layer.

Afterwards, the hidden states of GRU layer are fed into a temporal self-attention layer to differentiate their influences on each other. In detail, we take the hidden state vector at time step $t$ , i.e., $\mathbf{h}_{t}$ , as the query to attend over its historical representations, tracing the evolution of structural features. We follow [31] to adopt scaled dot-product attention in order to accelerate computational speed, as shown in Fig. 4. At each time step, the temporal self-attention layer takes the hidden state vectors $\mathbf{h}_{t-T}^{t}=\{{{\mathbf{h}}_{t-T}},{{\bf{h}}_{t-T+1}},\cdots,{{% \mathbf{h}}_{t}}\}$ as input and produces a new sequence $\mathbf{z}_{t-T}^{t}=\{{{\mathbf{z}}_{t-T}},{{\mathbf{z}}_{t-T+1}},\cdots,{{% \mathbf{z}}_{t}}\}$ . Similar with node-level attention block, we employ multi-head attention mechanism to learn features from different latent space and enhance the representational ability of our model [33]. In the $l$ -th attention head, linear transformations are performed first on the input vector to generate the queries, keys and values, denoted as:

$\displaystyle\mathbf{Q}=\mathbf{h}_{t-T}^{t}\mathbf{W}^{(q)},\mathbf{K}=% \mathbf{h}_{t-T}^{t}\mathbf{W}^{(k)},\mathbf{V}=\mathbf{h}_{t-T}^{t}\mathbf{W}% ^{(v)}$ (15)

where $\mathbf{W}^{(q)},\mathbf{W}^{(k)},\mathbf{W}^{(v)}\in\mathbb{R}^{H_{R}\times F% ^{\prime\prime}}$ are the trainable weights, $F^{\prime\prime}$ is the output feature dimension. Then the output vector is computed as:

$\displaystyle\mathbf{z}^{(l)}=\bm{\beta}\mathbf{V}$ (16)

where $\beta_{ij}\in\bm{\beta}$ ,

$\displaystyle\mathbf{\beta}_{ij}=\frac{{\exp(e_{ij})}}{{\sum\limits_{k=1}^{T}{% \exp(e_{ik})}}},i,j\in[t-T,t]$ (17)

is the softmax function, and

$\displaystyle\mathbf{e}=\frac{\mathbf{Q}\mathbf{K}^{\mathrm{T}}}{\sqrt{F^{% \prime\prime}}}$ (18)

is the attention coefficient matrix, with $e_{ij}\in\mathbf{e}$ indicating the influence of snapshot $i$ on snapshot $j$ . Since the input vectors are time-relevant, we follow [27] to add a mask matrix $\mathbf{M}\in\mathbb{R}^{l\times l}$ in Eq. (18) to enhance the auto-regressive property, such that:

$\displaystyle e^{ij}=\frac{(\mathbf{Q}\mathbf{K}^{\mathrm{T}})_{ij}}{{\sqrt{F^% {\prime\prime}}}}+{{M}_{ij}}$ (19)

where ${M}_{ij}=-\infty$ when $i>j$ , otherwise 0. This makes sure that when $i>j$ , the softmax operation in Eq. (16) generates a zero attention weight, i.e., $\beta_{ij}=0$ , which can ignore the attention from time step $i$ to $j$ . The outputs of $K_{T}$ independent attention heads are concatenated together as the final embedding, denoted as:

$\displaystyle\mathbf{z}_{t-T}^{t}=\textit{Concat}(\mathbf{z}^{(1)},\mathbf{z}^% {(2)},\cdots,\mathbf{z}^{(K_{T})})$ (20)

Figure 4.

Diagram of temporal self-attention layer.

In this case, the shape of $\mathbf{z}_{t-T}^{t}$ is $T\times(K_{T}\times F^{\prime\prime})$ .

4.4 The decoder network

In order to predict the network at time $t+1$ , we treat the output embedding at time $t$ , i.e. $\mathbf{z}_{t}\in\mathbb{R}^{1\times(K_{T}\times F^{\prime\prime})}$ as the learned embedding of the historical snapshots. The output vector is then fed into a decoder network to generate the prediction result. Briefly, the decoder network consists a fully-connected layer, denoted as:

$\displaystyle D(\mathbf{z}_{t})=\textit{ReLU}(\textit{ReLU}({\mathbf{z}_{t}}{% \mathbf{W}^{(h)}}+{\mathbf{b}^{(h)}}){\mathbf{W}^{(o)}}+{\mathbf{b}^{(o)}})$ (21)

where $\mathbf{W}^{(h)}\in\mathbb{R}^{(K_{T}\times F^{\prime\prime})\times H_{D}}$ and $\mathbf{W}^{(o)}\in\mathbb{R}^{H_{D}\times(N\times N)}$ are the weights of the hidden layer and output layer, $H_{D}$ is the dimension of hidden layer. We denote $\mathbf{S}_{t+1}=D(\mathbf{z}_{t})$ as the output vector of the decoder network, which is then reshaped from $1\times(N\times N)$ to $N\times N$ , with $[\mathbf{S}_{t+1}]_{ij}\in[0,1]$ representing the existence probability of links in snapshot $G_{t+1}$ .

4.5 Model optimization

Since our goal is to predict links in $G_{t+1}$ based on historical snapshots, the predicted score matrix $\mathbf{S}_{t+1}$ and the ground-truth adjacency matrix $\mathbf{A}_{t+1}$ should be close geometrically. In other words, if link $e(i,j)\in\mathbf{A}_{t+1}$ , the corresponding score $s_{ij}\in\mathbf{S}_{t+1}$ is close to 1 for a good prediction, otherwise 0. We use Frobenius norm to describe the distance between matrices, and train the TSAM model at time step $t$ by optimizing the following loss function:

$\displaystyle{L_{t}}(\theta;{\mathbf{A}}_{t-T}^{t},{\mathbf{A}}_{t+1})=\left\|% {\left({{{\mathbf{S}}}_{t+1}}-{{\mathbf{A}}_{t+1}}\right)\odot{\cal B}}\right% \|_{F}^{2}+\frac{\lambda}{2}\left\|\theta\right\|_{2}^{2}$ (22)

where ${\mathbf{S}}_{t+1}=f({\mathbf{A}}_{t-T}^{t})$ is the predicted score. ${\cal B}$ is the penalty term to deal with the sparsity of adjacency matrix following literature [34]. ${\cal B}_{ij}=\beta$ for $e(i,j)\in\mathbf{A}_{t+1}$ , other ${\cal B}_{ij}=1$ . In such case, when $\beta>1$ , it penalize inaccurate predictions of observed links more than those of unobserved links. To encourage sparsity in the model’s weights and prevent over-fitting, we add ${\cal L}_{2}$ regularizer to the loss function with $\lambda$ as the hyperparameter controlling its relative weight. We adopt Adam optimizer to minimize the loss function.

5. Experiments

5.1 Datasets

We use four temporal directed networks from real world to evaluate the performance of TSAM model, including two Email networks, a social network and an interaction network. Basic statistics of the four datasets are presented in Table 1, and a brief introduction of them is described as follows.

Table 1
Summary statistics of four temporal networks

Network	MAN	EEC	UCI	LEM
$\#$ of nodes	167	964	889	485
$\#$ of links	81,127	291,167	10,034	196,364
Average degree	971.58	604.08	22.57	809.75
Start date	2010/1/3	xxxx/3/17	2004/6/27	1979/4/1
End date	2010/9/27	xxxx/6/7	2004/10/26	2004/6/1
Total time span	9 months	15 months	4 months	302 months
Snapshot range	1 week	1 week	3 days	6 months
$\#$ of snapshots	38	64	40	50
Window size $T$	8	8	5	8

(1)

Manufacturing emails (MAN) [18]: An email network between employees of a mid-sized manufacturing company. The network is directed and nodes represent employees. The left node represents the sender and the right node represents the recipient. A directed link $e(u,v,t)$ represents that employee $u$ sent an email to employee $v$ at time $t$ . We use links in one week to generate a snapshot, and construct 38 snapshots with total duration of 9 months, as shown in Fig. 5a.

(2)

Email-Eu-core-temporal (EEC) [25]: An email network generated from the email data in a large European research institution. A directed link $e(u,v,t)$ represents that a person $u$ sent an email to another person $v$ at time $t$ . The timestamp of this dataset starts from 0 and no starting date is specified. We neglect the isolated links and only generate the snapshots based on links occurred during 15 consecutive months. Each snapshot contains links during one week, so we totally get 64 snapshots as shown in Fig. 5b.

(3)

UC Irvine messages (UCI) [24]: A network consisted of private messages sent on an online social network at the University of California, Irvine. A directed link $e(u,v,t)$ represents that user $u$ sent a private message to user $v$ at time $t$ . We choose links of 4 months in the experiments, and generate snapshots with the range of 3 days as shown in Fig. 5c.

(4)

LevantMonths (LEM) [20]: An interaction network collected by the Kansas Event Data System based on folders containing WEIS-coded events within eight countries: Egypt, Israel, Jordan, Lebanon, Palestinians, Syria, USA, and Russia. The dataset contains interactions from April 1979 to June 2004. Each snapshot contains links occurred in 6 months, and 50 snapshots are constructed in total, as shown in Fig. 5d.

Figure 5.

Histogram of link numbers for each snapshot in four networks.

5.2 Evaluation metrics

We use two standard evaluation metrics to evaluate the performance of temporal link prediction models. The area under the receiver characteristic operator curve (AUC) is a widely adopted evaluation metric for classification models, which considers both the sensitivity and specificity of the model. In link prediction problems, if there are $n^{\prime}$ times that the scores of randomly chosen existent links are higher than those of randomly chosen non-existent links among $n$ independent comparisons, and $n^{\prime\prime}$ times that they get the same scores, then AUC is calculated as:

$\displaystyle\mathrm{AUC}=\frac{n^{\prime}+0.5n^{\prime\prime}}{n}$ (23)

A larger AUC score indicates a better prediction performance for a given model. Similar with AUC, the area under the precision-recall curve (PRAUC) is designed to evaluate the sparsity of networks. However, both AUC and PRAUC cannot evaluate the added and removed links at the same time. Therefore, in addition we adopt the geometric mean of AUC and PRAUC (GMAUC) [12] to evaluate both added and removed links, defined as:

$\displaystyle\mathrm{GMAUC}=\left(\frac{\mathrm{PRAUC}-\frac{L_{A}}{L_{A}+L_{R% }}}{1-L_{A}/(L_{A}+L_{R})}\cdot 2\left(\mathrm{AUC}-0.5\right)\right)^{1/2}$ (24)

where $L_{A}$ and $L_{R}$ are the number of added and removed links respectively, $\mathrm{PRAUC}$ is the $\mathrm{PRAUC}$ value of new links, while $\mathrm{AUC}$ is the AUC score calculated by originally existed links.

5.3 Performance evaluation

Experiments are performed on a Ubuntu 16.04 LTS system with 48 cores, 128 GB RAM and 2.20 GHz clock frequency. We implement the model in Tensorflow 1.15.0, and train it on a NVIDIA Titan Xp GPU. We use a sliding window with size $T$ to get continuous snapshot sequences from all snapshots in each network. At each time step $t$ , we train separate models up to snapshot $t$ and evaluate it at $t+1$ for each $t=1,\cdots,T$ .

Table 2
Parameter settings of TSAM in four networks

Network	$N$	$F^{\prime}$	$H_{R}$	$F^{\prime\prime}$	$K_{N}$	$K_{T}$	$H_{D}$	$l r$	$\lambda$
MAN	167	32	1024	256	4	8	128	0.001	0
EEC	964	64	4096	1024	2	4	512	0.001	$1e{-5}$
UCI	889	64	4096	1024	2	4	512	0.001	$1e{-5}$
LEM	485	32	2048	512	4	8	256	0.005	0

We compare the performance of TSAM and five state-of-the-art models on temporal link prediction in directed networks, including:

(1)

TNE [38]: It models the temporal network with Markov process and uses matrix factorization to learn node embeddings.

(2)

GC-LSTM [4]: It is an end-to-end temporal link prediction model which uses GCN to extract structural features and LSTM to extract temporal features. We set the order of GCN $K=3$ to aggregate 3-hop neighbors.

(3)

EvolveGCN [26]: It is a graph embedding model which adapts GCN model along the temporal dimension without resorting to node embeddings. It captures the dynamism of graph sequences using an RNN to evolve the GCN parameters.

(4)

dyngraph2vec [6]: It uses multiple non-linear layers to learn structural patterns of each snapshot, and then uses recurrent layers to learn temporal transitions in the network. We choose one of its variations namely dyngraph2vetAERNN, which uses LSTM in the encoder to extract node embeddings and full-connected network in the decoder to generate predictions. The hyperparamerter $\mathrm{lb}$ is set to $T$ .

(5)

DySAT [27]: It is a graph embedding model based on joint self-attention along the two dimensions of structural neighborhood and temporal dynamics.

Table 3

Results of prediction performance in MAN and EEC

Method	MAN		EEC
	AUC (%)	GMAUC (%)	AUC (%)	GMAUC (%)
TNE	70.80 $\pm$ 1.1	77.15 $\pm$ 1.4	67.75 $\pm$ 1.0	69.87 $\pm$ 1.1
GC-LSTM	72.17 $\pm$ 0.7	73.03 $\pm$ 0.6	66.72 $\pm$ 0.7	69.51 $\pm$ 1.2
EvolveGCN	78.81 $\pm$ 0.4	82.53 $\pm$ 0.3	70.10 $\pm$ 0.7	72.53 $\pm$ 0.9
dyngraph2vec	76.60 $\pm$ 0.4	80.16 $\pm$ 0.4	68.83 $\pm$ 0.8	70.35 $\pm$ 1.1
DySAT	81.20 $\pm$ 0.2	84.05 $\pm$ 0.3	82.03 $\pm$ 0.3	85.50 $\pm$ 0.2
TSAM	80.87 $\pm$ 0.2	84.53 $\pm$ 0.3	84.21 $\pm$ 0.2	86.75 $\pm$ 0.2

Table 4

Results of prediction performance in UCI and LEM

Method	UCI		LEM
	AUC (%)	GMAUC (%)	AUC (%)	GMAUC (%)
TNE	67.11 $\pm$ 0.7	69.54 $\pm$ 0.7	76.20 $\pm$ 0.9	78.51 $\pm$ 1.1
GC-LSTM	67.54 $\pm$ 0.5	70.81 $\pm$ 0.7	79.52 $\pm$ 0.7	79.90 $\pm$ 0.7
EvolveGCN	69.53 $\pm$ 0.5	72.01 $\pm$ 0.8	80.71 $\pm$ 0.4	83.55 $\pm$ 0.6
dyngraph2vec	75.51 $\pm$ 0.4	78.14 $\pm$ 0.5	89.15 $\pm$ 0.4	91.00 $\pm$ 0.3
DySAT	79.87 $\pm$ 0.2	83.86 $\pm$ 0.2	88.52 $\pm$ 0.3	90.49 $\pm$ 0.3
TSAM	81.15 $\pm$ 0.2	84.91 $\pm$ 0.1	90.81 $\pm$ 0.3	91.90 $\pm$ 0.2

Notice that since we are dealing with directed networks, spectral domain convolution cannot perform on the asymmetric adjacency matrices because the asymmetric laplace matrices are not decomposable. Therefore, we replace all GCN units with GraphSAGE in GC-LSTM and EvolveGCN to perform spacial domain convolution on directed networks. The mean aggregator function is utilized in GraphSAGE to aggregate neighborhood features. Moreover, since TNE, EvolveGCN and DySAT are graph embedding models which learn node embedding $\mathbf{e}_{t}$ from the evolving network $G_{t-T}^{t}$ , we define similarity score matrix $\mathbf{S}_{t+1}=\mathbf{e}_{t}\cdot\mathbf{e}_{t}^{\mathrm{T}}$ to use them for temporal link prediction task. In TSAM, we adopt four transformations $\{\mathbf{C}^{M_{1}},\mathbf{C}^{M_{2}},\mathbf{C}^{M_{3}},\mathbf{C}^{M_{4}}\}$ listed in Eq. (6) as additional feature inputs. Other parameter settings in four networks are presented in Table 2.

Figure 6.

Average AUC of TSAM and baselines in four networks at each time step. Each value is the average of 20 independent experiments.

Tables 3 and 4 presents the prediction performance of TSAM and baselines in four networks. From the results we can observe consistent gains of 1–2% AUC in comparison to the best baseline in EEC, UCI and LEM. In MAN, the AUC of TSAM is slightly lower than DySAT, but TSAM get higher GMAUC than DySAT, which indicates a better performance of predicting both added and removed links. It is also clear that the performance of TSAM is more stable in four networks reflected by the smaller standard deviation. We also notice that compared with dyngraph2vec and GC-LSTM, our TSAM model and DySAT are able to achieve obvious improvements on both AUC and GMAUC, which indicates the effectiveness of self-attention mechanism on temporal link prediction task.

Figure 6 presents the average AUC at each time step in four networks. We find that the performance of TSAM is relatively better and more stable than baselines. The superiority of TSAM is obvious in comparison with TNE and GC-LSTM, whose AUC curves show rapid drops at certain time steps.

5.4 Influence of feature fusion

An important improvement in TSAM compared with other baselines is the feature fusion block, which uses matrix transformation to capture motif features in directed networks. Here we analyze the influence of feature fusion block with different choices of additional feature. Tables 5 and 6 present the performance of TSAM under three circumstances: 1) use only the output of GAT without additional features as the input of GRU, 2) use $\mathbf{C}^{M_{1}}$ as additional feature, 3) use matrices listed in Eq. (5) as additional feature. We adopt parameters under the best performance for each case. From the results we find that without additional feature, TSAM achieves lower AUC and GMAUC than DySAT in MAN and EEC. However, when $\mathbf{C}^{M_{1}}$ is adopted, an obvious improvement can be observed in four networks. Since $\mathbf{C}^{M_{1}}$ captures the local motifs of $u\to t\to v$ , it indicates that taking into account the motif structure in directed networks can lead to improvement on prediction performance. We also find that with more additional features added, the performance of TSAM gets better, bringing along more computational costs at the same time.

5.5 Influence of attention heads

Another important hyperparameter of TSAM is the number of attention heads in self-attention layers. Here we respectively analyze the influence of attention heads on the performance of TSAM. Figure 7 presents the performance of TSAM when we independently vary the number of attention heads in node-level and time-level self-attention blocks within range $\{1,2,4,8,16\}$ . In the results we find that more attention heads leads to better performance on both AUC and GMAUC. When $K_{N}\geqslant 4$ and $K_{T}\geqslant 8$ , the performance tend to be stable. It indicates that it is efficient to capture latent features when using 4 node-level attention heads and 8 time-level attention heads.

Table 5
Comparison on the influence of feature fusion in MAN and EEC

Method	MAN		EEC
	AUC (%)	GMAUC (%)	AUC (%)	GMAUC (%)
No feature	78.11 $\pm$ 0.3	82.31 $\pm$ 0.3	80.15 $\pm$ 0.3	81.24 $\pm$ 0.3
$\{\mathbf{C}^{M_{1}}\}$	80.19 $\pm$ 0.2	81.95 $\pm$ 0.3	82.60 $\pm$ 0.3	83.83 $\pm$ 0.2
$\{\mathbf{C}^{M_{1}},\mathbf{C}^{M_{2}},\mathbf{C}^{M_{3}},\mathbf{C}^{M_{4}}\}$	80.87 $\pm$ 0.2	84.53 $\pm$ 0.3	84.21 $\pm$ 0.2	86.75 $\pm$ 0.2

Table 6

Comparison on the influence of feature fusion in UCI and LEM

Method	UCI		LEM
	AUC (%)	GMAUC (%)	AUC (%)	GMAUC (%)
No feature	80.81 $\pm$ 0.2	82.53 $\pm$ 0.2	89.14 $\pm$ 0.3	91.12 $\pm$ 0.3
$\{\mathbf{C}^{M_{1}}\}$	79.90 $\pm$ 0.2	81.97 $\pm$ 0.2	89.57 $\pm$ 0.3	90.26 $\pm$ 0.2
$\{\mathbf{C}^{M_{1}},\mathbf{C}^{M_{2}},\mathbf{C}^{M_{3}},\mathbf{C}^{M_{4}}\}$	81.15 $\pm$ 0.2	84.91 $\pm$ 0.1	90.81 $\pm$ 0.3	91.90 $\pm$ 0.2

Figure 7.

Performance comparison on the number of attention heads in node-level and time-level self-attention blocks (in percentage).

6. Conclusions

Predicting the connectivity and direction of links in temporal networks is both meaningful and challenging. In this paper we propose a temporal link prediction model for directed networks based on graph neural networks and self-attention mechanism. The basic architecture of our model is an autoencoder. In the encoder, local structural features of each snapshot are drawn by the GAT layer and several GCLs. A GRU hidden layer then captures the temporal variations of the snapshot sequence. We use a time-level self-attention layer to differentiate the effect of each snapshot. In the decoder, a full-connected layer transforms the learned embedding into predicted adjacency matrix. Experimental results on realistic networks prove the effectiveness of our model in comparison with baselines.

In our future works, we will focus on the prediction of weighted links in directed networks. A possible direction of extending TSAM to solve weight prediction problems is leveraging the structure of generative adversarial network to refine the prediction accuracy.

Footnotes

Acknowledgments

This research was funded by the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (Grant No. 61521003), and National Natural Science Foundation of China (Grant No. 61803384).

References

Adamic

L.A.

and Adar

, Friends and neighbors on the Web, Soc. Netw 25(3) (2003), 211–230.

Bianconi

Gulbahce

and Motter

A.E.

, Local structure of directed networks, Phys. Rev. Lett 100(11) (Mar. 2008).

Bütün

Kaya

and Alhajj

, Extension of neighbor-based link prediction methods for directed, weighted and temporal social networks, Information Sciences 463–464 (Oct. 2018), 152–165.

Chen

and Zheng

, GC-LSTM: Graph Convolution Embedded LSTM for Dynamic Link Prediction, arxiv, Dec. 2018.

Dareddy

M.R.

Das

and Yang

, Motif2vec: Motif aware node representation learning for heterogeneous networks, Aug. 2019.

Goyal

Chhetri

S.R.

and Canedo

, dyngraph2vec: Capturing Network Dynamics using Dynamic Graph Representation Learning, arxiv, Sept. 2018.

Goyal

Kamra

and Liu

, DynGEM: Deep Embedding Method for Dynamic Graphs, arxiv, May 2018.

Gao

Lou

and Zhang

, Link prediction via graph attention network, arXiv, Oct. 2019.

Hamilton

W.L.

Ying

and Leskovec

, Inductive representation learning on large graphs, arxiv, June 2017.

10.

Jaccard

, Étude Comparative de La Distribution Florale Dans Une Portion Des Alpes et Du Jura, Jan. 1901.

11.

Jawed

Kaya

and Alhajj

, Time frame based link prediction in directed citation networks, In: Proceedings of the 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining 2015 – ASONAM ’15, pages 1162–1168, Paris, France, 2015. ACM Press.

12.

Junuthula

R.R.

K.S.

and Devabhaktuni

V.K.

, Evaluating link prediction accuracy in dynamic networks with added and removed edges, in: 2016 IEEE International Conferences on Big Data and Cloud Computing (BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and Communications (SustainCom) (BDCloud-SocialCom-SustainCom), pages 377–384, 2016.

13.

Katz

, A new status index derived from sociometric analysis, Psychmetrika 18(1) (1953), 39–43.

14.

Kefato

Z.T.

Sheikh

and Montresor

, Which way? Direction-Aware Attributed Graph Embedding, arxiv, Jan. 2020.

15.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks, in: International Conference on Learning Representations, Apr. 2017.

16.

Kossinets

, Effects of missing data in social networks, Soc. Netw 28(3) (July 2006), 247–268.

17.

Kumar

Novak

and Tomkins

, Structure and evolution of online social networks, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining – KDD ’06, pages 611–617, Philadelphia, PA, USA, 2006. ACM Press.

18.

Kunegis

, KONECT: the Koblenz network collection, in: Proceedings of the 22nd International Conference on World Wide Web – WWW ’13 Companion, pages 1343–1350, Rio de Janeiro, Brazil, 2013. ACM Press.

19.

Lei

Qin

Bai

Zhang

and Yang

, GCN-GAN: A non-linear temporal link prediction model for weighted dynamic networks, in: IEEE INFOCOM 2019 – IEEE Conference on Computer Communications, pages 388–396, Apr. 2019. 00003 ISSN: 0743-166X.

20.

Leskovec

and Krevl

, SNAP Datasets: Standford large network dataset collection, Jan. 2014.

21.

Liben-Nowell

and Kleinberg

, The link-prediction problem for social networks, Journal of the American Society for Information Science and Technology 58(7) (May 2007), 1019–1031.

22.

Lü

and Zhou

, Link prediction in complex networks: A survey, Phys. A 390(6) (Mar. 2011), 1150–1170.

23.

Monti

Otness

and Bronstein

M.M.

, MotifNet: A motif-based graph convolutional network for directed graphs, in: 2018 IEEE Data Science Workshop (DSW), pages 225–228, Lausanne, Switzerland, June 2018. IEEE.

24.

Panzarasa

Opsahl

and Carley

K.M.

, Patterns and dynamics of users’ behavior and interaction: Network analysis of an online community, J. Am. Soc. Inf. Sci. Technol 60(5) (May 2009), 911–932.

25.

Paranjape

Benson

A.R.

and Leskovec

, Motifs in Temporal Networks, in: Proceedings of the Tenth ACM International Conference on Web Search and Data Mining – WSDM ’17, pages 601–610, Cambridge, United Kingdom, 2017. ACM Press.

26.

Pareja

Domeniconi

Chen

Suzumura

Kanezashi

Kaler

Schardl

T.B.

and Leiserson

C.E.

, EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs, arxiv, Feb. 2019.

27.

Sankar

Gou

Zhang

and Yang

, DySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention Networks, in: Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, pages 519–527, Houston, TX, USA, Jan. 2020. Association for Computing Machinery.

28.

Shang

K.-k.

Small

X.-k.

and Yan

W.-s.

, The role of direct links for link prediction in evolving networks, EPL 117(2) (Jan. 2017), 28002.

29.

Sun

Bandyopadhyay

Bashizade

Liang

Sadayappan

and Parthasarathy

, ATP: Directed Graph Embedding with Asymmetric Transitivity Preservation, arxiv, Nov. 2018.

30.

Tylenda

Angelova

and Bedathur

, Towards time-aware link prediction in evolving social networks, in: Proceedings of the 3rd Workshop on Social Network Mining and Analysis – SNA-KDD ’09, pages 1–10, Paris, France, 2009. ACM Press.

31.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

and Polosukhin

, Attention Is All You Need, arxiv, June 2017.

32.

Veličković

Cucurull

Casanova

Romero

Liò

and Bengio

, Graph attention networks, arxiv, Oct. 2017.

33.

Voita

Talbot

Moiseev

Sennrich

and Titov

, Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned, arxiv, May 2019.

34.

Wang

Cui

and Zhu

, Structural Deep Network Embedding, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, pages 1225–1234, San Francisco, California, USA, Aug. 2016. Association for Computing Machinery.

35.

Zhang

Shi

Xie

King

and Yeung

D.-Y.

, GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs, arXiv, Mar. 2018.

36.

Zhou

Yang

Ren

and Zhuang

, Dynamic network embedding by modeling triadic closure process, in: Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

37.

Zhou

Lü

and Zhang

, Predicting missing links via local information, Eur. Phys. J. B 71(4) (2009), 623–630.

38.

Zhu

Guo

Yin

Steeg

G.V.

and Galstyan

, Scalable temporal latent space inference for link prediction in dynamic social networks, IEEE Trans. Knowl. Data Eng 28(10) (Oct. 2016), 2765–2777.

Temporal link prediction in directed networks based on self-attention mechanism

Abstract

Keywords

1. Introduction

2.1 Temporal link prediction

2.2 Graph neural network

2.3 Self-attention mechanism

3. Problem description

5.1 Datasets

Table 1 Summary statistics of four temporal networks

Table 2 Parameter settings of TSAM in four networks

5.5 Influence of attention heads

Table 5 Comparison on the influence of feature fusion in MAN and EEC

Footnotes

Acknowledgments

References

Table 1
Summary statistics of four temporal networks

Table 2
Parameter settings of TSAM in four networks

Table 5
Comparison on the influence of feature fusion in MAN and EEC