Combining jumping knowledge into traffic forecasting: An attention-based spatial-temporal adaptive integration gated network

Abstract

Traffic forecasting has become a core component of Intelligent Transportation Systems. However, accurate traffic forecasting is very challenging, caused by the complex traffic road networks. Most existing forecasting methods do not fully consider the topological structure information of road networks, making it difficult to extract accurate spatial features. In addition, spatial and temporal features have different impacts on traffic conditions, but the existing studies ignore the distribution of spatial-temporal features in traffic regions. To address these limitations, we propose a novel graph neural network architecture named Attention-based Spatial-Temporal Adaptive Integration Gated Network (AST-AIGN). The originality of AST-AIGN is to obtain a spatial feature that more accurately reflects the topological structure of the road networks by embedding Graph Attention Network (GAT) into Jumping Knowledge Net (JK-Net). We propose a data-dependent function called spatial-temporal adaptive integration gate to process the diversity of feature distribution and highlight features in road networks that significantly affects traffic conditions. We evaluate our model on two real-world traffic datasets from the Caltrans Performance Measurement System (PEMS04 and PEMS08), and the extensive experimental results demonstrate the proposed AST-AIGN architecture outperforms other baselines.

Keywords

Traffic forecasting spatial-temporal dependences jumping knowledge gating mechanism self-attention

1. Introduction

Traffic forecasting is a typical problem of spatial-temporal forecasting, which aims to predict traffic conditions in a certain period of time based on historical traffic data. Traffic forecasting is an indispensable component of Intelligent Transportation System (ITS) [1]. ITS has a wide range of applications, from managing transportation in an efficient manner to alleviating traffic congestion and reducing road accidents. However, as the non-linear and non-stationary traffic data depends on the dynamic road conditions, how to discover its inherent spatial-temporal patterns and generate accurate traffic predictions is extremely challenging.

Figure 1.

Visualization of traffic road networks topological structure.

Traffic forecasting has received significant research interests, and many methods have been proposed. For example, traditional statistic methods [2, 3, 4, 5, 6, 7] apply time series models to simulate the pattern of complex traffic data. However, these models usually consider that the input time series follow the stationary and linear assumption, which is violated by the traffic data. Hence, it is hard for traditional statistic models to jointly exploit the spatial and temporal correlations in the traffic flows and to make accurate predictions. Deep learning traffic forecasting models have brought high accuracy and gradually replaced traditional statistic models. For instance, existing deep learning models [8, 9] combine Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to model spatial and temporal correlations, respectively. Although CNNs can extract the spatial features of traffic data, they ignore the spatial topological structures of road networks. Furthermore, it is difficult for RNNs to extract long-term temporal features. In addition, these methods usually adopt autoregressive prediction, leading to propagation and accumulation of prediction errors. To cover these shortages, Graph Convolutional Networks (GCNs) [10] was proposed to construct the topological structure of road networks. Traffic sensors on roads are regarded as nodes in the road network graph, and edges of graph are constructed by the distance or similarity between each pair of nodes. By combining GCNs with CNNs [11, 12, 13], or RNNs [14, 15, 16], the graph-based deep learning methods can capture both spatial and temporal dependences and achieve promising results. To model highly dynamic spatial dependence caused by varying traffic speeds and multiple factors such as weather conditions, some researchers [17, 18, 19, 20] introduce self-attention mechanisms into GCNs, and the others [21, 22, 23] choose to combine GCNs with Transformer [24]. Although the effectiveness of the graph-based deep learning methods in traffic forecasting tasks has been shown, the existing methods still ignore two important issues.

First, the previous studies generally fail to consider two types of important topological structure information of road networks when extracting spatial features: (1) directed relationships among traffic nodes, (2) location of traffic nodes in the road network. These studies assume the relationships among nodes are undirected and do not distinguish the node position when extracting spatial features. Figure 1a represents the entire road networks in San Jose, California. Figure 1b visualizes the networks’ topological structure with the light-colored nodes representing transportation hubs and the dark-colored nodes indicating traffic edge nodes. Figure 1c illustrates the local topological structure circled by the red circle in Fig. 1b and indicates the spatial correlation between hub node 1 and its surrounding traffic nodes. Taking nodes 1 and 2 as an example, the edge weights between this pair are not the same, and node 1 has more significant influence on node 2 because node 1 is a traffic hub and possesses more 1-hop neighbors than node 2 [1]. Therefore, the relationship among the nodes should be regarded as directed. Existing graph-based approaches extract spatial feature through the neighborhood aggregation process (Graph convolution is essentially a neighborhood aggregation process [25], and we describe the process in detail in Section 3.2). In this process, the central hub nodes could get more global spatial feature but lose local ones, while the edge node retains local spatial features but has difficulty in obtaining global ones. Nodes with different locations can hardly get appropriate neighborhood ranges where nodes draw feature representations.

Second, the existing studies tend to ignore the diversity of the spatial-temporal feature distributions of road networks in different regions. In some areas, spatial dependence dominates traffic conditions, while in others, temporal dependence dominates. The traffic data of two pair traffic sensors in different regions are visualized in Fig. 2 to describe the various distributions. Within the same time span, the data from sensor 1 and sensor 2 possess apparent periodicity and trend, meaning the temporal factor has a greater infulence on this region. While sensor 3 gives feedback to changes of sensor 4, which indicates a stronger spatial dependence in the region.

Figure 2.

Traffic data with different spatial and temporal feature distribution.

In this work, we propose a novel method, AST-AIGN, which addresses the two limitations above. In this method, we combine GAT [17] with the JK-Net framework [26] to describe the directed relationship among traffic nodes and obtain topological structure-aware spatial feature representations of all nodes. A data-dependent gate, called spatial-temporal adaptive integration gate, is presented to process the diversity of spatial-temporal feature distributions in road networks and selectively integrate the spatial-temporal features in different regions. Meanwhile, our model is based on the self-attention to capture dynamic spatial dependence and long-term temporal dependence, complementing the feature extraction process. The main contributions of this work are as follows:

We propose the JK-GAT by embedding GAT into the JK-Net framework to obtain a spatial feature that more accurately reflects the topological structure of the road networks. JK-GAT simulates the directedness among traffic nodes through calculating the attention weight of GAT. With jump connection and aggregation mechanism in JK-Net, JK-GAT can obtain the topological structure-aware feature representations of traffic nodes.

We propose a data-dependent gate that can depict the spatial-temporal feature distributions in different regions without using any extra information (e.g., land-use data) but historical traffic data and highlight features that significantly impact traffic conditions.

We propose a novel forecasting model AST-AIGN to extract more accurate spatial-temporal dependences and complete traffic flow prediction. We further use the PEMS04 and PEMS08 datasets to evaluate the proposed method AST-AIGN. The extensive experimental results demonstrate that our model is superior to all baseline methods for the traffic forecasting problem.

The rest of this paper is structured as follows. Section 2 reviews the related work on traffic forecasting. Section 3 presents the problem definition and neighborhood aggregation scheme. Section 4 offers the details of the method proposed in the paper. Section 5 evaluates the prediction performance of AST-AIGN using real traffic datasets. Finally, this paper is concluded in Section 6.

2. Related work

2.1 Graph convolutional networks

We can analyze traffic networks and mobile networks as graphs for tasks such as traffic forecasting, assignment, and energy efficiency [27, 28]. Many last studies try to apply CNNs and RNNs to graph-structured data. Even if these neural networks can process Euclidean data efficiently, it is still hard to handle non-Euclidean data (e.g., graphs). GCNs is proposed to solve the problem and generate the feature representations of each node in graphs. GCNs always adopt the Laplacian matrix or the adjacency matrix to portray the relationships among nodes and can be divided into two categories: spectral methods and spatial methods. Researchers designed spectral domain graph convolution [29, 30] based on the spectral graph theory. As an expansion, Defferrard et al. [31] proposed fast localized convolutional filters on graphs to decrease the computational complexity. Since the lower calculation speed and applicability of directed graphs, spatial methods are developed much more wildly. For instance, Will et al. [32] designed a new sampling strategy for neighborhood node aggregation and applied the algorithm to large graphs for inductive representation learning. Veličković et al. [17] proposed GAT, which dynamically calculated the relevance among nodes and enhanced the accuracy of node classification. Michael et al. [33] represented relational GCN (R-GCN) for tasks on the heterogeneous graph with relational data. Brody et al. [34] incorporated GAT and R-GCN to increase the capacity of graph representation learning further. Moreover, many studies [26, 35] have been done on graph convolution theory, to solve the inherent problems of GCNs, such as over-smoothing.

2.2 Traffic forecasting

The existing traffic forecasting methods can be divided into two categories: model-driven methods and data-driven methods. Model-driven methods mainly use mathematical tools and physical knowledge to describe traffic problems through mathematical analysis [36]. This approach performs comprehensive system modeling based on prior knowledge and requires high computational power, but it is vulnerable to the limitations such as noise interference and sampling point distribution.

Data-driven methods aim at traffic condition forecasting and assessment based on statistical characteristics of data [37, 38]. These methods are more flexible and do not require analysis of the physical characteristics of the road networks, and they can be mainly divided into parametric and non-parametric models. Parametric models determine model parameters by processing raw data, and forecast traffic based on regression functions, such as the Kalman filter model [2], exponential smoothing model [3] and autoregressive moving average model (ARIMA) [5]. Among them, ARIMA is the most widely used. Hamed et al. [39] used ARIMA to predict urban traffic flow, and many variants of the model were proposed to improve the prediction accuracy [40, 41, 42]. The parametric model algorithm is simple and based on certain linear assumptions, but it is difficult to predict non-smooth traffic data effectively. Non-parametric models based on shallow machine learning can model the nonlinear characteristics of traffic data. Common models include K-neighborhood [4], support vector machine regression (SVR) [6], Bayesian networks [7], etc. However, it is still difficult to explore the complex spatial-temporal dependences of traffic data, which leads to poor prediction accuracy.

Recently, the rapidly developing deep learning models can capture the dynamic features of traffic data effectively by stacked nonlinear networks [43, 44, 45] that used RNNs to predict highway traffic flow. Shao et al. [46] proposed a long and short-term memory (LSTM) network using a memory unit and gating mechanism to capture long-term temporal dependence in traffic series. Fu et al. [47] adopted a more simple gated recursive unit GRU based on gating mechanism to extract temporal features and reduce computational cost. However, these models do not consider the topology structure of road networks which limits the performance of capturing spatial features. Liu et al. [9] proposed the Conv-LSTM model and a novel historical data matrix. Conv-LSTM adopts CNNs to model the spatial dependence from the matrix row vectors and LSTM to model the temporal dependence from the column vectors.In this way, Conv-LSTM completes modeling both spatial and temporal dependences, leading to better prediction performance. Despite adopting CNNs for spatial dependence modeling, it is inherently disabled to process the non-Euclidean traffic data.

GCNs-based methods can exploit the information on traffic network topological structure. Li et al. [14] proposed diffusion convolutional recurrent (DCRNN) based on the seq2seq model network, using bi-directional diffusion graph convolution to replace the matrix operation of the gated recursive unit GRU to capture spatial dependence by random wandering on the graph. Cui et al. [16] proposed the traffic graph convolutional long short-term memory neural network (TGC-LSTM) to learn the interplay among the different roads within the traffic network and to predict the overall traffic conditions. Huang et al. [48] introduced a novel multi-scale temporal dual graph convolution network (MD-GCN) to capture multi-scale temporal dependencies using a combination of channel attention and inception structures. Yu et al. [11] proposed a spatial-temporal graph convolution network (STGCN), using CNNs and GCNs to model the spatial-temporal dependences of traffic flows, respectively. The 1-D convolution in STGCN is more effective than RNNs when capturing the temporal dependence. GraphWaveNet [12] uses expanded convolutional (TCN) to extract long-term temporal dependence and construct an adaptive adjacency matrix to complement the road network topology. STSGCN [13] integrates spatial graphs of adjacent time steps into a local spatial-temporal graph and defines a spatial-temporal graph convolution module to capture features. Most of the above methods are based on static assumptions, but real traffic flows have highly dynamic spatial dependence, and the spatial relationships among nodes evolve over time. Vaswani et al. [24] proposed self-attention to dynamically compute the correlation between sequence elements using multi-headed self-attention mechanism so many studies turned to combining attention mechanisms with GCNs. STTNs [20], a spatial-temporal prediction model adopting the Scaled Dot-Product self-attention mechanism, stack spatial-temporal blocks to capture dynamic spatial dependence and long-term temporal dependence in traffic flow. Li et al. [21] proposed Forecaster based on Transformer architecture, which represents the spatial-temporal correlation among nodes by constructing correlation graphs and adopting sparse process on the weighted matrix. To achieve a more refined modeling result of the traffic networks, current researchers introduce the concept of multi-graph convolution. Li et al. [49] used a new progressive multi-graph convolution network (PMGCN), in which multiple graph convolutions adopt progressive connections and spatial-temporal attention dynamically adjusts each node in the graph. Ni et al. [50] introduced a novel multi-graph model based on an attention 1-D CNN and a gated interpretable framework to model historical traffic data. Qin et al. [51] proposed a dynamic multi-graph convolution recurrent network (DMGCRN) to capture coarse-grained region methods dynamically. Ke et al. [52] adopted the spatial correlation between roads and semantic correlation to construct the multi-graph and fused the results after separate convolution on the multi-graph. Yang et al. [53] integrated the multi-graph first and then conducted graph convolution on the multi-graph. Zeng et al. [54] introduced Point of Interest (POI) data to construct a traffic knowledge graph that reflected land utilization. However, most of these approaches do not fully use the topological structure information of the road networks and ignore different distributions of spatial-temporal features in traffic regions.

To address the limitaions of the existing studies and step beyond the state-of-the-art, a new graph neural network structure AST-AIGN is proposed in this paper for the traffic forecasting task on urban road networks.

Figure 3.

Aggregation of 2-hop neighborhood.

3. Preliminaries

3.1 Problem definition

In this study, the topology structure of road networks is described by a graph with the weights $g=(V,E,A)$ , where ${V=\{v_{1},v_{2},\ldots,v_{N}\}}$ represents the set of traffic nodes, and $N$ denotes the number of nodes; $E$ reflects the set of physically connection edges between traffic nodes; and $A\in R^{N\times N}$ is the adjacency matrix indicating the Euclidean distance or similarity among traffic nodes.

Traffic forecasting is a classic spatial-temporal forecasting problem, where traffic data is described by a matrix $X\in R^{N\times T\times D}$ containing $D$ kinds of attribute data such as speed, flow and density possessed in $N$ traffic nodes within past time step $T$ . Knowing the road network structure $g$ and traffic data $X$ , the mapping function $f$ can be learned and the traffic condition in the future time step ${T}^{\prime}$ can be calculated according to $f$ . The traffic forecasting problem can be stated as:

$\displaystyle[X_{t+1},\ldots,X_{t+{T}^{\prime}}]=f(g;[X_{t-T},\ldots,X_{t}])$ (1)

where $t$ represents the current time point, $T$ represents the length of the historical time series and ${T}^{\prime}$ represents the length of the time series to be forecasted.

3.2 Neighborhood aggregation

Diverse types of graph neural network [31, 10, 32, 17, 34] that were proposed to apply convolution operation on non-Euclidean data can be abstractly described as a neighborhood aggregation scheme [25]. The scheme is divided into two parts: feature generation and feature aggregation, which is mathematically presented as:

$\displaystyle{X}_{i}^{k}=\gamma({X}_{i}^{k-1},\Box_{j\in N(i)}\phi^{k}({X}_{i}% ^{k-1},{X}_{j}^{k-1},e_{j,i}))$ (2)

where ${X}_{i}^{k-1}$ represents the feature of node $i$ in the $(k-1)$ graph convolution layer, and $e_{j,i}$ represents the edge weight from node $j$ to $i$ . $\phi$ generates new feature representations utilizing neighbor node features and edge weights, $\Box$ and $\gamma$ represents the aggregation method of the generated features.

Figure 3 shows the process of aggregating 2-hop neighbors’ features of the central node by two graph convolution layers. The $k$ -th graph convolution layer aggregates the $k$ -hop neighbor node features for the central node. When using existing graph-based methods for traffic forecasting, traffic hub nodes will get overlarge neighborhood ranges while edge nodes will get limited ones.

4. Proposed model

This section focuses on traffic forecasting task on real road networks using Attention-based Spatial-Temporal Adaptive Integration Gated Network (AST-AIGN). AST-AIGN consists of three main components: feature encoding layer, two spatial-temporal layers, and prediction layer. The feature encoding and the prediction layer are composed of one and two layers of 1-D convolution, respectively.

Figure 4.

Overall structure of AST-AIGN.

4.1 Overall architecture

Figure 4 shows the overall structure of the AST-AIGN model, in which each spatial-temporal layer includes a module of spatial feature, a module of temporal feature and a data-dependent spatial-temporal adaptive integration gate. The traffic forecasting process is described as follows: The real traffic data $X\in R^{N\times T\times D}$ is convolved by the feature encoding layer to obtain the refined feature data $X_{L0}\in R^{N\times T\times D_{E}}$ as the input of the first spatial-temporal layer. $SM(\cdot)$ represents the module of spatial feature, $TM(\cdot)$ represents the module of temporal feature and $G(\cdot)$ represents spatial-temporal adaptive integration gate. The process of extracting features from $X_{L0}$ is as followed. The spatial features $X_{L1}^{SM}=SM(X_{L0},A)$ are captured using the information from initial adjacency matrix $A$ . Meanwhile, the module of temporal feature extracts the temporal feature $X_{L1}^{TM}=TM(X_{L0})$ . At last the gate integrates both spatial and temporal features $X_{L1}=G(X_{L0},X_{L1}^{TM},X_{L1}^{SM})$ as the output of the first spatial-temporal layer. Similarly, we get the output of second spatial-temporal layer $X_{L2}$ . Finally, through the two-layer convolution operation of the prediction layer, the prediction result $Y\in R^{N\times{T}^{\prime}\times D}$ is obtained.

4.1.1 Module of spatial feature

The structure of the module of spatial feature is shown in the top left of Fig. 4, which contains two parts: the self-attention mechanism and JK-GAT. The traffic spatial features are highly dynamic due to weather changes, traffic accidents, and other factors. Specifically, the edge weights between each sensor and its neighboring sensors are changed over time. We use self-attention without initial topological relationships as the basis of our model to model dynamic spatial dependence. We propose JK-GAT to focus on modeling the directedness of the road network structure and obtain node feature representations that are adaptive to the topological structure. After getting $X_{L}^{\textit{attn}}$ (obtained from self-attention mechanism) and $X_{L}^{\textit{JK-GAT}}$ (obtained from JK-GAT), we adopt linear transformation to integrate the features and get the output of module of spatial feature $X_{L+1}^{SM}$ .

$\displaystyle X_{L+1}^{SM}=W_{1}X_{L}^{\textit{attn}}+W_{2}X_{L}^{\textit{JK-% GAT}}$ (3)

4.1.2 Module of temporal feature

The structure of the module of spatial feature is shown in the bottom left of Fig. 4. How to capture the long-term temporal dependence is important in the traffic forecasting task. The limited time dependence can hardly reflect the trend and periodicity of traffic flow adequately, so we also adopt the self-attention mechanism to capture the long-term temporal dependence $X_{L+1}^{TM}$ effectively.

4.2 Self-attention for extracting spatial feature and temporal feature

The self-attention mechanism is an amendment of the attention mechanism, which is widely applied in computer vision and natural language processing. By calculating attention weights based on itself, the attention mechanism reduces the dependence on external information and is more conducive to explore the internal relevance of data or features. We adopt self-attention mechanism to model dynamic spatial dependence and long-term temporal dependence adequately. Taking extraction of spatial features as example, we first adopt spatial position embedding to combine a trainable random initialization parameter matrix $M^{S}\in R^{N\times N}$ with $X_{L}$ and obtain the embedded feature data $X_{L}^{SE}=[M^{S},X_{L}]\in R^{N\times T\times D_{E}}$ . Then $X_{L}^{SE}$ is projected into three subspaces: query subspace $Q^{SM}\in R^{N\times d_{q}^{S}}$ , key subspace $K^{SM}\in R^{N\times d_{q}^{S}}$ , and value subspace $V^{SM}\in R^{N\times d_{q}^{S}}$ . The projection process can be expressed as follow:

$\displaystyle Q^{SM}=W^{SM}_{q}X_{L}^{SE}$ $\displaystyle K^{SM}=W^{SM}_{k}X_{L}^{SE}$ (4) $\displaystyle V^{SM}=W^{SM}_{v}X_{L}^{SE}$

where $W^{SM}_{q}$ , $W^{SM}_{k}$ , $W^{SM}_{v}$ represent (Query, Key, Value) corresponding spatial weight matrices, respectively.

After getting the projection subspace, the node correlation weight matrix $C_{L}^{SM}\in R^{N\times N}$ is obtained by scaled dot-product describing the undirected connectivity between traffic nodes.

$\displaystyle C_{L}^{SM}=\textit{softmax}\left(\frac{Q^{SM}K^{SM^{T}}}{\sqrt{d% _{k}}}\right)$ (5)

The weight matrix is projected back to the original space domain, and a three-layer nonlinear activation feedforward neural network is connected to complete extraction the of dynamic spatial dependence $X_{L}^{\textit{attn}}\in R^{N\times T\times D_{E}}$ .

$\displaystyle X_{L}^{\textit{attn}}=\textit{Relu}(\textit{Relu}(C^{SM}_{L}V^{% SM}W_{0}^{SM})W_{1}^{SM})W_{2}^{SM}$ (6)

where $W_{0}^{SM}$ , $W_{1}^{SM}$ , $W_{2}^{SM}$ are the weight matrices of the feedforward network.

When capturing long-term temporal dependence with self-attention, we adopt the one-hot time encode to initialize the temporal position embedding matrix $M^{T}$ . The one-hot time encode can inject the time step information into each node in the traffic networks. After we obtain the temporal embedded feature data $X^{TE}_{L}=[M^{T},X_{L}]$ , the same self-attention mechanism is also adopted to capture the temporal dependence $X_{L+1}^{TM}$ .

4.3 JK-GAT for extracting spatial feature

Graph convolution or neighborhood aggregation is inherently a smoothing operation on graph signals. Each node in the graph draws feature from its neighborhood range which depends on the topology position of each node. Unlike other deep learning networks, the training results of graph convolutional networks are likely to be over-smoothed when the networks get deeper, leading to a performance decline of related prediction tasks. For the traffic forecasting task precisely, the features of traffic hub nodes are easily propagated to neighboring nodes. In contrast, the features of edge nodes are challenging to radiate to the other nodes, which makes it difficult to get appropriate feature representations. To overcome this problem, we introduce a general graph convolutional network framework, JK-Net, which enables layer-wise neural networks to obtain structure-aware feature representations. As shown in Fig. 5, JK-net applies two powerful changes to networks: jump connections and layer-aggregation mechanism.

In general graph convolution networks, each layer increases the size of the neighborhood range on the base of the previous layer. The jump connection are capable of making the output of each layer jump to the last layer, where the layer-aggregation mechanism carefully selects the spatial feature representations obtained from each layer. Assuming $h_{v}^{k}\in R^{T\times D_{E}}$ to be the feature at the $k$ -th layer, feature representation in the last layer is $[h_{v}^{1},h_{v}^{2},\ldots,h_{v}^{k}]$ where $v$ represents the node number.

Figure 5.

Structure of $k$ -layers JK-GAT.

The key idea of the layer-aggregation mechanism is to determine the importance of a node’s feature representation at different neighborhood ranges after looking at the features on all layers. JK-Net provides Concatenation and Max-pooling aggregation methods. Concatenation is the most direct way to combine features in all layers, which is suitable for small graphs and graphs with regular structure. Max-pooling can choose the most influential layer based on the traffic data pattern during training. Max-pooling is adaptive to topology structure and needs no additional parameters to train. The detail of the Max-pooling layer-aggregation mechanism can be expressed as follows:

$\displaystyle h_{\textit{Max-pooling}}=\max([h^{1}_{v}\lvert\rvert h^{2}_{v}% \ldots\lvert\rvert h^{k}_{v}])$ (7)

where $\lvert\rvert$ represents the vector concatenation on the highest dimension of input data. After concatenating the layer-wise representations connected in the last layer through the jump connection, $\max(\cdot)$ selects the largest column in the concatenation result. Then, the chosen column is transformed into the same dimension as the input data, and we get the aggregation representation $h_{\textit{Max-pooling}}$ of the Max-pooling method. In the whole process, no external parameters are introduced.

We adopt GAT in JK-net to draw the directed spatial features in road networks. Figure 5 represents a stacked JK-GAT with $k$ layers structure. GAT belongs to the spatial domain graph convolutional network, which uses the attention mechanism combined with the initial topology structure information to model the traffic road network directness. The process of GAT from the perspective of neighborhood aggregation can be split into two stages:

Feature generation. Using the node characteristics ${h_{1}^{k},h_{2}^{k},\ldots,h_{v}^{k}}$ at the $k$ -th layer to calculate the correlation weights $a_{ij}$ of node $i$ with its 1-hop neighborhood node $j$ .

$\displaystyle{a_{ij}=\frac{\exp(\textit{LeakyReLu}(\vec{a}^{T}([W_{\textit{GAT% }}h_{i}^{k}\lvert\rvert W_{\textit{GAT}}h_{j}^{k}])))}{\sum_{v\in V(i)}\exp(% \textit{LeakyReLu}(\vec{a}^{T}([W_{\textit{GAT}}h_{i}^{k}\lvert\rvert W_{% \textit{GAT}}h_{v}^{k}])))}}$ (8)

where $\lvert\rvert$ represents the vector concatenation, $W_{\textit{GAT}}$ represents the attention mapping matrix, $V(i)$ represents the 1-hop neighbors of node $i$ , $\vec{a}^{T}$ represents a single-layer feedforward neural network, and $\textit{LeakyReLU}()$ represents the activation function. If in the initial adjacency matrix $A_{ij}=0$ , the correlation weight $a_{ij}$ is assigned to 0. Moreover, when the 1-hop neighbors of nodes $i$ and $j$ are different, $a_{ij}\neq a_{ji}$ is satisfied which describe the directness of the traffic network. After getting the correlation weight, the new feature representation of node $j$ $F_{j}=a_{ij}Wh_{j}^{k}$ is generated.

Feature aggregation. After all the new feature representations of 1-hop neighbor are generated, the weighted aggregation of all 1-hop neighbor feature representations of node $i$ is used to update the feature of node $i$ :

$\displaystyle h_{i}^{k+1}=\sigma\left(\sum_{j\in V(i)}F_{j}\right)$ (9)

where $\sigma(\cdot)$ represents the nonlinear activation function.

Overall, JK-Net is a general framework of graph convolution to solve the over-smoothing problem in the learning process. We embed GAT into the JK-Net framework to combine their advantages in traffic forecasting tasks. The JK-GAT with $k$ layers structure combines the output of each layer by jump connections and layer-aggregation mechanism to obtain structure-aware feature representation $X_{L}^{\textit{JK-GAT}}\in R^{N\times T\times D_{E}}$ of traffic nodes.

$\displaystyle X_{L}^{\textit{JK-GAT}}=\textit{Layer-Aggregation}(h_{v}^{1},h_{% v}^{2},\ldots,h_{v}^{k})$ (10)

4.4 Spatial-temporal adaptive integration gate

We extract the spatial and temporal features simultaneously to make sure the features do not interfere with each other and affect the prediction performance. In traffic data, the dominant feature of traffic conditions is different in different traffic areas. We show the diversity of spatial-temporal features distribution in Fig. 6. In the business regions, spatial features have much more influence on traffic flow, while in the industrial areas, temporal features influence more. It’s necessary to selectively integrate spatial-temporal features in different regions and highlight features that significantly impact traffic conditions. POI data can reflect specific points with different functional attributions in urban areas, while this information is inaccessible in many scenarios. We propose a data-dependent function called spatial-temporal adaptive integration gate which directly explores the distribution in the historical traffic data without any external information. Thus our method is more flexible and universal for traffic forecasting.

Figure 6.

Spatial-temporal features distribution obtained from the gate: Dark blue represents spatial features influences more in this region and dark red means temporal features influences more.

Definition 1: spatial-temporal adaptive integration gate. We mainly use twice 2-D convolution operation to learn the original spatial-temporal feature distribution in different traffic regions from the input data $X_{L}\in R^{N\times T\times D_{E}}$ . A data-dependent vector $G\in R^{N\times T\times 2}$ is obtained through the gate.

$\displaystyle G=F(\sigma(N(F(X_{L}))))$ (11)

where $F(\cdot)$ represents the 2-D convolution function, $N$ represents the normalization function, and $\sigma(\cdot)$ represents the activation function, and the ReLU function is generally used. The first 2-D convolution on $X_{L}$ is performed by a $3\times 3\times D_{E}$ filter ( $D_{E}$ is the number of convolution kernel). We perform matrix addition on the obtained $N\times T\times D_{E}$ feature map to get a single channel feature map with $N\times T$ dimension. In the second 2-D convolution, we use two convolution kernels containing learnable parameters on the single channel feature map to get two $N\times T$ feature maps (the vector $G$ ), which are the original spatial and temporal feature distributions, respectively. The values in the two feature maps represent the weight of spatial and temporal features of every traffic node. A larger weight means the corresponding feature has a more significant influence on the traffic conditions around the traffic node.

Definition 2: activation function $\delta$ . Vector $G$ is activated using the nonlinear activation function $\delta(\cdot)$ to obtain the final feature distribution,which is designed as follow:

$\displaystyle\delta(\cdot)=\max(0,\textit{Tanh}(\cdot))$ (12)

Therefore, the distribution $\gamma_{L}=\left\{\begin{array}[]{l}\gamma^{SM}_{L}\\ \gamma^{TM}_{L}\end{array}\right\}\in R^{N\times T\times 2}$ can be calculated by $\delta(G)$ . $\gamma^{SM}_{L}$ represents the weight of spatial features in different traffic regions and $\gamma^{TM}_{L}$ represents the same of temporal features. In Fig. 6, we show how $\gamma^{SM}_{L}$ and $\gamma^{TM}_{L}$ can represent the distribution of spatial-temporal features in different regions with visualization.

After obtaining the distribution through the gate, we can selectively fuse the dependences $X^{SM}_{L+1}$ , $X^{TM}_{L+1}$ that are captured in the module of spatial feature and the module of temporal feature. Besides, a residual connection is employed in a spatial-temporal layer to improve stability and enhance convergence speed in the training process. The integration process can be described as follows:

$\displaystyle X_{L+1}=(X_{L+1}^{TM}+X_{L})\gamma^{TM}_{L}+(X_{L+1}^{SM}+X_{L})% \gamma^{SM}_{L}$ (13)

Table 1

Detail information of PEMS04 and PEMS08 datasets

Datesets	Nodes	Prediction feature	Max	Min	Mean	Coverd distances	Data unit
PEMS04	307	Traffic flow	919	0	211.70	$\sim$ 130 km	Vehicle/5-min
PEMS08	107	Traffic flow	1147	0	230.68	$\sim$ 90 km	Vehicle/5-min

5. Experiment

5.1 Dataset description

To validate the effectiveness of the proposed model, AST-AIGN needs to be verified on large-scale datasets. Two California highway traffic datasets, PEMS04 and PEMS08, were selected for experimental validation because of their integrity, correctness, and ease of acquisition and processing. The raw traffic data are collected by Caltrans Performance Measurement System (PEMS) in real-time at 30-second intervals and aggregated every 5 minutes to obtain the traffic datasets. The system deploys more than 39,000 traffic sensors on California freeways and records the location information of each sensor. We select traffic flow as the experimental data from three traffic attributes: flow, average speed, and average occupancy. The detail information of two datasets is shown in Table 1.

PEMS04 dataset contains traffic data collected by 3848 sensors on 29 roads in the San Francisco Bay area, with a 58 days time span spanning from January to February in 2018. PEMS04 contains relative distance information among 340 sensors, sensor distribution, and adjacency matrix distribution.

PEMS08 dataset contains traffic data collected by 1979 sensors on 8 roads in the San Bernardino area, spanning a total of 62 days from July to August 2016. PEMS08 contains information on the relative distances among 295 sensors.

The traffic data will be normalized to the interval [0, 1], with the first 80% of traffic data as the training set and the remaining 20% as the test set. We use 60 minutes (12 time steps) of current traffic flow data to predict traffic flow in prediction horizons 5 min, 15 min, 30 min, 45 min, and 60 min (1, 3, 6, 9, 12 time steps).

We utilize the distance information to determine the initial weights among sensors, which describe the degree of mutual influence between traffic nodes. The adjacency matrix $A$ that depicts the road network topology structure is constructed based on the equation:

$\displaystyle A_{ij}=\left\{\begin{array}[]{ll}\exp(-\frac{d_{ij}^{2}}{\sigma^% {2}}),&i\neq j\\ 1,&\text{otherwise}\end{array}\right\}$ (14)

where $A_{ij}$ represents the edge weights determined by the distance of sensors $i$ , $j$ , and $\sigma^{2}$ represents the thresholds controlling the distribution and sparsity of the adjacency matrix, which are assigned to 50. Besides, we add the self-loop of nodes into the adjacency matrix. The visualization of the adjacency matrix of the PEMS04 and PEMS08 is shown in Fig. 7.

Figure 7.

Sensor distribution and heat map of adjacency matrix for PEMS08 and PEMS04 dataset.

5.2 Evaluation metrics and baselines

To evaluate the performance of the model proposed in this paper, the mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) are used as evaluation metrics. Let $x=x_{1},x_{2}\ldots x_{\Omega}$ denote the true value, $\hat{x}=\hat{x}_{1},\hat{x}_{2},\ldots,\hat{x}_{\Omega}$ denote the predicted values.

Mean Absolute Error

$\displaystyle\textit{MAE}(x,\hat{x})=\frac{1}{\lvert\Omega\rvert}\sum_{i\in% \Omega}\lvert x_{i}-\hat{x_{i}}\rvert$ (15)

Root Mean Square Error

$\displaystyle\textit{RMSE}(x,\hat{x})=\sqrt{\frac{1}{\lvert\Omega\rvert}\sum_{% i\in\Omega}(x_{i}-\hat{x_{i}})^{2}}$ (16)

Mean Absolute Percentage Error

$\displaystyle\textit{MAPE}(x,\hat{x})={\frac{1}{\lvert\Omega\rvert}\sum_{i\in% \Omega}\left\lvert\frac{x_{i}-\hat{x_{i}}}{x_{i}}\right\rvert}$ (17)

To conduct convincing experiments, we choose some wildly adopted or state-of-the-art approaches as baseline models were to conduct experiments in comparison with the model proposed.

FC-LSTM [46]. A network model combining long and short-term memory networks with full connectivity. The size of input layers and number of hidden layers are set to 3 in the model. And the number of hidden units in each hidden layer is [400, 400, 400].

DCRNN [14]. An improved seq2seq model combined with bi-directional diffusion graph convolution. In the model, the reception field width of the filter is set to 3, and the number of filters is set to 64.

STGCN [11]. Spatial-temporal graph convolutional network, combining graph convolution with one-dimensional convolutional units. The channels of three layers in convolution blocks are [64, 16, 64] respectively. Both the spatial and temporal convolution kernel sizes are set to 3.

ASTGCN(r) [18]. Attention-based spatial-temporal graph convolutional network, which introduces spatial-temporal attention into the network structure. The number of attention layers is set to 3, and the kernel size along the temporal dimension is set to 3.

GraphWaveNet [12]. GraphWaveNet introduces adaptive adjacency matrix with 1-D diffusion convolution in graph convolution. The dilation factors of 8 blocks are set to [1, 2, 1, 2, 1, 2, 1, 2]. The diffusion step in graph convolution is set to 2.

STTNs [20]. A spatial-temporal graph convolutional network based on the self-attention mechanism, which combines the self-attention mechanism with graph convolution for a stackable deep network model. In the model, the number of blocks is set to 3, the number of spatial and temporal attention layers are set to [4, 2], respectively, and the attention head is set to 2.

5.3 Experimental settings

The experiments were conducted in a Tianhe supercomputing environment with 2 Intel(R) Xeon(R) Gold 6132 14-core CPUs and 4 GPUs per compute node. 256GB of memory (shared between 2 CPUs) was available for each computes node. The GPU is NVIDIATesla V100SXM2 with 16GB of HBM2 memory. AST-AIGN with 2 spatial-temporal layers is applied. The module of spatial feature and module of temporal feature use single-head and single layer self-attention. The input data can be extended to 64 dimensions through a 1-D convolution operation in the input layer. We use a uniform initialized random distribution of size 2 to ensure the reproducibility of experimental results. During model training, the Adam optimization algorithm is chosen, the learning rate is set to 0.001 and decays 3% every 50 generations, the loss function is MSE, the dataset batch size is set to 64, the model is trained for 200 generations, and a dropout rate of 0.05 is used.

The number of JK-GAT layers $k=$ [1, 2, 3, 4, 5] and the layer-aggregation methods [Concatenation, Max-pooling] in AST-AIGN are two important parameters, and different values will directly affect the prediction accuracy. We first experiment AST-AIGN with single-layer JK-GAT and different aggregation methods to predict the future 5 minutes traffic flow. Table 2 shows the results achieved by the AST-AIGN model on PEMS04 and PEMS08 datasets. It can be seen that for both PEMS04 and PEMS08 datasets, the prediction results are better when the ‘Max-pooling’ aggregation method is used. This is because the ‘Concatenation’ method simply splices all the layer-wise representations without selection. However, the ‘Max-pooling’ method is an adaptive method that can select the most influential representations from all layers. It is reasonable that ‘Max-pooling’ performs better than the Concatenation. So, we choose the ‘Max-pooling’ method to aggregate layers.

Table 2
Results of AST-AIGN with single-layer JK-GAT and different aggregation methods

	PEMS04			PEMS08
Aggragation methods	RMSE	MAE	MAPE	RMSE	MAE	MAPE
Concatenation	27.73	17.41	11.92%	20.45	13.36	8.55%
Max-pooling	27.61	17.33	11.84%	20.37	13.22	8.46%

Table 3

Result of ablation experiments for both datasets

	PEMS04			PEMS08
Models	RMSE	MAE	MAPE	RMSE	MAE	MAPE
AST-AIGN-para	33.82	21.95	15.64%	28.77	18.93	12.06%
AST-AIGN-gcn	33.01	21.35	15.16%	26.85	17.44	10.97%
AST-AIGN-last	32.97	21.32	15.19%	27.48	17.90	11.14%
AST-AIGN	32.75	21.20	15.26%	26.69	17.37	11.05%

Table 4

Results of AST-AIGN with $k$ -layers of JK-GAT for both datasets

	PEMS04 (5 min)			PEMS08 (5 min)
Layers	RMSE	MAE	MAPE	RMSE	MAE	MAPE
1	27.61	17.33	11.84%	20.50	13.33	8.45%
2	27.60	17.33	11.82%	20.40	13.25	8.47%
3	27.66	17.35	11.74%	20.48	13.33	8.50%
4	27.43	17.19	11.73%	20.52	13.33	8.48%
5	27.61	17.33	11.81%	20.37	13.22	8.46%
	PEMS04 (15 min)			PEMS08 (15 min)
Layers	RMSE	MAE	MAPE	RMSE	MAE	MAPE
1	28.49	17.99	12.53%	21.70	14.07	8.92%
2	28.46	17.95	12.41%	21.70	14.07	8.92%
3	28.40	17.92	12.15%	22.02	14.36	9.03%
4	28.49	17.96	12.35%	22.04	14.37	9.07%
5	29.11	18.43	12.80%	21.67	14.04	8.82%
	PEMS04 (30 min)			PEMS08 (30 min)
Layers	RMSE	MAE	MAPE	RMSE	MAE	MAPE
1	29.18	18.49	12.94%	22.69	14.71	9.20%
2	29.19	18.49	12.96%	22.69	14.71	9.21%
3	29.18	18.49	12.94%	22.66	14.70	9.15%
4	29.10	18.43	12.70%	22.57	14.65	9.16%
5	29.25	18.57	12.89%	22.37	14.50	9.06%
	PEMS04 (45 min)			PEMS08 (45 min)
Layers	RMSE	MAE	MAPE	RMSE	MAE	MAPE
1	30.74	19.66	13.75%	24.67	15.95	9.99%
2	30.74	19.66	13.75%	24.68	15.95	9.99%
3	30.78	19.67	13.80%	24.67	16.01	10.13%
4	30.69	19.60	13.73%	24.58	15.90	9.99%
5	30.97	19.79	13.82%	24.54	15.93	10.11%
	PEMS04 (60 min)			PEMS08 (60 min)
Layers	RMSE	MAE	MAPE	RMSE	MAE	MAPE
1	32.93	21.31	15.39%	26.69	17.37	11.05%
2	32.75	21.20	15.26%	26.72	17.39	11.06%
3	33.01	21.33	15.08%	27.88	18.24	11.47%
4	33.24	21.51	15.40%	27.01	17.56	11.14%
5	33.16	21.51	15.45%	27.27	17.77	11.15%

Figure 8.

Relationship of the JK-GAT layers and the prediction horizon with accuracy from AST-AIGN model on PEMS04 and PEMS08. The $x$ -axis is the prediction horizon and the $y$ -axis is the prediction accuracy.

Table 5

Comparison of prediction results among AST-AIGN and baselines on both datesets

Models	RMSE	MAE	MAPE	RMSE	MAE	MAPE
	PEMS04 (5 min)			PEMS08 (5 min)
FC-LSTM	44.42	31.28	29.78%	35.93	25.54	19.41%
STGCN	32.97	21.48	15.71%	27.90	18.91	12.41%
DCRNN	31.99	19.41	16.17%	25.06	16.51	16.08%
ASTGCN(r)	28.21	17.56	12.45%	21.20	13.72	9.21%
GraphWaveNet	27.60	17.31	11.60%	20.65	12.45	8.51%
STTNs	27.80	17.51	11.92%	20.55	13.33	8.46%
AST-AIGN	27.43	17.19	11.73%	20.37	13.22	8.46%
	PEMS04 (15 min)			PEMS08 (15 min)
Models	RMSE	MAE	MAPE	RMSE	MAE	MAPE
FC-LSTM	47.69	34.25	35.63%	39.40	28.85	23.42%
STGCN	33.09	21.34	12.86%	27.16	18.28	12.05%
DCRNN	32.78	20.12	17.56%	25.87	16.92	15.86%
ASTGCN(r)	29.55	18.51	12.64%	23.02	14.54	9.32%
GraphWaveNet	29.22	18.48	12.56%	22.56	14.35	9.12%
STTNs	28.78	18.05	12.51%	22.00	14.19	9.00%
AST-AIGN	28.40	17.92	12.15%	21.67	14.04	8.82%
	PEMS04 (30 min)			PEMS08 (30 min)
Models	RMSE	MAE	MAPE	RMSE	MAE	MAPE
FC-LSTM	50.36	36.51	39.62%	42.00	31.10	26.27%
STGCN	34.45	22.20	16.08%	28.14	18.88	12.39%
DCRNN	33.44	20.50	16.60%	26.88	17.60	15.83%
ASTGCN(r)	30.20	19.56	13.68%	23.22	14.98	9.87%
GraphWaveNet	29.92	18.93	13.15%	22.80	14.88	9.72%
STTNs	29.34	18.45	12.77%	22.56	14.59	9.29%
AST-AIGN	29.10	18.43	12.70%	22.37	14.50	9.06%
	PEMS04 (45 min)			PEMS08 (45 min)
Models	RMSE	MAE	MAPE	RMSE	MAE	MAPE
FC-LSTM	57.67	42.74	50.87%	49.38	37.04	32.24%
STGCN	36.34	23.68	17.55%	29.05	19.40	12.66%
DCRNN	35.12	21.87	18.04%	28.60	18.62	16.36%
ASTGCN(r)	32.02	20.26	14.95%	25.99	16.67	10.65%
GraphWaveNet	31.68	20.16	14.05%	24.95	15.98	10.43%
STTNs	31.09	19.65	13.61%	24.74	15.96	10.00%
AST-AIGN	30.69	19.60	13.73%	24.54	15.93	10.11%
	PEMS04 (60 min)			PEMS08 (60 min)
Models	RMSE	MAE	MAPE	RMSE	MAE	MAPE
FC-LSTM	65.24	56.80	62.45%	67.33	50.93	44.29%
STGCN	39.76	26.27	19.94%	32.72	21.99	14.33%
DCRNN	38.47	24.62	21.05%	32.23	21.28	18.11%
ASTGCN(r)	37.06	24.04	17.23%	28.70	18.65	11.25%
GraphWaveNet	33.46	22.10	15.59%	27.41	17.83	11.61%
STTNs	33.73	21.68	15.53%	27.06	17.56	11.02%
AST-AIGN	32.75	21.20	15.26%	26.69	17.37	11.05%

Next, we proceed with experiments for traffic flow forecasting of all time steps with different JK-GAT layers on both datasets. The results are summarized in Table 4. Figure 8a shows the number of JK-GAT layers that achieve the best results at different prediction horizons on PEMS04. When the prediction horizon becomes longer for all JK-GAT layers, the RMSE of our AST-AIGN shows an overall increasing trend, which indicates that a longer horizon will affect the prediction performance to some extent. It is clear that on the short-term and medium-term forecasting tasks, a larger number of JK-GAT layers achieves the lowest error, which means the influence of a broader range of traffic edge nodes needs to be considered more. While on the long-term forecasting task, the lowest error is obtained when the number of JK-GAT layers is 2, which means the effect of adjacent nodes is more influential. Specifically, the best prediction results for the prediction horizon [5 min, 15 min, 30 min, 45 min] were obtained with the number of JK-GAT layers [4, 3, 4, 4], respectively. On the contrary, at a prediction horizon of 60 min, the best prediction result was obtained with two layers of JK-GAT.

Similar conclusions can be obtained from the experimental results of PEMS08. And as shown in Fig. 8b, the best prediction results were obtained at the prediction horizon [5 min, 15 min, 30 min, 45 min], with the number of JK-GAT layers all being 5. In the prediction horizon of 60 min, the best prediction result is obtained for JK-GAT with a single layer. This indicates that the two datasets have similar traffic flow patterns.

5.4 Comparison study

Table 5 compares the prediction performance of AST-AIGN with the above baselines on the PEMS04 and PEMS08 datasets for prediction time step $=$ [1, 3, 6, 9, 12]. We can observe the following phenomenon from the experimental results in Table 5.

As shown in Table 5, all three metrics (RMSE, MAPE, and MAE) of each model grow when the prediction time step becomes longer, which indicates that the longer the prediction time step, the worse the prediction performance of each model. That is because predicting farther points in the time dimension requires much more accurate temporal dependence.

As can be seen from Table 5, FC-LSTM is the worst performing model because it is usually used to capture temporal features but not spatial features. When the prediction time step becomes longer, the FC-LSTM prediction error grows most rapidly.

The graph-based deep learning networks STGCN and DCRNN achieved better performance than FC-LSTM on both datasets due to STGCN and DCRNN capturing both spatial and temporal features for final prediction.

ASTGCN and GraphWaveNet outperform STGCN and DCRNN models on both datasets. ASTGCN introduces attention mechanism to describe dynamic spatial features, and GraphWaveNet proposes an adaptive adjacency matrix to achieve the same effect. These models indicate the effectiveness of modeling dynamic spatial dependence. The STTNs model combined with the Transformer framework to extract dynamic spatial dependence and long-term temporal dependence, which performs better than ASTGCN but hardly higher than GraphWaveNet. The STTNs model extracts spatial-temporal dependences step by step, resulting in insufficient integration of spatial-temporal features.

Our AST-AIGN achieves the best results on all prediction time steps for both datasets, with JK-GAT capturing spatial feature that more accurately reflects the topological structure and a data-dependent gate processing the diversity of road networks’ feature distribution.

To analyze the performance of each model at different prediction time steps, we visualize the result of error analysis. The prediction results of each model under different prediction horizon are shown in Figs 9 and 10, and it shows that the prediction accuracy of each model generally decreases when the horizon becomes longer. However, AST-AIGN achieves the best prediction results regardless of the variation of the horizon. And the changing trend of our model’s prediction results is smaller, indicating that the AST-AIGN is less affected by the scale of the prediction horizon and is suitable for prediction tasks at different time scales.

In summary, the results of the comparison study show that our AST-AIGN is effective in modeling complex spatial-temporal dependences and can achieve promising performance in traffic forecasting task.

Figure 9.

Comparison of prediction results of all models with different time step on PEMS04.

Figure 10.

Comparison of prediction results of all models with different time step on PEMS08.

5.5 Ablation study

To verify the superiority of AST-AIGN, we further design three variants: (1) AST-AIGN-para: removes the adaptive integration gate from the model. (2) AST-AIGN-gcn: replacing the GAT layer with a GCN layer, making it difficult for the model to simulate the directness of the traffic road network, but the GCN layers are still connected to each other in the form of ‘Max-pooling’. (3) AST-AIGN-last: no ‘Max-pooling’ connection between GAT layers. We compare the prediction performance on 12 time steps of AST-AIGN with its three variants on both datasets, and from Table 3 we can observe that:

AST-AIGN-para has the highest RMSE, MAE, and MAPE on both datasets. This indicates that not considering the spatial-temporal feature distribution leads to inadequate integration of dependences, which ultimately results in poorer prediction results.

AST-AIGN-gcn and AST-AIGN-last both outperformed AST-AIGN-para on both datasets. On PEMS04, AST-AIGN-last achieves better results. It indicates that in the dataset with a larger number of nodes, describing the directness among nodes is more significant than obtaining the structure-aware feature representation for each node. Similarly, AST-AIGN-gcn achieves better results on the PEMS08 dataset with less number of nodes.

Compared with the other three variants, AST-AIGN achieves the lowest RMSE and MAE on all datasets. It demonstrates the advantages of our AST-AIGN in extracting spatial-temporal features from traffic flow data.

Overall, the results of the ablation experiments demonstrate the effectiveness of AST-AIGN in modeling spatial-temporal relationships. The application of JK-GAT and spatial-temporal adaptive integration gate stand as important roles in exploiting spatial-temporal information in traffic data.

5.6 Model interpretation

To better understand AST-AIGN proposed in this paper, we visualize the results of the comparison study. In PEMS04, we choose the traffic flow data of sensor 123 on one day in February 2018 to visualize. In PEMS08, we choose sensor 110 on one day in August 2016. Figure 11 shows prediction results of all models on different time steps compared to ground truth. We can observe that:

AST-AIGN tends to generate smooth predictions when there are oscillation points in the traffic flow

Figure 11.

Comparison visualization results for different prediction time step of all models on PEMS04 (top) and PEMS08 (bottom).

data. This indicates that the model has better robustness and is less vulnerable to sudden change points.

When the prediction time step becomes longer, the prediction results of AST-AIGN remain stable, while the prediction results of other baselines vary more. This reflects that the prediction time step has limited influence on the model.

To visually investigate the role of JK-GAT in modeling the directness of road networks, we select the first 50 sensors in PEMS04 and present the heatmap of the adjacency matrix and graph attention matrix in Fig. 12. In heatmaps, the value of the $i$ -th row and $j$ -th column represents the influence weight from sensor $i$ to $j$ . Of these 50 sensors, sensor 42 possesses nine 1-hop neighbors, which makes it a traffic hub node. While sensor 12 links with two 1-hop neighbors, meaning it belongs to the traffic edge node. As shown in Fig. 12, the influence weights in the adjacency matrix between sensors 42 and 12 are the same. In the graph attention matrix, we can observe sensor 42 has much more influence on 12, which reflects the directness of road networks. This analysis presents AST-AIGN not only achieves promising prediction results but also shows the advantage of interpretability.

Figure 12.

Adjacency matrix(left) and the graph attention matrix(right) obtained from the JK-GAT.

6. Conclusion

To discover the inherent spatial-temporal patterns in traffic flow and generate accurate predictions, we propose a new spatial-temporal traffic data prediction network structure AST-AIGN, which introduces a data-dependent adaptive integration gate by considering different spatial-temporal feature distributions to fuse spatial-temporal features. In addition, we employ JK-GAT to portray the directness among traffic nodes and obtain a better structure-aware feature representation of all nodes. Finally, AST-AIGN is used to handle the spatial-temporal traffic forecasting task, evaluated on two real datasets, PEMS04 and PEMS08, and compared with FC-LSTM, STGCN, DCRNN, ASTGCN(r), GraphWaveNet, and STTNs. AST-AIGN can successfully capture spatial-temporal features and is not limited to traffic forecasting but can also be applied to other spatial-temporal tasks.

Though the superior performance of our model in traffic flow prediction tasks, there are some research gaps that we will focus on in the future, including consideration of the effects of external factors (weather and social events) on traffic flow prediction and the model’s generalizability to other spatial-temporal prediction tasks such as traffic accident prediction and weather forecasting.

Footnotes

Acknowledgments

This work was supported by National Natural Science Foundation of China (NSFC) (U1911205) and Hubei Provincial Department of Natural Resources Science and Technology Project (No.ZRZY2023KJ15).

Conflict of interest

The authors declare that they have no conflict of interest.

Data availability statements

Datasets ued in this paper can be downloaded at http://pems.dot.ca.gov.

References

Mori

Mendiburu

Álvarez

and Lozano

J.A.

, A review of travel time estimation and forecasting for advanced traveller information systems, Transportmetrica A: Transport Science 11(2) (2015), 119–157.

Wang

and Ren

, Research on short-term traffic flow prediction based on microcosmic simulation, Journal of System Simulation 21(14) (2009), 4501–4503.

Xueli

Shuai

and Jianjun

, Application of quadric exponential smoothing model in short-term prediction of traffic information, Journal of Highway and Transportation Research and Development 28(S1) (2011), 101–104.

Van Lint

and Van Hinsbergen

, Short-term traffic and travel time prediction models, Artificial Intelligence Applications to Critical Transportation Issues 22(1) (2012), 22–41.

Qiu

and Yang

, A short-term traffic flow forecast algorithm based on double seasonal time series, Journal of Sichuan University (Engineering Science Edition) 45(5) (2013), 64–68.

Han

and Xu

, Short-term traffic flow forecasting model based on support vector machine regression, Journal of South China University of Technology (Natural Science Edition) 41(9) (2013), 71–76.

Lin

Z.S.C.

and Zhaoming

, Bayesian network model for traffic flow estimation using prior link flows, Journal of Southeast University (English Edition) 29(3) (2013), 322–327.

and Tan

, Short-term traffic flow forecasting with spatial-temporal correlation in a hybrid deep learning framework, arXiv preprint arXiv:1612.01022, 2016.

Liu

Zheng

Feng

and Chen

, Short-term traffic flow prediction with conv-lstm, in: 2017 9th International Conference on Wireless Communications and Signal Processing (WCSP), IEEE, 2017, pp. 1–6.

10.

Kipf

T.N.

and Welling

, Semi-supervised classification with graph convolutional networks, arXiv preprint arXiv:1609. 02907, 2016.

11.

Yin

and Zhu

, Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting, arXiv preprint arXiv:1709.04875, 2017.

12.

Pan

Long

Jiang

and Zhang

, Graph wavenet for deep spatial-temporal graph modelingï¼Œ arXiv preprint arXiv:1906.00121, 2019.

13.

Song

Lin

Guo

and Wan

, Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 914–921.

14.

Shahabi

and Liu

, Diffusion convolutional recurrent neural network: Data-driven traffic forecasting, arXiv preprint arXiv:1707.01926, 2017.

15.

Zhao

Song

Zhang

Liu

Wang

Lin

Deng

and Li

, T-gcn: A temporal graph convolutional network for traffic prediction. IEEE Transactions on Intelligent Transportation Systems 21(9) (2019), 3848–3858.

16.

Cui

Henrickson

and Wang

, Traffic graph convolutional recurrent neural network: A deep learning framework for network-scale traffic learning and forecasting, IEEE Transactions on Intelligent Transportation Systems 21(11) (2019), 4883–4894.

17.

Veličković

Cucurull

Casanova

Romero

Lio

and Bengio

, Graph attention networks, arXiv preprint arXiv:1710.10903, 2018.

18.

Guo

Lin

Feng

Song

and Wan

, Attention based spatial-temporal graph convolutional networks for traffic flow forecasting, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 922–929.

19.

Chen

Xie

Cao

Gao

and Feng

, Multi-range attentive bicomponent graph convolutional network for traffic forecasting, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 3529–3536.

20.

Dai

Liu

Gao

Lin

G.-J.

and Xiong

, Spatial-temporal transformer networks for traffic flow forecasting, arXiv preprint arXiv:2001.02908, 2020.

21.

and Moura

J.M.

, Forecaster: A graph transformer for forecasting spatial and time-dependent data, arXiv preprint arXiv:1909.04019, 2020.

22.

Yun

Jeong

Kim

Kang

and Kim

H.J.

, Graph transformer networks, Advances in neural information processing systems, 2020, 32.

23.

Cai

Janowicz

Mai

Yan

and Zhu

, Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting, Transactions in GIS 24(3) (2020), 736–755.

24.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A.N.

Kaiser

Ł.

and Polosukhin

, Attention is all you need, Advances in Neural Information Processing Systems 30(3) (2017), 36–55.

25.

Gilmer

Schoenholz

S.S.

Riley

P.F.

Vinyals

and Dahl

G.E.

, Neural message passing for quantum chemistry, in: International Conference on Machine Learning, PMLR, 2017, pp. 1263–1272.

26.

Tian

Sonobe

Kawarabayashi

K.-i.

and Jegelka

, Representation learning on graphs with jumping knowledge networks, in: International Conference on Machine Learning, PMLR, 2018, pp. 5453–5462.

27.

Mohajer

Daliri

M.S.

Mirzaei

Ziaeddini

Nabipour

and Bavaghar

, Heterogeneous computational resource allocation for noma: Toward green mobile edge-computing systems, IEEE Transactions on Services Computing 16(2) (2022), 1225–1238.

28.

Mohajer

Sorouri

Mirzaei

Ziaeddini

Rad

K.J.

and Bavaghar

, Energy-aware hierarchical resource management and backhaul traffic optimization in heterogeneous cellular networks, IEEE Systems Journal 16(4) (2022), 5188–5199.

29.

Bruna

Zaremba

Szlam

and LeCun

, Spectral networks and locally connected networks on graphs, arXiv preprint arXiv:1312.6203, 2013.

30.

Henaff

Bruna

and LeCun

, Deep convolutional networks on graph-structured data, arXiv preprint arXiv:1506. 05163, 2015.

31.

Defferrard

Bresson

and Vandergheynst

, Convolutional neural networks on graphs with fast localized spectral filtering, Advances in Neural Information Processing Systems 33(1) (2016), 65–78.

32.

Hamilton

Ying

and Leskovec

, Inductive representation learning on large graphs, Advances in Neural Information Processing Systems 30(1) (2017), 23–35.

33.

Schlichtkrull

Kipf

T.N.

Bloem

Berg

R.v.d.

Titov

and Welling

, Modeling relational data with graph convolutional networks, in: European Semantic Web Conference, Springer, 2018, pp. 593–607.

34.

Brody

Alon

and Yahav

, How attentive are graph attention networks? arXiv preprint arXiv:2105.14491, 2021.

35.

Wang

Wei

Kuang

and Ding

, Graph neural networks with node-wise architecture, in: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2022, pp. 1949–1958.

36.

Vlahogianni

E.I.

, Computational intelligence and optimization for transportation big data: Challenges and opportunities, Engineering and Applied Sciences Optimization, 2015, 107–128.

37.

Xiangjie

, Short-term traffic volume intelligent hybrid forecasting model and its application, Systems Engineering-Theory & Practice 31(3) (2011), 562–568.

38.

Shan

Zhao

and Xia

, Urban road traffic speed estimation for missing probe vehicle data based on multiple linear regression model, in: 16th International IEEE Conference on Intelligent Transportation Systems (ITSC 2013), IEEE, 2013, pp. 118–123.

39.

Hamed

M.M.

Al-Masaeid

H.R.

and Said

Z.M.B.

, Short-term prediction of traffic volume in urban arterials, Journal of Transportation Engineering 121(3) (1995), 249–254.

40.

Van Der Voort

Dougherty

and Watson

, Combining kohonen maps with arima time series models to forecast traffic flow, Transportation Research Part C: Emerging Technologies 4(5) (1996), 307–318.

41.

Lee

and Fambro

D.B.

, Application of subset autoregressive integrated moving average model for short-term freeway traffic volume forecasting, Transportation Research Record 1678(1) (1999), 179–188.

42.

Williams

B.M.

and Hoel

L.A.

, Modeling and forecasting vehicular traffic flow as a seasonal arima process: Theoretical basis and empirical results, Journal of Transportation Engineering 129(6) (2003), 664–672.

43.

Duan

Kang

and Wang

F.-Y.

, Traffic flow prediction with big data: A deep learning approach, IEEE Transactions on Intelligent Transportation Systems 16(2) (2014), 865–873.

44.

Guo

Wang

and Yuan

, A survey of connected shared vehicle-road cooperative intelligent transportation systems, Control and Decision 34(11) (2019), 2375–2389.

45.

Van Lint

Hoogendoorn

and van Zuylen

H.J.

, Freeway travel time prediction with state-space neural networks: Modeling state-space dynamics with recurrent neural networks, Transportation Research Record 1811(1) (2002), 30–39.

46.

Shao

and Soong

B.-H.

, Traffic flow prediction with long short-term memory networks (lstms), in: 2016 IEEE Region 10 Conference (TENCON), IEEE, 2016, pp. 2986–2989.

47.

Zhang

and Li

, Using lstm and gru neural network methods for traffic flow prediction, in: 2016 31st Youth Academic Annual Conference of Chinese Association of Automation (YAC), IEEE, 2017, pp. 324–328.

48.

Huang

Wang

Lan

Jiang

and Yuan

, Md-gcn: A multi-scale temporal dual graph convolution network for traffic flow prediction, Sensors 23(2) (2023), 841.

49.

Han

Zhang

Sun

and Chen

, Pmgcn: Progressive multi-graph convolutional network for traffic forecasting, ISPRS International Journal of Geo-Information 12(6) (2023), 241.

50.

and Zhang

, Stgmn: A gated multi-graph convolutional network framework for traffic flow prediction, Applied Intelligence 52(13) (2022), 15026–15039.

51.

Qin

Fang

Luo

Zhao

and Wang

, Dmgcrn: Dynamic multi-graph convolution recurrent network for traffic forecasting, arXiv preprint arXiv:2112.02264, 2021.

52.

Qin

Yang

Zheng

Zhu

and Ye

, Predicting origin-destination ride-sourcing demand with a spatio-temporal encoder-decoder residual multi-graph convolutional network, Transportation Research Part C: Emerging Technologies 122 (2021), 102858.

53.

Yang

Tang

and Liu

, Dual temporal gated multi-graph convolution network for taxi demand prediction, Neural Computing and Applications 22 (2022), 1–16.

54.

Zeng

and Tang

, Combining knowledge graph into metro passenger flow prediction: A split-attention relational graph convolutional network, Expert Systems with Applications 213 (2022), 118790.

Combining jumping knowledge into traffic forecasting: An attention-based spatial-temporal adaptive integration gated network

Abstract

Keywords

1. Introduction

2.1 Graph convolutional networks

2.2 Traffic forecasting

3.1 Problem definition

4.1.1 Module of spatial feature

4.2 Self-attention for extracting spatial feature and temporal feature

5.1 Dataset description

Table 2 Results of AST-AIGN with single-layer JK-GAT and different aggregation methods

5.6 Model interpretation

Footnotes

Acknowledgments

Conflict of interest

Data availability statements

References

Table 2
Results of AST-AIGN with single-layer JK-GAT and different aggregation methods