Abstract
Traffic forecasting has become a core component of Intelligent Transportation Systems. However, accurate traffic forecasting is very challenging, caused by the complex traffic road networks. Most existing forecasting methods do not fully consider the topological structure information of road networks, making it difficult to extract accurate spatial features. In addition, spatial and temporal features have different impacts on traffic conditions, but the existing studies ignore the distribution of spatial-temporal features in traffic regions. To address these limitations, we propose a novel graph neural network architecture named Attention-based Spatial-Temporal Adaptive Integration Gated Network (AST-AIGN). The originality of AST-AIGN is to obtain a spatial feature that more accurately reflects the topological structure of the road networks by embedding Graph Attention Network (GAT) into Jumping Knowledge Net (JK-Net). We propose a data-dependent function called spatial-temporal adaptive integration gate to process the diversity of feature distribution and highlight features in road networks that significantly affects traffic conditions. We evaluate our model on two real-world traffic datasets from the Caltrans Performance Measurement System (PEMS04 and PEMS08), and the extensive experimental results demonstrate the proposed AST-AIGN architecture outperforms other baselines.
Keywords
Introduction
Traffic forecasting is a typical problem of spatial-temporal forecasting, which aims to predict traffic conditions in a certain period of time based on historical traffic data. Traffic forecasting is an indispensable component of Intelligent Transportation System (ITS) [1]. ITS has a wide range of applications, from managing transportation in an efficient manner to alleviating traffic congestion and reducing road accidents. However, as the non-linear and non-stationary traffic data depends on the dynamic road conditions, how to discover its inherent spatial-temporal patterns and generate accurate traffic predictions is extremely challenging.
Visualization of traffic road networks topological structure.
Traffic forecasting has received significant research interests, and many methods have been proposed. For example, traditional statistic methods [2, 3, 4, 5, 6, 7] apply time series models to simulate the pattern of complex traffic data. However, these models usually consider that the input time series follow the stationary and linear assumption, which is violated by the traffic data. Hence, it is hard for traditional statistic models to jointly exploit the spatial and temporal correlations in the traffic flows and to make accurate predictions. Deep learning traffic forecasting models have brought high accuracy and gradually replaced traditional statistic models. For instance, existing deep learning models [8, 9] combine Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to model spatial and temporal correlations, respectively. Although CNNs can extract the spatial features of traffic data, they ignore the spatial topological structures of road networks. Furthermore, it is difficult for RNNs to extract long-term temporal features. In addition, these methods usually adopt autoregressive prediction, leading to propagation and accumulation of prediction errors. To cover these shortages, Graph Convolutional Networks (GCNs) [10] was proposed to construct the topological structure of road networks. Traffic sensors on roads are regarded as nodes in the road network graph, and edges of graph are constructed by the distance or similarity between each pair of nodes. By combining GCNs with CNNs [11, 12, 13], or RNNs [14, 15, 16], the graph-based deep learning methods can capture both spatial and temporal dependences and achieve promising results. To model highly dynamic spatial dependence caused by varying traffic speeds and multiple factors such as weather conditions, some researchers [17, 18, 19, 20] introduce self-attention mechanisms into GCNs, and the others [21, 22, 23] choose to combine GCNs with Transformer [24]. Although the effectiveness of the graph-based deep learning methods in traffic forecasting tasks has been shown, the existing methods still ignore two important issues.
First, the previous studies generally fail to consider two types of important topological structure information of road networks when extracting spatial features: (1) directed relationships among traffic nodes, (2) location of traffic nodes in the road network. These studies assume the relationships among nodes are undirected and do not distinguish the node position when extracting spatial features. Figure 1a represents the entire road networks in San Jose, California. Figure 1b visualizes the networks’ topological structure with the light-colored nodes representing transportation hubs and the dark-colored nodes indicating traffic edge nodes. Figure 1c illustrates the local topological structure circled by the red circle in Fig. 1b and indicates the spatial correlation between hub node 1 and its surrounding traffic nodes. Taking nodes 1 and 2 as an example, the edge weights between this pair are not the same, and node 1 has more significant influence on node 2 because node 1 is a traffic hub and possesses more 1-hop neighbors than node 2 [1]. Therefore, the relationship among the nodes should be regarded as directed. Existing graph-based approaches extract spatial feature through the neighborhood aggregation process (Graph convolution is essentially a neighborhood aggregation process [25], and we describe the process in detail in Section 3.2). In this process, the central hub nodes could get more global spatial feature but lose local ones, while the edge node retains local spatial features but has difficulty in obtaining global ones. Nodes with different locations can hardly get appropriate neighborhood ranges where nodes draw feature representations.
Second, the existing studies tend to ignore the diversity of the spatial-temporal feature distributions of road networks in different regions. In some areas, spatial dependence dominates traffic conditions, while in others, temporal dependence dominates. The traffic data of two pair traffic sensors in different regions are visualized in Fig. 2 to describe the various distributions. Within the same time span, the data from sensor 1 and sensor 2 possess apparent periodicity and trend, meaning the temporal factor has a greater infulence on this region. While sensor 3 gives feedback to changes of sensor 4, which indicates a stronger spatial dependence in the region.
Traffic data with different spatial and temporal feature distribution.
In this work, we propose a novel method, AST-AIGN, which addresses the two limitations above. In this method, we combine GAT [17] with the JK-Net framework [26] to describe the directed relationship among traffic nodes and obtain topological structure-aware spatial feature representations of all nodes. A data-dependent gate, called spatial-temporal adaptive integration gate, is presented to process the diversity of spatial-temporal feature distributions in road networks and selectively integrate the spatial-temporal features in different regions. Meanwhile, our model is based on the self-attention to capture dynamic spatial dependence and long-term temporal dependence, complementing the feature extraction process. The main contributions of this work are as follows:
We propose the JK-GAT by embedding GAT into the JK-Net framework to obtain a spatial feature that more accurately reflects the topological structure of the road networks. JK-GAT simulates the directedness among traffic nodes through calculating the attention weight of GAT. With jump connection and aggregation mechanism in JK-Net, JK-GAT can obtain the topological structure-aware feature representations of traffic nodes. We propose a data-dependent gate that can depict the spatial-temporal feature distributions in different regions without using any extra information (e.g., land-use data) but historical traffic data and highlight features that significantly impact traffic conditions. We propose a novel forecasting model AST-AIGN to extract more accurate spatial-temporal dependences and complete traffic flow prediction. We further use the PEMS04 and PEMS08 datasets to evaluate the proposed method AST-AIGN. The extensive experimental results demonstrate that our model is superior to all baseline methods for the traffic forecasting problem.
The rest of this paper is structured as follows. Section 2 reviews the related work on traffic forecasting. Section 3 presents the problem definition and neighborhood aggregation scheme. Section 4 offers the details of the method proposed in the paper. Section 5 evaluates the prediction performance of AST-AIGN using real traffic datasets. Finally, this paper is concluded in Section 6.
Graph convolutional networks
We can analyze traffic networks and mobile networks as graphs for tasks such as traffic forecasting, assignment, and energy efficiency [27, 28]. Many last studies try to apply CNNs and RNNs to graph-structured data. Even if these neural networks can process Euclidean data efficiently, it is still hard to handle non-Euclidean data (e.g., graphs). GCNs is proposed to solve the problem and generate the feature representations of each node in graphs. GCNs always adopt the Laplacian matrix or the adjacency matrix to portray the relationships among nodes and can be divided into two categories: spectral methods and spatial methods. Researchers designed spectral domain graph convolution [29, 30] based on the spectral graph theory. As an expansion, Defferrard et al. [31] proposed fast localized convolutional filters on graphs to decrease the computational complexity. Since the lower calculation speed and applicability of directed graphs, spatial methods are developed much more wildly. For instance, Will et al. [32] designed a new sampling strategy for neighborhood node aggregation and applied the algorithm to large graphs for inductive representation learning. Veličković et al. [17] proposed GAT, which dynamically calculated the relevance among nodes and enhanced the accuracy of node classification. Michael et al. [33] represented relational GCN (R-GCN) for tasks on the heterogeneous graph with relational data. Brody et al. [34] incorporated GAT and R-GCN to increase the capacity of graph representation learning further. Moreover, many studies [26, 35] have been done on graph convolution theory, to solve the inherent problems of GCNs, such as over-smoothing.
Traffic forecasting
The existing traffic forecasting methods can be divided into two categories: model-driven methods and data-driven methods. Model-driven methods mainly use mathematical tools and physical knowledge to describe traffic problems through mathematical analysis [36]. This approach performs comprehensive system modeling based on prior knowledge and requires high computational power, but it is vulnerable to the limitations such as noise interference and sampling point distribution.
Data-driven methods aim at traffic condition forecasting and assessment based on statistical characteristics of data [37, 38]. These methods are more flexible and do not require analysis of the physical characteristics of the road networks, and they can be mainly divided into parametric and non-parametric models. Parametric models determine model parameters by processing raw data, and forecast traffic based on regression functions, such as the Kalman filter model [2], exponential smoothing model [3] and autoregressive moving average model (ARIMA) [5]. Among them, ARIMA is the most widely used. Hamed et al. [39] used ARIMA to predict urban traffic flow, and many variants of the model were proposed to improve the prediction accuracy [40, 41, 42]. The parametric model algorithm is simple and based on certain linear assumptions, but it is difficult to predict non-smooth traffic data effectively. Non-parametric models based on shallow machine learning can model the nonlinear characteristics of traffic data. Common models include K-neighborhood [4], support vector machine regression (SVR) [6], Bayesian networks [7], etc. However, it is still difficult to explore the complex spatial-temporal dependences of traffic data, which leads to poor prediction accuracy.
Recently, the rapidly developing deep learning models can capture the dynamic features of traffic data effectively by stacked nonlinear networks [43, 44, 45] that used RNNs to predict highway traffic flow. Shao et al. [46] proposed a long and short-term memory (LSTM) network using a memory unit and gating mechanism to capture long-term temporal dependence in traffic series. Fu et al. [47] adopted a more simple gated recursive unit GRU based on gating mechanism to extract temporal features and reduce computational cost. However, these models do not consider the topology structure of road networks which limits the performance of capturing spatial features. Liu et al. [9] proposed the Conv-LSTM model and a novel historical data matrix. Conv-LSTM adopts CNNs to model the spatial dependence from the matrix row vectors and LSTM to model the temporal dependence from the column vectors.In this way, Conv-LSTM completes modeling both spatial and temporal dependences, leading to better prediction performance. Despite adopting CNNs for spatial dependence modeling, it is inherently disabled to process the non-Euclidean traffic data.
GCNs-based methods can exploit the information on traffic network topological structure. Li et al. [14] proposed diffusion convolutional recurrent (DCRNN) based on the seq2seq model network, using bi-directional diffusion graph convolution to replace the matrix operation of the gated recursive unit GRU to capture spatial dependence by random wandering on the graph. Cui et al. [16] proposed the traffic graph convolutional long short-term memory neural network (TGC-LSTM) to learn the interplay among the different roads within the traffic network and to predict the overall traffic conditions. Huang et al. [48] introduced a novel multi-scale temporal dual graph convolution network (MD-GCN) to capture multi-scale temporal dependencies using a combination of channel attention and inception structures. Yu et al. [11] proposed a spatial-temporal graph convolution network (STGCN), using CNNs and GCNs to model the spatial-temporal dependences of traffic flows, respectively. The 1-D convolution in STGCN is more effective than RNNs when capturing the temporal dependence. GraphWaveNet [12] uses expanded convolutional (TCN) to extract long-term temporal dependence and construct an adaptive adjacency matrix to complement the road network topology. STSGCN [13] integrates spatial graphs of adjacent time steps into a local spatial-temporal graph and defines a spatial-temporal graph convolution module to capture features. Most of the above methods are based on static assumptions, but real traffic flows have highly dynamic spatial dependence, and the spatial relationships among nodes evolve over time. Vaswani et al. [24] proposed self-attention to dynamically compute the correlation between sequence elements using multi-headed self-attention mechanism so many studies turned to combining attention mechanisms with GCNs. STTNs [20], a spatial-temporal prediction model adopting the Scaled Dot-Product self-attention mechanism, stack spatial-temporal blocks to capture dynamic spatial dependence and long-term temporal dependence in traffic flow. Li et al. [21] proposed Forecaster based on Transformer architecture, which represents the spatial-temporal correlation among nodes by constructing correlation graphs and adopting sparse process on the weighted matrix. To achieve a more refined modeling result of the traffic networks, current researchers introduce the concept of multi-graph convolution. Li et al. [49] used a new progressive multi-graph convolution network (PMGCN), in which multiple graph convolutions adopt progressive connections and spatial-temporal attention dynamically adjusts each node in the graph. Ni et al. [50] introduced a novel multi-graph model based on an attention 1-D CNN and a gated interpretable framework to model historical traffic data. Qin et al. [51] proposed a dynamic multi-graph convolution recurrent network (DMGCRN) to capture coarse-grained region methods dynamically. Ke et al. [52] adopted the spatial correlation between roads and semantic correlation to construct the multi-graph and fused the results after separate convolution on the multi-graph. Yang et al. [53] integrated the multi-graph first and then conducted graph convolution on the multi-graph. Zeng et al. [54] introduced Point of Interest (POI) data to construct a traffic knowledge graph that reflected land utilization. However, most of these approaches do not fully use the topological structure information of the road networks and ignore different distributions of spatial-temporal features in traffic regions.
To address the limitaions of the existing studies and step beyond the state-of-the-art, a new graph neural network structure AST-AIGN is proposed in this paper for the traffic forecasting task on urban road networks.
Aggregation of 2-hop neighborhood.
Problem definition
In this study, the topology structure of road networks is described by a graph with the weights
Traffic forecasting is a classic spatial-temporal forecasting problem, where traffic data is described by a matrix
where
Diverse types of graph neural network [31, 10, 32, 17, 34] that were proposed to apply convolution operation on non-Euclidean data can be abstractly described as a neighborhood aggregation scheme [25]. The scheme is divided into two parts: feature generation and feature aggregation, which is mathematically presented as:
where
Figure 3 shows the process of aggregating 2-hop neighbors’ features of the central node by two graph convolution layers. The
This section focuses on traffic forecasting task on real road networks using Attention-based Spatial-Temporal Adaptive Integration Gated Network (AST-AIGN). AST-AIGN consists of three main components: feature encoding layer, two spatial-temporal layers, and prediction layer. The feature encoding and the prediction layer are composed of one and two layers of 1-D convolution, respectively.
Overall structure of AST-AIGN.
Figure 4 shows the overall structure of the AST-AIGN model, in which each spatial-temporal layer includes a module of spatial feature, a module of temporal feature and a data-dependent spatial-temporal adaptive integration gate. The traffic forecasting process is described as follows: The real traffic data
Module of spatial feature
The structure of the module of spatial feature is shown in the top left of Fig. 4, which contains two parts: the self-attention mechanism and JK-GAT. The traffic spatial features are highly dynamic due to weather changes, traffic accidents, and other factors. Specifically, the edge weights between each sensor and its neighboring sensors are changed over time. We use self-attention without initial topological relationships as the basis of our model to model dynamic spatial dependence. We propose JK-GAT to focus on modeling the directedness of the road network structure and obtain node feature representations that are adaptive to the topological structure. After getting
The structure of the module of spatial feature is shown in the bottom left of Fig. 4. How to capture the long-term temporal dependence is important in the traffic forecasting task. The limited time dependence can hardly reflect the trend and periodicity of traffic flow adequately, so we also adopt the self-attention mechanism to capture the long-term temporal dependence
Self-attention for extracting spatial feature and temporal feature
The self-attention mechanism is an amendment of the attention mechanism, which is widely applied in computer vision and natural language processing. By calculating attention weights based on itself, the attention mechanism reduces the dependence on external information and is more conducive to explore the internal relevance of data or features. We adopt self-attention mechanism to model dynamic spatial dependence and long-term temporal dependence adequately. Taking extraction of spatial features as example, we first adopt spatial position embedding to combine a trainable random initialization parameter matrix
where
After getting the projection subspace, the node correlation weight matrix
The weight matrix is projected back to the original space domain, and a three-layer nonlinear activation feedforward neural network is connected to complete extraction the of dynamic spatial dependence
where
When capturing long-term temporal dependence with self-attention, we adopt the one-hot time encode to initialize the temporal position embedding matrix
Graph convolution or neighborhood aggregation is inherently a smoothing operation on graph signals. Each node in the graph draws feature from its neighborhood range which depends on the topology position of each node. Unlike other deep learning networks, the training results of graph convolutional networks are likely to be over-smoothed when the networks get deeper, leading to a performance decline of related prediction tasks. For the traffic forecasting task precisely, the features of traffic hub nodes are easily propagated to neighboring nodes. In contrast, the features of edge nodes are challenging to radiate to the other nodes, which makes it difficult to get appropriate feature representations. To overcome this problem, we introduce a general graph convolutional network framework, JK-Net, which enables layer-wise neural networks to obtain structure-aware feature representations. As shown in Fig. 5, JK-net applies two powerful changes to networks: jump connections and layer-aggregation mechanism.
In general graph convolution networks, each layer increases the size of the neighborhood range on the base of the previous layer. The jump connection are capable of making the output of each layer jump to the last layer, where the layer-aggregation mechanism carefully selects the spatial feature representations obtained from each layer. Assuming
Structure of 
The key idea of the layer-aggregation mechanism is to determine the importance of a node’s feature representation at different neighborhood ranges after looking at the features on all layers. JK-Net provides Concatenation and Max-pooling aggregation methods. Concatenation is the most direct way to combine features in all layers, which is suitable for small graphs and graphs with regular structure. Max-pooling can choose the most influential layer based on the traffic data pattern during training. Max-pooling is adaptive to topology structure and needs no additional parameters to train. The detail of the Max-pooling layer-aggregation mechanism can be expressed as follows:
where
We adopt GAT in JK-net to draw the directed spatial features in road networks. Figure 5 represents a stacked JK-GAT with
where
where
Overall, JK-Net is a general framework of graph convolution to solve the over-smoothing problem in the learning process. We embed GAT into the JK-Net framework to combine their advantages in traffic forecasting tasks. The JK-GAT with
We extract the spatial and temporal features simultaneously to make sure the features do not interfere with each other and affect the prediction performance. In traffic data, the dominant feature of traffic conditions is different in different traffic areas. We show the diversity of spatial-temporal features distribution in Fig. 6. In the business regions, spatial features have much more influence on traffic flow, while in the industrial areas, temporal features influence more. It’s necessary to selectively integrate spatial-temporal features in different regions and highlight features that significantly impact traffic conditions. POI data can reflect specific points with different functional attributions in urban areas, while this information is inaccessible in many scenarios. We propose a data-dependent function called spatial-temporal adaptive integration gate which directly explores the distribution in the historical traffic data without any external information. Thus our method is more flexible and universal for traffic forecasting.
Spatial-temporal features distribution obtained from the gate: Dark blue represents spatial features influences more in this region and dark red means temporal features influences more.
where
Therefore, the distribution
After obtaining the distribution through the gate, we can selectively fuse the dependences
Detail information of PEMS04 and PEMS08 datasets
Dataset description
To validate the effectiveness of the proposed model, AST-AIGN needs to be verified on large-scale datasets. Two California highway traffic datasets, PEMS04 and PEMS08, were selected for experimental validation because of their integrity, correctness, and ease of acquisition and processing. The raw traffic data are collected by Caltrans Performance Measurement System (PEMS) in real-time at 30-second intervals and aggregated every 5 minutes to obtain the traffic datasets. The system deploys more than 39,000 traffic sensors on California freeways and records the location information of each sensor. We select traffic flow as the experimental data from three traffic attributes: flow, average speed, and average occupancy. The detail information of two datasets is shown in Table 1.
The traffic data will be normalized to the interval [0, 1], with the first 80% of traffic data as the training set and the remaining 20% as the test set. We use 60 minutes (12 time steps) of current traffic flow data to predict traffic flow in prediction horizons 5 min, 15 min, 30 min, 45 min, and 60 min (1, 3, 6, 9, 12 time steps).
We utilize the distance information to determine the initial weights among sensors, which describe the degree of mutual influence between traffic nodes. The adjacency matrix
where
Sensor distribution and heat map of adjacency matrix for PEMS08 and PEMS04 dataset.
To evaluate the performance of the model proposed in this paper, the mean absolute error (MAE), root mean square error (RMSE) and mean absolute percentage error (MAPE) are used as evaluation metrics. Let
To conduct convincing experiments, we choose some wildly adopted or state-of-the-art approaches as baseline models were to conduct experiments in comparison with the model proposed.
The experiments were conducted in a Tianhe supercomputing environment with 2 Intel(R) Xeon(R) Gold 6132 14-core CPUs and 4 GPUs per compute node. 256GB of memory (shared between 2 CPUs) was available for each computes node. The GPU is NVIDIATesla V100SXM2 with 16GB of HBM2 memory. AST-AIGN with 2 spatial-temporal layers is applied. The module of spatial feature and module of temporal feature use single-head and single layer self-attention. The input data can be extended to 64 dimensions through a 1-D convolution operation in the input layer. We use a uniform initialized random distribution of size 2 to ensure the reproducibility of experimental results. During model training, the Adam optimization algorithm is chosen, the learning rate is set to 0.001 and decays 3% every 50 generations, the loss function is MSE, the dataset batch size is set to 64, the model is trained for 200 generations, and a dropout rate of 0.05 is used.
The number of JK-GAT layers
Results of AST-AIGN with single-layer JK-GAT and different aggregation methods
Results of AST-AIGN with single-layer JK-GAT and different aggregation methods
Result of ablation experiments for both datasets
Results of AST-AIGN with
Relationship of the JK-GAT layers and the prediction horizon with accuracy from AST-AIGN model on PEMS04 and PEMS08. The 
Comparison of prediction results among AST-AIGN and baselines on both datesets
Next, we proceed with experiments for traffic flow forecasting of all time steps with different JK-GAT layers on both datasets. The results are summarized in Table 4. Figure 8a shows the number of JK-GAT layers that achieve the best results at different prediction horizons on PEMS04. When the prediction horizon becomes longer for all JK-GAT layers, the RMSE of our AST-AIGN shows an overall increasing trend, which indicates that a longer horizon will affect the prediction performance to some extent. It is clear that on the short-term and medium-term forecasting tasks, a larger number of JK-GAT layers achieves the lowest error, which means the influence of a broader range of traffic edge nodes needs to be considered more. While on the long-term forecasting task, the lowest error is obtained when the number of JK-GAT layers is 2, which means the effect of adjacent nodes is more influential. Specifically, the best prediction results for the prediction horizon [5 min, 15 min, 30 min, 45 min] were obtained with the number of JK-GAT layers [4, 3, 4, 4], respectively. On the contrary, at a prediction horizon of 60 min, the best prediction result was obtained with two layers of JK-GAT.
Similar conclusions can be obtained from the experimental results of PEMS08. And as shown in Fig. 8b, the best prediction results were obtained at the prediction horizon [5 min, 15 min, 30 min, 45 min], with the number of JK-GAT layers all being 5. In the prediction horizon of 60 min, the best prediction result is obtained for JK-GAT with a single layer. This indicates that the two datasets have similar traffic flow patterns.
Table 5 compares the prediction performance of AST-AIGN with the above baselines on the PEMS04 and PEMS08 datasets for prediction time step
As shown in Table 5, all three metrics (RMSE, MAPE, and MAE) of each model grow when the prediction time step becomes longer, which indicates that the longer the prediction time step, the worse the prediction performance of each model. That is because predicting farther points in the time dimension requires much more accurate temporal dependence. As can be seen from Table 5, FC-LSTM is the worst performing model because it is usually used to capture temporal features but not spatial features. When the prediction time step becomes longer, the FC-LSTM prediction error grows most rapidly. The graph-based deep learning networks STGCN and DCRNN achieved better performance than FC-LSTM on both datasets due to STGCN and DCRNN capturing both spatial and temporal features for final prediction. ASTGCN and GraphWaveNet outperform STGCN and DCRNN models on both datasets. ASTGCN introduces attention mechanism to describe dynamic spatial features, and GraphWaveNet proposes an adaptive adjacency matrix to achieve the same effect. These models indicate the effectiveness of modeling dynamic spatial dependence. The STTNs model combined with the Transformer framework to extract dynamic spatial dependence and long-term temporal dependence, which performs better than ASTGCN but hardly higher than GraphWaveNet. The STTNs model extracts spatial-temporal dependences step by step, resulting in insufficient integration of spatial-temporal features. Our AST-AIGN achieves the best results on all prediction time steps for both datasets, with JK-GAT capturing spatial feature that more accurately reflects the topological structure and a data-dependent gate processing the diversity of road networks’ feature distribution.
To analyze the performance of each model at different prediction time steps, we visualize the result of error analysis. The prediction results of each model under different prediction horizon are shown in Figs 9 and 10, and it shows that the prediction accuracy of each model generally decreases when the horizon becomes longer. However, AST-AIGN achieves the best prediction results regardless of the variation of the horizon. And the changing trend of our model’s prediction results is smaller, indicating that the AST-AIGN is less affected by the scale of the prediction horizon and is suitable for prediction tasks at different time scales.
In summary, the results of the comparison study show that our AST-AIGN is effective in modeling complex spatial-temporal dependences and can achieve promising performance in traffic forecasting task.
Comparison of prediction results of all models with different time step on PEMS04.
Comparison of prediction results of all models with different time step on PEMS08.
To verify the superiority of AST-AIGN, we further design three variants: (1) AST-AIGN-para: removes the adaptive integration gate from the model. (2) AST-AIGN-gcn: replacing the GAT layer with a GCN layer, making it difficult for the model to simulate the directness of the traffic road network, but the GCN layers are still connected to each other in the form of ‘Max-pooling’. (3) AST-AIGN-last: no ‘Max-pooling’ connection between GAT layers. We compare the prediction performance on 12 time steps of AST-AIGN with its three variants on both datasets, and from Table 3 we can observe that:
AST-AIGN-para has the highest RMSE, MAE, and MAPE on both datasets. This indicates that not considering the spatial-temporal feature distribution leads to inadequate integration of dependences, which ultimately results in poorer prediction results. AST-AIGN-gcn and AST-AIGN-last both outperformed AST-AIGN-para on both datasets. On PEMS04, AST-AIGN-last achieves better results. It indicates that in the dataset with a larger number of nodes, describing the directness among nodes is more significant than obtaining the structure-aware feature representation for each node. Similarly, AST-AIGN-gcn achieves better results on the PEMS08 dataset with less number of nodes. Compared with the other three variants, AST-AIGN achieves the lowest RMSE and MAE on all datasets. It demonstrates the advantages of our AST-AIGN in extracting spatial-temporal features from traffic flow data.
Overall, the results of the ablation experiments demonstrate the effectiveness of AST-AIGN in modeling spatial-temporal relationships. The application of JK-GAT and spatial-temporal adaptive integration gate stand as important roles in exploiting spatial-temporal information in traffic data.
Model interpretation
To better understand AST-AIGN proposed in this paper, we visualize the results of the comparison study. In PEMS04, we choose the traffic flow data of sensor 123 on one day in February 2018 to visualize. In PEMS08, we choose sensor 110 on one day in August 2016. Figure 11 shows prediction results of all models on different time steps compared to ground truth. We can observe that:
AST-AIGN tends to generate smooth predictions when there are oscillation points in the traffic flow
Comparison visualization results for different prediction time step of all models on PEMS04 (top) and PEMS08 (bottom). data. This indicates that the model has better robustness and is less vulnerable to sudden change points. When the prediction time step becomes longer, the prediction results of AST-AIGN remain stable, while the prediction results of other baselines vary more. This reflects that the prediction time step has limited influence on the model.

To visually investigate the role of JK-GAT in modeling the directness of road networks, we select the first 50 sensors in PEMS04 and present the heatmap of the adjacency matrix and graph attention matrix in Fig. 12. In heatmaps, the value of the
Adjacency matrix(left) and the graph attention matrix(right) obtained from the JK-GAT.
To discover the inherent spatial-temporal patterns in traffic flow and generate accurate predictions, we propose a new spatial-temporal traffic data prediction network structure AST-AIGN, which introduces a data-dependent adaptive integration gate by considering different spatial-temporal feature distributions to fuse spatial-temporal features. In addition, we employ JK-GAT to portray the directness among traffic nodes and obtain a better structure-aware feature representation of all nodes. Finally, AST-AIGN is used to handle the spatial-temporal traffic forecasting task, evaluated on two real datasets, PEMS04 and PEMS08, and compared with FC-LSTM, STGCN, DCRNN, ASTGCN(r), GraphWaveNet, and STTNs. AST-AIGN can successfully capture spatial-temporal features and is not limited to traffic forecasting but can also be applied to other spatial-temporal tasks.
Though the superior performance of our model in traffic flow prediction tasks, there are some research gaps that we will focus on in the future, including consideration of the effects of external factors (weather and social events) on traffic flow prediction and the model’s generalizability to other spatial-temporal prediction tasks such as traffic accident prediction and weather forecasting.
Footnotes
Acknowledgments
This work was supported by National Natural Science Foundation of China (NSFC) (U1911205) and Hubei Provincial Department of Natural Resources Science and Technology Project (No.ZRZY2023KJ15).
Conflict of interest
The authors declare that they have no conflict of interest.
Data availability statements
Datasets ued in this paper can be downloaded at
