Abstract
The key to solving traffic congestion is the accurate traffic speed forecasting. However, this is difficult owing to the intricate spatial-temporal correlation of traffic networks. Most existing studies either ignore the correlations among distant sensors, or ignore the time-varying spatial features, resulting in the inability to extract accurate and reliable spatial-temporal features. To overcome these shortcomings, this study proposes a new deep learning framework named spatial-temporal gated graph convolutional network for long-term traffic speed forecasting. Firstly, a new spatial graph generation method is proposed, which uses the adjacency matrix to generate a global spatial graph with more comprehensive spatial features. Then, a new spatial-temporal gated recurrent unit is proposed to extract the comprehensive spatial-temporal features from traffic data by embedding a new graph convolution operation into gated recurrent unit. Finally, a new self-attention block is proposed to extract global features from the traffic data. The evaluation on two real-world traffic speed datasets demonstrates the proposed model can accurately forecast the long-term traffic speed, and outperforms the baseline models in most evaluation metrics.
Keywords
Introduction
Traffic congestion has become an increasingly serious problem with the urbanization deepening [1], and accurate traffic speed forecasting has become the key to solving this problem. Reliable traffic speed forecasting can not only help the government to reasonably arrange road management strategies, but also allow the public to reasonably arrange the time and route of travel [2]. Phusittrakool et al. [3] stated that accurate long-term traffic speed forecasting can solve traffic problems more effectively than short-term traffic speed forecasting. However, due to the intricate spatial-temporal correlation of traffic networks, reliable long-term traffic speed forecasting remains a huge difficulty.
The computing power of computers is rapidly developing, and classical machine learning methods and deep learning techniques have increasingly been applied to various aspects, including traffic speed forecasting. Classical machine learning methods like autoregressive integrated moving average (ARIMA) [4] and support vector regression (SVR) [5], roughly forecast traffic by establishing a mapping relationship. Machine learning methods can learn simple temporal features from traffic data, but they cannot learn the complex spatial-temporal features of traffic networks.
The deep learning techniques have been widely recognized as the feasible tools for traffic forecasting during the past decade [6]. Recurrent neural network (RNN) and its variants long short-term memory network (LSTM) [7] and gated recurrent unit network (GRU) [8] have proven effective in capturing temporal features in traffic data to forecast traffic speed [9]. However, RNN-based works slice the traffic networks into independent traffic data, ignoring the correlation between spatial sensors. Convolutional neural network (CNN) [10] has been employed to extract the spatial features in traffic networks, but it tends to produce information distortion because it considers the irregular traffic networks as a regular two-dimensional matrix.
Because the traffic network is a non-Euclidean structure, several previous studies [11, 12] have used graph convolutional network (GCN) to model traffic networks. However, various existing GCN-based models only extract local spatial features, but ignore the correlations among distant sensors, making it challenging for the models to learn the spatial features comprehensively. Some spatial-temporal models [13, 14] use spatial and temporal modules to extract spatial and temporal features from traffic data, respectively. However, they struggle to represent the accurate and reliable spatial-temporal features comprehensively because they do not consider time-varying spatial features.
Therefore, to address the shortcomings of existing studies, this study proposes a new spatial-temporal gated graph convolutional network (STGGCN) to learn the reliable spatial-temporal features for accurate long-term traffic speed forecasting. The following three aspects are the primary contributions of this study:
A new spatial graph generation method is proposed to comprehensively describe the spatial correlation between the traffic sensors. It enhances the adjacency matrix with k-hop algorithm to generate a global spatial graph with more comprehensive spatial features. A new spatial-temporal gated recurrent unit (STGRU) is proposed to efficiently extract the comprehensive spatial-temporal features from traffic data. It extracts both spatial features and temporal features by embedding graph convolution operation into GRU. A new self-attention block is proposed to extract the global features from the traffic data. This reduces the risk of overfitting and improves the model training effect, which helps to more accurately forecast the traffic speed.
The remainder of this paper is structured as follows. Section 2 illustrates the characteristics of some existing models. Section 3 defines some concepts related to traffic speed forecasting through problem definition. Section 4 explores the structure of the proposed model. Section 5 demonstrates the dataset, experimental settings, and the experimental results. Section 6 summarizes the study and discusses the future work.
Related work
Deep learning for traffic forecasting
With the recent technological advancements that have made computing resources cheaper, the deep learning methods have significantly developed, and are widely used to forecast traffic to solve the problem of traffic congestion caused by urbanization. Our previous work [15] has demonstrated that deep learning can effectively extract high-dimensional spatial and temporal features in traffic networks. Chen and Chen [16] proposed a deep learning model for data interpolation, which effectively handled the traffic flow prediction task with data loss, proving that deep learning can have strong generalization ability.
Earlier research used RNN to learn temporal dependencies in traffic data, but tended to produce gradient disappearance and gradient explosion problems, which makes it difficult to memorize long-term traffic sequences. Extending the RNN, Hochreiter and Schmidhuber [7] proposed LSTM, and Cho et al. [8] proposed the GRU to solve these problems using multiple gating units and cell state to capture temporal features. Cui et al. [17] stacked bidirectional LSTM with unidirectional LSTM and added a data imputation mechanism to LSTM to deal with the traffic state with missing values. Zhu et al. [18] used seasonal trend decomposition algorithm to decompose traffic data, and then used LSTM, MLP, and seasonal cycle to extract deep temporal dependence. Kumar et al. [19] combined the Internet of Things with GRU and LSTM to predict short-term traffic flow. Although RNN-based models can make accurate traffic forecasts to a certain extent, they slice the traffic networks into independent traffic data and ignore the spatial features of the traffic networks.
To increase the forecasting accuracy of traffic networks, some studies began to extract both spatial and temporal features from traffic networks. Liu et al. [10] used both 1D-Convolution and LSTM to extract spatiotemporal dependencies in traffic data, and then employed a bidirectional LSTM module to learn the heterogeneity of traffic data. Yao et al. [20] exploited local CNN to learn spatial dependencies in traffic networks, and combined it with LSTM to build spatial and temporal views. However, these CNN-based models treat the irregular traffic network as a regular two-dimensional matrix, which will cause information distortion, resulting in loss of local spatial information.
To handle non-Euclidean data [21] such as traffic networks better, GCN has been proposed to replace CNN to extract more accurate spatial features. Bruna et al. [22] employed GCN that applied convolution operations to non-Euclidean data based on graph Laplacian. Defferrard et al. [23] reduced the complexity of GCNs using Chebyshev polynomials. Kipf and Welling [24] proposed a renormalization propagation method to further improve the computational effectiveness and forecasting ability of GCN. With the development and maturity of GCN, various studies begun to use GCN to extract spatial features in traffic networks. For example, Yu et al. [13] chose to express the traffic speed prediction problem on a traffic graph structure and used GCN and temporal gated-Convolution to extract spatial-temporal features in the traffic networks, which accelerated the model training speed. Geng et al. [11] divide the traffic network into multiple regions, and then use multi-graph convolution to model the correlation between regions to forecast ride-hailing demand. Zhao et al. [14] combined GRU and GCN into a temporal graph convolution network (T-GCN) to forecast traffic speed, and obtained relatively good results. Lv et al. [12] encoded various semantic information into multiple graphs, and then used multi-graph convolution and GRU to extract the spatial-temporal features of the traffic network. Zhao et al. [25] used double graph convolution to replace the full connection layer of the gating unit in GRU to extract spatial-temporal dependency of the traffic network.
However, although the above GCN-based models can extract the spatial features of traffic networks, they tended to ignore the correlations among distant sensors. Although they used spatial and temporal modules to extract spatial and temporal features respectively, they tended to ignore the time-varying spatial features.
Attention mechanism
Attention mechanisms can allocate limited computing resources to more important goals and have been applied in various application areas, including natural language processing (NLP) [26], image processing [27], and time series forecasting [28]. Zhang et al. [29] proposed a gated attention network, which uses a convolutional sub-network to control the distribution of attention, and then combines the aggregator with GRU to forecast traffic speeds. Guo et al. [30] proposed a new spatial-temporal attention mechanism composed of spatial attention and temporal attention to learn dynamic spatial and temporal features. Zhao et al. [31] combined attention with dynamic graphs to capture dynamic, hidden, and long-term correlations in traffic data by assigning different weights to sensors at different times and locations.
However, the attention mechanism has optimization difficulties and ignores the shortcomings of contextual relations. Fortunately, the self-attention mechanism can make up for these shortcomings. Vaswani et al. [32] proposed a deep learning model based on the self-attention mechanism for solving machine translation tasks. Wang et al. [28] combined a self-attention mechanism and graph neural network layer with location attention mechanism to forecast traffic speed, demonstrating the usefulness of the self-attention mechanism in traffic speed forecasting. Reza et al. [33] proposed a multi-head self-attention model, demonstrating that the number of attention layers and multi-head attention significantly improve the model’s predictive effectiveness. Yan et al. [34] combined self-attention mechanism with LSTM to predict traffic speed, proving that self-attention mechanism can effectively capture the long-term dependence of traffic speed.
The above work proves the usefulness of the self-attention mechanism, but it is vulnerable to overfitting and encounters difficult training problems, because it calculates the similarity point by point and generates large amounts of data.
Problem definition
In this section, several basic concepts about traffic speed forecasting are first defined, and then the purpose of the traffic speed forecasting model is explained.
The aim of the traffic speed prediction model is to learn a mapping function F (·), and forecast future traffic speed from known historical traffic data, traffic networks graph, and mapping function F (·). The function F (·) is shown in Equation (1).
Figure 1 depicts the STGGCN model’s framework proposed in this study. It consists of three parts: spatial graph generation module, STGRU module, and self-attention module. In the spatial graph generation module, the adjacency matrix is enhanced with the k-hop algorithm to generate multiple k-hop adjacency matrices with different distance ranges, and then these local spatial graphs are fused into a global spatial graph with more comprehensive spatial features. In the STGRU module, a new k-hop graph convolution (KGC) operation is performed on the global spatial graph to extract time-varying spatial features, and the comprehensive spatial-temporal features are extracted by embedding KGC into GRU. In the self-attention module, a new self-attention block is used to extract the global features in the traffic data, and utilize the improved residual network to better train the model and improve its forecasting accuracy.

Framework of the proposed STGGCN.
To model the traffic networks, not only the correlations among close sensors need to be focused on, but also those among distant sensors make sense. Therefore, as shown in Fig. 1-A, an adjacency matrix is firstly generated based on the geographical location of the traffic sensors. Then, the adjacency matrix is enhanced using the k-hop algorithm to generate a global spatial graph with comprehensive spatial features, which can represent the correlations among both close and distant sensors.
To represent the correlation between two sensors accurately, the threshold-based Gaussian kernel formula [35] is employed to construct an adjacency matrix that represents the spatial structure of the traffic networks well, as shown in Equation (2).
To facilitate subsequent calculation, the adjacency matrix A is first transformed into
To generate a global spatial graph G
st
with more comprehensive spatial features, multiple k-hop adjacency matrices with different distance ranges (i.e., local spatial graphs) are produced and aggregated, according to Equation (5).
As shown in Fig. 1-B, combined with the global spatial graph, the traffic data with time steps are input into the proposed STGRU to extract the time-varying spatial features and temporal features.
Referring to Fig. 2, STGRU extends the original GRU [8] involving reset gate and update gate with a new spatial gate, performing the graph convolution operation to extract the time-varying spatial features. The spatial gate s
g
captures the time-varying spatial features from the global spatial graph G
st
, the update gate r
g
retains the significant spatial-temporal features, and the reset gate z
g
removes the unwanted features.

Architecture of STGRU.
Although the graph convolution operation of the spatial gate based on the adjacency matrix was proposed by Kipf and Welling [24], this does not apply to the k-hop adjacency matrices in the global spatial graph. Therefore, a new KGC operation is proposed in Equation (6).
After the time-varying spatial features are extracted from the global spatial graph, they are merged with the cell states. Finally, the fused data will go through the reset gate and update gate, resulting in the comprehensive spatial-temporal features. The formula for figuring out
The RNN-based models including GRU can only extract features in chronological order [28]. To extract the global features of traffic data for more accurate traffic speed forecasting, this study proposes a new self-attention block based on the combination of the self-attention mechanism [32] with improved residual network. The self-attention mechanism has been proven to be effective in extracting the global features of traffic data [28], but it suffers from overfitting and training difficulties because of large amount of computation, which can be addressed by the improved residual network.
As shown in Fig. 1-C, the self-attention module includes positional encoding, self-attention block (consisting of self-attention layer and improved residual network), and output layer. Among them, the type of positional encoding is sinusoidal positional encoding, and the self-attention layer uses the multi-head self-attention mechanism. Please refer to Vaswani et al. [32] for the details of sinusoidal positional coding and self-attention layer.
Extending the original residual network [27], Ding et al. [36] increased the number of identities of the residual network, and added 3×3 and 1×1 convolutions on the identities, improving the efficiency of image feature extraction. However, because the traffic network is a non-Euclidean structure, convolution is not suitable for extracting global features from traffic data. Inspired by but differed from Ding et al. [36], this study employs two different non-linear activation functions (i.e., relu and sigmoid) in the improved residual network to extract the global features from different aspects. Then the two different types of global features are connected with the initial features to obtain the fused global features, which are finally normalized through layer normalization into the data range [0,1]. The specific calculation is shown in Equation (12).
After obtaining the final global features, the output layer (i.e., the fully connected layer) is used to obtain the final multi-step traffic speed forecasting results.
Dataset selection and preprocessing
To measure the forecasting performance and robustness of the proposed model, this study selects two real-world traffic speed datasets to evaluate the model. These datasets were collected by the Highway Performance Measurement System (PEMS) [38], and are briefly described as follows.
PEMS-BAY: The dataset comprises traffic speed data from 150 sensors in the Bay Area of California from January 1st 2017 to February 28th 2017. During this period, there are much heavy rainfall weather, and the rainfall in February broke the precipitation history record of the Bay Area of California. The sensor distribution is shown in Fig. 3(a).
PEMS4: The dataset comprises traffic speed data from 128 sensors in the Bay Area of California from June 1st 2017 to June 30th 2017. During this period, there are much gale weather. The sensor distribution is shown in Fig. 3(b).

The sensor distribution of the two datasets.
The traffic data in these two datasets were collected by traffic sensors every 5 minutes. The data were preprocessed through Z-score normalization, as shown in Equation (13):
All experimental results in this section were obtained through 5 times averages. All the comparison models in this study runs under the computing environment with python 3.6 and pytorch 1.10. The computer running the experiments is configured as Intel (R) Core (TM) i7-8700 3.20 GHz CPU, 11 G RAM NVIDIA GeForce GTX 1080 Ti.
The dataset was divided into training, validation, and test sets according to the ratio of 7 : 2:1. When evaluating the model performance, 12 known historical time steps were used to forecast future 12 consecutive time steps. The mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) were used as evaluation indicators for the comparison models. The specific calculation method for the metrics is shown in Equations (14)–(16).
In training the STGGCN proposed in this study, the training epoch was set to 50, the learning rate was set to 0.002, the batch size was set to 200, and the optimizer was Adam. The k value of the k-hop algorithm in the spatial graph generation module was set to 4, and the number of heads for multi-head self-attention in the self-attention block was set to 4.
To evaluate the relative performance of the proposed model in the field of traffic speed forecasting, this study compares the proposed STGGCN with some classical machine learning methods and some deep learning models as baseline models. The hyperparameters of the baseline models were set according to the default parameters. These baseline models are briefly introduced as follows:
HA: The history average (HA) method calculates the average value of the previous 12 times steps as the forecasted value of the next moment.
SVR [5]: SVR uses support vector regression method to fit the temporal features in traffic data for forecasting.
GRU [8]: GRU extracts the temporal features in the traffic data, and uses the hidden state to store and transfer the learned features.
GCN [24]: GCN learns the spatial features of the traffic networks through the adjacency matrix.
STGCN [13]: Spatio-temporal graph convolutional networks (STGCN) integrates GCN and gated temporal convolution to extract local spatial-temporal features in traffic networks.
STGNN [28]: Spatial temporal graph neural network (STGNN) utilizes the location attention mechanism to capture spatial features, and utilizes GRU and transformer to capture both local and global temporal features.
STSGCN [39]: Spatial-temporal synchronous graph convolutional network (STSGCN) utilizes a spatial-temporal graph convolution module to capture local spatial-temporal features and multi-module layers to capture heterogeneity in long spatio-temporal graphs.
Table 1 is the forecasting results of the proposed STGGCN and the baseline models on the PEMS-BAY and PEMS4 datasets, including those for 15 minutes, 30 minutes, 45 minutes, and 60 minutes forecasting. The optimal forecasting results at each time point are bolded. Next, the experimental results are analyzed as follows.
Forecasting results of STGGCN and baseline models on two real-world datasets
Forecasting results of STGGCN and baseline models on two real-world datasets
Note: the bolded data in the table represent the optimal results.
The deep learning models (GRU, GCN, STGCN, STGNN, STSGCN, and STGGCN) outperforms the classical machine learning models (HA and SVR) for traffic speed forecasting in most cases. This is because the traditional time series forecasting models have simple linear structures and cannot capture complex non-linear features in traffic data. In most cases, the models based on spatial-temporal features (STGCN, STGNN, STSGCN, and STGGCN) perform better than other models based only on temporal or spatial features (HA, SVR, GRU, and GCN), because traffic speed is both spatially and temporally correlated. In comparing STGGCN with deep learning models that extracts only local spatial-temporal features (STGCN and STSGCN), it was found that their forecasting performance is similar for short-term traffic speed forecasting. However, as the forecasting time length increases, it is found that STGGCN performs better than STGCN and STSGCN. This is because STGGCN makes full use of the global features of traffic data. In comparing STGGCN with STGNN, it is found that STGGCN performs worse than STGNN for short-term traffic speed forecasting. However, as the forecasting time length increases, STGGCN performs better and better than STGNN. This is because STGGCN can extract more comprehensive spatial-temporal features.
To more intuitively compare the forecasting performance between STGGCN and STGNN, their forecasting results on randomly selected sensors at 288 consecutive time steps are visualized in Fig. 4. It can be observed that STGGCN is more sensitive to traffic speed changes than STGNN, and its forecast results are considerably close to the true values, demonstrating that STGGCN is very suitable for capturing complex traffic features for traffic speed forecasting, particularly for long-term traffic speed forecasting.

Comparison of forecasting performance between STGGCN and STGNN on PEMS-BAY.
To demonstrate the usefulness of each module in the proposed STGGCN, the ablation experiments are conducted to evaluate four variant models obtained by combining different modules with the basic GRU, which are GRU-GCN, e-GRU-GCN, e-GRU-KGC, and e-GRU-KGC-SA. They are described as follows.
GRU-GCN: This model uses GRU and GCN to extract temporal features and spatial features respectively, and then combine them [14]. e-GRU-GCN: This model embeds GCN into GRU, i.e., adds GCN as a spatial gate of GRU to extract spatial-temporal features. e-GRU-KGC: This model embeds KGC into GRU, i.e., adds KGC in the spatial gate of GRU to extract comprehensive spatial-temporal features. e-GRU-KGC-SA: This model combines a self-attention block with e-GRU-KGC to extract global features, which adopts the original residual network [40].
Table 2 shows the forecasting results of GRU, its variant models, and the proposed STGGCN on two real-world datasets, including those for 15 minutes, 30 minutes, 45 minutes, and 60 minutes forecasting. The optimal forecasting results at each time point are bolded. Figure 5 shows the comparison of MAE, RMSE, and MAPE values among the different models on two real-world datasets. The experimental results were analyzed as follows.
Forecasting results of GRU, its variant models and proposed STGGCN on two real-world datasets
Forecasting results of GRU, its variant models and proposed STGGCN on two real-world datasets
Note: the bolded data in the table represent the optimal results.

Comparison of GRU, its variant models and proposed STGGCN on two real-world datasets.
GRU performs well at 15 min and 30 min, but its forecasting performance at 45 min and 60 min is unsatisfactory on two real-world datasets, because it only utilizes the temporal features of traffic data and ignores the spatial features. GRU-GCN performs worse at 15 min and 30 min than GRU, but performs better at 45 min and 60 min than GRU on the PEMS-BAY. This is because the utilization of simple spatial-temporal features can improve the accuracy of long-term traffic forecasting. However, GRU-GCN always performs worse than GRU on the PEMS4 dataset. This is because the spatial features in the PEMS4 dataset are relatively complex, and GRU-GCN is not good at extracting complex spatial features, in particular, the time-varying spatial features. e-GRU-GCN performs better than GRU-GCN on two real-world datasets, because it embeds GCN into GRU and extracts time-varying spatial features and temporal features. It was found that the time-varying spatial features are beneficial for traffic speed forecasting. e-GRU-KGC performs better than e-GRU-GCN on two real-world datasets, because the proposed KGC operation embedded in GRU helps to extract the more comprehensive spatial-temporal features. e-GRU-KGC-SA performs better than e-GRU-KGC on two real-world datasets, because it extracts the global features from traffic data through self-attention block. It was found that the global features were helpful for accurately forecasting traffic speed. The proposed STGGCN performs better than e-GRU-KGC-SA on two real-world datasets, demonstrating that the proposed improved residual network reduced the probability of overfitting and enhanced the training effect of the model.
In summary, each proposed module in STGGCN can significantly enhance the forecasting performance of the model.
This study proposes a new deep learning model STGGCN to enhance the abilities of traffic speed forecasting. Experiments on two real-world traffic speed datasets verified its superiority over baseline models and the usefulness of the modules involved. The following conclusions are made in light of the experimental results:
The proposed spatial graph generation module can enhance the adjacency matrix with k-hop algorithm to produce the global spatial graph with comprehensive spatial features. The proposed STGRU performs a new KGC operation, and embeds KGC into GRU to extract comprehensive spatial-temporal features. The proposed improved residual network in the self-attention block helps aggregate more global features from different aspects to reduce the probability of overfitting and improve the forecasting accuracy of the model.
Although STGGCN can accurately forecast traffic speed at most of the time, its short-term forecasting performance still has space for improvement, and there are many aspects that can be explored in the future. First, this model only considers the spatial-temporal information of traffic speed while ignoring some other information, such as semantic correlation and events (such as sport game and concert). Second, the attention module can be integrated with the graph convolution operation to enhance the aggregation of local spatial features and improve the short-term traffic speed forecasting ability of the model. Finally, the self-attention block can be improved to extract more comprehensive global features (such as feature interaction and vector interaction).
Footnotes
Acknowledgments
This work was supported by Zhejiang Key R & D Project of China (No. 2022C01005, No. 2022C01082), Humanities and Social Sciences Research Project of Ministry of Education, China (No. 20YJC870003).
