Spatial-temporal gated graph convolutional network: a new deep learning framework for long-term traffic speed forecasting

Abstract

The key to solving traffic congestion is the accurate traffic speed forecasting. However, this is difficult owing to the intricate spatial-temporal correlation of traffic networks. Most existing studies either ignore the correlations among distant sensors, or ignore the time-varying spatial features, resulting in the inability to extract accurate and reliable spatial-temporal features. To overcome these shortcomings, this study proposes a new deep learning framework named spatial-temporal gated graph convolutional network for long-term traffic speed forecasting. Firstly, a new spatial graph generation method is proposed, which uses the adjacency matrix to generate a global spatial graph with more comprehensive spatial features. Then, a new spatial-temporal gated recurrent unit is proposed to extract the comprehensive spatial-temporal features from traffic data by embedding a new graph convolution operation into gated recurrent unit. Finally, a new self-attention block is proposed to extract global features from the traffic data. The evaluation on two real-world traffic speed datasets demonstrates the proposed model can accurately forecast the long-term traffic speed, and outperforms the baseline models in most evaluation metrics.

Keywords

Traffic speed forecasting graph convolution operation gated recurrent unit self-attention block

1 Introduction

Traffic congestion has become an increasingly serious problem with the urbanization deepening [1], and accurate traffic speed forecasting has become the key to solving this problem. Reliable traffic speed forecasting can not only help the government to reasonably arrange road management strategies, but also allow the public to reasonably arrange the time and route of travel [2]. Phusittrakool et al. [3] stated that accurate long-term traffic speed forecasting can solve traffic problems more effectively than short-term traffic speed forecasting. However, due to the intricate spatial-temporal correlation of traffic networks, reliable long-term traffic speed forecasting remains a huge difficulty.

The computing power of computers is rapidly developing, and classical machine learning methods and deep learning techniques have increasingly been applied to various aspects, including traffic speed forecasting. Classical machine learning methods like autoregressive integrated moving average (ARIMA) [4] and support vector regression (SVR) [5], roughly forecast traffic by establishing a mapping relationship. Machine learning methods can learn simple temporal features from traffic data, but they cannot learn the complex spatial-temporal features of traffic networks.

The deep learning techniques have been widely recognized as the feasible tools for traffic forecasting during the past decade [6]. Recurrent neural network (RNN) and its variants long short-term memory network (LSTM) [7] and gated recurrent unit network (GRU) [8] have proven effective in capturing temporal features in traffic data to forecast traffic speed [9]. However, RNN-based works slice the traffic networks into independent traffic data, ignoring the correlation between spatial sensors. Convolutional neural network (CNN) [10] has been employed to extract the spatial features in traffic networks, but it tends to produce information distortion because it considers the irregular traffic networks as a regular two-dimensional matrix.

Because the traffic network is a non-Euclidean structure, several previous studies [11, 12] have used graph convolutional network (GCN) to model traffic networks. However, various existing GCN-based models only extract local spatial features, but ignore the correlations among distant sensors, making it challenging for the models to learn the spatial features comprehensively. Some spatial-temporal models [13, 14] use spatial and temporal modules to extract spatial and temporal features from traffic data, respectively. However, they struggle to represent the accurate and reliable spatial-temporal features comprehensively because they do not consider time-varying spatial features.

Therefore, to address the shortcomings of existing studies, this study proposes a new spatial-temporal gated graph convolutional network (STGGCN) to learn the reliable spatial-temporal features for accurate long-term traffic speed forecasting. The following three aspects are the primary contributions of this study:

A new spatial graph generation method is proposed to comprehensively describe the spatial correlation between the traffic sensors. It enhances the adjacency matrix with k-hop algorithm to generate a global spatial graph with more comprehensive spatial features.

A new spatial-temporal gated recurrent unit (STGRU) is proposed to efficiently extract the comprehensive spatial-temporal features from traffic data. It extracts both spatial features and temporal features by embedding graph convolution operation into GRU.

A new self-attention block is proposed to extract the global features from the traffic data. This reduces the risk of overfitting and improves the model training effect, which helps to more accurately forecast the traffic speed.

The remainder of this paper is structured as follows. Section 2 illustrates the characteristics of some existing models. Section 3 defines some concepts related to traffic speed forecasting through problem definition. Section 4 explores the structure of the proposed model. Section 5 demonstrates the dataset, experimental settings, and the experimental results. Section 6 summarizes the study and discusses the future work.

2 Related work

2.1 Deep learning for traffic forecasting

With the recent technological advancements that have made computing resources cheaper, the deep learning methods have significantly developed, and are widely used to forecast traffic to solve the problem of traffic congestion caused by urbanization. Our previous work [15] has demonstrated that deep learning can effectively extract high-dimensional spatial and temporal features in traffic networks. Chen and Chen [16] proposed a deep learning model for data interpolation, which effectively handled the traffic flow prediction task with data loss, proving that deep learning can have strong generalization ability.

Earlier research used RNN to learn temporal dependencies in traffic data, but tended to produce gradient disappearance and gradient explosion problems, which makes it difficult to memorize long-term traffic sequences. Extending the RNN, Hochreiter and Schmidhuber [7] proposed LSTM, and Cho et al. [8] proposed the GRU to solve these problems using multiple gating units and cell state to capture temporal features. Cui et al. [17] stacked bidirectional LSTM with unidirectional LSTM and added a data imputation mechanism to LSTM to deal with the traffic state with missing values. Zhu et al. [18] used seasonal trend decomposition algorithm to decompose traffic data, and then used LSTM, MLP, and seasonal cycle to extract deep temporal dependence. Kumar et al. [19] combined the Internet of Things with GRU and LSTM to predict short-term traffic flow. Although RNN-based models can make accurate traffic forecasts to a certain extent, they slice the traffic networks into independent traffic data and ignore the spatial features of the traffic networks.

To increase the forecasting accuracy of traffic networks, some studies began to extract both spatial and temporal features from traffic networks. Liu et al. [10] used both 1D-Convolution and LSTM to extract spatiotemporal dependencies in traffic data, and then employed a bidirectional LSTM module to learn the heterogeneity of traffic data. Yao et al. [20] exploited local CNN to learn spatial dependencies in traffic networks, and combined it with LSTM to build spatial and temporal views. However, these CNN-based models treat the irregular traffic network as a regular two-dimensional matrix, which will cause information distortion, resulting in loss of local spatial information.

To handle non-Euclidean data [21] such as traffic networks better, GCN has been proposed to replace CNN to extract more accurate spatial features. Bruna et al. [22] employed GCN that applied convolution operations to non-Euclidean data based on graph Laplacian. Defferrard et al. [23] reduced the complexity of GCNs using Chebyshev polynomials. Kipf and Welling [24] proposed a renormalization propagation method to further improve the computational effectiveness and forecasting ability of GCN. With the development and maturity of GCN, various studies begun to use GCN to extract spatial features in traffic networks. For example, Yu et al. [13] chose to express the traffic speed prediction problem on a traffic graph structure and used GCN and temporal gated-Convolution to extract spatial-temporal features in the traffic networks, which accelerated the model training speed. Geng et al. [11] divide the traffic network into multiple regions, and then use multi-graph convolution to model the correlation between regions to forecast ride-hailing demand. Zhao et al. [14] combined GRU and GCN into a temporal graph convolution network (T-GCN) to forecast traffic speed, and obtained relatively good results. Lv et al. [12] encoded various semantic information into multiple graphs, and then used multi-graph convolution and GRU to extract the spatial-temporal features of the traffic network. Zhao et al. [25] used double graph convolution to replace the full connection layer of the gating unit in GRU to extract spatial-temporal dependency of the traffic network.

However, although the above GCN-based models can extract the spatial features of traffic networks, they tended to ignore the correlations among distant sensors. Although they used spatial and temporal modules to extract spatial and temporal features respectively, they tended to ignore the time-varying spatial features.

2.2 Attention mechanism

Attention mechanisms can allocate limited computing resources to more important goals and have been applied in various application areas, including natural language processing (NLP) [26], image processing [27], and time series forecasting [28]. Zhang et al. [29] proposed a gated attention network, which uses a convolutional sub-network to control the distribution of attention, and then combines the aggregator with GRU to forecast traffic speeds. Guo et al. [30] proposed a new spatial-temporal attention mechanism composed of spatial attention and temporal attention to learn dynamic spatial and temporal features. Zhao et al. [31] combined attention with dynamic graphs to capture dynamic, hidden, and long-term correlations in traffic data by assigning different weights to sensors at different times and locations.

However, the attention mechanism has optimization difficulties and ignores the shortcomings of contextual relations. Fortunately, the self-attention mechanism can make up for these shortcomings. Vaswani et al. [32] proposed a deep learning model based on the self-attention mechanism for solving machine translation tasks. Wang et al. [28] combined a self-attention mechanism and graph neural network layer with location attention mechanism to forecast traffic speed, demonstrating the usefulness of the self-attention mechanism in traffic speed forecasting. Reza et al. [33] proposed a multi-head self-attention model, demonstrating that the number of attention layers and multi-head attention significantly improve the model’s predictive effectiveness. Yan et al. [34] combined self-attention mechanism with LSTM to predict traffic speed, proving that self-attention mechanism can effectively capture the long-term dependence of traffic speed.

The above work proves the usefulness of the self-attention mechanism, but it is vulnerable to overfitting and encounters difficult training problems, because it calculates the similarity point by point and generates large amounts of data.

3 Problem definition

In this section, several basic concepts about traffic speed forecasting are first defined, and then the purpose of the traffic speed forecasting model is explained.

Definition 1. Traffic network graph: Although traffic roads are directional, many traffic problems will affect both upstream and downstream, so in this study, the traffic network graph is considered to be an undirected graph G = (V, E, A), where G is the traffic networks graph, V = {v₁, v₂, …, v_N} represents N sensors in the traffic networks; E is the connection between the sensors; and A represents the adjacency matrix of the traffic network graph.

Definition 2. Adjacency matrix: The adjacency matrix $A \in ℝ^{N \times N}$ can be used to represent the adjacency relationship between sensors on the traffic roads, in which each element A_ij denotes the strength of the spatial association between v_i and v_j, and it should be noted that A_ij ∈ [0, 1] and A_ii = 1, where 0 means there is no relationship between the sensors, and 1 means there is a strong relationship between the sensors.

The aim of the traffic speed prediction model is to learn a mapping function F (·), and forecast future traffic speed from known historical traffic data, traffic networks graph, and mapping function F (·). The function F (·) is shown in Equation (1). $Y_{t + 1}, Y_{t + 2} \dots Y_{t + T} = F (X_{t - H + 1}, X_{t - H + 2}, \dots, X_{t}, G) .$ (1) where Y denotes the future traffic speed to be forecasted, t is the current time point, T is the length of time steps at which the future traffic speed needs to be forecasted, H is the known length of the historical traffic data, and X_t is the historical traffic data at time t and G is the traffic network graph.

4 Methodology

Figure 1 depicts the STGGCN model’s framework proposed in this study. It consists of three parts: spatial graph generation module, STGRU module, and self-attention module. In the spatial graph generation module, the adjacency matrix is enhanced with the k-hop algorithm to generate multiple k-hop adjacency matrices with different distance ranges, and then these local spatial graphs are fused into a global spatial graph with more comprehensive spatial features. In the STGRU module, a new k-hop graph convolution (KGC) operation is performed on the global spatial graph to extract time-varying spatial features, and the comprehensive spatial-temporal features are extracted by embedding KGC into GRU. In the self-attention module, a new self-attention block is used to extract the global features in the traffic data, and utilize the improved residual network to better train the model and improve its forecasting accuracy.

Fig. 1

Framework of the proposed STGGCN.

4.1 Spatial graph generation module

To model the traffic networks, not only the correlations among close sensors need to be focused on, but also those among distant sensors make sense. Therefore, as shown in Fig. 1-A, an adjacency matrix is firstly generated based on the geographical location of the traffic sensors. Then, the adjacency matrix is enhanced using the k-hop algorithm to generate a global spatial graph with comprehensive spatial features, which can represent the correlations among both close and distant sensors.

To represent the correlation between two sensors accurately, the threshold-based Gaussian kernel formula [35] is employed to construct an adjacency matrix that represents the spatial structure of the traffic networks well, as shown in Equation (2). $W_{ij} = {\begin{matrix} exp (- \frac{dis {(i, j)}^{2}}{s^{2}}), & if exp (- \frac{dis {(i, j)}^{2}}{s^{2}}) ⩾ κ \\ 0, & otherwise \end{matrix}$ (2) where W_ij is the association between sensor i and sensor j, dis (i, j) is the linear spatial distance between sensor i and sensor j, s is the standard deviation of the linear spatial distance between sensors; and κ is the set threshold. W_ij produces an adjacency matrix $A \in ℝ^{N \times N}$ , where N is the number of sensors.

To facilitate subsequent calculation, the adjacency matrix A is first transformed into $\tilde{A}$ according to Equation (3). $\tilde{A} = A + I$ (3) where $\tilde{A}$ is the transformed adjacency matrix and I is the diagonal identity matrix. It is noted that all elements in $\tilde{A}$ belong to [0,1]. The matrix $\tilde{A}$ is then processed into ${\tilde{A}}^{k}$ using the k-hop algorithm, as shown in Equation (4). ${\tilde{A}}^{k} = norm ((A + I)^{k})$ (4) where ${\tilde{A}}^{k}$ is a k-hop adjacency matrix and norm is the normalization function. Different k-hop coefficients will allow the k-hop adjacency matrix to represent the local spatial features with different distance ranges among sensors. Min-max normalization function is used to ensure that all elements in ${\tilde{A}}^{k}$ still belong to [0,1].

To generate a global spatial graph G_st with more comprehensive spatial features, multiple k-hop adjacency matrices with different distance ranges (i.e., local spatial graphs) are produced and aggregated, according to Equation (5). $G_{st} = relu (rownorm (W_{1} {\tilde{A}}^{1} + W_{2} {\tilde{A}}^{2} + \dots + W_{k} {\tilde{A}}^{k}))$ (5) where relu is the relu activation function, W₁, W₂, ... , W_k are trainable weights, and rownorm is also a normalization method, which normalizes each element in the global spatial graph by applying min-max normalization on each row of the global spatial graph to eliminate the influence caused by the large weight gap among different sensors.

4.2 Spatial-temporal gated recurrent unit module

As shown in Fig. 1-B, combined with the global spatial graph, the traffic data with time steps are input into the proposed STGRU to extract the time-varying spatial features and temporal features.

Referring to Fig. 2, STGRU extends the original GRU [8] involving reset gate and update gate with a new spatial gate, performing the graph convolution operation to extract the time-varying spatial features. The spatial gate s_g captures the time-varying spatial features from the global spatial graph G_st, the update gate r_g retains the significant spatial-temporal features, and the reset gate z_g removes the unwanted features. $h_{t - 1}^{n}$ and $h_{t}^{n}$ denote the cell states of the n-th sensor at time t-1 and time t respectively, X_t is the traffic speed of all sensors at time t, $x_{t}^{n}$ denotes the traffic speed of the n-th sensor at time t, and $x_{t + 1}^{n}$ denotes the forecasting traffic speed of the n-th sensor at time t + 1. KGC represents the k-hop graph convolution, ⊕ and ⊗ represent the gating mechanisms, and σ and tanh represent the sigmoid and tanh activation function, respectively.

Fig. 2

Architecture of STGRU.

Although the graph convolution operation of the spatial gate based on the adjacency matrix was proposed by Kipf and Welling [24], this does not apply to the k-hop adjacency matrices in the global spatial graph. Therefore, a new KGC operation is proposed in Equation (6). $KGC = (G_{st} - diag (G_{st})) X_{t}$ (6) where diag denotes the function that obtains the diagonal matrix (i.e., the correlation between the current sensor and other sensors). This method subtracts the traffic speed of the current sensor from that of all sensors to avoid data redundancy, because both traffic speed inputs of the current sensor and all sensors into the STGRU cause the repeated input of the traffic speed of the current sensor.

After the time-varying spatial features are extracted from the global spatial graph, they are merged with the cell states. Finally, the fused data will go through the reset gate and update gate, resulting in the comprehensive spatial-temporal features. The formula for figuring out $h_{t}^{n}$ (i.e., $x_{t + 1}^{n}$ ) is defined as Equations (7)–(11), in which, Equations (8), (9) and (11) are borrowed from GRU [8]. $s_{g} = σ (KGC \cdot W_{s}^{k})$ (7) $r_{g} = σ (x_{t}^{n} \cdot W_{x}^{r} + h_{t - 1}^{n} \cdot W_{h}^{r} + b_{r})$ (8) $z_{g} = σ (x_{t}^{n} \cdot W_{x}^{z} + h_{t - 1}^{n} \cdot W_{h}^{z} + b_{z})$ (9) $\tilde{h_{t}^{n}} = \tanh ((s_{g} + r_{g} ⊙ h_{t - 1}^{n}) \cdot W_{h}^{h} + x_{t}^{n} \cdot W_{x}^{h} + b_{n})$ (10) $h_{t}^{n} = (1 - z_{g}) ⊙ h_{t - 1}^{n} + z_{g} ⊙ \tilde{h_{t}^{n}}$ (11) where $W_{s}^{k}$ , $W_{x}^{r}$ , $W_{h}^{r}$ , $W_{x}^{z}$ , $W_{h}^{z}$ , $W_{h}^{h}$ , and $W_{x}^{h}$ are learnable weight matrices, b_r, b_z, and b_n are the bias of the reset gate, update gate, and cell state respectively, $\tilde{h_{t}^{n}}$ is the temporary cell state, ⊙ is the Hadamard product operator, and · represents matrix multiplication. Then the spatial-temporal features obtained by different STGRU are concatenated to produce the global features that are extracted in the next step.

4.3 Self-attention module

The RNN-based models including GRU can only extract features in chronological order [28]. To extract the global features of traffic data for more accurate traffic speed forecasting, this study proposes a new self-attention block based on the combination of the self-attention mechanism [32] with improved residual network. The self-attention mechanism has been proven to be effective in extracting the global features of traffic data [28], but it suffers from overfitting and training difficulties because of large amount of computation, which can be addressed by the improved residual network.

As shown in Fig. 1-C, the self-attention module includes positional encoding, self-attention block (consisting of self-attention layer and improved residual network), and output layer. Among them, the type of positional encoding is sinusoidal positional encoding, and the self-attention layer uses the multi-head self-attention mechanism. Please refer to Vaswani et al. [32] for the details of sinusoidal positional coding and self-attention layer.

Extending the original residual network [27], Ding et al. [36] increased the number of identities of the residual network, and added 3×3 and 1×1 convolutions on the identities, improving the efficiency of image feature extraction. However, because the traffic network is a non-Euclidean structure, convolution is not suitable for extracting global features from traffic data. Inspired by but differed from Ding et al. [36], this study employs two different non-linear activation functions (i.e., relu and sigmoid) in the improved residual network to extract the global features from different aspects. Then the two different types of global features are connected with the initial features to obtain the fused global features, which are finally normalized through layer normalization into the data range [0,1]. The specific calculation is shown in Equation (12). $output = layernorm (σ (value) + relu (value) + value)$ (12) where output is the normalized and fused global features, layernorm is the layer normalization [37], and σ and relu are activation functions.

After obtaining the final global features, the output layer (i.e., the fully connected layer) is used to obtain the final multi-step traffic speed forecasting results.

5 Experiments and analysis

5.1 Dataset selection and preprocessing

To measure the forecasting performance and robustness of the proposed model, this study selects two real-world traffic speed datasets to evaluate the model. These datasets were collected by the Highway Performance Measurement System (PEMS) [38], and are briefly described as follows.

PEMS-BAY: The dataset comprises traffic speed data from 150 sensors in the Bay Area of California from January 1st 2017 to February 28th 2017. During this period, there are much heavy rainfall weather, and the rainfall in February broke the precipitation history record of the Bay Area of California. The sensor distribution is shown in Fig. 3(a).

PEMS4: The dataset comprises traffic speed data from 128 sensors in the Bay Area of California from June 1st 2017 to June 30th 2017. During this period, there are much gale weather. The sensor distribution is shown in Fig. 3(b).

Fig. 3

The sensor distribution of the two datasets.

The traffic data in these two datasets were collected by traffic sensors every 5 minutes. The data were preprocessed through Z-score normalization, as shown in Equation (13): $Z = \frac{X - μ}{s}$ (13) where Z is the normalized data, X is the traffic data, μ is the mean of X, and s is the standard deviation of X.

5.2 Experimental settings

All experimental results in this section were obtained through 5 times averages. All the comparison models in this study runs under the computing environment with python 3.6 and pytorch 1.10. The computer running the experiments is configured as Intel (R) Core (TM) i7-8700 3.20 GHz CPU, 11 G RAM NVIDIA GeForce GTX 1080 Ti.

The dataset was divided into training, validation, and test sets according to the ratio of 7 : 2:1. When evaluating the model performance, 12 known historical time steps were used to forecast future 12 consecutive time steps. The mean absolute error (MAE), root mean squared error (RMSE), and mean absolute percentage error (MAPE) were used as evaluation indicators for the comparison models. The specific calculation method for the metrics is shown in Equations (14)–(16). $MAE = \frac{1}{T} \sum_{i = 1}^{T} | Y_{i} - {\hat{Y}}_{i} |$ (14) $RMSE = \sqrt{\frac{1}{T} \sum_{i = 1}^{T} (Y_{i} - {\hat{Y}}_{i})^{2}}$ (15) $MAPE = \frac{100 %}{T} \sum_{i = 1}^{T} | \frac{Y_{i} - {\hat{Y}}_{i}}{Y_{i}} |$ (16) where T represent the forecasted time steps, Y_i represent the true value at time point i, and ${\hat{Y}}_{i}$ is the forecasted value at time point i.

In training the STGGCN proposed in this study, the training epoch was set to 50, the learning rate was set to 0.002, the batch size was set to 200, and the optimizer was Adam. The k value of the k-hop algorithm in the spatial graph generation module was set to 4, and the number of heads for multi-head self-attention in the self-attention block was set to 4.

5.3 Baseline experiments

To evaluate the relative performance of the proposed model in the field of traffic speed forecasting, this study compares the proposed STGGCN with some classical machine learning methods and some deep learning models as baseline models. The hyperparameters of the baseline models were set according to the default parameters. These baseline models are briefly introduced as follows:

HA: The history average (HA) method calculates the average value of the previous 12 times steps as the forecasted value of the next moment.

SVR [5]: SVR uses support vector regression method to fit the temporal features in traffic data for forecasting.

GRU [8]: GRU extracts the temporal features in the traffic data, and uses the hidden state to store and transfer the learned features.

GCN [24]: GCN learns the spatial features of the traffic networks through the adjacency matrix.

STGCN [13]: Spatio-temporal graph convolutional networks (STGCN) integrates GCN and gated temporal convolution to extract local spatial-temporal features in traffic networks.

STGNN [28]: Spatial temporal graph neural network (STGNN) utilizes the location attention mechanism to capture spatial features, and utilizes GRU and transformer to capture both local and global temporal features.

STSGCN [39]: Spatial-temporal synchronous graph convolutional network (STSGCN) utilizes a spatial-temporal graph convolution module to capture local spatial-temporal features and multi-module layers to capture heterogeneity in long spatio-temporal graphs.

Table 1 is the forecasting results of the proposed STGGCN and the baseline models on the PEMS-BAY and PEMS4 datasets, including those for 15 minutes, 30 minutes, 45 minutes, and 60 minutes forecasting. The optimal forecasting results at each time point are bolded. Next, the experimental results are analyzed as follows.

Table 1
Forecasting results of STGGCN and baseline models on two real-world datasets

Dataset Model 15 minutes 30 minutes 45 minutes 60 minutes

MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE

PEMS-BAY HA 2.760 6.160 5.590% 2.760 6.160 5.590% 2.760 6.160 5.590% 2.760 6.160 5.590%

SVR 2.575 7.612 6.401% 2.700 7.852 6.509% 2.827 8.094 6.628% 2.954 8.341 6.766%

GRU 1.544 3.171 3.358% 2.125 4.519 4.940% 2.624 5.450 6.358% 3.065 6.180 7.650%

GCN 3.455 5.652 7.714% 3.061 6.158 8.420% 3.902 6.644 9.094% 4.079 7.089 9.752%

STGCN 1.334 2.767 2.841% 1.863 4.121 4.146% 2.298 5.112 5.086% 2.706 5.993 5.897%

STGNN 1.357 2.935 2.828% 1.839 4.229 4.128% 2.180 5.091 5.154% 2.356 5.709 6.008%

STSGCN 1.408 3.034 2.992% 1.778 4.110 3.992% 2.026 4.714 4.652% 2.216 5.182 5.160%

STGGCN 1.434 2.882 3.076% 1.713 3.600 3.788% 1.878 3.860 4.220% 2.042 4.299 4.656%

PEMS4 HA 3.380 6.510 8.250% 3.380 6.510 8.250% 3.380 6.510 8.250% 3.380 6.510 8.250%

SVR 3.832 11.095 8.345% 3.944 11.312 8.454% 4.063 11.550 8.563% 4.192 11.815 8.693%

GRU 1.951 3.757 4.340% 2.625 5.208 6.224% 3.176 6.203 7.914% 3.714 6.979 9.548%

GCN 5.771 9.350 15.026% 5.816 9.455 15.244% 5.864 9.565 15.458% 5.920 9.681 15.682%

STGCN 1.992 3.763 4.331% 2.827 5.614 6.375% 3.496 6.960 7.934% 4.080 8.073 9.221%

STGNN 1.825 3.603 3.886% 2.413 5.126 5.408% 2.875 6.158 6.748% 3.274 6.935 7.976%

STSGCN 3.122 6.080 7.662% 3.298 6.482 8.154% 3.466 6.840 8.648% 3.656 7.186 9.190%

STGGCN 1.988 3.708 4.612% 2.403 4.753 5.868% 2.667 5.351 6.690% 2.885 5.722 7.330%

Dataset	Model	15 minutes	30 minutes	45 minutes	60 minutes
PEMS-BAY	HA	2.760	6.160	5.590%	2.760	6.160	5.590%	2.760	6.160	5.590%	2.760	6.160	5.590%
	SVR	2.575	7.612	6.401%	2.700	7.852	6.509%	2.827	8.094	6.628%	2.954	8.341	6.766%
	GRU	1.544	3.171	3.358%	2.125	4.519	4.940%	2.624	5.450	6.358%	3.065	6.180	7.650%
	GCN	3.455	5.652	7.714%	3.061	6.158	8.420%	3.902	6.644	9.094%	4.079	7.089	9.752%
	STGCN	1.334	2.767	2.841%	1.863	4.121	4.146%	2.298	5.112	5.086%	2.706	5.993	5.897%
	STGNN	1.357	2.935	2.828%	1.839	4.229	4.128%	2.180	5.091	5.154%	2.356	5.709	6.008%
	STSGCN	1.408	3.034	2.992%	1.778	4.110	3.992%	2.026	4.714	4.652%	2.216	5.182	5.160%
	STGGCN	1.434	2.882	3.076%	1.713	3.600	3.788%	1.878	3.860	4.220%	2.042	4.299	4.656%
PEMS4	HA	3.380	6.510	8.250%	3.380	6.510	8.250%	3.380	6.510	8.250%	3.380	6.510	8.250%
	SVR	3.832	11.095	8.345%	3.944	11.312	8.454%	4.063	11.550	8.563%	4.192	11.815	8.693%
	GRU	1.951	3.757	4.340%	2.625	5.208	6.224%	3.176	6.203	7.914%	3.714	6.979	9.548%
	GCN	5.771	9.350	15.026%	5.816	9.455	15.244%	5.864	9.565	15.458%	5.920	9.681	15.682%
	STGCN	1.992	3.763	4.331%	2.827	5.614	6.375%	3.496	6.960	7.934%	4.080	8.073	9.221%
	STGNN	1.825	3.603	3.886%	2.413	5.126	5.408%	2.875	6.158	6.748%	3.274	6.935	7.976%
	STSGCN	3.122	6.080	7.662%	3.298	6.482	8.154%	3.466	6.840	8.648%	3.656	7.186	9.190%
	STGGCN	1.988	3.708	4.612%	2.403	4.753	5.868%	2.667	5.351	6.690%	2.885	5.722	7.330%

Note: the bolded data in the table represent the optimal results.

The deep learning models (GRU, GCN, STGCN, STGNN, STSGCN, and STGGCN) outperforms the classical machine learning models (HA and SVR) for traffic speed forecasting in most cases. This is because the traditional time series forecasting models have simple linear structures and cannot capture complex non-linear features in traffic data.

In most cases, the models based on spatial-temporal features (STGCN, STGNN, STSGCN, and STGGCN) perform better than other models based only on temporal or spatial features (HA, SVR, GRU, and GCN), because traffic speed is both spatially and temporally correlated.

In comparing STGGCN with deep learning models that extracts only local spatial-temporal features (STGCN and STSGCN), it was found that their forecasting performance is similar for short-term traffic speed forecasting. However, as the forecasting time length increases, it is found that STGGCN performs better than STGCN and STSGCN. This is because STGGCN makes full use of the global features of traffic data.

In comparing STGGCN with STGNN, it is found that STGGCN performs worse than STGNN for short-term traffic speed forecasting. However, as the forecasting time length increases, STGGCN performs better and better than STGNN. This is because STGGCN can extract more comprehensive spatial-temporal features.

To more intuitively compare the forecasting performance between STGGCN and STGNN, their forecasting results on randomly selected sensors at 288 consecutive time steps are visualized in Fig. 4. It can be observed that STGGCN is more sensitive to traffic speed changes than STGNN, and its forecast results are considerably close to the true values, demonstrating that STGGCN is very suitable for capturing complex traffic features for traffic speed forecasting, particularly for long-term traffic speed forecasting.

Fig. 4

Comparison of forecasting performance between STGGCN and STGNN on PEMS-BAY.

5.4 Ablation experiments

To demonstrate the usefulness of each module in the proposed STGGCN, the ablation experiments are conducted to evaluate four variant models obtained by combining different modules with the basic GRU, which are GRU-GCN, e-GRU-GCN, e-GRU-KGC, and e-GRU-KGC-SA. They are described as follows.

GRU-GCN: This model uses GRU and GCN to extract temporal features and spatial features respectively, and then combine them [14].

e-GRU-GCN: This model embeds GCN into GRU, i.e., adds GCN as a spatial gate of GRU to extract spatial-temporal features.

e-GRU-KGC: This model embeds KGC into GRU, i.e., adds KGC in the spatial gate of GRU to extract comprehensive spatial-temporal features.

e-GRU-KGC-SA: This model combines a self-attention block with e-GRU-KGC to extract global features, which adopts the original residual network [40].

Table 2 shows the forecasting results of GRU, its variant models, and the proposed STGGCN on two real-world datasets, including those for 15 minutes, 30 minutes, 45 minutes, and 60 minutes forecasting. The optimal forecasting results at each time point are bolded. Figure 5 shows the comparison of MAE, RMSE, and MAPE values among the different models on two real-world datasets. The experimental results were analyzed as follows.

Table 2
Forecasting results of GRU, its variant models and proposed STGGCN on two real-world datasets

Dataset Model 15 minutes 30 minutes 45 minutes 60 minutes

MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE

PEMS-BAY GRU 1.544 3.171 3.358% 2.125 4.519 4.940% 2.624 5.450 6.358% 3.065 6.180 7.650%

GRU-GCN 2.143 3.837 4.554% 2.295 4.171 4.952% 2.444 4.439 5.320% 2.648 4.727 5.766%

e-GRU-GCN 1.677 3.175 3.544% 1.936 3.775 4.206% 2.097 4.126 4.614% 2.223 4.356 4.910%

e-GRU-KGC 1.586 3.098 3.396% 1.838 3.764 4.112% 2.003 4.142 4.578% 2.151 4.485 4.948%

e-GRU-KGC-SA 1.537 3.050 3.330% 1.808 3.731 4.036% 1.981 4.135 4.508% 2.154 4.484 4.998%

STGGCN 1.434 2.882 3.076% 1.713 3.600 3.788% 1.878 3.860 4.220% 2.042 4.299 4.656%

PEMS4 GRU 1.951 3.757 4.340% 2.625 5.208 6.224% 3.176 6.203 7.914% 3.714 6.979 9.548%

GRU-GCN 3.562 6.130 8.786% 2.633 6.297 8.982% 3.756 6.523 9.356% 3.921 6.795 9.864%

e-GRU-GCN 2.263 4.022 5.282% 2.689 4.951 6.444% 2.963 5.511 7.196% 3.171 5.885 7.722%

e-GRU-KGC 2.125 3.933 5.060% 2.500 4.798 6.154% 2.754 5.355 6.902% 2.956 5.767 7.444%

e-GRU-KGC-SA 2.020 3.769 4.708% 2.455 4.839 6.084% 2.723 5.432 6.884% 2.943 5.808 7.492%

STGGCN 1.988 3.708 4.612% 2.403 4.753 5.868% 2.667 5.351 6.690% 2.885 5.722 7.330%

Dataset	Model	15 minutes	30 minutes	45 minutes	60 minutes
PEMS-BAY	GRU	1.544	3.171	3.358%	2.125	4.519	4.940%	2.624	5.450	6.358%	3.065	6.180	7.650%
	GRU-GCN	2.143	3.837	4.554%	2.295	4.171	4.952%	2.444	4.439	5.320%	2.648	4.727	5.766%
	e-GRU-GCN	1.677	3.175	3.544%	1.936	3.775	4.206%	2.097	4.126	4.614%	2.223	4.356	4.910%
	e-GRU-KGC	1.586	3.098	3.396%	1.838	3.764	4.112%	2.003	4.142	4.578%	2.151	4.485	4.948%
	e-GRU-KGC-SA	1.537	3.050	3.330%	1.808	3.731	4.036%	1.981	4.135	4.508%	2.154	4.484	4.998%
	STGGCN	1.434	2.882	3.076%	1.713	3.600	3.788%	1.878	3.860	4.220%	2.042	4.299	4.656%
PEMS4	GRU	1.951	3.757	4.340%	2.625	5.208	6.224%	3.176	6.203	7.914%	3.714	6.979	9.548%
	GRU-GCN	3.562	6.130	8.786%	2.633	6.297	8.982%	3.756	6.523	9.356%	3.921	6.795	9.864%
	e-GRU-GCN	2.263	4.022	5.282%	2.689	4.951	6.444%	2.963	5.511	7.196%	3.171	5.885	7.722%
	e-GRU-KGC	2.125	3.933	5.060%	2.500	4.798	6.154%	2.754	5.355	6.902%	2.956	5.767	7.444%
	e-GRU-KGC-SA	2.020	3.769	4.708%	2.455	4.839	6.084%	2.723	5.432	6.884%	2.943	5.808	7.492%
	STGGCN	1.988	3.708	4.612%	2.403	4.753	5.868%	2.667	5.351	6.690%	2.885	5.722	7.330%

Note: the bolded data in the table represent the optimal results.

Fig. 5

Comparison of GRU, its variant models and proposed STGGCN on two real-world datasets.

GRU performs well at 15 min and 30 min, but its forecasting performance at 45 min and 60 min is unsatisfactory on two real-world datasets, because it only utilizes the temporal features of traffic data and ignores the spatial features.

GRU-GCN performs worse at 15 min and 30 min than GRU, but performs better at 45 min and 60 min than GRU on the PEMS-BAY. This is because the utilization of simple spatial-temporal features can improve the accuracy of long-term traffic forecasting. However, GRU-GCN always performs worse than GRU on the PEMS4 dataset. This is because the spatial features in the PEMS4 dataset are relatively complex, and GRU-GCN is not good at extracting complex spatial features, in particular, the time-varying spatial features.

e-GRU-GCN performs better than GRU-GCN on two real-world datasets, because it embeds GCN into GRU and extracts time-varying spatial features and temporal features. It was found that the time-varying spatial features are beneficial for traffic speed forecasting.

e-GRU-KGC performs better than e-GRU-GCN on two real-world datasets, because the proposed KGC operation embedded in GRU helps to extract the more comprehensive spatial-temporal features.

e-GRU-KGC-SA performs better than e-GRU-KGC on two real-world datasets, because it extracts the global features from traffic data through self-attention block. It was found that the global features were helpful for accurately forecasting traffic speed.

The proposed STGGCN performs better than e-GRU-KGC-SA on two real-world datasets, demonstrating that the proposed improved residual network reduced the probability of overfitting and enhanced the training effect of the model.

In summary, each proposed module in STGGCN can significantly enhance the forecasting performance of the model.

6 Conclusion

This study proposes a new deep learning model STGGCN to enhance the abilities of traffic speed forecasting. Experiments on two real-world traffic speed datasets verified its superiority over baseline models and the usefulness of the modules involved. The following conclusions are made in light of the experimental results:

The proposed spatial graph generation module can enhance the adjacency matrix with k-hop algorithm to produce the global spatial graph with comprehensive spatial features.

The proposed STGRU performs a new KGC operation, and embeds KGC into GRU to extract comprehensive spatial-temporal features.

The proposed improved residual network in the self-attention block helps aggregate more global features from different aspects to reduce the probability of overfitting and improve the forecasting accuracy of the model.

Although STGGCN can accurately forecast traffic speed at most of the time, its short-term forecasting performance still has space for improvement, and there are many aspects that can be explored in the future. First, this model only considers the spatial-temporal information of traffic speed while ignoring some other information, such as semantic correlation and events (such as sport game and concert). Second, the attention module can be integrated with the graph convolution operation to enhance the aggregation of local spatial features and improve the short-term traffic speed forecasting ability of the model. Finally, the self-attention block can be improved to extract more comprehensive global features (such as feature interaction and vector interaction).

Footnotes

Acknowledgments

This work was supported by Zhejiang Key R & D Project of China (No. 2022C01005, No. 2022C01082), Humanities and Social Sciences Research Project of Ministry of Education, China (No. 20YJC870003).

References

D.H.

, Predicting short-term traffic flow in urban based on multivariate linear regression model, Journal of Intelligent & Fuzzy Systems 39(3) (2020), 1417–1427.

Wang

, Shi

Q.X.

, Short-term traffic speed forecasting hybrid model based on Chaos-Wavelet Analysis-Support Vector Machine theory, Transportation Research Part C: Emerging Technologies 27 (2013), 219–232.

Phusittrakool

, Jeenanunta

, Prathombutr

, Evaluation of network performance under provision of short predictive traffic information, Walailak Journal of Science and Technology 13(6) (2015), 433–450.

Mohammad

M.M.

, Hashem

R.A.

, Bani

S.Z.M.

, Short-term prediction of traffic volume in urban arterials, Journal of Transportation Engineering 121(3) (1995), 249–254.

C.H.

, Wei

C.C.

, Su

D.C.

, Chang

M.H.

, Ho

J.M.

, Travel time prediction with support vector regression, IEEE Transactions on Intelligent Transportation Systems 5(4) (2004), 276–281.

Polson

N.G.

, Sokolov

V.O.

, Deep learning for short-term traffic flow prediction, Transportation Research Part C: Emerging Technologies 79 (2017), 1–17.

Hochreiter

, Schmidhuber

, Long short-term memory, Neural Computation 9 (1997), 1735–1780.

Cho

, Merrienboer

B.V.

, Gulcehre

, Bahdanau

, Bougares

, Schwenk

and Bengio Yoshua , Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 19th Conference on Empirical Methods in Natural Language Processing, (2014) October 25–29, Doha, Qatar, pp. 1724–1734.

X.L.

, Tao

Z.M.

, Wang

Y.H.

, Yu

H.Y.

, Wang

Y.P.

, Long short-term memory neural network for traffic speed prediction using remote microwave sensor data, Transportation Research Part C: Emerging Technologies 54 (2015), 187–197.

10.

Liu

Y.P.

, Zheng

H.F.

, Feng

X.X.

, Chen

Z.H.

, Short-term traffic flow prediction with Conv-LSTM. In Proceedings of the 9th International Conference on Wireless Communications and Signal Processing, (2017) October 11–13, Nanjing, China, doi: 10.1109/WCSP.2017.8171119.

11.

Geng

, Li

Y.G.

, Wang

L.Y.

, Zhang

L.Y.

, Yang

, Ye

J.P.

, Liu

, Spatiotemporal multi-graph convolution network for ride-hailing demand forecasting. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, January 27–February 1, Hawaii, USA, (2019) pp. 3656–3663.

12.

M.Q.

, Hong

Z.X.

, Chen

T.M.

, Zhu

T.T.

, Temporal multi-graph convolutional network for traffic flow prediction, IEEE Transactions on Intelligent Transportation Systems 22(6) (2021), 3337–3348.

13.

, Yin

H.T.

, Zhu

Z.X.

, Spatio-temporal graph convolutional networks: A deep learning framework for traffic forecasting. In Proceedings of the 27th International Joint Conferences on Artificial Intelligence Organization (2018) July 13–19, Stockholm, Sweden, pp. 3634–3640.

14.

Zhao

, Song

Y.J.

, Zhang

, Liu

, Wang

, Lin

, Deng

, Li

H.F.

, T-GCN: A temporal graph convolutional network for traffic prediction, IEEE Transactions on Intelligent Transportation Systems 21(9) (2020), 3848–3858.

15.

Zhang

W.Y.

, Zhu

, Zhang

, Chen

, Xu

J.Y.

, Dynamic graph convolutional networks based on spatiotemporal data embedding for traffic flow forecasting, Knowledge-Based Systems 250 (2022), 109028.

16.

Chen

, Chen

X.M.

, A novel reinforced dynamic graph convolutional network model with data imputation for network-wide traffic flow prediction, Transportation Research Part C: Emerging Technologies 143 (2022), 103820.

17.

Cui

Z.Y.

, Ke

R.M.

, Pu

Z.Y.

, Wang

Y.H.

, Stacked bidirectional and unidirectional LSTM recurrent neural network for forecasting network-wide traffic state with missing values, Transportation Research Part C: Emerging Technologies 118 (2020), 102674.

18.

Zhu

, Zhang

W.Y.

, Zhang

Z.Q.

, A novel hybrid deeplearning model for taxi demand forecasting based on decomposition oftime series and fusion of text data, Journal of Intelligent &Fuzzy Systems 41(3) (2021), 3355–3371.

19.

Kumar

B.P.

, Hariharan

, Shanmugam

, Shriram

, Sridhar

, Enabling internet of things in road traffic forecasting with deeplearning models, Journal of Intelligent & Fuzzy Systems 43(5) (2022), 6265–6276.

20.

Yao

H.X.

, Wu

, Ke

J.T.

, Tang

X.F.

, Jia

Y.T.

, Lu

S.Y.

, Gong

P.H.

, Ye

J.P.

, Li

Z.H.

, Deep multi-viewspatial-temporal network for taxi demand prediction. In Proceedings of the 32nd Association for the Advancement of Artificial Intelligence, February 2–7, NewOrleans, Louisiana, USA (2018), pp. 2588–2595.

21.

Bronstein

M.M.

, Bruna

, LeCun

, Szlam

, Vandergheynst

, Geometric deep learning: Going beyond Euclidean data, IEEE Signal Processing Magazine 34(4) (2017), 18–42.

22.

Bruna

, Zaremba

, Szlam

, LeCun

, Spectral networks and deep locally connected networks on graphs. In Proceedings of the 28th Conference and Workshop on Neural Information Processing Systems, (2014) December 7–13, Montreal, Canada.

23.

Defferrard

, Bresson

, Vandergheynst

, Convolutional neural networks on graphs with fast localized spectral filtering. In Proceedings of the 30th Conference on Neural Information Processing Systems, December 5–10, Barcelona, Spain, (2016) pp. 3844–3852.

24.

Kipf

T.N.

, Welling

, Semi-supervised classification with graph convolutional networks. In Proceedings of the 5th International Conference on Learning Representations, (2017) April 24–26, Toulon, France.

25.

Zhao

, Chen

M.C.

, Du

Y.T.

, Yang

H.Y.

, Wang

C.J.

, Spatial-Temporal Graph Convolutional Gated Recurrent Network for Traffic Forecasting (2022). arXiv preprint arXiv: 2210.02737.

26.

Bahdanau

, Cho

K.H.

, Bengio

, Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations, (2015) May 7–9, San Diego, CA, USA.

27.

, Li Shen , Albanie

, Sun

and Wu

E.H.

, Squeeze-and-excitation networks, IEEE Transactions on Pattern Analysis and Machine Intelligence 42(8) (2019), 2011–2033.

28.

Wang

X.Y.

, Ma

, Wang

Y.Q.

, Jin

, Wang

, Tang

J.L.

, Jia

C.Y.

, Yu

, Traffic flow prediction via spatial temporal graph neural network. In Proceedings of the 20th International World Wide Web Conference Committee, (2020) April 20–24, Taipei, Taiwan, pp. 1082–1092.

29.

Zhang

J.N.

, Shi

X.J.

, Xie

J.Y.

, Ma

, King

, Yeung

D.Y.

, GaAN: Gated attention networks for learning on large and spatiotemporal graphs. In Proceedings of the 34th Conference on Uncertainty in Artificial Intelligence, (2018) August 6–10, Monterey, California, USA, pp. 339–349.

30.

Guo

S.N.

, Lin

Y.F.

, Feng

, Song

, Wan

H.Y.

, Attention based spatial-temporal graph convolutional networks for traffic flow forecasting. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, (2019) January 27-February 1, Hawaii, USA, pp. 922–929.

31.

Zhao

J.L.

, Liu

Z.B.

, Sun

Q.X.

, Li

, Jia

X.Y.

, Zhang

R.M.

, Attention-based dynamic spatial-temporal graph convolutional networks for traffic speed forecasting, Expert Systems with Applications 204 (2022), 117511.

32.

Vaswani

, Shazeer

, Parmar

, Uszkoreit

, Jones

, Gomez

A.N.

, Kaiser

, Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, December 4–9, Long Beach, CA, USA, (2017), pp. 5998–6008.

33.

Reza

, Ferreira

M.C.

, Machado

J.J.M.

, Tavares

J.M.R.S.

, A multi-head attention-based transformer model for traffic flow forecasting with a comparative analysis to recurrent neural networks, Expert Systems with Applications 202 (2022), 117275.

34.

Yan

, Gan

X.H.

, Wang

, Qin

T.J.

, Self-attention eidetic 3D-LSTM: Video prediction models for traffic flow forecasting, Neurocomputing 509 (2022), 167–176.

35.

Shuman

D.I.S.K.

, Frossard

, Ortega

, Vandergheynst

, The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains, IEEE Signal Processing Magazine 30(3) (2013), 83–98.

36.

Ding

X.H.

, Zhang

X.Y.

, Ma

N.N.

, Han

J.G.

, Ding

G.G.

, Sun

, RepVGG: Making VGG-style convnets great again. In Proceedings of the 34th IEEE Conference on Computer Vision and Pattern Recognition, (2021) June 19–25, Online, pp. 13733–13742.

37.

J.L.

, Kiros

J.R.

, Hinton

G.E.

, Layer normalization (2016). arXiv preprint arXiv: 1607.06450.

38.

Chen

, Petty

, Skabardonis

, Varaiya

, Jia

Z.F.

, Freeway performance measurement system mining loop detector data, Transportation Research Record 1 (2001), 96–102.

39.

Song

, Lin

Y.F.

, Guo

S.N.

, Wan

H.Y.

, Spatial-temporal synchronous graph convolutional networks: A new framework for spatial-temporal network data forecasting. In Proceedings of the 34th Association for the Advancement of Artificial Intelligence, February 7–12, New York, USA (2020), pp. 914–921.

40.

K.M.

, Zhang

X.Y.

, Ren

S.Q.

, Sun

, Deep residual learning for image recognition. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition (2016), June 26th–July 1st, pp. 770–778.