Multi-Stage Fusion Framework for Short-Term Passenger Flow Forecasting in Urban Rail Transit Systems Using Multi-Source Data

Abstract

To improve real-time operation and management in urban rail transit (URT) systems, accurate and reliable short-term passenger flow forecasting at the network level is a crucial task. Although numerous endeavors have been devoted to this field, the insufficient topological representation for passenger flows in the URT network, the overlooking of intrinsic correlations among multi-source data, and the information loss in deep-learning frameworks are still critical issues that need to be addressed. This study proposes a multi-stage fusion passenger forecasting (MSFPF) model to accomplish short-term multi-step passenger forecasting leveraging multi-source data, and overcome the above-mentioned challenges. Based on the characteristics of passenger flows in the URT network, time-based origin–destination flow data is involved and utilized to enhance the representation of flows and provide spatial-temporal features. Then, the interaction and relationship among multi-source data are estimated to capture their intrinsic correlations. To effectively and comprehensively extract temporal and spatial features, a transformer long short-term memory block and a depth-wise attention block are constructed with attention mechanisms and employed. Furthermore, we construct the multi-stage fusion (MSF) structure to alleviate the information loss during the learning process, which is a significant component in improving the forecasting accuracy. In addition, the model is applied to two large-scale real-world datasets, in which it outperforms nine widely used baselines and four specific variants of itself. The quantitative experiments demonstrate the robustness and superiority of the proposed MSFPF model, and the significant contribution of the MSF structure in the model.

Keywords

urban rail transit system deep learning and advanced computing applications short-term forecasting spatial-temporal data mining multi-source data fusion

Urban rail transit (URT) is a significant component of public transport in cities, offering high-capacity, punctual, and high-speed service to a larger number of passengers. Accurate and reliable short-term passenger flow forecasting (STPFF) plays a vital role in URT systems, which can support many downstream tasks, such as timetable optimization, passenger flow control, and service quality improvement. Moreover, short-term predictions can provide more detailed forecasting information. In addition, multi-step prediction can extend the total forecasting periods and, thus, multi-step short-term passenger prediction balances the details and horizons.

It is well known that time-series forecasting is fundamental to STPFF. To model the temporal dependencies in time-series data, many statistical, machine learning, and deep-learning methods¹ are employed, where deep-learning methods have achieved outstanding performances in this field. On the other hand, exploring the spatial correlations is also essential in topological URT networks. Thus, some neural networks with great potential in spatial correlation extraction were widely studied², but still some models originally proposed for road traffic forecasting are directly be introduced to URT networks. Furthermore, to extract comprehensive spatial-temporal features and achieve accurate prediction, the use of multi-source data³ also plays a significant role in STPFF, which can provide comprehensive information and affect the passenger generation and attraction.

Moreover, STPFF still has challenges because of its complex spatial correlations, temporal correlations, and numerous influencing factors, such as the weather, air condition, and land-use characteristics. In the past decades, although many endeavors have been devoted to this field, and a solid foundation has been laid, several critical issues are still unaddressed. (1) The modeling of the topological URT network is a crucial point in this task. However, in many existing studies, the spatial correlations are directly modeled with respect to the physical adjacency between stations,^1,2 which is effective and originally proposed in road traffic modeling, whereas they are limited in representing the URT network because the road traffic flows are strictly constrained by certain traffic flow theories and path, while the flows in URT networks can be abstractedly regarded as a skip from the entry station to the exit station, which can represent a passenger trip without strictly following “adjacency continuity.” To provide a more intuitive comparison to support our perspective, their flow mechanisms are illustrated in Figures 1 and 2, respectively. Therefore, how to model the spatial correlations in the URT network still requires further exploration. (2) Multi-source input data (e.g., passenger flow data, origin–destination [OD] flow data, point of interest [POI] data) is essential for comprehensive spatial-temporal feature extraction and accurate forecasting.³ In general, intrinsic relations exist between these data sources, while many existing studies only considered the interaction and relationship of these multi-source data at the decision-wise level, instead of at the data-wise level. (3) During multi-source fusion, information loss is inevitable because of the learning operations (e.g., temporal correlations and spatial dependencies), such as convolution, pooling, dimensionality reduction, and non-linear mapping.⁴ For example, the sparsity of passenger flow data after convolutional layers, the degradation of graphical representation caused by neural networks,³ and the mask of some typical features can all be regarded as a form of information loss. Moreover, an image representing the URT network is utilized to visualize the phenomenon of information loss, substituting for the passenger flow matrix, since the latter one cannot be easily visualized or quantified. As shown in Figure 3, the contained information is not as rich as the original network graph after convolution and pooling operations, where the right-hand side graphs are blurry and suffer from color separation.⁵ The loss of crucial information will have an impact on prediction accuracy, and how to preserve more information to improve forecasting accuracy is still a challenge.

Figure 1.

Illustration of road traffic flows.

Figure 2.

Illustration of passenger flows in the urban rail transit (URT) system.

Figure 3.

Illustration of information loss.

To address these problems, we propose a multi-step forecasting framework to conduct STPFF tasks based on multi-source data. The network-level forecasting framework collaboratively models all stations in the URT system, taking into account temporal dependencies, spatial correlations, and network topology features. To address the above-mentioned challenges, time-based OD flow data, extracted in specific timestep-based manner, is utilized to improve the representation of connectivity in the URT network, and provide sufficient spatial features and extra temporal features. Moreover, a multi-stage fusion (MSF) structure is constructed to prevent information loss in the learning process to a great extent, which spans through the whole network and consists of three fusion stages. In addition, a transformer long short-term memory block (TLB) and depth-wise attention block (DAB) are built to extract temporal and spatial features effectively and enhance prediction accuracy. When applying the framework to two real-world large-scale datasets, our method outperforms a set of baselines, which demonstrates the superiority and robustness of our proposed model. Overall, the main contributions of this work are listed as follows, and these contributions can provide valuable references for related transportation studies, such as URT network modeling, information preservation, and feature extraction from traffic big data.

(1) The study proposes an end-to-end framework to successfully extract comprehensive spatial-temporal features from multi-source data and achieve STPFF. The experimental result compared to several widely used baselines demonstrates its superiority and robustness.

(2) The study conducts matrices operations at the data-wise level to capture the intrinsic relationship between passenger flow data and time-based OD data. Moreover, spatial and temporal features in POI data are extracted and involved independently using self-attention layers.

(3) The study constructs the MSF structure that spans through the whole framework to prevent information loss during the learning process and achieve comprehensive complementarity to a great extent via information preserving and skip connections.

Literature Review

Short-Term Passenger Flow Forecasting

It is well known that the extraction of temporal features is the key point in time-series forecasting. In the early research, mathematical statistics-based models were mainly utilized. The autoregressive integrated moving average (ARIMA)⁶ and seasonal ARIMA⁷ have been widely applied to conduct time-series prediction tasks. However, these methods with fixed model structures and parameters are limited to cope with uncertain and complex non-linear correlations, and also cannot meet the real-time and precision requirements of current STPFF tasks. Next, the rise of machine learning offered a new solution to capture time-series correlations. For example, the back-propagation neural network (BPNN)⁸ and support vector regression (SVR)⁹ are utilized to accomplish short-term passenger flow prediction, showing better performances. Besides, having realized that a single model might miss some unobvious features, some hybrid models, such as the ARIMA-wavelet model¹⁰ and the wavelet-SVM (wavelet support vector machine) mixed model,¹¹ were proposed, where the advantages of different methods were both leveraged. However, these models are still limited when processing high-dimensional and complex spatiotemporal data, especially in large URT systems. Benefiting from the deep structure, deep-learning-based models performed even better in time-series forecasting and promoted the development of STPFF.¹² Among them, the most noteworthy model structures are the recurrent neural network (RNN)¹³ and its specific variants, the long short-term memory (LSTM) network¹⁴ and gate recurrent unit (GRU),¹⁵ which were proposed to overcome some shortcomings of the RNN. The LSTM network and GRU are able to capture the temporal features in time-series data efficiently and adequately, as well as the long-term dependencies. Besides, the LSTM network was also combined with SVR¹⁶ to improve the model structure and prediction accuracy. Nevertheless, they merely focused on temporal dependencies while the spatial correlations in the URT network were ignored, which is a topological network containing extensive spatial features.

Therefore, considering spatial correlations in topological metro networks, a LSTM-based spatial-temporal forecasting model was proposed,¹⁷ using a time–cost matrix to represent spatial correlations, but the spatial correlations within the URT network cannot be adequately represented since it only connects stations via travel time, and crucial topological information in the network¹ is overlooked. Concurrently, convolutional neural networks (CNNs) showed the extensive capability to capture spatial dependencies¹⁸ and achieved great performance in STPFF combined with a dynamic time warping algorithm.¹⁹ However, only Euclidean data is feasible in the CNN model, and much traffic information is prone to loss during the conversion process from non-Euclidean to Euclidean data. Subsequently, as research on traffic forecasting progressed, graph convolutional networks (GCNs) emerged as a more fitting approach for topological traffic networks.²⁰ Thus, one study²¹ constructed a multi-GCN to extract spatial-temporal features and topological information. In this regard, graph attention networks (GATs) equipped with attention mechanisms were proposed²² to emphasize the significance of various connections in the network. The GAT model is capable of focusing on essential features, thereby enhancing model interpretability. Nonetheless, the GAT’s ability may be constrained in high-dimensional data because of the number of layers and its incompatibility with residual connections. To capture high-order correlations,²³ one study put forward a GCN-based hypergraph structure to predict passenger flow. Despite demonstrating the tremendous potential for road traffic forecasting, GCN-based models have limitations in representing the connectivity among stations in the URT systems based on the adjacency matrix²⁴ since these models were proposed for road traffic forecasting,²³ which differs from the URT system. Road traffic flows are strictly constrained by corresponding traffic flow theories, but flows in the URT network can be abstracted to a skip from entry station to exit station. To address the problem,²⁴ one study illustrated the appropriateness of extracting spatial features with OD data, while another ²⁵ constructed inter-station graphs to extract potential network interactions instead of physical connections. Moreover, another study²⁶ also constructed a metro network graph based on OD data to consider travel behaviors.

Multi-Source Data Fusion

To comprehensively capture spatial-temporal features in STPFF tasks, numerous recent studies have proposed hybrid architectures and leveraged multi-source data, which can enhance the prediction accuracy and robustness of the model. Generally, distinct neural networks are employed for feature extraction from different data sources, where the feature extraction is also a significant part in multi-source data fusion. For instance, Zhao et al.¹ integrated the GCN and GRU, where the GCN is used to learn topological dependencies and the GRU is applied to extract temporal dynamic changes. Similarly, Geng et al.²⁷ extracted spatial features via the multi-GCN and analyzed time-series data using a contextual gated RNN. Li and Lasenby ²⁸ introduced an attention mechanism in their study, employing a multi-head GAT and LSTM to conduct traffic forecasting, whereas Zhang et al.³ proposed ResLSTM, which is based on the ResNet, GCN, and LSTM. ResLSTM also considers weather and air conditions, which will affect passengers’ travel plans. It also utilizes residual connections to prevent gradient vanishing while retaining more low-level information. Moreover, one study²⁹ also involves metrological data for passenger demand forecasting and processes it appropriately. All of these studies considered multi-source data, where spatial-temporal features, even some external influencing features, can be obtained to promote forecasting precision. Therefore, how to fuse multi-source data reasonably and efficiently has become a significant topic. According to the fusion stage, some studies conduct feature-wise fusion,^3,29,30 whose performance highly depends on the extracted features. Other studies fuse them at the decision-level,^23,27 with low computational requirements, while they cannot comprehensively take the advantage of multi-source complementary. Generally, a single fusion layer fails to completely leverage each data source and achieve complementarity. Although multi-source data fusion and hybrid architecture have achieved better accuracy, the loss of information during the learning and fusion process are still inevitable, and the intrinsic relationship between used data sources are usually overlooked.

Problem Definition

The purpose of the study is to conduct STPFF in the URT system at the next multiple time intervals, utilizing historical automatic fare collection (AFC) data and POI data and defining that the forecasting target is $Y_{t}$ .

Based on AFC data, the passenger inflow matrix $P_{t} \in R^{M \times N}$ is extracted with 10-min time granularity, where $M$ represents the number of stations in the system, $N$ represents the historical timesteps utilized for prediction, and each element $p (i, n)$ denotes the observed volume at station $i$ during the time interval $n$ . Besides, the timestep-based OD matrix $O D_{t} \in R^{N \times M \times M}$ is also extracted from the AFC data with the same time granularity, in which each element $o d_{ij} (n)$ represents the number of ODs from station $i$ to station $j$ at time interval $n$ . The extraction of a time-based OD matrix with 10-min intervals is based on the entry time and aligns temporally well with the extraction of inflow matrix. In addition, three temporal patterns are considered for historical timesteps, which can capture long-term temporal correlations more comprehensively. These are the real-time pattern, daily pattern, and weekly pattern, which are explained in the Model Configuration section. The specific manner of timestep-based OD data is sufficient to generally model the passenger flow in the URT network.

Moreover, POI data are pre-processed to form the matrix $POI \in R^{M \times C}$ , where $C$ is the number of categories of the POIs, and each element $poi (i, c)$ denotes the number of points for category $c$ within a radius of 1000 m around station $i$ . Eventually, involving the multi-step prediction ( $k$ steps), prediction of passenger inflow from $Y_{t + 1}$ to $Y_{t + k}$ is formulated as follows:

\begin{matrix} [\begin{matrix} Y_{t + 1} & \dots & Y_{t + k} \end{matrix}] = f (P_{t}, O D_{t}, P O I) \end{matrix}

(1)

where $f (\cdot)$ denotes the model that will be trained from the proposed deep-learning architecture, representing the non-linear relations between the prediction values and input matrixes.

Methodology

This section presents the proposed multi-stage fusion passenger forecasting (MSFPF) model, illustrated in Figure 4, which leverages three data sources, namely inflow data, time-based OD data, and POI data. To overcome the challenge of information loss during the learning and multi-source fusion process, our MSF structure spans through the entire network and consists of three fusion stages, as described in the Multi-Stage Fusion section. To facilitate interaction and establish the relationship between different data sources, we apply two specific operations (spatial interaction and temporal interaction) between the inflow and OD matrices, which each transfer related inflow distribution and outflow features to the other. In addition, self-attention extractors are introduced to extract spatial and temporal features from the POI data, and the feature maps are then fused with the other two sources, as described in the Multi-Source Data Interaction and Relationship section. Furthermore, to capture temporal features more efficiently from the processed time-series data, the TLB is employed, as described in the Transformer Long Short-Term Memory Block section. Besides, the DAB, based on channel attention and depth-wise separable convolution modules, is developed to extract spatial and topological features from fused time-based OD data, as described in the Depth-Wise Attention Block section.

Figure 4.

Framework of the proposed multi-stage fusion passenger forecasting model.

Multi-Stage Fusion

In a deep-learning network, information loss during the learning process is inevitable since non-linear mapping, convolutions, dimensionality reduction, and other operations are required for feature extraction and data fusion. Meanwhile, the loss of crucial information will cause negative effects on forecasting accuracy. To overcome this problem, we propose the MSF structure, which consists of three fusion stages, namely early-fusion, deep-fusion, and late-fusion. It spans through the whole network, and adjacent stages are connected by the residual connection, as described in detail in the following. Thus, once the data undergoes operations, which potentially lead to information loss, the fused metrics derived from the previous stage will complement the feature maps. Subsequently, the details of each fusion stage are introduced.

Early-fusion stage

early-fusion can directly process low-level information and preserve the most comprehensive information. In the case of our study, pre-processing is required because inflow and OD matrices have different dimensions. A bidirectional long short-term memory (Bi-LSTM) network is employed to pre-process the inflow matrix since previous studies have demonstrated the great potential of LSTM networks in processing the time-series passenger flow, effectively capturing the temporal features and long-term dependencies. Meanwhile, the time-based OD matrix is processed by DensNet³¹ to extract the spatial-temporal features, as shown in Figure 5. The dense block encourages feature reuse and allows for the efficient propagation of information between layers, which ensures the preservation of comprehensive information. Subsequently, two feature maps with the same dimension are combined into an early-feature matrix $F_{E}$ by weighted fusion. The fusion is calculated by Equation 2, where $A_{E 1}, A_{E 2}$ represent the feature maps from different sources after pre-processing and $w_{E 1}, w_{E 2}$ indicate the weights, which are initialized as 1 and will be updated during the learning:

\begin{matrix} F_{E} = w_{E 1} A_{E 1} + w_{E 2} A_{E 2} \end{matrix}

(2)

Figure 5.

The structure of DensNet.

Deep-fusion stage

the deep-fusion stage allows for more flexible control over the fusion process and enables the preservation of features from multiple sources. Its main role is merging the feature maps processed by the TLB and DAB. In addition, to complement information lost in previous operations, the early-feature matrix is involved in this stage. By using weighted fusion, a deep-feature matrix $F_{D}$ is obtained by Equation 3, where $A_{D 1}, A_{D 2}$ represent processed feature maps from the TLB and DAB, $w_{D 1}, w_{D 2}, w_{D 3}$ represent the learnable weight parameters that are initialized as 1:

\begin{matrix} F_{D} = w_{D 1} A_{D 1} + w_{D 2} A_{D 2} + w_{D 3} F_{E} \end{matrix}

(3)

Late-fusion stage

late-fusion is efficient and has low computational requirements since the features have been processed to decision-wise at this stage. Its main role is fusing the decision matrices from different resources. Besides, considering the information loss in the fully connected layer and inadequate decision, the deep-feature matrix is included. To effectively accomplish the fusion and model the distinctive significance, element-wise fusion is conducted by Equation 4, where $A_{L 1}, A_{L 2}$ represent the decision matrix from different sources, $W_{L 1}, W_{L 2}, W_{L 3}$ denote learnable weight vectors that are initialized as the identity matrix, and ‘ $°$ ’ is the Hadamard product:

\begin{matrix} F_{L} = W_{L 1} ° A_{L 1} + W_{L 2} ° A_{L 2} + W_{L 3} ° F_{D} \end{matrix}

(4)

Multi-Source Data Interaction and Relationship

The interaction operations are conducted at the data-wise level to improve the interaction and establish a relationship between inflow and time-based OD matrices because they have intrinsic correlations, which can support spatial-temporal modeling. Data-wise-level operations can capture multi-source relationship within input data, which allows for direct processing of low-level information. Moreover, POI data is utilized to provide additional spatial and temporal features independently for different data sources.

In the temporal interaction, the related outflow feature map that carries temporal information is extracted from the OD matrix by accumulating the passenger flows that exit in station $j$ at timestep $n$ , which derives the related outflow $o (j, n)$ , illustrated in Equation 5. In addition, categories and quantities of POIs also represents specific temporal regularities, which are related to land-use, and the regularities are usually observed as morning or evening peaks. For instance, stations near residential areas always show morning peaks on weekdays, while stations near markets usually attract more passengers on weekends. The temporal features caused by POIs can be extracted by a self-attention layer; the self-attention mechanism will be discussed in detail in the Transformer Long Short-Term Memory Block section. Furthermore, the outflow matrix and temporal features caused by POIs are combined with the inflow matrix using weighted summation, illustrated in Equation 6:

\begin{matrix} o (j, n) = \sum_{i = 1}^{M} o d_{ij} (n), (j = 1, 2, \dots, M) \end{matrix}

(5)

\begin{matrix} U_{1} (n) = w_{u 1} P (n) + w_{u 2} O {(n)}^{T} + w_{u 3} At t_{1}, (n = 1, 2, \dots, N) \end{matrix}

(6)

where $o (j, n)$ represents passenger outflow in station $i$ at timestep $n$ , $U_{1} (n)$ is the unified feature map from multi-source data with temporal features, $P (n), O (n)$ denote inflow matrices and outflow vectors at timestep $n$ , respectively, $At t_{1}$ represents the temporal features, and $w_{u 1}, w_{u 2}, w_{u 3}$ are learnable weights for the collaboration.

In the spatial interaction, we extract related distribution feature maps from inflow data that can support the spatial feature extraction via calculating the proportion $r (i, n)$ of passenger inflow in station $i$ among the overall network at timestep $n$ , expressed in Equation 7, which illustrates the distribution of passenger inflows. Moreover, POI data also carry spatial dependencies, in which it is usually observed that stations near markets, residential areas, and office buildings attract more passengers. The POI spatial features can be extracted separately from origin and destination perspectives for the OD matrix using two self-attention networks. Furthermore, the inflow distribution matrix and POI spatial features are combined with the OD matrix by weighted summation, shown in Equation 8:

\begin{matrix} r (i, n) = \frac{p (i, n)}{\sum_{j = 1}^{M} p (j, n)}, (i = 1, 2, \dots, M) \end{matrix}

(7)

\begin{matrix} U_{2} (n) = w_{u 4} OD (n) + w_{u 5} R (n) + w_{u 6} At t_{2} + w_{u 7} At t_{3}, \end{matrix}

(8)

where $r (i, n)$ represents the ratio of passengers in station $i$ at timestep $n$ , $U_{2} (n)$ is the unified feature map from multi-source data with spatial features, $OD (n)$ denotes the OD matrix at timestep $n$ , $R (n)$ denotes the ratio matrix at timestep $n$ , by concatenating ratio vectors from all stations, $At t_{2}, At t_{3}$ represent the POI spatial feature maps from the origin and destination perspectives, respectively, and $w_{u 4}, w_{u 5}, w_{u 6}, w_{u 7}$ are learnable weights for the interaction.

Transformer Long Short-Term Memory Block

Existing studies have demonstrated the great potential of LSTM networks in effectively extracting temporal features and long-term dependencies in time-series data. Moreover, the multi-head attention mechanism in the transformer enables the model to focus on the main features and improve forecasting efficiency. Therefore, by replacing the feedforward network with a Bi-LSTM network in the conventional transformer, the TLB is constructed to process the fused data, as illustrated in Figure 6.

Figure 6.

The structure of the transformer long short-term memory block.

In the TLB, input data first undergoes the positional encoding layer. The layer can effectively encode temporal dependencies among different timesteps and record their time-series information, which is unavailable since the data is not naturally ordered in the conventional transformer model. Moreover, multi-head attention is able to capture sufficient and accurate global spatial-temporal features in large-scale data by performing parallel computations of self-attention $N$ times, as illustrated in Figure 6. In each attention head $n$ , a temporal matrix from the positional encoding layer is mapped to query $Q$ , key $K$ , and value $V$ , through three learnable weight vectors $W^{Q}, W^{K}, W^{V}$ , respectively, as shown in Equations 9–11. Based on these three vectors, a non-linear mapping is conducted in Equation 12, and the self-attention is realized, where $\sqrt{d_{k}}$ is the scaling factor to adjust the result of the dot product:

\begin{matrix} Q_{n} = U_{1} W_{n}^{Q} + b_{n}^{Q} \end{matrix}

(9)

\begin{matrix} K_{n} = U_{1} W_{n}^{K} + b_{n}^{K} \end{matrix}

(10)

\begin{matrix} V_{n} = U_{1} W_{n}^{V} + b_{n}^{V} \end{matrix}

(11)

\begin{matrix} Z_{n} = f (Q_{n}, K_{n}, V_{n}) = Softmax (\frac{Q_{n} K_{n}^{T}}{\sqrt{d_{k}}}) V_{n} \end{matrix}

(12)

Furthermore, the feature maps from multi-head attention will be processed by a Bi-LSTM network. It processes the data in both forward and backward directions, which allows for better handling of the sequence data and extracting comprehensive temporal features, especially long-term dependencies.

Depth-Wise Attention Block

The input data in the DAB carry substantial quantities of temporal and spatial information, with a size (n, m, m), where $n$ and $m$ respectively represent the number of timesteps and stations. CNN-based methods are good at parallel processing and are effective in extracting features from intricate data. The DAB incorporates a channel attention module and a depth-wise separable convolution module, which is illustrated in Figure 7.

Figure 7.

The structure of the depth-wise attention block.

In the channel attention module,³² the input data is squeezed along the channel dimension, and each two-dimensional (2D) OD matrix in channel $c$ is converted to value $a_{c}$ by a global average pooling operation, expressed in Equation 13. Subsequently, the excitation operator generates channel weights $W$ with a fully connected layer and Sigmoid function, as illustrated in Equation 14. The final output of the module $O D_{SE}$ is obtained by rescaling unified OD matrix $U_{2}$ with channel weights $W$ , which can be regarded as a self-attention channel map. The module effectively models the significance of distinct channels and extracts the main spatial-temporal features from different timesteps:

\begin{matrix} a_{c} = F_{sq} (U_{2}) = \frac{1}{M \times M} \sum_{i = 1}^{M} \sum_{j = 1}^{M} {U_{2}}_{c} (i, j) \end{matrix}

(13)

\begin{matrix} W = F_{ex} (a) = Sigmoid (FC (a)) \end{matrix}

(14)

\begin{matrix} O D_{SE} = W ° U_{2} \end{matrix}

(15)

The core concept of the depth-wise separable convolution module³³ is decomposing a standard convolution into two steps: depth-wise convolution and point-wise convolution. It collaborates well with the preceding module, taking the self-attention channel map as input. In the depth-wise convolution layer, different filters are used for each input channel, and the same number of intermediate feature maps are obtained after the convolution operations. The point-wise convolution layer can generate output features from intermediate maps via $1 \times 1$ convolutions and weighted combination. Comparing to standard convolution, the proposed module achieves a reduction in parameter numbers.³³

Experiments

Experimental Setup

Data Description

In this study, two large-scale real-world AFC datasets from the URT system are utilized, respectively collected in 2020 and 2021. A total of 64 metro stations are involved in the first datasets, and 85 stations are considered in the second datasets because of the addition of stations between the two periods. Each dataset consists of five weeks, with passenger ID, entry-time, entry-station, exit-time and exit-station information in each record. To ensure data quality, “meaningless data,”“blank data,” and “error data” are removed in data cleaning, such as cases where the entry time is much earlier than the operation time of rail transit, or some information is missing; or where the inbound time is later than the outbound time. Eventually, 17,790,319 pieces and 18,232,887 pieces of feasible data remain in the datasets, respectively. As mentioned in the Probelm Definition section, inflow matrices and time-based OD matrices are extracted for the study.

In addition, also two POI datasets are used, corresponding to the two periods of AFC data. Each piece of raw data contains the POI name, code, location, and category, where the location and category are primarily focused on in this study. We accumulated the number of points for each category within a radius of 1000 m around the stations and obtained the POI matrices.

Model Configuration

This section described the implementation details of the proposed model. Each dataset was divided into a training set, validation set, and testing set with a split rate of 85% (10%) and 15%. In each forecasting process, 30 historical timesteps of three temporal patterns, namely real-time, daily, and weekly pattern, are employed to predict network-level passenger flow at the next one-step, two-step and three-step respectively. Real-time pattern represents the data in recent timesteps. Daily and weekly patterns denote the data in the same time intervals of the last day and the last week, respectively. To balance training time and forecasting precision, we set the batch size as 32 and the learning rate as 0.0005 after several experiments. In the training process, Adam software is utilized to optimize each training and the mean squared error (MSE) is selected as the training loss. Moreover, the model checkpoint and early stopping techniques are employed to save the candidate model and end the training task earlier to avoid overfitting, respectively. Furthermore, attempting to eliminate the impact of parameter initialization, we train the model five times and take the average values as the result.

Baseline Models

To validate the performance of the proposed model, nine widely used traffic forecasting models are employed to conduct the comparison. Moreover, four variants from the proposed model are constructed by modifying specific structures. A brief description of the selected baselines and variants is given below. In addition, some specific modifications are required to conduct the model on baselines³⁴ because some of them were proposed for distinct tasks (e.g., road traffic state forecasting) or utilized different data inputs (e.g., passenger inflow, outflow, and adjacency matrix).

(1) SVR: SVR is used in traffic forecasting as an early stage, and it performs regression analysis of the data based on principles of support regression machines.

(2) LSTM¹⁴: LSTM can effectively capture temporal features in time-series data, and involve the long-term dependency.

(3) ConvLSTM (convolutional long short-term memory)³⁵: traditional LSTM can effectively process time-series data, but the spatial correlations in a topological URT network are overlooked. Therefore, ConvLSTM employed convolution layers to address this problem.

(4) Bi-LSTM: Bi-LSTM contains two LSTM networks in the forward and back–forward directions respectively, and has a higher capability in feature extraction.

(5) Spatio-Temporal Residual Network (ST-ResNet)³⁶: ST-ResNet can model the topological networks based on convolution operations, and address gradient vanishing and explosion in deep networks by skip connection.

(6) GCN²⁰: the GCN is an extended form of the CNN on topological graph $G = (V, E, A)$ , which focuses on capturing spatial dependencies in the URT network.

(7) Temporal Graph Convolutional Network (T-GCN)¹: the framework employs the GCN to learn spatial dependencies in topological networks, and utilizes the GRU to capture temporal dependencies in time-series data.

(8) Graph-WaveNet³⁷: the model captures spatial dependencies using an adaptive correlation matrix and extracts temporal dependencies with stacked dilated casual convolutions.

(9) Transformer: the model achieves sequence modeling through a self-attention mechanism and effectively captures long-term dependencies within the sequence, which improves prediction accuracy.

(10) Spatial-temporal Integrated Prediction Model (STIPM)²⁴: The framework accomplishes multi-step STPFF and extracts spatial and temporal features from various data sources based on the attention mechanism, LSTM, and CNN.

(11) MSFPF-no MSF: the MSF structure, containing three fusion layers and skip connections between them, is removed from the proposed MSFPF model, and the other configurations remain.

(12) MSFPF-no TLB: the TLB is replaced by a conventional LSTM network in the proposed MSFPF model, and the other configurations remain.

(13) MSFPF-no DAB: the DAB is replaced by two CNN layers in the proposed MSFPF model, and the other configurations remain.

(14) MSFPF-no POI: the input POI data and corresponding data interaction are removed from the proposed MSFPF model, and the other configurations remain.

Evaluation Matrices

In this study, the root mean square error (RMSE), mean absolute error (MAE), and weighted mean absolute percentage error (WMAPE) are selected as evaluation metrics. The definitions of the three matrices are introduced in Equations 16–18, where n represents the number of samples, $y_{i}$ denotes the ground truth values, and ${\hat{y}}_{i}$ represents the corresponding forecasting values:

\begin{matrix} RMSE = \sqrt{MSE} = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}} \end{matrix}

(16)

\begin{matrix} MAE = \frac{1}{n} \sum_{i = 1}^{n} | (y_{i} - {\hat{y}}_{i}) | \end{matrix}

(17)

\begin{matrix} WMAPE = \sum_{i = 1}^{n} (\frac{y_{i}}{\sum_{j = 1}^{n} y_{j}} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} |) \end{matrix}

(18)

Results and Analysis

Network-level forecasting performance analysis

Tables 1 and 2 illustrate the quantitative comparisons between the MSFPF model, variants, and baselines in two real-world datasets. With respect to forecasting accuracy, the deep-learning based model performs better than conventional machine learning-based models. For instance, the LSTM network outperforms the SVR model by 7.33%, 16.48%, and 23.09% with respect to MAE for the three prediction steps, respectively. Meanwhile, LSTM-based models, particularly the LSTM and Bi-LSTM models, achieve relatively better forecasting results, which illustrates the superiority of LSTM in processing medium and long-term time-series data. However, the LSTM network fails to effectively capture spatial correlations. Although the transformer model is also suitable for time-series forecasting, the input data is not naturally ordered within time-series in standard one, which requires pre-processing. Moreover, although the ST-ResNet model can extract spatial correlations, it is relatively weak in temporal dependency extraction. In addition, the CNN cannot directly obtain high-quality spatial-temporal features via convolutions, while the GCN and T-GCN models do not achieve remarkable results because of the weak topological representation in the URT system, which has been mentioned before. Moreover, the models that use multi-source data suffers from information loss; thus, they fail to achieve the desired effect, such as the ST-ResNet, Graph WaveNet, and STIPM.

Table 1.

Quantitative Comparison of the Proposed Model and Baselines (Datasets 1)

	1-Step (10 min)			2-Step (20 min)			3-Step (30 min)
	RMSE	MAE	WMAPE/%	RMSE	MAE	WMAPE/%	RMSE	MAE	WMAPE/%
SVR	24.27	13.92	17.79	28.16	15.66	20.09	31.51	17.37	22.37
LSTM	24.50	12.90	15.93	25.81	13.08	16.19	26.45	13.36	16.53
ConvLSTM	27.35	14.66	15.97	28.03	14.70	16.19	28.61	14.92	16.40
Bi-LSTM	21.05	12.16	14.81	21.53	12.36	15.15	22.57	12.78	15.49
ST-ResNet	27.83	15.65	17.17	30.86	16.73	18.09	31.54	17.34	18.63
GCN	27.85	15.54	17.15	28.99	15.88	17.45	29.90	16.39	17.97
T-GCN	29.25	15.90	17.52	30.54	16.15	17.74	30.56	17.11	18.68
GWN	23.17	12.68	15.70	23.19	12.97	15.98	23.58	13.19	16.31
Transformer	25.35	13.81	16.92	25.51	14.58	17.83	25.52	14.85	18.20
STIPM	21.67	11.91	14.65	22.45	12.19	14.96	22.60	12.34	15.23
MSFPF-no TLB	17.69	9.38	11.63	18.14	9.78	11.99	19.17	10.79	13.30
MSFPF-no DAB	18.28	9.17	11.33	18.49	9.83	12.19	18.71	10.44	12.93
MSFPF-no POI	17.85	9.31	11.53	17.68	9.76	12.09	18.75	10.00	12.38
MSFPF-no MSF	25.95	13.88	17.12	27.87	14.14	17.51	27.06	14.25	17.62
MSFPF (our)	17.47	9.00	11.10	17.57	9.32	11.52	18.40	9.77	12.06

Note: RMSE = root mean square error; MAE = mean absolute error; WMAPE = weighted mean absolute percentage error; SVR = support vector regression; LSTM = long short-term memory; ConvLSTM = convolutional long short-term memory; Bi-LSTM = bidirectional long short-term memory; GCN = graph convolutional network; MSFPF = multi-stage fusion passenger forecasting; TLB = transformer long short-term memory block; DAB = depth-wise attention block; POI = point of interest; MSF = multi-stage fusion; ST-ResNet = Spatio-Temporal Residual Network; T-GCN = Temporal Graph Convolutional Network; GWN = Graph WaveNet; STIPM = Spatial-temporal Integrated Prediction Model. The use of bold/italics aims to emphasize our method.

Table 2.

Quantitative Comparison of the Proposed Model and Baselines (Datasets 2)

	1-Step (10 min)			2-Step (20 min)			3-Step (30 min)
	RMSE	MAE	WMAPE/%	RMSE	MAE	WMAPE/%	RMSE	MAE	WMAPE/%
SVR	23.65	12.16	19.58	27.91	13.81	22.22	30.85	15.20	24.42
LSTM	22.11	11.67	18.56	22.13	11.71	18.65	22.82	11.83	18.81
ConvLSTM	20.96	11.22	0.18	21.89	11.43	18.20	21.93	11.61	18.45
Bi-LSTM	17.07	9.87	16.49	17.59	9.90	16.47	18.27	10.17	16.89
ST-ResNet	25.24	13.13	20.76	25.66	13.24	21.04	25.93	13.65	21.26
GCN	21.17	11.37	18.04	22.06	11.69	18.47	23.40	12.17	19.43
T-GCN	21.87	11.73	18.74	22.60	11.95	19.02	23.24	12.28	19.50
GWN	20.11	10.63	17.79	20.18	10.97	18.35	20.56	11.05	18.51
Transformer	19.16	11.25	18.51	20.59	11.54	19.14	21.31	11.98	19.69
STIPM	18.10	10.26	16.80	18.67	10.37	17.21	19.48	10.79	17.88
MSFPF-No TLB	16.19	8.29	13.88	16.12	8.94	14.87	17.17	9.53	15.93
MSFPF-No DAB	15.15	8.50	14.29	15.32	9.36	15.74	16.01	9.49	15.89
MSFPF-No POI	15.36	9.49	15.83	15.99	9.37	15.60	16.15	9.50	15.93
MSFPF-No MSF	27.27	13.84	22.46	26.96	13.57	22.87	26.35	13.55	22.14
MSFPF (our)	15.44	7.87	13.04	15.32	8.86	14.80	15.72	9.31	15.52

In addition, the proposed MSFPF model consistently expresses superior performances to the baselines on all evaluation indexes, forecasting timesteps, and datasets. For example, in datasets 1, compared to the baselines, MSFPF significantly improves the RMSE, MAE, and WMAPE indexes for single-step, two-step, and three-step forecasting. The prediction performance can be seen more intuitively in Figure 8, which presents the comparison of prediction flow and ground truth, using a weekday as an example. The ground truth passenger flows in the metro network on that day are shown in red, while the prediction values are displayed in blue, and they illustrate great performances for all forecasting timesteps. Overall, these results prove that both spatial and temporal features are extracted comprehensively and appropriately by the MSFPF model, which has robustness and superiority. Furthermore, the significance of the MSF structure is obvious from the variant MSFPF-no MSF, in which the performances drop dramatically. A detailed analysis of the variants will be given in the Ablation Experiments section.

Figure 8.

Illustration of the comparison between prediction flow and ground truth.

Single-station forecasting performance analysis

in Figure 8, different stations show specific regulations and forecasting performances. Thus, we illustrate the comparisons between forecasting values and ground truth in three typical stations in Figures 9–11, respectively. (a) The first station is a commuting station surrounded mainly by residential areas and office buildings that exhibits obvious morning peaks on weekdays. (b) The second station is located near large commercial areas that show significant evening peaks and a sudden increase in passenger flow at weekends. Meanwhile, this station also has greater total passenger flows compared with other stations. Furthermore, (c) the third station is a large integrated transport hub, where railways, URT, and ground public transport merge. Its passenger flow does not follow strong regularities and fluctuates significantly and dramatically over time because of the impact of railways timetables. Nevertheless, the proposed model can still capture the fluctuation trends and respond promptly. Based on accurate URT passenger forecasting, the demand prediction of other transit modes can also be supported,^38,39 where the passengers from the URT system are a significant source. Overall, although the time-varying patterns of passenger flow are distinctive from station to station, our proposed MSFPF model can effectively capture these temporal dynamic changes in each station, even in the transport hub station. This demonstrates the effectiveness, robustness, and practical application value of the model, fulfilling the requirements of STPFF at the network level.

Figure 9.

Prediction–ground truth comparison at commuting station with multi-step.

Figure 10.

Prediction–ground truth comparison at commercial station with multi-step.

Figure 11.

Prediction–ground truth comparison at transport hub station with multi-step.

Error distribution analysis

the evaluation metrics shown in Tables 1 and 2 measure the forecasting accuracy with respect to average errors in the whole network. Therefore, to further validate the forecasting performances, besides analyzing the overall and individual performances, it is also significant to explore the error distributions among all metro stations and each data point. Subsequently, we utilize boxplots to illustrate the evaluation metrics of all data points in Figures 12–14, in which each figure indicates the forecasting error distributions of the MSFPF model and the selected baseline models with the RMSE, MAE, and WMAPE, respectively. The boxplots show that few abnormal data points are output by the MSFPF model, and its error indicators of boxplots are generally lower than those of other baseline models. Furthermore, Figure 15 expresses the actual distribution of the prediction values and ground truth values, where the data points are closely fitted on the line $y = x$ and evenly distributed on both sides of it. In addition, regardless of whether the passenger flow is large or small, the proposed MSFPF model always provides reliable and accurate forecasting results, demonstrating the model’s robustness and the effect of reducing outliers.

Figure 12.

Boxplots of the multi-stage fusion passenger forecasting (MSFPF) model and baselines with multi-step (root mean square error [RMSE]).

Figure 13.

Boxplots of the multi-stage fusion passenger forecasting (MSFPF) model and baselines with multi-step (mean absolute error [MAE]).

Figure 14.

Boxplots of the multi-stage fusion passenger forecasting (MSFPF) model and baselines with multi-step (weighted mean absolute percentage error [WMAPE]).

Figure 15.

Scatters and fitting of prediction–ground truth points.

Multi-step forecasting performance analysis

Figure 16 illustrates the comparison of multi-step prediction performance, taking the RMSE metrics as an example. Compared to simple medium and long-term forecasting, multi-step short-term forecasting provides more detailed and comprehensive information within the same forecasting period. Among the models, the SVR model, using the “direct multi-step prediction” strategy, shows the most significant change because of the incoherence in the second and third steps. On the other hand, the other models employ the “multi-input and multi-output” strategy to accomplish multi-step prediction. As shown in Figure 16, the prediction accuracy decreases as the number of forecasting timesteps increases, but rare exceptional cases may occur, shown in dataset 2, because of the uncertainty and repetition of experiments. Meanwhile, the total prediction period extends, and the task difficulty also increases.⁴⁰ Nevertheless, the accuracy changes are still within an acceptable range, indicating the feasibility of using multi-step forecasting in STPFF tasks.

Figure 16.

Performance comparison for multi-step forecasting (root mean square error [RMSE]).

Ablation Experiments

The concept of an ablation experiment is removing or modifying specific components in the proposed deep-learning framework to validate the effect of the selected components. Therefore, a series of variants from the MSFPF model is constructed for further analysis. The details of the variants are described in the Baseline Models section, and the model performances are illustrated in Tables 1 and 2. In the following analysis, dataset 1 is taken as an example.

As shown in Table 1, (1) MSFPF-no TLB: the removal of the TLB slightly decreases the forecasting precision, which demonstrates that the transformer model accomplishes improving the result of prediction based on capturing major features and global temporal dependencies. (2) MSFPF-no DAB: Meanwhile, replacing the DAB with CNNs also causes accuracy drops and the model parameter size is even nearly 1.65 times larger than before. This suggests that the DAB structure improves prediction accuracy via extracting many spatial features while reducing the size of model parameters. (3) MSFPF-no POI: Moreover, the forecasting accuracy decreases after removing the POI data source, including the POI data input, integrated spatial-temporal feature extractions, and data fusion operation. It proves that the land-use around the stations has impacts on passenger generation and attraction, and our proposed model can extract corresponding spatial-temporal dependencies from the POI data to support short-term passenger prediction. Nevertheless, these three variants of the model still perform better than the selected baselines, which demonstrates the structural robustness of the MSFPF model. (4) MSFPF-no MSF: However, the removal of the MSF structure leads to a dramatic decline in forecasting accuracy, where all the evaluation indicators increase to a great extent. For example, the RMSE index increases by nearly 48.54%, 58.62%, and 47.07% on single-step, two-step, and three-step, respectively. Therefore, although information loss cannot be quantified directly, the variant that removes the MSF structure illustrates that significant information loss occurs during the process of feature extractions, matrix operations, and multi-source data fusion. Furthermore, the MSF structure accomplishes preserving the information to a great extent and is the crucial point to improving the forecasting accuracy in our proposed MSFPF model.

Conclusion

This study proposes a deep-learning framework named the MSFPF model to conduct multi-step STPFF at the network level. Distinguishing it from most previous studies, time-based OD data is used to represent the topological information in the URT system instead of physical adjacency. Secondly, data interaction and relationship establishment of multi-source data are conducted data-wise. Moreover, we construct the MSF structure, TLB, and DAB to extract and leverage spatial-temporal features effectively and comprehensively. Furthermore, the MSFPF model is validated on two real-world large-scale datasets. Based on its performances, the following conclusions can be drawn.

(1) The proposed framework achieves extraction and leverages sufficient spatial and temporal features from multi-source data, and the prediction performances and ablation experiments demonstrate the superiority and robustness of the MSFPF model compared with widely used baselines and specific variants.

(2) The MSF structure can preserve the information via three fusion stages and residual connections. It alleviates the problem of information loss during feature extraction and multi-source fusion, and significantly improves forecasting accuracy, where the RMSE index decreases by 32.68%, 36.96%, and 32.00% on the single-step, two-step, and three-step, respectively.

(3) In addition, conducting multi-source date interaction and relationship establishment, instead of treating them independently, can explore the intrinsic correlations between various data source, and enhance the effect of data utilization.

(4) The TLB and the DAB extract corresponding spatial-temporal features effectively and comprehensively, where the TLB illustrates great potential in time-series data processing and the DAB successfully extracts sufficient spatial features while reducing the size of model parameters by nearly 39.4%.

In future research, it is worthwhile considering improving prediction accuracy in integrated transport hubs, in which the passenger flow fluctuates rapidly and shows poor regularity. For instance, the data of railways timetable can be introduced, which can provide support to model the flow changes and provide more temporal information. Furthermore, the transfer behaviors or large events⁴¹ could also be taken into consideration to improve the demand prediction of multiple transit modes simultaneously^38,39,42.

Footnotes

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: Y. Chen, J. Zhang; data collection: Y. Chen, J. Zhang; analysis and interpretation of results: Y. Chen, J. Zhang; draft manuscript preparation: Y. Chen, J. Zhang, Y. Lu, K. Yang, H. Liu, Y. Liang. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Nos. 72201029, 72288101, 73222022).

References

Zhao

Song

Zhang

Liu

Wang

Lin

Deng

T-GCN: A Temporal Graph Convolutional Network for Traffic Prediction. IEEE Transactions on Intelligent Transportation System, Vol. 21, No. 9, 2020, pp. 3848–3858. https://doi.org/10.1109/TITS.2019.2935152

Zhao

Multi-STGCnet: A Graph Convolution Based Spatial-Temporal Framework for Subway Passenger Flow Forecasting. 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, IEEE, New York, 2020, pp. 1–8. https://doi.org/10.1109/IJCNN48605.2020.9207049

Zhang

Chen

Cui

Guo

Zhu

Deep Learning Architecture for Short-Term Passenger Flow Forecasting in Urban Rail Transit. IEEE Transactions on Intelligent Transportation System, Vol. 22, No. 11, 2021, pp. 7004–7014. https://doi.org/10.1109/TITS.2020.3000761

Zhang

Ren

Sun

Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, IEEE, New York, 2016, pp. 770–778.

Wang

Turko

Shaikh

Park

Das

Hohman

Kahng

Polo Chau

D. H.

CNN Explainer: Learning Convolutional Neural Networks With Interactive Visualization. IEEE Transactions on Visualization and Computer Graphics, Vol. 27, No. 2, 2020, pp. 1396–1406. https://doi.org/10.1109/TVCG.2020.3030418

Zhu

N Days Average Volume Based ARIMA Forecasting Model for Shanghai Metro Passenger Flow. 2010 International Conference on Artificial Intelligence and Education (ICAIE), Hangzhou, China, IEEE, New York, 2010, pp. 631–635. https://doi.org/10.1109/ICAIE.2010.5641088

Williams

B. M.

Hoel

L. A.

Modeling and Forecasting Vehicular Traffic Flow as a Seasonal ARIMA Process: Theoretical Basis and Empirical Results. Journal of Transportation Engineering, Vol. 129, No. 6, 2003, pp. 664–672. https://doi.org/10.1061/(ASCE)0733-947X(2003)129:6(664)

Wei

Chen

M. -C.

Forecasting the Short-Term Metro Passenger Flow with Empirical Mode Decomposition and Neural Networks. Transportation Research Part C: Emerging Technologies, Vol. 21, No. 1, 2012, pp. 148–162. https://doi.org/10.1016/j.trc.2011.06.009

Chen

Railway Passenger Volume Forecasting Based on Support Vector Machine and Genetic Algorithm. 2009 ETP International Conference on Future Computer and Communication, Wuhan, China, IEEE, New York, 2009, pp. 282–284. https://doi.org/10.1109/FCC.2009.81

10.

Wang

Zhang

Chen

Zhang

Short-Term Forecasting of Urban Rail Transit Ridership Based on ARIMA and Wavelet Decomposition. AIP Conference Proceedings, Vol. 1967, 2018, p. 040025. https://doi.org/10.1063/1.5039099

11.

Sun

Leng

Guan

A Novel Wavelet-SVM Short-Time Passenger Flow Prediction in Beijing Subway System. Neurocomputing, Vol. 166, 2015, pp. 109–121. https://doi.org/10.1016/j.neucom.2015.03.085

12.

Szegedy

Liu

Jia

Sermanet

Reed

Anguelov

Erhan

Vanhoucke

Rabinovich

Going Deeper With Convolutions. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, IEEE, New York, 2015, pp. 1–9. https://doi.org/10.1109/CVPR.2015.7298594

13.

Zaremba

Sutskever

Vinyals

Recurrent Neural Network Regularization. arXiv:1409.2329, 2015. http://arxiv.org/abs/1409.2329

14.

Tao

Wang

Long Short-Term Memory Neural Network for Traffic Speed Prediction Using Remote Microwave Sensor Data. Transportation Research Part C: Emerging Technologies, Vol. 54, 2015, pp. 187–197. https://doi.org/10.1016/j.trc.2015.03.014

15.

Chung

Gulcehre

Cho

Bengio

Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:1412.3555, 2014. http://arxiv.org/abs/1412.3555

16.

Guo

Xie

Qin

Jia

Wang

Short-Term Abnormal Passenger Flow Prediction Based on the Fusion of SVR and LSTM. IEEE Access, Vol. 7, 2019, pp. 42946–42955. https://doi.org/10.1109/ACCESS.2019.2907739

17.

Tang

Yang

ST-LSTM: A Deep Learning Approach Combined Spatio-Temporal Features for Short-Term Forecast in Rail Transit. Journal of Advanced Transportation, Vol. 2019, 2019, pp. 1–8. https://doi.org/10.1155/2019/8392592

18.

Masci

Meier

Cireşan

Schmidhuber

Stacked Convolutional Auto-Encoders for Hierarchical Feature Extraction. In Artificial Neural Networks and Machine Learning - ICANN 2011 ( T.

Honkela

Duch

Girolami

Kaski

, eds.), Springer, Berlin Heidelberg, 2011, pp. 52–59. https://doi.org/10.1007/978-3-642-21735-7_7

19.

Zhang

Liu

Zheng

Short-Term Prediction of Passenger Demand in Multi-Zone Level: Temporal Convolutional Neural Network With Multi-Task Learning. IEEE Transactions on Intelligent Transportation Systems, Vol. 21, No. 4, 2020, pp. 1480–1490. https://doi.org/10.1109/TITS.2019.2909571

20.

Kipf

T. N.

Welling

Semi-Supervised Classification with Graph Convolutional Networks. International Conference on Learning Representations (ICLR) 2017, arXiv:1609.02907, 2017. http://arxiv.org/abs/1609.02907

21.

Zhang

Chen

Guo

Multi-Graph Convolutional Network for Short-Term Passenger Flow Forecasting in Urban Rail Transit. IET Intelligent Transport System, Vol. 14, No. 10, 2020, pp. 1210–1217. https://doi.org/10.1049/iet-its.2019.0873

22.

Veličković

Cucurull

Casanova

Romero

Liò

Bengio

Graph Attention Networks. International Conference on Learning Representations (ICLR) 2018, arXiv:1710.10903, 2018. http://arxiv.org/abs/1710.10903

23.

Wang

Zhang

Wei

Piao

Yin

Metro Passenger Flow Prediction via Dynamic Hypergraph Convolution Networks. IEEE Transactions on Intelligent Transportation System, Vol. 22, No. 12, 2021, pp. 7891–7903. https://doi.org/10.1109/TITS.2021.3072743

24.

Zhang

Chen

Panchamy

Jin

Wang

Yang

Attention-Based Multi-Step Short-Term Passenger Flow Spatial-temporal Integrated Prediction Model in URT Systems. Journal of Geo-information Science, Vol. 25, No. 4, 2023, pp. 698–713. https://doi.org/10.12082/dqxxkx.2023.220817

25.

Wang

Zhao

Yin

Liu

IG-Net: An Interaction Graph Network Model for Metro Passenger Flow Forecasting. IEEE Transactions on Intelligent Transportation System, Vol. 24, No. 4, 2023, pp. 4147–4157. https://doi.org/10.1109/TITS.2023.3235805

26.

Zeng

Tang

Combining Knowledge Graph into Metro Passenger Flow Prediction: A Split-Attention Relational Graph Convolutional Network. Expert Systems with Applications, Vol. 213, 2023, pp. 118790. https://doi.org/10.1016/j.eswa.2022.118790

27.

Geng

Wang

Zhang

Yang

Liu

Spatiotemporal Multi-Graph Convolution Network for Ride-Hailing Demand Forecasting. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, 2019, pp. 3656–3663. https://doi.org/10.1609/aaai.v33i01.33013656

28.

Lasenby

Spatiotemporal Attention-Based Graph Convolution Network for Segment-Level Traffic Prediction. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 7, 2022, pp. 8337–8345. https://doi.org/10.1109/TITS.2021.3078187

29.

Bai

Yao

Wang

Zhang

Deep Spatial–Temporal Sequence Modeling for Multi-Step Passenger Demand Prediction. Future Generation Computer Systems, Vol. 121, 2021, pp. 25–34. https://doi.org/10.1016/j.future.2021.03.003

30.

Jiang

Song

Fan

Xia

Wang

Chen

Cai

Shigasaki

Transfer Urban Human Mobility Via Poi Embedding Over Multiple Cities. ACM Transactions on Data Science, Vol. 2, No. 1, 2021, pp. 1–26.

31.

Huang

Liu

Van Der Maaten

Weinberger

K. Q.

Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, IEEE, New York, 2017, pp. 2261–2269. https://doi.org/10.1109/CVPR.2017.243

32.

Shen

Sun

Squeeze-and-Excitation. Networks Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, IEEE, New York, 2018, pp.7132–7141.

33.

Howard

A. G.

Zhu

Chen

Kalenichenko

Wang

Weyand

Andreetto

Adam

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861, 2017. http://arxiv.org/abs/1704.04861

34.

Jiang

Yin

Wang

Deng

Liu

Cai

Deng

Song

Shibasaki

Dl-Traff: Survey and Benchmark of Deep Learning Models for Urban Traffic Prediction. Proceedings of the 30th ACM International Conference On Information & Knowledge Management, Association for Computing Machinery, Queensland, Australia, 2021, pp. 4515–4525.

35.

Chen

Liu

A Deep Learning Model with Conv-LSTM Networks for Subway Passenger Congestion Delay Prediction. Journal of Advanced Transportation, Vol. 2021, 2021, pp. 1–10. https://doi.org/10.1155/2021/6645214

36.

Zhang

Zheng

Deep Spatio-Temporal Residual Networks for Citywide Crowd Flows Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, No. 1, 2017, pp. 1655–1661. https://doi.org/10.1609/aaai.v31i1.10735

37.

Pan

Long

Jiang

Zhang

Graph WaveNet for Deep Spatial-Temporal Graph Modeling. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19), AAAI Press, Macao, China, 2019, pp. 1907–1913.

38.

Kontou

Garikapati

Hou

Reducing Ridesourcing Empty Vehicle Travel with Future Travel Demand Prediction. Transportation Research Part C: Emerging Technologies, Vol. 121, 2020, p. 102826.

39.

Yang

Zhang

Yang

Gao

Short-Term Passenger Flow Prediction for Multi-Traffic Modes: A Transformer and Residual Network Based Multi-Task Learning Method. Information Sciences, Vol. 642, 2023, p. 119144.

40.

Jiang

Cai

Wang

Yang

Fan

Chen

Tsubouchi

Song

Shibasaki

DeepCrowd: A Deep Model for Large-Scale Citywide Crowd Density and Flow Prediction. IEEE Transactions on Knowledge and Data Engineering, Vol. 35, 2021, pp. 276–290.

41.

Jiang

Song

Huang

Song

Xia

Cai

Wang

Kim

K. S.

Shibasaki

Deepurbanevent: A System for Predicting Citywide Crowd Dynamics at Big Events. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Association for Computing Machinery, AK, USA, 2019, pp. 2114–2122.

42.

G. J.

Stern

Taxi Utilization Rate Maximization by Dynamic Demand Prediction: A Case Study in the City of Chicago. Transportation Research Record: Journal of Transportation Research Board, 2022. 2676(4): 367–379.