Abstract
To improve real-time operation and management in urban rail transit (URT) systems, accurate and reliable short-term passenger flow forecasting at the network level is a crucial task. Although numerous endeavors have been devoted to this field, the insufficient topological representation for passenger flows in the URT network, the overlooking of intrinsic correlations among multi-source data, and the information loss in deep-learning frameworks are still critical issues that need to be addressed. This study proposes a multi-stage fusion passenger forecasting (MSFPF) model to accomplish short-term multi-step passenger forecasting leveraging multi-source data, and overcome the above-mentioned challenges. Based on the characteristics of passenger flows in the URT network, time-based origin–destination flow data is involved and utilized to enhance the representation of flows and provide spatial-temporal features. Then, the interaction and relationship among multi-source data are estimated to capture their intrinsic correlations. To effectively and comprehensively extract temporal and spatial features, a transformer long short-term memory block and a depth-wise attention block are constructed with attention mechanisms and employed. Furthermore, we construct the multi-stage fusion (MSF) structure to alleviate the information loss during the learning process, which is a significant component in improving the forecasting accuracy. In addition, the model is applied to two large-scale real-world datasets, in which it outperforms nine widely used baselines and four specific variants of itself. The quantitative experiments demonstrate the robustness and superiority of the proposed MSFPF model, and the significant contribution of the MSF structure in the model.
Keywords
Urban rail transit (URT) is a significant component of public transport in cities, offering high-capacity, punctual, and high-speed service to a larger number of passengers. Accurate and reliable short-term passenger flow forecasting (STPFF) plays a vital role in URT systems, which can support many downstream tasks, such as timetable optimization, passenger flow control, and service quality improvement. Moreover, short-term predictions can provide more detailed forecasting information. In addition, multi-step prediction can extend the total forecasting periods and, thus, multi-step short-term passenger prediction balances the details and horizons.
It is well known that time-series forecasting is fundamental to STPFF. To model the temporal dependencies in time-series data, many statistical, machine learning, and deep-learning methods 1 are employed, where deep-learning methods have achieved outstanding performances in this field. On the other hand, exploring the spatial correlations is also essential in topological URT networks. Thus, some neural networks with great potential in spatial correlation extraction were widely studied 2 , but still some models originally proposed for road traffic forecasting are directly be introduced to URT networks. Furthermore, to extract comprehensive spatial-temporal features and achieve accurate prediction, the use of multi-source data 3 also plays a significant role in STPFF, which can provide comprehensive information and affect the passenger generation and attraction.
Moreover, STPFF still has challenges because of its complex spatial correlations, temporal correlations, and numerous influencing factors, such as the weather, air condition, and land-use characteristics. In the past decades, although many endeavors have been devoted to this field, and a solid foundation has been laid, several critical issues are still unaddressed. (1) The modeling of the topological URT network is a crucial point in this task. However, in many existing studies, the spatial correlations are directly modeled with respect to the physical adjacency between stations,1,2 which is effective and originally proposed in road traffic modeling, whereas they are limited in representing the URT network because the road traffic flows are strictly constrained by certain traffic flow theories and path, while the flows in URT networks can be abstractedly regarded as a skip from the entry station to the exit station, which can represent a passenger trip without strictly following “adjacency continuity.” To provide a more intuitive comparison to support our perspective, their flow mechanisms are illustrated in Figures 1 and 2, respectively. Therefore, how to model the spatial correlations in the URT network still requires further exploration. (2) Multi-source input data (e.g., passenger flow data, origin–destination [OD] flow data, point of interest [POI] data) is essential for comprehensive spatial-temporal feature extraction and accurate forecasting. 3 In general, intrinsic relations exist between these data sources, while many existing studies only considered the interaction and relationship of these multi-source data at the decision-wise level, instead of at the data-wise level. (3) During multi-source fusion, information loss is inevitable because of the learning operations (e.g., temporal correlations and spatial dependencies), such as convolution, pooling, dimensionality reduction, and non-linear mapping. 4 For example, the sparsity of passenger flow data after convolutional layers, the degradation of graphical representation caused by neural networks, 3 and the mask of some typical features can all be regarded as a form of information loss. Moreover, an image representing the URT network is utilized to visualize the phenomenon of information loss, substituting for the passenger flow matrix, since the latter one cannot be easily visualized or quantified. As shown in Figure 3, the contained information is not as rich as the original network graph after convolution and pooling operations, where the right-hand side graphs are blurry and suffer from color separation. 5 The loss of crucial information will have an impact on prediction accuracy, and how to preserve more information to improve forecasting accuracy is still a challenge.

Illustration of road traffic flows.

Illustration of passenger flows in the urban rail transit (URT) system.

Illustration of information loss.
To address these problems, we propose a multi-step forecasting framework to conduct STPFF tasks based on multi-source data. The network-level forecasting framework collaboratively models all stations in the URT system, taking into account temporal dependencies, spatial correlations, and network topology features. To address the above-mentioned challenges, time-based OD flow data, extracted in specific timestep-based manner, is utilized to improve the representation of connectivity in the URT network, and provide sufficient spatial features and extra temporal features. Moreover, a multi-stage fusion (MSF) structure is constructed to prevent information loss in the learning process to a great extent, which spans through the whole network and consists of three fusion stages. In addition, a transformer long short-term memory block (TLB) and depth-wise attention block (DAB) are built to extract temporal and spatial features effectively and enhance prediction accuracy. When applying the framework to two real-world large-scale datasets, our method outperforms a set of baselines, which demonstrates the superiority and robustness of our proposed model. Overall, the main contributions of this work are listed as follows, and these contributions can provide valuable references for related transportation studies, such as URT network modeling, information preservation, and feature extraction from traffic big data.
(1) The study proposes an end-to-end framework to successfully extract comprehensive spatial-temporal features from multi-source data and achieve STPFF. The experimental result compared to several widely used baselines demonstrates its superiority and robustness.
(2) The study conducts matrices operations at the data-wise level to capture the intrinsic relationship between passenger flow data and time-based OD data. Moreover, spatial and temporal features in POI data are extracted and involved independently using self-attention layers.
(3) The study constructs the MSF structure that spans through the whole framework to prevent information loss during the learning process and achieve comprehensive complementarity to a great extent via information preserving and skip connections.
Literature Review
Short-Term Passenger Flow Forecasting
It is well known that the extraction of temporal features is the key point in time-series forecasting. In the early research, mathematical statistics-based models were mainly utilized. The autoregressive integrated moving average (ARIMA) 6 and seasonal ARIMA 7 have been widely applied to conduct time-series prediction tasks. However, these methods with fixed model structures and parameters are limited to cope with uncertain and complex non-linear correlations, and also cannot meet the real-time and precision requirements of current STPFF tasks. Next, the rise of machine learning offered a new solution to capture time-series correlations. For example, the back-propagation neural network (BPNN) 8 and support vector regression (SVR) 9 are utilized to accomplish short-term passenger flow prediction, showing better performances. Besides, having realized that a single model might miss some unobvious features, some hybrid models, such as the ARIMA-wavelet model 10 and the wavelet-SVM (wavelet support vector machine) mixed model, 11 were proposed, where the advantages of different methods were both leveraged. However, these models are still limited when processing high-dimensional and complex spatiotemporal data, especially in large URT systems. Benefiting from the deep structure, deep-learning-based models performed even better in time-series forecasting and promoted the development of STPFF. 12 Among them, the most noteworthy model structures are the recurrent neural network (RNN) 13 and its specific variants, the long short-term memory (LSTM) network 14 and gate recurrent unit (GRU), 15 which were proposed to overcome some shortcomings of the RNN. The LSTM network and GRU are able to capture the temporal features in time-series data efficiently and adequately, as well as the long-term dependencies. Besides, the LSTM network was also combined with SVR 16 to improve the model structure and prediction accuracy. Nevertheless, they merely focused on temporal dependencies while the spatial correlations in the URT network were ignored, which is a topological network containing extensive spatial features.
Therefore, considering spatial correlations in topological metro networks, a LSTM-based spatial-temporal forecasting model was proposed, 17 using a time–cost matrix to represent spatial correlations, but the spatial correlations within the URT network cannot be adequately represented since it only connects stations via travel time, and crucial topological information in the network 1 is overlooked. Concurrently, convolutional neural networks (CNNs) showed the extensive capability to capture spatial dependencies 18 and achieved great performance in STPFF combined with a dynamic time warping algorithm. 19 However, only Euclidean data is feasible in the CNN model, and much traffic information is prone to loss during the conversion process from non-Euclidean to Euclidean data. Subsequently, as research on traffic forecasting progressed, graph convolutional networks (GCNs) emerged as a more fitting approach for topological traffic networks. 20 Thus, one study 21 constructed a multi-GCN to extract spatial-temporal features and topological information. In this regard, graph attention networks (GATs) equipped with attention mechanisms were proposed 22 to emphasize the significance of various connections in the network. The GAT model is capable of focusing on essential features, thereby enhancing model interpretability. Nonetheless, the GAT’s ability may be constrained in high-dimensional data because of the number of layers and its incompatibility with residual connections. To capture high-order correlations, 23 one study put forward a GCN-based hypergraph structure to predict passenger flow. Despite demonstrating the tremendous potential for road traffic forecasting, GCN-based models have limitations in representing the connectivity among stations in the URT systems based on the adjacency matrix 24 since these models were proposed for road traffic forecasting, 23 which differs from the URT system. Road traffic flows are strictly constrained by corresponding traffic flow theories, but flows in the URT network can be abstracted to a skip from entry station to exit station. To address the problem, 24 one study illustrated the appropriateness of extracting spatial features with OD data, while another 25 constructed inter-station graphs to extract potential network interactions instead of physical connections. Moreover, another study 26 also constructed a metro network graph based on OD data to consider travel behaviors.
Multi-Source Data Fusion
To comprehensively capture spatial-temporal features in STPFF tasks, numerous recent studies have proposed hybrid architectures and leveraged multi-source data, which can enhance the prediction accuracy and robustness of the model. Generally, distinct neural networks are employed for feature extraction from different data sources, where the feature extraction is also a significant part in multi-source data fusion. For instance, Zhao et al. 1 integrated the GCN and GRU, where the GCN is used to learn topological dependencies and the GRU is applied to extract temporal dynamic changes. Similarly, Geng et al. 27 extracted spatial features via the multi-GCN and analyzed time-series data using a contextual gated RNN. Li and Lasenby 28 introduced an attention mechanism in their study, employing a multi-head GAT and LSTM to conduct traffic forecasting, whereas Zhang et al. 3 proposed ResLSTM, which is based on the ResNet, GCN, and LSTM. ResLSTM also considers weather and air conditions, which will affect passengers’ travel plans. It also utilizes residual connections to prevent gradient vanishing while retaining more low-level information. Moreover, one study 29 also involves metrological data for passenger demand forecasting and processes it appropriately. All of these studies considered multi-source data, where spatial-temporal features, even some external influencing features, can be obtained to promote forecasting precision. Therefore, how to fuse multi-source data reasonably and efficiently has become a significant topic. According to the fusion stage, some studies conduct feature-wise fusion,3,29,30 whose performance highly depends on the extracted features. Other studies fuse them at the decision-level,23,27 with low computational requirements, while they cannot comprehensively take the advantage of multi-source complementary. Generally, a single fusion layer fails to completely leverage each data source and achieve complementarity. Although multi-source data fusion and hybrid architecture have achieved better accuracy, the loss of information during the learning and fusion process are still inevitable, and the intrinsic relationship between used data sources are usually overlooked.
Problem Definition
The purpose of the study is to conduct STPFF in the URT system at the next multiple time intervals, utilizing historical automatic fare collection (AFC) data and POI data and defining that the forecasting target is
Based on AFC data, the passenger inflow matrix
Moreover, POI data are pre-processed to form the matrix
where
Methodology
This section presents the proposed multi-stage fusion passenger forecasting (MSFPF) model, illustrated in Figure 4, which leverages three data sources, namely inflow data, time-based OD data, and POI data. To overcome the challenge of information loss during the learning and multi-source fusion process, our MSF structure spans through the entire network and consists of three fusion stages, as described in the Multi-Stage Fusion section. To facilitate interaction and establish the relationship between different data sources, we apply two specific operations (spatial interaction and temporal interaction) between the inflow and OD matrices, which each transfer related inflow distribution and outflow features to the other. In addition, self-attention extractors are introduced to extract spatial and temporal features from the POI data, and the feature maps are then fused with the other two sources, as described in the Multi-Source Data Interaction and Relationship section. Furthermore, to capture temporal features more efficiently from the processed time-series data, the TLB is employed, as described in the Transformer Long Short-Term Memory Block section. Besides, the DAB, based on channel attention and depth-wise separable convolution modules, is developed to extract spatial and topological features from fused time-based OD data, as described in the Depth-Wise Attention Block section.

Framework of the proposed multi-stage fusion passenger forecasting model.
Multi-Stage Fusion
In a deep-learning network, information loss during the learning process is inevitable since non-linear mapping, convolutions, dimensionality reduction, and other operations are required for feature extraction and data fusion. Meanwhile, the loss of crucial information will cause negative effects on forecasting accuracy. To overcome this problem, we propose the MSF structure, which consists of three fusion stages, namely early-fusion, deep-fusion, and late-fusion. It spans through the whole network, and adjacent stages are connected by the residual connection, as described in detail in the following. Thus, once the data undergoes operations, which potentially lead to information loss, the fused metrics derived from the previous stage will complement the feature maps. Subsequently, the details of each fusion stage are introduced.
Early-fusion stage
early-fusion can directly process low-level information and preserve the most comprehensive information. In the case of our study, pre-processing is required because inflow and OD matrices have different dimensions. A bidirectional long short-term memory (Bi-LSTM) network is employed to pre-process the inflow matrix since previous studies have demonstrated the great potential of LSTM networks in processing the time-series passenger flow, effectively capturing the temporal features and long-term dependencies. Meanwhile, the time-based OD matrix is processed by DensNet
31
to extract the spatial-temporal features, as shown in Figure 5. The dense block encourages feature reuse and allows for the efficient propagation of information between layers, which ensures the preservation of comprehensive information. Subsequently, two feature maps with the same dimension are combined into an early-feature matrix

The structure of DensNet.
Deep-fusion stage
the deep-fusion stage allows for more flexible control over the fusion process and enables the preservation of features from multiple sources. Its main role is merging the feature maps processed by the TLB and DAB. In addition, to complement information lost in previous operations, the early-feature matrix is involved in this stage. By using weighted fusion, a deep-feature matrix
Late-fusion stage
late-fusion is efficient and has low computational requirements since the features have been processed to decision-wise at this stage. Its main role is fusing the decision matrices from different resources. Besides, considering the information loss in the fully connected layer and inadequate decision, the deep-feature matrix is included. To effectively accomplish the fusion and model the distinctive significance, element-wise fusion is conducted by Equation 4, where
Multi-Source Data Interaction and Relationship
The interaction operations are conducted at the data-wise level to improve the interaction and establish a relationship between inflow and time-based OD matrices because they have intrinsic correlations, which can support spatial-temporal modeling. Data-wise-level operations can capture multi-source relationship within input data, which allows for direct processing of low-level information. Moreover, POI data is utilized to provide additional spatial and temporal features independently for different data sources.
In the temporal interaction, the related outflow feature map that carries temporal information is extracted from the OD matrix by accumulating the passenger flows that exit in station
where
In the spatial interaction, we extract related distribution feature maps from inflow data that can support the spatial feature extraction via calculating the proportion
where
Transformer Long Short-Term Memory Block
Existing studies have demonstrated the great potential of LSTM networks in effectively extracting temporal features and long-term dependencies in time-series data. Moreover, the multi-head attention mechanism in the transformer enables the model to focus on the main features and improve forecasting efficiency. Therefore, by replacing the feedforward network with a Bi-LSTM network in the conventional transformer, the TLB is constructed to process the fused data, as illustrated in Figure 6.

The structure of the transformer long short-term memory block.
In the TLB, input data first undergoes the positional encoding layer. The layer can effectively encode temporal dependencies among different timesteps and record their time-series information, which is unavailable since the data is not naturally ordered in the conventional transformer model. Moreover, multi-head attention is able to capture sufficient and accurate global spatial-temporal features in large-scale data by performing parallel computations of self-attention
Furthermore, the feature maps from multi-head attention will be processed by a Bi-LSTM network. It processes the data in both forward and backward directions, which allows for better handling of the sequence data and extracting comprehensive temporal features, especially long-term dependencies.
Depth-Wise Attention Block
The input data in the DAB carry substantial quantities of temporal and spatial information, with a size (n, m, m), where

The structure of the depth-wise attention block.
In the channel attention module,
32
the input data is squeezed along the channel dimension, and each two-dimensional (2D) OD matrix in channel
The core concept of the depth-wise separable convolution module
33
is decomposing a standard convolution into two steps: depth-wise convolution and point-wise convolution. It collaborates well with the preceding module, taking the self-attention channel map as input. In the depth-wise convolution layer, different filters are used for each input channel, and the same number of intermediate feature maps are obtained after the convolution operations. The point-wise convolution layer can generate output features from intermediate maps via
Experiments
Experimental Setup
Data Description
In this study, two large-scale real-world AFC datasets from the URT system are utilized, respectively collected in 2020 and 2021. A total of 64 metro stations are involved in the first datasets, and 85 stations are considered in the second datasets because of the addition of stations between the two periods. Each dataset consists of five weeks, with passenger ID, entry-time, entry-station, exit-time and exit-station information in each record. To ensure data quality, “meaningless data,”“blank data,” and “error data” are removed in data cleaning, such as cases where the entry time is much earlier than the operation time of rail transit, or some information is missing; or where the inbound time is later than the outbound time. Eventually, 17,790,319 pieces and 18,232,887 pieces of feasible data remain in the datasets, respectively. As mentioned in the Probelm Definition section, inflow matrices and time-based OD matrices are extracted for the study.
In addition, also two POI datasets are used, corresponding to the two periods of AFC data. Each piece of raw data contains the POI name, code, location, and category, where the location and category are primarily focused on in this study. We accumulated the number of points for each category within a radius of 1000 m around the stations and obtained the POI matrices.
Model Configuration
This section described the implementation details of the proposed model. Each dataset was divided into a training set, validation set, and testing set with a split rate of 85% (10%) and 15%. In each forecasting process, 30 historical timesteps of three temporal patterns, namely real-time, daily, and weekly pattern, are employed to predict network-level passenger flow at the next one-step, two-step and three-step respectively. Real-time pattern represents the data in recent timesteps. Daily and weekly patterns denote the data in the same time intervals of the last day and the last week, respectively. To balance training time and forecasting precision, we set the batch size as 32 and the learning rate as 0.0005 after several experiments. In the training process, Adam software is utilized to optimize each training and the mean squared error (MSE) is selected as the training loss. Moreover, the model checkpoint and early stopping techniques are employed to save the candidate model and end the training task earlier to avoid overfitting, respectively. Furthermore, attempting to eliminate the impact of parameter initialization, we train the model five times and take the average values as the result.
Baseline Models
To validate the performance of the proposed model, nine widely used traffic forecasting models are employed to conduct the comparison. Moreover, four variants from the proposed model are constructed by modifying specific structures. A brief description of the selected baselines and variants is given below. In addition, some specific modifications are required to conduct the model on baselines 34 because some of them were proposed for distinct tasks (e.g., road traffic state forecasting) or utilized different data inputs (e.g., passenger inflow, outflow, and adjacency matrix).
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
Evaluation Matrices
In this study, the root mean square error (RMSE), mean absolute error (MAE), and weighted mean absolute percentage error (WMAPE) are selected as evaluation metrics. The definitions of the three matrices are introduced in Equations 16–18, where n represents the number of samples,
Results and Analysis
Network-level forecasting performance analysis
Tables 1 and 2 illustrate the quantitative comparisons between the MSFPF model, variants, and baselines in two real-world datasets. With respect to forecasting accuracy, the deep-learning based model performs better than conventional machine learning-based models. For instance, the LSTM network outperforms the SVR model by 7.33%, 16.48%, and 23.09% with respect to MAE for the three prediction steps, respectively. Meanwhile, LSTM-based models, particularly the LSTM and Bi-LSTM models, achieve relatively better forecasting results, which illustrates the superiority of LSTM in processing medium and long-term time-series data. However, the LSTM network fails to effectively capture spatial correlations. Although the transformer model is also suitable for time-series forecasting, the input data is not naturally ordered within time-series in standard one, which requires pre-processing. Moreover, although the ST-ResNet model can extract spatial correlations, it is relatively weak in temporal dependency extraction. In addition, the CNN cannot directly obtain high-quality spatial-temporal features via convolutions, while the GCN and T-GCN models do not achieve remarkable results because of the weak topological representation in the URT system, which has been mentioned before. Moreover, the models that use multi-source data suffers from information loss; thus, they fail to achieve the desired effect, such as the ST-ResNet, Graph WaveNet, and STIPM.
Quantitative Comparison of the Proposed Model and Baselines (Datasets 1)
Note: RMSE = root mean square error; MAE = mean absolute error; WMAPE = weighted mean absolute percentage error; SVR = support vector regression; LSTM = long short-term memory; ConvLSTM = convolutional long short-term memory; Bi-LSTM = bidirectional long short-term memory; GCN = graph convolutional network; MSFPF = multi-stage fusion passenger forecasting; TLB = transformer long short-term memory block; DAB = depth-wise attention block; POI = point of interest; MSF = multi-stage fusion; ST-ResNet = Spatio-Temporal Residual Network; T-GCN = Temporal Graph Convolutional Network; GWN = Graph WaveNet; STIPM = Spatial-temporal Integrated Prediction Model. The use of bold/italics aims to emphasize our method.
Quantitative Comparison of the Proposed Model and Baselines (Datasets 2)
Note: RMSE = root mean square error; MAE = mean absolute error; WMAPE = weighted mean absolute percentage error; SVR = support vector regression; LSTM = long short-term memory; ConvLSTM = convolutional long short-term memory; Bi-LSTM = bidirectional long short-term memory; GCN = graph convolutional network; MSFPF = multi-stage fusion passenger forecasting; TLB = transformer long short-term memory block; DAB = depth-wise attention block; POI = point of interest; MSF = multi-stage fusion; ST-ResNet = Spatio-Temporal Residual Network; T-GCN = Temporal Graph Convolutional Network; GWN = Graph WaveNet; STIPM = Spatial-temporal Integrated Prediction Model. The use of bold/italics aims to emphasize our method.
In addition, the proposed MSFPF model consistently expresses superior performances to the baselines on all evaluation indexes, forecasting timesteps, and datasets. For example, in datasets 1, compared to the baselines, MSFPF significantly improves the RMSE, MAE, and WMAPE indexes for single-step, two-step, and three-step forecasting. The prediction performance can be seen more intuitively in Figure 8, which presents the comparison of prediction flow and ground truth, using a weekday as an example. The ground truth passenger flows in the metro network on that day are shown in red, while the prediction values are displayed in blue, and they illustrate great performances for all forecasting timesteps. Overall, these results prove that both spatial and temporal features are extracted comprehensively and appropriately by the MSFPF model, which has robustness and superiority. Furthermore, the significance of the MSF structure is obvious from the variant MSFPF-no MSF, in which the performances drop dramatically. A detailed analysis of the variants will be given in the Ablation Experiments section.

Illustration of the comparison between prediction flow and ground truth.
Single-station forecasting performance analysis
in Figure 8, different stations show specific regulations and forecasting performances. Thus, we illustrate the comparisons between forecasting values and ground truth in three typical stations in Figures 9–11, respectively.

Prediction–ground truth comparison at commuting station with multi-step.

Prediction–ground truth comparison at commercial station with multi-step.

Prediction–ground truth comparison at transport hub station with multi-step.
Error distribution analysis
the evaluation metrics shown in Tables 1 and 2 measure the forecasting accuracy with respect to average errors in the whole network. Therefore, to further validate the forecasting performances, besides analyzing the overall and individual performances, it is also significant to explore the error distributions among all metro stations and each data point. Subsequently, we utilize boxplots to illustrate the evaluation metrics of all data points in Figures 12–14, in which each figure indicates the forecasting error distributions of the MSFPF model and the selected baseline models with the RMSE, MAE, and WMAPE, respectively. The boxplots show that few abnormal data points are output by the MSFPF model, and its error indicators of boxplots are generally lower than those of other baseline models. Furthermore, Figure 15 expresses the actual distribution of the prediction values and ground truth values, where the data points are closely fitted on the line

Boxplots of the multi-stage fusion passenger forecasting (MSFPF) model and baselines with multi-step (root mean square error [RMSE]).

Boxplots of the multi-stage fusion passenger forecasting (MSFPF) model and baselines with multi-step (mean absolute error [MAE]).

Boxplots of the multi-stage fusion passenger forecasting (MSFPF) model and baselines with multi-step (weighted mean absolute percentage error [WMAPE]).

Scatters and fitting of prediction–ground truth points.
Multi-step forecasting performance analysis
Figure 16 illustrates the comparison of multi-step prediction performance, taking the RMSE metrics as an example. Compared to simple medium and long-term forecasting, multi-step short-term forecasting provides more detailed and comprehensive information within the same forecasting period. Among the models, the SVR model, using the “direct multi-step prediction” strategy, shows the most significant change because of the incoherence in the second and third steps. On the other hand, the other models employ the “multi-input and multi-output” strategy to accomplish multi-step prediction. As shown in Figure 16, the prediction accuracy decreases as the number of forecasting timesteps increases, but rare exceptional cases may occur, shown in dataset 2, because of the uncertainty and repetition of experiments. Meanwhile, the total prediction period extends, and the task difficulty also increases. 40 Nevertheless, the accuracy changes are still within an acceptable range, indicating the feasibility of using multi-step forecasting in STPFF tasks.

Performance comparison for multi-step forecasting (root mean square error [RMSE]).
Ablation Experiments
The concept of an ablation experiment is removing or modifying specific components in the proposed deep-learning framework to validate the effect of the selected components. Therefore, a series of variants from the MSFPF model is constructed for further analysis. The details of the variants are described in the Baseline Models section, and the model performances are illustrated in Tables 1 and 2. In the following analysis, dataset 1 is taken as an example.
As shown in Table 1,
Conclusion
This study proposes a deep-learning framework named the MSFPF model to conduct multi-step STPFF at the network level. Distinguishing it from most previous studies, time-based OD data is used to represent the topological information in the URT system instead of physical adjacency. Secondly, data interaction and relationship establishment of multi-source data are conducted data-wise. Moreover, we construct the MSF structure, TLB, and DAB to extract and leverage spatial-temporal features effectively and comprehensively. Furthermore, the MSFPF model is validated on two real-world large-scale datasets. Based on its performances, the following conclusions can be drawn.
(1) The proposed framework achieves extraction and leverages sufficient spatial and temporal features from multi-source data, and the prediction performances and ablation experiments demonstrate the superiority and robustness of the MSFPF model compared with widely used baselines and specific variants.
(2) The MSF structure can preserve the information via three fusion stages and residual connections. It alleviates the problem of information loss during feature extraction and multi-source fusion, and significantly improves forecasting accuracy, where the RMSE index decreases by 32.68%, 36.96%, and 32.00% on the single-step, two-step, and three-step, respectively.
(3) In addition, conducting multi-source date interaction and relationship establishment, instead of treating them independently, can explore the intrinsic correlations between various data source, and enhance the effect of data utilization.
(4) The TLB and the DAB extract corresponding spatial-temporal features effectively and comprehensively, where the TLB illustrates great potential in time-series data processing and the DAB successfully extracts sufficient spatial features while reducing the size of model parameters by nearly 39.4%.
In future research, it is worthwhile considering improving prediction accuracy in integrated transport hubs, in which the passenger flow fluctuates rapidly and shows poor regularity. For instance, the data of railways timetable can be introduced, which can provide support to model the flow changes and provide more temporal information. Furthermore, the transfer behaviors or large events 41 could also be taken into consideration to improve the demand prediction of multiple transit modes simultaneously38,39,42.
Footnotes
Author Contributions
The authors confirm contribution to the paper as follows: study conception and design: Y. Chen, J. Zhang; data collection: Y. Chen, J. Zhang; analysis and interpretation of results: Y. Chen, J. Zhang; draft manuscript preparation: Y. Chen, J. Zhang, Y. Lu, K. Yang, H. Liu, Y. Liang. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Nos. 72201029, 72288101, 73222022).
