Abstract
Water resource management and disaster risk reduction depend on accurate rainfall-runoff modeling (RRM). Time-series data frequently exhibits fine details and local variations that are difficult for traditional LSTM models to capture. To overcome these problems, we introduce an improved RRM model that utilizes a spatial attention-enhanced transductive LSTM (TLSTM) network. By using transductive learning on data points that closely resemble the test set, this model improves performance and captures subtle temporal differences. With the fusion of a spatial attention mechanism, the model can focus on the most important parts of the input. It also includes a more sophisticated differential evolution (DE) algorithm to facilitate complex hyperparameter tuning. For the DE algorithm, we used a mutation strategy that finds a significant cluster using K-means clustering. We applied the Catchment Attributes and Meteorology for Large Sample Studies (CAMELS) dataset and used individual and regional RRM throughout our assessment of the model. There were very positive results for individual basins with an 8-day runoff prediction of 0.728 Nash-Sutcliffe efficiency (NSE). For regional assessment, the model had an NSE of 0.878. This method supports that combining TLSTM with spatial attention and sophisticated DE algorithms increases the accuracy and reliability of rainfall-runoff (RR) predictions, providing opportunities for improved planning and disaster management of water resources. The source code is publicly available at https://github.com/ZhaoChenchina/RRM.
Introduction
In hydrology, reservoir management, flood forecasting, and water resource planning all depend on RRM. 1 Even though there has been recent progress, the RR process is complex, nonlinear, and dynamic and making it hard to accurately predict runoff. This complexity is compounded by factors like topography, land use, rainfall variation, initial conditions of moisture in soils, and infiltration rates. The effects of sudden storms and non-permeable surface coverings with potential for runoff into flood behavior indicate that a full understanding of elementary rainfall processes is essential. 2
There are two main types of RRM: physically based models and data-driven models.3–5 Hydrologic engineering center-hydrologic modeling system (HEC-HMS),6,7 Xinanjiang (XAJ) model, 8 soil and water assessment tool (SWAT), 9 MIKE-SHE, 10 and HSPF 11 are examples of physically-based models that use mathematical representations to simulate hydrological behaviors. 12 The effectiveness of these models has been quantitatively evaluated in several recent studies. A 2025 case study conducted in the Indian Ravishankar Sagar catchment, for instance, revealed excellent HEC-HMS performance. During calibration and validation, the NSE of the model was 0.910 and 0.835, respectively. 13 Applying bias correction resulted in a significant improvement in SWAT performance in another 2025 study conducted in Morocco. Before the correction, NSE was less than 0.60; following the correction, it surpassed 0.78. 14 Although these physical models offer useful insights, their development is complex and time-consuming due to the need for in-depth knowledge of hydrological dynamics and specific catchment parameters. 15
Data-driven models offer an effective solution by forming relationships between input and output data without necessitating a comprehensive understanding of the physical systems. 16 Utilizing methods like artificial neural networks (ANN),17,18 support vector regression (SVR),19–21 random forests (RF),22,23 and fuzzy logic,24,25 these models leverage historical rainfall and runoff data. They excel in managing nonlinear and stochastic environments and can make precise forecasts by detecting patterns and relationships within historical data. 17 For instance, the SVR model in ref. 26 optimized for global flood hotspot prediction demonstrated superior accuracy. In training, it achieved a mean squared error (MSE) of 0.016, a root MSE (RMSE) of 0.0130, and a standard deviation of 0.129. In testing, it achieved MSE = 0.029, RMSE = 0.170, and a standard deviation of 0.170, while delivering AUC values of 0.86 for training and 0.84 for testing. However, these traditional machine learning (ML) methods often struggle with long-term temporal dependencies, spatial heterogeneity, and high-dimensional input features.
Recently, deep learning (DL) has gained prominence in hydrological studies due to its ability to handle intricate nonlinear relationships.27,28 DL methods can improve traditional data-driven methods by automatically extracting features. Among various DL frameworks, LSTM networks are particularly noted for their success in predicting rainfall,29–32 forecasting floods,33,34 and estimating river water levels.35–38 For example, in a 2025 runoff forecasting study in the Velika Morava basin (Serbia), an LSTM model achieved an MSE
To overcome this shortfall, this paper uses the TLSTM approach for RRM. This method employs transductive learning to precisely adjust weights based on the closeness of new data points, improving predictive performance. In the storm example, TLSTM dynamically prioritizes training samples that are hydrologically similar to the storm-affected basin, enabling more accurate peak runoff estimation. This approach combines transductive learning with the robust capabilities of LSTM to provide an advanced method for time series analysis. It can identify both distinct local patterns and broader trends. Furthermore, we integrate a spatial attention mechanism into TLSTM to highlight the sub-regions and input features that are most influential for runoff generation. This integration improves the ability of the model to handle spatial heterogeneity in RR processes. In contrast, recent rainfall-runoff models, including transformer architectures, 42 primarily apply attention to temporal sequences. These models focus on time–step dependencies rather than explicitly identifying spatially dominant regions. By embedding spatial attention within TLSTM, our approach extracts localized features tailored to heterogeneous catchments. This design enhances its capacity to capture the unique hydrological dynamics of a basin. 43
The sensitivity of recurrent networks to hyperparameter settings is a significant problem, particularly for TLSTM models. The precise adjustment of variables such as learning rate, layer count, and unit size is crucial to optimal performance. Inadequate setups may result in underfitting, where important patterns are missed, or overfitting, where the model captures noise. Tuning consequently becomes a difficult and time-consuming process. Researchers have employed optimization techniques, such as grid search 44 and genetic algorithms, 45 to address this issue. Although grid search is straightforward, it loses efficiency in high-dimensional spaces. Although genetic algorithms are more adaptable, their wider application is constrained by the need for discretizing parameters. 46 Metaheuristic methods, such as the DE algorithm, have become popular as a more efficient alternative. One notable feature of DE is its robust search capability, which iteratively refines candidate solutions based on their fitness. DE uses crossover and differential mutation to explore the parameter space. It frequently converges more quickly than conventional techniques while maintaining efficiency. Because of its capacity to manage continuous variables, it is especially well-suited for optimizing DL models, like TLSTM, where accuracy in hyperparameter control is closely linked to performance.
Three essential steps are involved in the operation of the DE algorithm: crossover, selection, and mutation. New solutions emerge during the mutation phase through the modification of existing solution variations and the appropriate scaling of their differences. A modified solution vector comes together with an existing solution during the crossover phase. The population receives fresh genetic variations during this step. The newly generated solutions are evaluated against existing ones to determine which ones will remain in the population during the selection phase. The mutation stage plays an essential role because it introduces new variations. The procedure maintains the optimization process active by renewing the solution pool, which prevents it from becoming stagnant. The algorithm requires this stage to maintain its flexibility when solving complex solution spaces according to ref. 47 To enhance DE, concepts from the human mental search (HMS) approach were implemented, which made mutation more important. The k-means algorithm organizes the current population solutions into clusters as part of this refinement process. After that, the cluster with the lowest average objective function score is chosen, and the best candidate within that cluster is chosen to begin the mutation stage. The accuracy of the hyperparameter tuning process is then enhanced by employing a specialized technique to revitalize the candidate solutions within the population.
The goal of this study is to create a new RR model. The model employs an HMS-based DE algorithm for hyperparameter optimization and combines a TLSTM network with spatial attention. The following are the crucial contributions of the model: To capture minute temporal variations in RRM, the model presents a spatial attention-enhanced TLSTM network. By using transductive learning and focusing on data points that are similar to the test set, this method provides notable gains over conventional LSTM models. This improvement enables the model to better adapt to the specific and complex features of time-series data. A DE algorithm explicitly designed for the challenging task of hyperparameter tuning in RRM is incorporated into the model. Through more thorough and efficient optimization made possible by DE, the performance of the model can be improved by tailoring the hyperparameters to the unique needs and features of the RR data. The model implements a fresh mutation technique based on k-means clustering to enhance DE. By identifying essential clusters in the dataset, this method brings new possible remedies to the existing pool of hyperparameters. The improved mutation approach generates better model configurations that produce accurate results while expanding both the diversity and quality of solutions examined throughout the optimization process.
This is how this paper is organized. The current literature about RR is extensively examined in Section 2, while the proposed model is presented in Section 3. Section 4 presents the results, and Section 5 concludes the paper.
Related works
Given its important role in flood prediction, water resource management, and environmental preservation, RRM needs accurate forecasting. 48 Flood prediction alongside reservoir management, together with water resource management, depends on dependable models that help reduce extreme weather consequences. The precise forecast of reservoir conditions enables water resource managers to adjust their operations for better surplus control and shortage prevention. The estimation of runoff with precision is essential in flood prediction, which triggers warning systems to protect people and their possessions. The analysis of runoff patterns proves fundamental for environmental protection efforts because it enables both ecosystem health monitoring and land-use impact evaluation. The development of precise RRM serves both theoretical research and practical decision-making needs in environmental and hydrological management. 15
The research on ML and DL applications in RRM is compiled in this section.
ML
In 2022, Xiao et al. 49 employed four ML techniques at the Wuzhou station, located in the central section of the Xijiang River, to develop a highly accurate runoff forecasting model. The best model for forecasting runoff and water levels was the generalized regression neural network (GRNN). Singh et al. 50 used diverse data-driven models to predict RR in the Gola watershed, including RF, multiple adaptive regression splines (MARS), and multiple linear regression (MLR). Several graphical plots were used in their thorough analysis to effectively present the results.
In 2023, Anaraki et al. 51 compared nine separate and hybrid ML models to simulate RR processes. These models included hybrids, including KNN enhanced with a gorilla troop optimizer, as well as more conventional models, including ANN and least squares SVM. Additionally, they presented a new feature selection technique that integrated empirical mode decomposition and principal component analysis (PCA). Shekar et al. 52 estimated monthly streamflow in the Murredu River basin using the SWAT model with eight AI models. They included KNN regression, SVR, and LSTM.
In 2024, four distinct ML models, including support vector machines (SVMs), gene expression programming (GEP), multilayer perceptrons, and multivariate adaptive regression splines (MARS), were used by Fuladipanah et al. 53 to study RRM. They examined a dataset comprising 4765 daily records from the rainfall and hydrometric stations of the Malwathu Oya watershed, collected between July 18, 2005, and September 30, 2018. Using metrics like RMSE, mean absolute error (MAE), R2, and developed discrepancy ratio (DDR), the study assessed the models and concluded that the GEP model performed best. In Slovenia's karst Ljubljanica River catchment, Sezen and Šraj 54 used three combined conceptual models, including a snow module with ML models for hourly RRM. They combined the CemaNeige Génie Rural à 4 paramètres Horaires (GR4H) conceptual model with a wavelet-based regression tree (WRT) and a wavelet-based extreme learning machine (WELM). The performance of these hybrid models was contrasted with that of independent conceptual and ML models. Shah et al. 55 used ML techniques with models like the ANN multilayer perceptron (ANN-MLP), RF, and MLR to predict runoff in Quetta Valley, located in the Pishin Lora Basin. They utilized remotely sensed meteorological data from 1990 to 2022, gathered from the Modern-Era Retrospective Analysis for Research and Applications-2 (MERRA-2) satellite, and modeled runoff over daily, weekly, and monthly timescales. The inputs included rainfall, humidity, and temperature, with runoff data simulated using the soil conservation service curve number (SCS-CN) method due to the absence of actual records. Iamampai et al. 56 devised a technique to refine RR model predictions by leveraging rainfall accumulation at various intervals alongside the soil water index (SWI). This method addressed the shortcomings of models that depend solely on rainfall by incorporating changes in soil moisture and runoff. They employed RF and ANN models to simulate daily runoff, noting that accumulated rainfall was a pivotal input. Remarkably, the RF model surpassed the ANN and conceptual models in accuracy. Using soil moisture data spanning 39 years, Kantharia et al. 57 used an adaptive neuro-fuzzy inference system (ANFIS) to estimate daily discharge in the Damanganga basin.
In 2025, a novel framework that combines ML techniques with stochastic hydrologic modeling was proposed by Houénafa et al. 58 They employed wavelet-enhanced models, such as wavelet-based gated recurrent units (WGRU) and wavelet-based extreme gradient boosting (WXGBoost). These models improved streamflow prediction by utilizing daily discharge variability and statistical features. A parameter regionalization technique for distributed RRM was presented by Sayama et al. 59 The Bayes theorem and conditional probability were used to direct parameter assignment. This method decreased computational demand by connecting model parameters to regional hydrological features. To simulate monthly RR processes in watersheds in southern Thailand, Kaewthong et al. 60 compared the performance of ML algorithms, including MLR, MLP, and SVM, to the Modèle du Génie Rural à 2 paramètres Mensuel (GR2 M) conceptual model.
The analysis of different ML approaches reveals a standard issue when these methods are used for RR. The methods depend heavily on adequate data quality and sparse data, while facing challenges with model overfitting during training and geographic generalization. Additionally, these models need substantial computing resources and expert knowledge to optimize their numerous parameters effectively.
DL
In 2023, to overcome the limitations of DL and conceptual models in hydrology, Kapoor et al. 48 combined sophisticated DL architectures, particularly convolutional neural network (CNN) and LSTM networks, with the Modèle du Génie Rural à 4 paramètres Journalier (GR4J) RR model. The hybrid method unites conceptual model simplicity with deep learning prediction power for enhanced performance and simplified model design. Their research demonstrates that hybrid models deliver improved adaptability across different hydrological settings. The hybrid models demonstrate superior performance over both conceptual models and traditional deep neural networks (DNNs), specifically in arid catchment areas. Dang et al. 61 researched to enhance RR prediction accuracy through ML and DL model optimization methods. The researchers integrated four different optimizers into LSTM models, which included root mean square propagation (RMSprop) and Adagrad as well as Adadelta and Adam. The research tested different dropout strategies through 0%, 10%, 20% and 30% dropout rates in LSTM architectures across two Mekong Delta hydro-meteorological stations in Vietnam. The research aimed to establish the best optimizer together with the best dropout rate to achieve better model performance and reduce overfitting in hydrological forecasting.
In 2024, a novel DL architecture for hourly RR modeling was presented by Ishida et al. 2 A sequential framework links LSTM networks to a one-dimensional CNN (1D-CNN) through this system. The LSTM system handles short-term data while the 1D-CNN processes extended hourly meteorological data. The system achieves better predictions through CNN-extracted feature integration. The Japanese watershed of the Ishikari River served as a testing site for this approach, which proved its capability to enhance hourly model forecasts. Li et al. 28 developed a process-driven DL model through the combination of a conceptual hydrological model (EXP-HYDRO) with a recurrent neural network (RNN). The PRNN-EA-LSTM model integrates an entity-aware LSTM cell, which acts as a post-processor with EXP-HYDRO as a process driver inside an RNN cell. This specialized model structure unites the accurate predictions of DL methods with the fundamental understanding of process-based models. The model demonstrates enhanced robustness when energized by sub-process variables from EXP-HYDRO across 531 catchments. Li et al. 27 developed a hybrid model that utilizes LSTM together with the transformer architecture and random search (RS) for efficient flood forecasting. The model accurately reproduces flood patterns through RR data analysis from the Yellow River Jingle watershed. By doing this, it gets around the drawbacks of data-driven models and traditional physical hydrology. The adaptability of AI for RRM in the Indian Bardha watershed was investigated by Shekar et al. 62 Six AI models, namely ANN, KNN, XGBoost, RF, CNN, and CNN-RNN, were applied in their study, originally published between 2003 and 2009. The calibration period spanned 2003–2007, and the validation period was 2008–2009. This approach tests the model's performance based on spatial distributions of rainfall, temperature, and discharge. Wang et al. 63 introduced a new method of long-term runoff forecasting by using a transformer method and baseflow separation. Temperature Forecasting Using Transformer Networks deals with the problems of computational inefficiency and error accumulation after long sequences when using LSTM. To enhance the credibility of flood prediction, Kim et al. 64 used an RRM (Rainfall-runoff model) plus AI to address the persistent issue of flood-related damages in Vietnam. They employed a genetic algorithm (GA) and pattern search (PS) to perform the calibration of the parameters, and they made use of the Tank model to compute the peak flow rate for the flood discharge. They used some of the evaluation criteria (WSSR and SSR) for measuring the peak flood discharge, and used a rating curve and Monte Carlo simulation as criteria to transform flood discharge into floodwater levels and establish confidence bounds. They utilized the measurements they made to improve the flood level predictions (by better informing the AI models in particular DNNs and LSTMs). Yoon et al. 65 presented a semi-supervised learning method, called self-training, to augment data-driven models in the case of RR relationships, which often suffer from a lack of paired observations of climate data and streamflow. Essentially, they applied self-training in a semi-supervised, teacher-student model by using a teacher model trained on a small number of paired samples to generate pseudo streamflow for unpaired samples. These pseudo streamflow observations, combined with the real observations, provided the student model with data to train on. The approach they used has an annealing-capable loss function to deal with the uncertainties of the pseudo streamflow. Their results showed large improvements over older fully supervised models using an LSTM network and the CAMELS dataset and provided a feasible approach to RRM with augmented data. To improve RR predictions, Wang et al. 66 developed a novel informer neural network in combination with the empirical wavelet transform (EWT). The new technique, considering only single events of rainfall, utilizes EWT for reducing the non-linearity and non-stationarity of runoff data. To reduce the impact of data variability, the model splits RR data into three sections using fractal theory as well. GPM precipitation data and 15 years of USGS runoff data were used to evaluate the method. It was superior to traditional LSTM-based methods. The effectiveness of ML-based RRM under changing environmental conditions was investigated by Li et al. 67 They used models such as an LSTM network, a multi-hidden-layer back propagation neural network (MBP), and category boosting (CatBoost) for RRMs on a daily time scale for the Dongwan catchment of the Yiluo River Basin. These models were repeatedly and non-real-time tuned for non-stationary hydrological data. Their ability in holding variation and irregularities of hydrological time series was evaluated utilizing determinants including deterministic coefficient, peak flow error, and runoff depth error.
In 2025, six AI models for RR analysis, including SVR, MLR, XGBoost, LSTM, CNN, and a hybrid CNN-RNN, were assessed by Shekar et al. 68 during training and validation periods spanning 1998 to 2006. Jiang et al. 69 introduced a model for RRM, which utilizes a convolutional neural network architecture to analyze frequency-domain transformed hydrological time series, capturing local and global data variations. Li et al. 70 developed a model, integrating LSTM with Fourier Kolmogorov Arnold networks and convolutional feature extraction to enhance spatial data handling and improve precipitation-runoff modeling accuracy. Zhang et al. 71 developed a hybrid explainable streamflow forecasting model using CNN-LSTM-attention, focusing on typical river source regions in the eastern Qinghai Tibet Plateau, incorporating base flow to enhance high-flow predictions. Yin et al. 72 introduced a multi–step regional RRM approach using a pyramidal transformer (PT) with hierarchical attention (PTHA). The model integrates dynamic and static hydrological attributes to capture spatiotemporal dependencies more effectively than conventional LSTM models. Using DL techniques, Bao et al. 73 created a runoff forecasting model that improved feature extraction and captured temporal dependencies in hydrological time-series data by combining transformer architectures and LSTM networks with sparse and full attention mechanisms. Zhu et al. 74 used a hybrid DL technique that combines transformer architectures with self-attention mechanisms and CNNs (ResNet). This design allows for precise runoff prediction in large river basins by capturing intricate spatiotemporal patterns in multi-source hydrological data. To capture nonlinear hydrological patterns with enhanced spatiotemporal feature learning, Xu et al. 75 developed a runoff prediction model that combines wavelet feature extraction, a transformer encoder for sequence representation, and a TimeBlock decoding module. To improve feature extraction over conventional LSTM and transformer approaches, Yin et al. 76 developed a temporal-periodic transformer (TPT) model that combines temporal and periodic attention mechanisms to capture time-dependent and annually recurring patterns in streamflow. A lightweight attention-based modern CNN (LMCNN) was proposed by Jiang et al. 77 To effectively capture multi-channel temporal and cross-variable runoff features, it combines depthwise convolution (DWConv), pointwise and squeeze-and-excitation (SE)-enhanced convolution (PSConv), and improved SE networks (SENets).
Previous studies on RRM have often emphasized the predictive ability of LSTM models. Nonetheless, many LSTM models train globally on the entire dataset. This makes it more difficult for them to record regional differences in hydrological dynamics. In certain situations, this lowers forecast accuracy because the model may overlook minute but significant variations in rainfall and runoff. We suggest TLSTM as a solution to this gap. This transductive approach uses the proximity of incoming data points to adjust weights. By incorporating local adaptability and retaining the advantages of LSTM, the method enhances the model's ability to capture short-term fluctuations. Therefore, by addressing the limitations of traditional LSTM models in representing fine-grained hydrological responses, the study makes a significant contribution.
The lack of effective and methodical hyperparameter optimization for deep learning models in RRM represents another gap in the literature. Previous research employed grid search or trial-and-error methods. These approaches often yield suboptimal results and are computationally expensive. We employ the HMS-based DE algorithm to solve this. It works well for locating balanced parameter configurations and exploring high-dimensional search spaces. In doing so, our study enhances prediction accuracy and stability while lowering computational load. To summarize, this study makes two contributions: (i) it introduces TLSTM to overcome the limitations of global LSTM training and capture localized hydrological patterns; and (ii) it uses the HMS-based DE algorithm for hyperparameter optimization to increase rainfall–runoff prediction accuracy, efficiency, and robustness.
Method
A new approach to RRM uses spatial attention-enhanced TLSTM networks together with DE algorithm-based hyperparameter optimization. Figure 1 shows the workflow of the suggested model. The process starts with the input dataset, which includes catchment characteristics, runoff, and rainfall. To improve prediction accuracy and prepare the data for modeling, it undergoes preprocessing. After preprocessing, DE identifies the best hyperparameters which get used in later model training procedures. The spatial attention module selects the most important spatial features. By using this approach, the model can focus its attention on specific areas that produce the most runoff. The TLSTM network uses attention-weighted inputs to learn complex temporal relationships within hydrological sequences. The model generates runoff predictions, which get evaluated for stability and reliability through standard hydrological performance evaluation metrics.

Process of the suggested RR model.
In this work, we assess the suggested model using the CAMELS dataset. CAMELS 78 is a standard in RRM benchmarks due to its wide scope and the variety of catchment characteristics it encompasses. For thorough model validation, this dataset offers comprehensive meteorological and runoff data. The method places a strong emphasis on detailed and varied data to ensure comprehensive analyses tailored to the distinct features of each dataset. However, it remains flexible and applicable to various hydrological datasets.
CAMELS contains 11,981 records from 674 basins across the United States. This enormous dataset, which provides daily runoff records, catchment characteristics, and meteorological forcings for every basin from October 1, 1980, to December 31, 2014, facilitates thorough model evaluation. While regional models use data from the Maurer dataset, individual basin RRM uses daily meteorological data from Daymet. These meteorological data sources were selected by accepted guidelines for RRM. These two datasets are commonly used in both individual and regional models. A total of 32 features can be extracted from each record in the dataset. For individual basin modeling, we replicate the benchmarks by selecting the same 673 basins. Similarly, for regional modeling, we utilize the same 448 basins used in benchmark studies. To maintain a fair comparison, our methodology employs the same daily meteorological forcings and catchment characteristics as those used in the benchmarks.
For building and evaluating the model, the dataset was partitioned into a 70:30 ratio for training and testing. This ratio aligns with standard practice in hydrology and machine learning. In these fields, allocating 20–30% of data for testing is recommended to balance model learning and evaluation reliability. 79 Empirical studies, including recent streamflow modeling with machine learning, show that 70:30 splits typically provide robust generalization while maintaining sufficient unseen data for unbiased assessment. 79 In this setting, the test set consisted of the last 30% of temporal sequences in each catchment, separated chronologically to avoid data leakage. During training, only the input features of these test sequences without target runoff values were used to adjust the model in the transductive step. The actual runoff values were reserved exclusively for final evaluation. An 80:20 split provides more training examples. However, it can reduce the statistical diversity of the validation set and increase the variance of performance estimates. Using less than 20% test data can also inflate performance metrics due to overfitting. Therefore, considering the dataset size, hydrological variability, and model complexity, the 70:30 split provided the most reliable balance for both training and evaluation.
Data preprocessing
The pre-processing stage focuses on preparing raw hydrological and meteorological data to ensure the model receives consistent, complete, and high-quality inputs for RR prediction. First, the collected rainfall, runoff, and catchment attribute datasets are systematically inspected to detect missing or anomalous records that may result from sensor malfunctions, transmission errors, or reporting delays.
The nearest-neighbor imputation method is used to fill in the missing values. The approach substitutes missing values by obtaining the nearest valid observation value to keep time series continuity. Statistical thresholds that use z-scores and interquartile ranges (IQR) help detect unrealistic measurements, including negative rainfall values and extreme runoff spikes that do not match precipitation events. The procedure removes isolated anomalies but substitutes continuous anomalies with interpolated values. The multi-step data cleaning procedure produces a final dataset that will be dependable and suitable for precise RRM.
The next step after data cleaning involves min-max scaling to bring all input features into uniform numerical ranges between 0 and 1. The training process becomes more stable because this procedure stops high-magnitude variables like runoff from overpowering smaller-scale variables such as rainfall intensity during the learning process.
A sliding-window method is used to transform the data into time-series sequences. The sliding window method produces overlapping sequences of input and output pairs through a fixed-length window that slides across the temporal data. This enables the model to learn both long-term dependencies (from accumulated hydrological memory) and short-term patterns (from recent rainfall and runoff). As inputs to the model, the pre-processed data are subsequently organized into feature matrices that integrate spatial attributes and temporal sequences.
Hyperparameter optimization
An essential component of DL is hyperparameter optimization, which has a big influence on the efficacy and efficiency of training algorithms. Hyperparameters, such as learning rate, batch size, and number of epochs, regulate the training process of the model. The ability of the model to generalize from training data to unknown data is also influenced by these parameters. The overall performance of the model, computation load, and rate of convergence can all be significantly impacted by the selection of these parameters.
Table 1 lists the hyperparameters that were modified during our investigation; the value ranges were selected based on prior DL applications in RR. These ranges provide a framework for understanding how the DE algorithm works, enabling modifications and enhancements tailored to specific project needs. To create the initial population for the DE optimization process, random initial hyperparameter values were generated within the specified ranges (defined in Table 1) using the Random Key encoding.
In research, hyperparameters were optimized.
In research, hyperparameters were optimized.
The Random Key technique is used in this paper to optimize hyperparameters. It provides ease of integration with evolutionary algorithms, simplicity, and efficiency in continuous search spaces. It facilitates a thorough investigation of various model configurations. Additionally, it facilitates parallelization, which reduces computation costs and time. It also enables global searches, which reduces the likelihood of local minima. These properties make it highly adaptable for tuning deep–learning architectures with multiple hyperparameters. This adaptability ultimately improves overall model performance. 80
Random Key employs a method that translates solutions into a series of T numeric vectors, each having D dimensions, collectively termed
An example of the Random Key approach for the hyperparameter ‘number of layers’ where

The random key method for the hyperparameter “number of layers” with
We enhance the Random Key method using the DE algorithm for its robustness and adaptability in complex, high-dimensional hyperparameter optimization. DE maintains a diverse population, using mutation, crossover, and selection to avoid local optima and ensure global exploration. DE is flexible with both continuous and discrete variables. Its simplicity and ease of integration allow it to combine seamlessly with the Random Key method. This integration ensures thorough and efficient optimization. It also leads to more accurate and reliable final model results.
The three stages of D are crossover, selection, and mutation. Because it creates new genetic variations, mutation is essential for evolution. It accomplishes this by employing weighted differences between other solutions to modify base solutions. By doing this, diversity is maintained, exploration is encouraged, and convergence to less-than-ideal outcomes is avoided. The mutation phase guides the algorithm toward optimal results by generating a diverse range of excellent solutions. These procedures guarantee a successful search when paired with crossover and selection. They also improve the overall optimization potential of DE.
In DE, the following method is used by the mutation phase to create a new vector:
Here,
In this approach,
The newly developed DE variant introduces a novel mutation mechanism, drawing inspiration from recent advancements in the field.
81
Initially, k-means clustering is applied to the current population to identify distinct groups within the search space. This approach segments the population into several clusters, with the number of clusters, k, randomly chosen within the range
The following equation defines a new mutation operator inspired by this clustering strategy:
In this formulation, Selection: k candidate solutions are randomly selected to serve as the initial centroids for the k-means clustering algorithm. Generation: Through mutation, M new candidate solutions are created, creating the group Replacement: M candidates are randomly chosen from the population to assemble a set B. Update: The new set
Instead of allocating attention evenly throughout the dataset, spatial attention concentrates it on relevant portions of the input data. This focused approach improves the ability of the model to rank important variables and characteristics, thereby protecting it from noise or irrelevant data. Between the input sequence and the TLSTM layers, spatial attention acts as a bridge. By performing a weighted summarization of the inputs, the ability of the model to highlight key characteristics during training is increased. An
The process for calculating the weighted sum of the input features is shown in Equation 5:
The process directs the focus of the scheme to the most important features to be considered. This ensures maximum efficiency and relevance of the data processing.
The LSTM architecture incorporates a gating mechanism that effectively manages the flow of information through the network. As illustrated in Figure 3, the structure of an LSTM unit controls the retention, updating, and retrieval of information within its memory cells across various time intervals. This model strategically employs three distinct gates to regulate information flow: the input gate (

Structure of an LSTM and TLSTM unit.
The gates within the LSTM architecture utilize the sigmoid function
Newer versions, such as TLSTM, have been developed since the original LSTM to expand its application and enhance performance across various research domains.
80
The TLSTM model retains a structure similar to a conventional LSTM, as outlined in Figure 2 and Equations 6–10. The main difference lies in how the parameters behave. Standard LSTM keeps parameters fixed and independent of the test instance. In contrast, TLSTM adjusts its parameters based on the feature vector of the test sample. During this process, the true label of the test sample is not available. Instead, the sample affects the weighting of training instances. These weights are assigned according to the similarity between the test sample features and the training data features. Consequently, the state space of the TLSTM model is expressed as follows:
The symbol
Metrics
Key performance measures, including NSE, RMSE, and absolute threshold percentage error (ATPE-2%), are used in this study to evaluate RRM. These metrics were chosen because of their capacity to thoroughly assess model performance from various angles. NSE was selected as a trustworthy indicator of overall model accuracy because it effectively measures how well the predicted values align with the observed data. Because it is sensitive to significant errors and highlights the impact of significant deviations between observed and predicted values, the RMSE is chosen to provide insight into the precision of the model. Finally, the ATPE-2% is used to measure the prediction accuracy of the model within a limited acceptable error range. This provides a targeted performance metric in situations where small errors are crucial. These metrics, which encompass accuracy, precision, and practical applicability, collectively provide a comprehensive assessment of RRM.
The following formula calculates NSE:
Here,
A lower RMSE value indicates greater model accuracy. The following formula defines ATPE-2%, which evaluates the accuracy of peak flow predictions:
Here,
We also employ the Kling–Gupta Efficiency (KGE), R2, and mean bias error (MBE) metrics to assess the performance of the suggested model from other angles, particularly those that address bias and correlation. Because it integrates correlation, variability, and bias into a single index, KGE is selected. This integration enables a fair evaluation of the accuracy and dependability of the model. The strength of the linear relationship between observed and predicted values is expressed in terms of R2. This measure indicates how effectively the model captures the patterns of the data. To measure systematic bias, MBE is used. It displays whether the target variable is consistently overestimated or underestimated by the model.
KGE is defined as follows:
Higher KGE values indicate better performance.
Greater predictive accuracy and a better match between the simulated and observed outcomes are indicated by larger R2 values. MBE is computed as follows:
Better performance is indicated by MBE values that are closer to zero, regardless of sign.
To handle the intricate calculations required for the modeling process, the study was conducted on a 64-bit Windows operating system. The hardware configuration consisted of a 3.60 GHz Intel Core i9-9900 K CPU and 32 GB of DDR4 RAM (double data rate fourth-generation random-access memory). This provides the system with the computational capability and memory space needed to handle large data and run complex algorithms. An NVIDIA GeForce RTX 2080 Ti graphics processing unit (GPU) was utilized to train the deep learning models efficiently and run the spatial attention mechanisms. Python (version 3.8) was chosen as the main programming language for this study because it has very large data processing and ML libraries. The DL models, TLSTM and LSTM, were implemented with the help of TensorFlow (version 2.5) and Keras. TensorFlow's support for GPU made the training process remarkably faster. The spatial attention mechanism was implemented by the custom layers of TensorFlow. The DEAP library, which is flexible and efficient in evolutionary computation, was utilized to implement the DE algorithm. Other ML procedures, e.g., the k-means clustering algorithm, that are required in the mutation strategy of the DE process were carried out using Scikit-learn (version 0.24). NumPy (version 1.19) and Pandas (version 1.2) were used to make data analysis and processing easier. Hierarchical Data Format version 5 (HDF5) was used to handle the CAMELS dataset for storing and retrieving large hydrological and meteorological datasets effectively. To ensure accurate and reliable performance measures of the models, Python scripts specifically developed were utilized to calculate the measures of performance. To evaluate the accuracy and reliability of our models, we adopted a 5-fold stratified cross-validation technique.
This methodology was selected for several compelling reasons. First, 5-fold cross-validation offers the best compromise between model reliability and cost of evaluation. We ensure each fold is used once for testing by partitioning the data set into five disjoint subsets and holding out the other folds for training. By doing this, the risk of overfitting is reduced, and an improved estimation of model performance across several subsets of data is obtained. Secondly, the stratified nature of this procedure is required because it guarantees each fold to have the identical class label distribution as the original dataset. For imbalanced data, where sparse classes can exist, this is especially critical. By making each fold consistently reflect the class balance of the overall dataset, stratification avoids model bias when it is being evaluated. This approach enhances the validity of our results by providing a more accurate estimation of the model's ability to generalize new data. A 5-fold stratified cross-validation approach was used to ensure the generalizability and stability of our models, providing a sound foundation for performance evaluation across various settings.
We conducted an in-depth comparative analysis of our RR algorithm against twelve leading models: CNN-LSTM, 2 RS-LSTM, 27 PRNN-EA-LSTM, 28 MLR, 50 CNN-RNN, 62 Transformer, 63 Tank, 64 Self-training, 65 CNN-LSTM-attention, 71 PTHA, 72 LSTM-Transformer, 73 and ResNet-Transformer. 74 To determine the distinct influence of each feature on the performance of the model, we also evaluated the results of removing specific elements from our model, namely TLSTM, spatial attention (SA), and hyperparameter optimization (HO). Results for 1-, 2-, 4-, and 8-day-ahead runoff predictions were presented. Our analysis concentrated on individual and regional RRM. The testing phase results are shown in Tables 2 and 3, and the training phase results are shown in Tables 4 and 5.
Comparison of the proposed RR model with other state-of-the-art and ablation models using NSE, RMSE, and ATPE-2% across various prediction horizons in individual modeling during the testing phase.
Comparison of the proposed RR model with other state-of-the-art and ablation models using NSE, RMSE, and ATPE-2% across various prediction horizons in individual modeling during the testing phase.
Bold values indicate the best results in each column.
Comparison of the proposed RR model with other state-of-the-art and ablation models using NSE, RMSE, and ATPE-2% across various prediction horizons in regional modeling during the testing phase.
Bold values indicate the best results in each column.
Comparison of the proposed RR model with other state-of-the-art and ablation models using NSE, RMSE, and ATPE-2% across various prediction horizons in individual modeling during the training phase.
Bold values indicate the best results in each column.
Comparison of the proposed RR model with other state-of-the-art and ablation models using NSE, RMSE, and ATPE-2% across various prediction horizons in regional modeling during the training phase.
Bold values indicate the best results in each column.
In individual modeling during the testing phase, recent transformer-based approaches in hydrology, such as Transformer, LSTM-Transformer, and ResNet-Transformer, perform competitively at certain prediction horizons, particularly for short-term forecasts. At the 1-day-ahead horizon, LSTM-Transformer achieves an RMSE of 1.995 and an NSE of 0.760. This corresponds to a 1.26% reduction in RMSE and a 17.9% increase in NSE compared to the standard Transformer. This indicates better capture of spatiotemporal dependencies. In catchments with relatively stable flow regimes, RMSE gains can be modest. For example, in the CAMELS catchment US_01491000, which has low seasonal variability, LSTM-Transformer improves RMSE over CNN-LSTM by only 2.3% at the 1-day-ahead horizon (0.934 vs. 0.956). However, in the same case, NSE increases by 8.5%, while ATPE decreases by 2% and by 14.2%. This indicates that significant gains in other metrics can still accompany small RMSE improvements. Performance declines are more pronounced for longer prediction horizons. At the 8-day-ahead horizon, LSTM-Transformer achieves an RMSE of 2.472, compared to 2.477 for Transformer, but ATPE-2% remains relatively high. This reflects a trade-off between accuracy and stability. Non-transformer models such as CNN-LSTM-attention and PTHA perform well in short-term forecasts but tend to degrade more noticeably over longer horizons. These limitations arise from three factors. First, no method optimizes temporal and spatial features together. Second, the models are sensitive to noisy hydrological inputs. Third, optimization strategies are not tailored to the distribution of hydrological data. Similar patterns are observed in the training phase, with relative improvements closely matching those in testing.
In comparison with strong baseline models, the proposed model combines TLSTM, spatial attention, and a k-means-enhanced DE algorithm. This combination has not been applied in hydrology, particularly in comparison to transformer-based baselines, making it a novel and practical approach. Across most prediction horizons, the model shows consistent gains, although some basins record smaller improvements. For example, in CAMELS catchment US_01548500, which has a stable annual hydrograph and low inter-annual variability, the RMSE improvement over CNN-LSTM is modest at 2.4% for the 1-day-ahead horizon (0.945 vs. 0.968). However, in the same catchment, NSE improves by 9.1%, while ATPE decreases by 2% and by 15.6% for ATPE-2. This indicates that even when RMSE gains are limited, other performance aspects, such as error stability and distribution, still improve. This pattern indicates that in basins with low temporal variability, TLSTM and spatial attention have limited scope to reduce RMSE but can still enhance prediction stability. At longer horizons, gains are more notable. At the 8-day-ahead horizon, the proposed model improves over the best transformer-based competitor (LSTM-Transformer) by 23.2% in ATPE-2% and 40.6% in RMSE. At the 4-day-ahead horizon, it exceeds CNN-LSTM-attention by 36.3% in ATPE-2%, 3.57% in RMSE, and 12.7% in NSE. For the 2-day-ahead horizon, the gains over LSTM-Transformer are 17.1% in ATPE-2% and 22.8% in NSE. At the 1-day-ahead horizon, it achieves an 18.7% reduction in ATPE-2% and an 11.3% increase in NSE. These results suggest that TLSTM models complex hydrological sequences, spatial attention improves the weighting of important regions, and the modified DE algorithm speeds convergence toward suitable configurations. Together, these components address key weaknesses of transformer-based hydrological models by reducing data needs for convergence and limiting performance loss at long prediction horizons. Similar improvement patterns are seen in training, confirming the consistency of results across both phases.
In the ablation analysis during testing, removing TLSTM (Proposed w/o TLSTM) reduces the average NSE across horizons by 36.9% and increases RMSE by 12.0%. This suggests that TLSTM accounts for about 56% of the overall improvement. Omitting SA (Proposed w/o SA) results in a 23.0% drop in NSE and a 6.0% rise in RMSE, contributing roughly 23% of the total gain. Removing HO causes a 20.8% decrease in NSE and a 5.7% increase in RMSE, representing around 21% of the improvement. These results demonstrate that each component makes a measurable contribution and that their combined effect yields a stronger overall performance. The impact of each element varies depending on the hydrological conditions. In the CAMELS catchment US_01646000, a mountainous basin in Virginia with steep slopes and diverse land cover, SA improves NSE by 14.2% compared to the no-SA variant. This result highlights the ability of SA to capture localized spatial influences. In US_06888500, which has highly seasonal precipitation with distinct wet and dry periods, TLSTM reduces RMSE by 21.5% over the no-TLSTM variant. This shows an advantage for learning long-term temporal patterns. In US_04127918, where mixed snowmelt and rainfall drive flows and inter-annual variability is high, HO improves RMSE by 9.8% compared to the no-HO variant. This reflects a role in adjusting parameters to accommodate complex seasonal transitions. In US_10109000, where rainfall is evenly distributed throughout the year and flow patterns are relatively stable, TLSTM, SA, and HO together yield smaller RMSE gains of about 4–5%. Even so, ATPE-2% and NSE stability still improve. The patterns observed during training are similar to those in testing, which supports the conclusion that the contribution of each component is consistent across both phases.
During regional testing, and excluding the proposed approach, the ranking of strong baselines differs from the ranking in individual modeling. Transformer-based architectures, such as Transformer, LSTM-Transformer, and ResNet-Transformer, perform well at some horizons but often exhibit instability across different timescales. For instance, at the 8-day-ahead horizon, LSTM-Transformer reaches an NSE of 0.706, which is 14.0% higher than CNN-LSTM (0.647). However, at the 2-day-ahead horizon, its NSE falls to 0.538, a 24.6% drop compared to CNN-LSTM (0.777). At a 4-day-ahead forecast, ResNet-Transformer has an RMSE of 1.857, which is 5.6% lower than the base Transformer (1.966). At a 1-day-ahead forecast, its NSE is 0.692, which is 14.3% lower than CNN-RNN (0.791). Among non-transformer baselines, CNN-RNN and PRNN-EA-LSTM show solid mid-term performance. PRNN-EA-LSTM improves NSE by 8.9% over CNN-LSTM at the 4-day-ahead horizon. However, both lack a mechanism that jointly captures regional spatial dependencies and temporal dynamics. This limits generalization over long ranges in heterogeneous catchments. Performance differences between models vary notably with hydrological conditions. In CAMELS catchment US_12144000 (large basin, 4850 km2, humid subtropical climate, moderate precipitation noise), LSTM-Transformer achieves an NSE of 0.812 at 4-day-ahead, which is 9.4% higher than CNN-LSTM. At a 1-day-ahead forecast, the advantage narrows to 2.1% because flows are largely stable at low levels. In US_06746095 (a medium-sized basin in a semi-arid region with high interannual variability and noisy precipitation inputs), the same model exhibits a decline in NSE from 0.774 at 1-day-ahead to 0.542 at 8-day-ahead, indicating sensitivity to input noise. In US_13337000 (snowmelt-driven basin in a mountainous region with seasonal high-flow events), ResNet-Transformer reduces RMSE by 12.6% relative to Transformer at 8-day-ahead during peak flow. During extended periods of low flow, the gain is minimal. These cases suggest that basin size, climatic variability, and precipitation noise significantly impact model performance patterns, and that advantages are not uniform across different hydrological scenarios. Similar patterns appear in the regional training phase, with relative improvements closely matching those in testing.
When compared to the strongest baselines, integrating TLSTM, spatial attention, and the k-means-enhanced DE algorithm in the proposed model leads to consistent gains in regional modeling. In the testing phase, at the 8-day-ahead horizon, the method improves over the top transformer-based competitor (Transformer) with a 50.7% reduction in ATPE-2% (0.294 vs. 0.596), a 15.1% decrease in RMSE (2.053 vs. 2.419), and a 10.6% increase in NSE (0.878 vs. 0.794). At the 4-day-ahead horizon, it outperforms the best non-transformer baseline (MLR) by 55.6% in ATPE-2%, 22.8% in RMSE, and 2.3% in NSE. For the short-term 1-day-ahead horizon, it achieves a 78.4% reduction in ATPE-2%, a 55.6% reduction in RMSE, and an 8.8% increase in NSE compared to CNN-LSTM. The improvements are notable in many cases but smaller in some regional basins with highly irregular precipitation–runoff patterns. In the CAMELS dataset, short-horizon NSE gains are modest for two types of basins: (1) those in the lowest-variability quartile, which have a high baseflow index and low coefficient of variation of daily flow, and (2) those with an aridity index greater than 1.0. In these cases, the main advantages are evident in ATPE-2% and RMSE, reflecting improved error stability. Snowmelt-dominated basins, which have a high proportion of precipitation as snow and strong seasonality, and semi-arid intermittent basins, which have high daymet precipitation intermittency, gain more in ATPE-2% and RMSE than in NSE. This pattern suggests that the main benefit is error stabilization rather than an increase in peak NSE values. These results indicate that TLSTM is important for modeling inter-catchment temporal dependencies, while spatial attention improves region-specific feature weighting. The k-means-based DE helps adapt hyperparameters to varied hydrological regimes, which is particularly valuable in regional contexts where variability is greater than in individual catchment modeling. Similar relative gains were seen in the regional training phase, confirming that the performance patterns are consistent across both training and testing.
Ablation studies in the regional setting help quantify the contribution of each component. During the testing phase, removing TLSTM reduces the average NSE across horizons by 17.0% and increases the RMSE by 8.3%, suggesting that TLSTM contributes approximately 39% of the total improvement. Excluding spatial attention results in a 13.2% decrease in NSE and a 5.4% increase in RMSE, accounting for approximately 26% of the gains. Omitting hyperparameter optimization results in a 9.7% decrease in NSE and a 3.9% increase in RMSE, representing approximately 21% of the improvement. The remaining gains come from the interaction between components. Basin-level analysis reveals that TLSTM yields greater benefits in catchments characterized by strong seasonal flow cycles. For example, in US_09352900 (Colorado), a snowmelt- and seasonality-driven basin, pronounced wet–dry cycles produce a 19.4% increase in NSE and a 10.7% reduction in RMSE compared to the no-TLSTM variant. Spatial attention yields higher gains in topographically complex basins, such as US_01646000 (located in Virginia's Blue Ridge region). In this location, steep elevation gradients and heterogeneous land cover create localized flow responses. In this case, SA improves NSE by 14.1% over the no-SA variant. Hyperparameter optimization is most beneficial in basins with high variability in precipitation inputs. For instance, in US_06888500 (semi-arid Kansas), rainfall events are intermittent and highly skewed. In this basin, HO reduces the RMSE by 8.5% compared to fixed hyperparameters, demonstrating its ability to adapt to noisy forcing data.
To strengthen the statistical evaluation, we complemented paired t-tests with effect sizes (Cohen's d) and 95% confidence intervals (CIs) for all comparisons in Tables 2 and 3. These additional metrics provide a clearer view of the magnitude and reliability of observed differences, beyond what p-values alone can convey. Effect sizes were interpreted using conventional thresholds (small: 0.2, medium: 0.5, large: 0.8). In individual modeling, the proposed RR model demonstrated large effect sizes (d > 0.8) in all horizons for NSE, RMSE, and ATPE-2% when compared with top-performing transformer-based baselines. For example, against the LSTM-Transformer, Cohen's d for NSE improvements ranged from 0.84 (1-day-ahead) to 1.21 (8-day-ahead). The 95% CIs excluded zero, indicating statistically and practically significant gains. Similarly, relative to the Transformer, the model showed medium-to-large effects (d = 0.63–1.05) in NSE. For RMSE and ATPE-2%, the effects were consistent and medium in size (d = 0.45–0.70 and d = 0.48–0.73), demonstrating stable advantages in accuracy and error reduction. Comparisons with the ResNet-Transformer revealed large effects in short-horizon predictions (d ≈ 0.95 at 1-day-ahead in NSE) and moderate effects at longer horizons (d ≈ 0.54 at 8-day-ahead in NSE). In regional modeling, the advantage of the proposed model over transformer-based baselines was more pronounced. Against the Transformer, effect sizes for NSE exceeded 1.0 across all horizons, with the largest difference at the 8-day-ahead horizon (d = 1.28, CI: [0.96, 1.60]). Compared with the LSTM-Transformer, the gains were especially strong in long-term forecasts (8-day-ahead, d = 1.35, CI: [1.05, 1.66] in NSE) and remained moderate to large in shorter horizons. The ResNet-Transformer comparison similarly favored the proposed model, particularly at the 4-day-ahead (d = 1.12, CI: [0.81, 1.42]) in NSE. These results suggest that the combined use of TLSTM, spatial attention, and k-means–enhanced DE provides systematic benefits across spatial and temporal contexts. Across both modeling settings, the calculated p-values for NSE, RMSE, and ATPE-2% were consistently below 0.01 in all comparisons between the proposed model and baselines, indicating strong statistical significance. The 95% CIs for effect sizes did not cross zero in any case, with lower bounds ranging from 0.41 to 0.95 and upper bounds extending to 1.66. Across all experimental results, including individual and regional modeling during training, the paired t-test p-values were consistently below 0.05. This indicates statistical significance. The 95% CIs for effect sizes did not cross zero in any case, with lower bounds ranging from 0.48 to 0.96 and upper bounds extending to 1.69. This pattern confirms that the observed improvements are not only statistically reliable but also practically meaningful, with stable magnitude across horizons and datasets.
Tables 6 and 7 compare the proposed RR model with several state-of-the-art baselines and its ablation variants. The comparison uses KGE, R2, and MBE across 1-, 2-, 4-, and 8-day-ahead horizons in both individual and regional modeling during the testing phase. Tables 8 and 9 show the results for individual and regional modeling during the training phase. In the individual testing phase, the proposed model generally achieves high R2 values (0.922–0.970), indicating strong explanatory power. Its KGE values (0.106–0.314) are higher than those of most competing methods, suggesting a better balance between correlation, variability, and bias. MBE results show lower systematic bias, particularly at short horizons. For instance, at the 1-day-ahead horizon, the model records an MBE of 0.604, compared to 0.793 for the Transformer and 0.746 for CNN-LSTM, corresponding to reductions of 23.8% and 19.0%. The proposed approach shows larger KGE improvements than Transformer-based hydrological models such as LSTM-Transformer and ResNet-Transformer, particularly at mid- and long-term horizons. At the 8-day-ahead horizon, KGE increases by 110.6% over LSTM-Transformer (0.106 vs. −0.004) and by 106.4% over ResNet-Transformer (0.106 vs. −0.010). These improvements result from three combined factors: TLSTM for temporal modeling, SA for spatial feature weighting, and k-means–enhanced DE for adaptive hyperparameter tuning.
Comparison of the proposed RR model with other state-of-the-art and ablation models using KGE, R2, and MBE across various prediction horizons in individual modeling during the testing phase.
Bold values indicate the best results in each column.
Comparison of the proposed RR model with other state-of-the-art and ablation models using KGE, R2, and MBE across various prediction horizons in regional modeling during the testing phase.
Bold values indicate the best results in each column.
Comparison of the proposed RR model with other state-of-the-art and ablation models using KGE, R2, and MBE across various prediction horizons in individual modeling during the training phase.
Bold values indicate the best results in each column.
Comparison of the proposed RR model with other state-of-the-art and ablation models using KGE, R2, and MBE across various prediction horizons in regional modeling during the training phase.
Bold values indicate the best results in each column.
Ablation results clarify the role of each component. In the testing phase, removing TLSTM lowers KGE by up to 66.2% (e.g., at the 2-day-ahead horizon). Removing SA reduces it by up to 67.2%, and removing HO results in a drop of 65.5% at certain horizons. These proportions align with earlier NSE and RMSE analyses, which estimate the contributions of TLSTM at approximately 56%, SA at 23%, and HO at 21%. Basin-level analysis shows that model benefits vary by hydrological setting. In basins with snowmelt dominance and strong seasonality, such as US_09352900 in Colorado, TLSTM captures recurring wet–dry flow cycles more effectively. This results in a 19.4% increase in KGE and an R2 of 0.960 compared to the variant without TLSTM. In topographically complex basins such as US_01646000 in the Blue Ridge region of Virginia, SA identifies localized flow responses caused by steep elevation gradients and diverse land cover more effectively. In this case, R2 improves by 14.1% over the no-SA variant. In semi-arid basins with intermittent and skewed rainfall, such as US_06888500 in Kansas, the HO model adjusts its parameters to handle noisy and irregular precipitation inputs. This reduces MBE from 0.82 to 0.75 and improves error balance. These examples show how seasonality, terrain complexity, and rainfall intermittency influence the impact of each component.
In the regional testing phase, the proposed model shows stable performance across all prediction horizons. At mid- and long-term horizons, R2 remains above 0.96, indicating strong explanatory power in aggregated basin settings. KGE values range from 0.285 at the 8-day-ahead horizon to 0.465 at the 1-day-ahead horizon. These results are generally higher than those of most baseline models, suggesting a balanced representation of correlation, variability, and bias. Bias analysis shows lower MBE for the proposed model, particularly at shorter horizons. For example, at the 1-day-ahead horizon, MBE is 0.617, compared with 0.672 for the Transformer and 0.621 for the CNN-LSTM. This represents reductions in absolute bias of 8.2% and 0.6%, respectively. The proposed approach yields higher KGE values than Transformer-based hydrological models, such as LSTM-Transformer and ResNet-Transformer, across all horizons. The largest improvement appears at the 8-day-ahead horizon. KGE is 0.285, compared with 0.002 for LSTM-Transformer and 0.008 for ResNet-Transformer. These findings point to three drivers. TLSTM models temporal dependencies, SA weights spatial features adaptively, and k-means–enhanced DE tunes hyperparameters. Together, these components capture spatial heterogeneity and temporal dynamics in regional modeling.
Ablation results highlight the contribution of each component. At the 8-day-ahead horizon, removing TLSTM lowers KGE from 0.285 to 0.039, showing its role in capturing long-term temporal dependencies. Omitting SA reduces KGE to 0.045, indicating its value in improving spatial feature representation. Without HO, KGE drops to −0.015, reflecting its effect on controlling bias and improving stability. These patterns match findings from individual modeling. In regional settings, however, the gains are more evenly distributed across horizons. This suggests that jointly using spatial and temporal information benefits performance under varied hydrological conditions. Basin-level examples illustrate these effects. In US_09352900 (Colorado), which is a snowmelt-driven basin with strong seasonality, TLSTM increases KGE by 21.7% at the 8-day-ahead horizon compared to the no-TLSTM variant. This improvement captures the melt–runoff cycle more effectively. In US_01646000 (Virginia, Blue Ridge region), which has steep slopes and diverse land cover, SA raises R2 by 16.2% at the 4-day-ahead horizon. This is achieved by weighting localized spatial features more effectively. In US_06888500 (Kansas), which is a semi-arid basin with irregular rainfall, HO lowers MBE from 0.82 to 0.74 at the 1-day-ahead horizon. This improves bias control under noisy precipitation inputs. Overall, the combination of temporal sequence learning, spatial attention, and adaptive hyperparameter optimization supports robust performance across a range of basin types and forecast horizons.
To complement the statistical testing in Tables 6 and 7, we computed effect sizes (Cohen's d) and 95% CIs for all comparisons between the proposed RR model and baselines across all horizons. In individual modeling, the RR model showed large effect sizes (d > 0.8) for R2 and KGE in most horizons. The 95% CIs excluded zero in all cases when compared with the Transformer. For example, in R2, the effect sizes ranged from 0.86 (1-day-ahead) to 1.28 (8-day-ahead), while for KGE they ranged from 0.84 to 1.20, indicating practically meaningful improvements in predictive skill. For MBE, the model showed medium-to-large effects (d = 0.56–0.92) across horizons, reflecting consistent bias reduction. In regional modeling, the advantages were even more pronounced. Against the Transformer, R2 improvements yielded large effect sizes in all horizons (d = 0.93–1.35), with the highest value at the 4-day-ahead horizon. KGE also displayed consistently large effects (d = 0.87–1.18), while MBE effects were moderate to large (d = 0.61–0.95). Across both modeling settings, the paired t-tests yielded p-values consistently below 0.01 for all three metrics, confirming statistical significance. The 95% CIs for all effect sizes did not cross zero, with lower bounds ranging from 0.41 to 0.96 and upper bounds extending to 1.66. Across all experimental results, including both individual and regional modeling during training, the p-values from paired t-tests were consistently below 0.04. This confirms statistical significance. The 95% CIs for effect sizes did not cross zero in any case, with lower bounds ranging from 0.51 to 0.97 and upper bounds extending to 1.73. These results demonstrate that the RR model yields statistically reliable and practically substantial improvements over the baselines. The advantages are most evident across various metrics, horizons, and spatial contexts, particularly when compared to transformer-based models.
Figure 4 shows the Taylor-like diagram comparing the performance of different rainfall–runoff models at the 8-day-ahead horizon in individual modeling. The diagram integrates key metrics. The radial axis shows error magnitude (RMSE), and the angular dimension reflects model fit quality. This structure enables a compact visualization that allows for the assessment of accuracy, variability, and agreement with observations simultaneously. It helps distinguish models that achieve both lower prediction errors and stronger alignment with observed runoff patterns. Each point represents the trade-off of a model between RMSE and NSE. Most baseline methods cluster in a similar region, with moderate correlation and higher RMSE values. By contrast, the Proposed model appears separated. It lies closer to the ideal region of high correlation and low RMSE. This indicates that the Proposed model captures temporal dynamics and spatial variability more effectively than CNN-LSTM, RS-LSTM, Transformer, and their variants. Hybrid Transformer-based approaches, such as LSTM-Transformer and ResNet-Transformer, also fall behind. They show higher errors and weaker explanatory power. Overall, the diagram highlights robustness, stability, and superior predictive accuracy for the Proposed model across complex hydrological conditions.

Taylor-like diagram of rainfall–runoff models at the 8-day-ahead horizon in individual modeling.
Figure 5 shows the beeswarm plots of NSE, RMSE, and ATPE-2% across different rainfall–runoff models for the 8-day-ahead prediction horizon in individual modeling. The proposed full model outperforms all baselines. Its NSE values cluster near 0.88, which indicates superior predictive reliability. In contrast, Transformer-based models, such as Transformer, LSTM-Transformer, and ResNet-Transformer, achieve moderate performance but still fall short by approximately 8–12% in NSE. In terms of RMSE, the proposed model yields the lowest error values, at approximately 2.05. Most alternatives range between 2.3 and 2.7, which reflects an improvement of about 10–20% in accuracy. For ATPE-2%, the proposed model achieves the lowest distribution. This shows better error tolerance and stability, confirming robustness and practical suitability for long-horizon forecasts.

Beeswarm plots of NSE, RMSE, and ATPE-2% for rainfall–runoff models at the 8-day-ahead prediction horizon in individual modeling.
The box plots in Figure 6 provide a comparative view of the performance of different rainfall–runoff models for the 8-day-ahead prediction horizon in individual modeling, as evaluated by NSE and RMSE, and ATPE-2% metrics. The proposed full model achieves the lowest error (≈0.29). This is a substantial reduction compared to CNN-LSTM (≈0.52) and Transformer (≈0.61). This demonstrates the higher precision of the model in capturing runoff dynamics. Regarding RMSE, the proposed model outperforms all alternatives with the lowest values (≈2.05). CNN-LSTM, Transformer, and hybrid models such as ResNet-Transformer report values above 2.20. Lower RMSE highlights the superior accuracy and reduced residual errors of the proposed framework. For NSE, the proposed model reaches the highest efficiency (≈0.87). It significantly surpasses CNN-LSTM (≈0.57) and Transformer (≈0.63). This improvement indicates better agreement between observed and predicted runoff. Taken together, these results clearly confirm the superior robustness and reliability of the proposed model compared with both conventional and Transformer-based approaches.

Box plots of NSE, RMSE, and ATPE-2% for rainfall–runoff models at the 8-day-ahead prediction horizon in individual modeling.
The comparative analysis in Table 10 highlights clear advantages of the proposed model in computational efficiency, making it suitable for real-time rainfall–runoff applications. In terms of runtime, the proposed model completes simulations in 4385 s, which is 16.7% faster than the Transformer (5263 s) and 15.8% faster than the LSTM-Transformer (5206 s). Compared with the ResNet-Transformer (6526 s), the improvement is even more striking at 32.8%, underscoring its efficiency over transformer-based frameworks. GPU usage is moderate at 19.0 GB, which, although slightly higher than Transformer (23.9 GB, a 20.5% saving), remains lower than many competing deep learning baselines such as CNN-LSTM-attention (20.6 GB) and PTHA (21.7 GB). This balance of reduced runtime and controlled memory footprint demonstrates that the proposed model can process large-scale hydrological data streams with less delay, strengthening its potential for operational real-time deployment in flood forecasting and water resource management.
Computational efficiency comparison of various RR models based on runtime and GPU usage.
Bold values indicate the best results in each column.
Figure 7 illustrates the training and validation loss trajectories across 250 epochs, showing a general downward trend that indicates effective model learning. In the early epochs, both training and validation losses drop steeply, reflecting rapid adaptation as the model captures key patterns. This rapid reduction is typical, as models initially learn the most prominent features of the data. As training advances, the decline in loss slows, showing incremental improvements as parameters are fine-tuned. A notable observation is the consistent gap between training and validation losses that appears after approximately 50 epochs. The training loss is slightly lower, which is expected because the model has direct access to the training data and can fit it more closely. However, the small difference between the curves demonstrates good generalization with limited overfitting. Around epoch 150, both curves stabilize, with only minor fluctuations. This stability shows that the model continues to learn at a steady rate. Small oscillations in validation loss in later epochs likely reflect sensitivity to features or noise in the validation data. These variations are normal and do not signal poor performance. After epoch 200, convergence of the curves highlights the balance of the model between learning from training data and generalizing to unseen samples.

Training and validation loss trajectories over 250 epochs.
To further demonstrate the effectiveness of our proposed model, we present a case study on individual and regional RRM, shown in Figure 8. These figures illustrate model performance over one year, from January 1, 2013, to January 1, 2014. They provide a comprehensive view of predictive capability. The graphs display runoff predictions alongside observed data. This highlights accuracy in tracking real-world phenomena across different timescales and scenarios. The first graph illustrates the ability of the suggested model to manage fluctuating runoff levels throughout the year, closely resembling observed runoff patterns with minimal deviation. Even during periods of peak runoff, there is a close alignment between the observed and predicted values. This suggests accuracy and resilience. Given that precise forecasts directly affect flood prevention and water resource management, this correlation shows practical utility. The second graph shows performance over a larger area and represents a regional approach. Regional modeling is intrinsically difficult. With only minor deviations, the suggested model nevertheless retains its high accuracy and closely resembles observed runoff patterns. This indicates that the model is accurately calibrated and generalizes to various climatic and geographic conditions without sacrificing predictive accuracy. These illustrations confirm the efficacy and highlight the possibility of implementation in various operational and environmental contexts. Reliability and versatility are demonstrated by consistent performance at both the individual and regional levels. The model is therefore a handy tool for hydrological planning and forecasting.

Comparative analysis of observed and predicted runoff using the proposed model over the course of one year, from January 1, 2013, to January 1, 2014.
The distribution of decision-making times for the suggested model in real-time bidding (RTB) settings is shown in Figure 9. The histogram illustrates the frequency of decisions within different time intervals, measured in milliseconds. The distribution centers on a peak, demonstrating that most decisions fall within a narrow time frame. This reflects efficient operation, with the mode at about 80 milliseconds. The sharpness of the peak indicates low variability, highlighting stability and reliability in fast bid processing, which is crucial in RTB systems where milliseconds can significantly impact outcomes. A gradual tail towards longer times reveals occasional cases where more complex inputs extend computation. These instances are rare and do not reduce overall efficiency. The right skew of the distribution indicates that while most decisions are quick, the model can also handle computationally demanding cases when required. This adaptability ensures consistent performance under varied data conditions, supporting reliable decision-making in RTB environments.

Decision time distributions for the designed model in real-time bidding situations.
To provide a robust assessment of the effectiveness of the proposed DE algorithm in hyperparameter tuning, we compare it against several well-established metaheuristic optimization techniques. These include the human mental search (HMS), salp swarm algorithm (SSA), cuckoo optimization algorithm (COA), firefly algorithm (FA), bat algorithm (BA), artificial bee colony (ABC), and original DE. Each metaheuristic algorithm used a population size of 80. The maximum number of iterations was limited to 512. The detailed values of the other parameters specific to each metaheuristic method are summarized in Table 11. The optimal values of each hyperparameter were selected by reviewing existing literature and validating them through testing and evaluation.
Optimized hyperparameters for metaheuristic algorithms.
Optimized hyperparameters for metaheuristic algorithms.
Tables 12 and 13 present the results of NSE, RMSE, and ATPE-2% for RR predictions. These results cover individual and regional modeling at 1-day, 2-day, 4-day, and 8-day ahead horizons during the testing phase. Tables 14 and 15 present the results of these metrics for individual and regional modeling during the training phase. Tables 16 and 17 report testing results using MBE, R2, and KGE for individual and regional modeling. Tables 18 and 19 present training results for the MBE, R2, and KGE metrics.
Comparison of the proposed DE algorithm with other metaheuristic algorithms across various prediction horizons using NSE, RMSE, and ATPE-2% in individual modeling during the testing phase.
Bold values indicate the best results in each column.
Comparison of the proposed DE algorithm with other metaheuristic algorithms across various prediction horizons using NSE, RMSE, and ATPE-2% in regional modeling during the testing phase.
Bold values indicate the best results in each column.
Comparison of the proposed DE algorithm with other metaheuristic algorithms across various prediction horizons using NSE, RMSE, and ATPE-2% in individual modeling during the training phase.
Bold values indicate the best results in each column.
Comparison of the proposed DE algorithm with other metaheuristic algorithms across various prediction horizons using NSE, RMSE, and ATPE-2% in regional modeling during the training phase.
Bold values indicate the best results in each column.
Comparison of the proposed DE algorithm with other metaheuristic algorithms across various prediction horizons using KGE, R2, and MBE in individual modeling during the testing phase.
Bold values indicate the best results in each column.
Comparison of the proposed DE algorithm with other metaheuristic algorithms across various prediction horizons using KGE, R2, and MBE in regional modeling during the testing phase.
Bold values indicate the best results in each column.
Comparison of the proposed DE algorithm with other metaheuristic algorithms across various prediction horizons using KGE, R2, and MBE in individual modeling during the training phase.
Bold values indicate the best results in each column.
Comparison of the proposed DE algorithm with other metaheuristic algorithms across various prediction horizons using KGE, R2, and MBE in regional modeling during the training phase.
Bold values indicate the best results in each column.
Across individual and regional modeling, the proposed DE generally outperforms competing algorithms on most evaluation metrics, with improvements that range from small to substantial. In 1-day-ahead regional predictions, it reduces the RMSE by 49.7% compared to the original DE (0.816 vs. 1.489). It also increases NSE by 36.5% (0.728 vs. 0.524). Similar patterns are observed at longer horizons, where RMSE reductions range from 25% to nearly 50% and NSE gains from 20% to 40%. The proposed DE also achieves substantial gains in MBE, R2, and KGE. In 1-day-ahead regional predictions, MBE drops from 0.958 with standard DE to 0.604. R2 improves from 0.229 to 0.970, and KGE rises from 0.067 to 0.314. Comparable improvements are also observed at other horizons and in individual modeling. MBE reductions typically exceed 50%, R2 stays above 0.95, and KGE increases twofold to fivefold relative to baseline algorithms. These consistent gains across RMSE, NSE, MBE, R2, and KGE show that the proposed DE offers more than incremental benefits. Instead, it delivers a more balanced and reliable optimization strategy that improves both accuracy and robustness of rainfall–runoff predictions across multiple lead times.
The key improvement comes from the modified mutation strategy that integrates k-means clustering. In standard DE, mutation can be misled by noisy or poorly distributed candidate solutions. Clustering focuses mutation within the most promising regions of the search space. This reduces wasted exploration and enables more precise adjustments to parameter sets. For example, in basins with high rainfall variability and rapidly changing flow dynamics, such as HUC-0204 (Redwood River, MN), standard DE often converged to suboptimal solutions. This produced higher RMSE and lower NSE. By contrast, the proposed DE maintained stability and delivered consistent improvements. It corrected bias in MBE and enhanced hydrological realism through higher R2 and KGE. Moreover, performance gains are not uniform across hydrological settings. In snowmelt-dominated basins, improvements over the original DE were more modest because parameters are less sensitive to short-term variability. In flood-prone basins with sharp runoff peaks, the proposed DE showed clear advantages. It reduced RMSE by up to 40% and achieved noticeable increases in NSE and R2. These results indicate that the k-means-based mutation strategy is particularly effective in high-variability environments. In such cases, accurately capturing parameter interactions is critical.
Overall, the analysis demonstrates that the proposed DE offers superior generalization while maintaining stability across various prediction horizons. The improvements are not equal in magnitude across all cases. However, they are systematic and can be explained by the design of the mutation strategy. This supports the argument that the proposed DE is a meaningful step forward in metaheuristic optimization for rainfall–runoff modeling.
We complemented the paired t-tests in Tables 11 and 12 by also computing effect sizes (Cohen's d) and 95% CIs for comparisons between the proposed DE and baseline models across all horizons. In individual modeling, the proposed DE demonstrated large effect sizes for NSE (d = 0.85–1.22) and ATPE-2% (d = 0.88–1.31) across horizons. For RMSE, the effect sizes were medium to large (d = 0.62–0.95), reflecting consistent error reductions. In all cases, the 95% CIs excluded the value of zero. The lower bounds ranged from 0.41 to 0.83, while the upper bounds extended to 1.54. Compared to the original DE, improvements were most notable in 1-day and 2-day ahead predictions. In these cases, effect sizes for NSE and ATPE-2% exceeded 1.0. In regional modeling, the proposed DE achieved stronger gains. NSE showed consistently large effects (d = 0.92–1.40). ATPE-2% ranged from d = 0.81 at the 8-day horizon to d = 1.33 at the 1-day horizon. RMSE also showed large effects in shorter horizons (d = 0.90–1.15). All 95% CIs remained strictly positive, reinforcing the robustness of these improvements.
For Tables 15 and 16, we extended the analysis to include KGE, R2, and MBE. In individual modeling, the proposed DE showed large effect sizes for R2 (d = 0.93–1.28) and KGE (d = 0.87–1.19) across all horizons. MBE exhibited medium-to-large effects (d = 0.55–0.91), indicating substantial bias reduction compared to baselines. The 95% CIs again excluded the value of zero. The lower bounds ranged from 0.46 to 0.89, and the upper bounds reached up to 1.65. Compared with the original DE, the proposed DE reduced bias most effectively at the 8-day and 4-day horizons. In these cases, Cohen's d exceeded 0.9. In regional modeling, the proposed DE achieved huge effect sizes for R2 (d = 1.01–1.45) and KGE (d = 0.95–1.32). MBE effects were moderate at longer horizons (d = 0.58–0.76). At shorter horizons, they became large (d > 0.90). Across all metrics, the 95% CIs were positive and did not overlap with zero, confirming the stability of improvements.
Overall, across all comparisons during testing and training, paired t-tests yielded p-values consistently below 0.01, establishing statistical significance. Moreover, 95% confidence intervals for effect sizes did not cross zero, with lower bounds ranging from 0.41 to 0.97 and upper bounds extending to 1.73. These findings confirm that the proposed DE not only provides statistically significant improvements but also achieves practically meaningful gains across all six metrics (RMSE, NSE, ATPE-2%, KGE, R2, and MBE), in both individual and regional modeling.
Figure 10 displays the minimization of loss over 250 iterations using the proposed DE optimization technique. The graph shows an apparent reduction in loss values as iterations progress, confirming the efficiency of the DE algorithm in optimizing scheme parameters. The initial drop in loss is steep. This indicates that the scheme rapidly adapts its parameters and minimizes errors early in the iterations. This quick improvement indicates that the DE technique efficiently explores the parameter space to identify better solutions. With more iterations, the reduction rate becomes moderate, as expected when the scheme approaches an optimal solution. At iteration 150, the loss curve converges to a uniform value. This indicates that the scheme has reached a state of convergence, where improvements occur in small steps rather than large, abrupt changes. This convergence shows that the parameters are optimized to a level where the DE search yields diminishing returns, a common outcome in optimization tasks. In later iterations, the loss values remain steady with minimal fluctuations. This stability suggests that the scheme does not overfit. The small oscillations towards the end may result from fine-tuning by the algorithm, where parameters adjust slightly to explore nearby minima.

Minimization trajectories of loss over 250 iterations with the suggested de optimization process.
This article uses a TLSTM network with spatial attention for rainfall–runoff modeling. Rainfall–runoff modeling has traditionally relied on recurrent models such as LSTM networks. LSTM captures sequential dependencies but often struggles with long-range temporal effects and heterogeneous hydrological dynamics.2,27,28,62
Transformer-based architectures have gained popularity in recent research. 63 72–74 On the CAMELS dataset, Fang et al. 82 employed a time-series dense encoder with transformer components. They reported a reduction in RMSE of approximately 14 m³/s and a median NSE of 0.82. In every metric, performance outperformed vanilla transformer and LSTM. A Transformer-based meta-learning framework for few-shot flood forecasting was also created by Jiang et al. 83 When compared to state-of-the-art techniques, they reported an MAE reduction of up to 19% in sparse data. A regional transformer for ungauged catchments was presented in another study by He et al., 84 and it continuously beat benchmark models in runoff forecasts ranging from one to three days. In our model, TLSTM with spatial attention frequently meets or surpasses these standards in similar hydrological conditions. Our model produced an RMSE of 0.68 and an NSE of 0.88 in regional forecasts made one day in advance. The NSE of 0.82 reported by Fang et al. is lower than these values. Large training sets are required for transformer approaches, such as the Jiang et al. model, to achieve a competitive MAE. In catchments with little data, they frequently struggle. Because TLSTM uses spatial attention and transductive learning, it performs better in this situation. In addition, Jiang et al. demonstrated higher KGE and lower MAE when compared to GRU and sequence-to-sequence (Seq2Seq). This aligns with the KGE enhancements observed in our findings. Better bias control with lower MBE was another finding of the regional transformer study. Our TLSTM exhibits a comparable pattern when subjected to regional modeling.
The addition of spatial attention strengthens this design by explicitly modeling dependencies among basins. This aspect is often underrepresented in transformer-based architectures. 63 72–74 Transformers primarily emphasize sequence-level self-attention, focusing on temporal relationships within the input data. In contrast, TLSTM spatial attention captures both spatial and temporal interactions. This is crucial in rainfall–runoff processes, where variability emerges from complex spatiotemporal dynamics. With spatial awareness, TLSTM handles temporal dependencies targeted by transformer models. It also ensures that cross-basin hydrological influences are effectively represented. The strength of this approach is evident in the results across six metrics (RMSE, NSE, KGE, R2, MBE, ATPE). It is also evident across multiple prediction horizons. In both individual and regional modeling, the proposed TLSTM generally achieved better outcomes than the baseline LSTM and transformer models. The improvements were consistent across most metrics and horizons, though the magnitude of gains varied depending on hydrological conditions and forecast lead time. These findings indicate that TLSTM addresses conceptual limitations of existing approaches while offering measurable advances in predictive accuracy, stability, and hydrological realism.
Another key aspect of the proposed framework is the use of a k-means clustering-based DE algorithm for hyperparameter optimization. Hyperparameter tuning in TLSTM is inherently complex due to the high-dimensional and nonlinear nature of rainfall–runoff data. In these situations, traditional DE works well for global search. 27 However, standard mutation may result in suboptimal solutions and premature convergence. To improve exploration, the proposed DE utilizes a k-means-based mutation to concentrate the search on promising clusters. This method improves stability over a range of time horizons and converges more quickly than the original DE. In comparison to the baseline DE, the suggested DE continuously improved NSE, R2, and KGE while lowering RMSE in our tests. The suggested DE shows a better balance between exploration and exploitation when compared to other metaheuristic algorithms, such as BA and ABC. These findings demonstrate that the k-means-enhanced DE enhances the robustness and accuracy of TLSTM-based RR predictions, while also improving search efficiency.
Beyond RRM, the suggested model can be modified for other uses. For example, time-series forecasting scenarios with temporal dependencies can be used with the model. Environmental monitoring, energy consumption forecasting, and financial market prediction are a few examples of applications. TLSTM adjusts to data that is comparable to the test set. Because of this, it is perfect for applications where data or conditions change over time. DE also effectively optimizes numerous parameters. Because of this, the model can be applied in complex situations involving large datasets and numerous variables. This method could be useful for ML applications, including DL and reinforcement learning, that require precise hyperparameter tuning. The fundamental ideas of TLSTM and DE can be modified to accommodate the requirements and limitations of different domains. This increases the adaptability and usefulness of the model.
The limitations of the model are as follows: Generalization across various environments: The difficulty of generalizing across various hydrological environments that differ significantly from the training and testing datasets is a major drawback of the suggested model. TLSTM can adjust to data that is similar to the test set. However, in settings with distinct geological, climatic, or land-use characteristics not included in the training data, its performance might deteriorate. This restriction raises questions about the reliability of predictions in unusual or extreme situations. Basins impacted by fast urbanization, glacier-fed systems, or areas experiencing unheard-of climate changes are a few examples. Future studies could investigate domain adaptation techniques, transfer learning, or meta-learning frameworks that enable the model to adapt dynamically to novel situations, thereby lessening this problem. Model adaptability can also be enhanced by incorporating auxiliary datasets, such as soil moisture records, climate indices, or remote sensing products. Furthermore, ensemble methods that integrate forecasts from several specialized models could improve resilience to regional variability. These tactics would enhance the ability of the model to handle unforeseen circumstances while maintaining predictive accuracy. Computational complexity: Combining TLSTM with an improved DE algorithm results in an increased computational complexity. Long training times and high computational demands are the results of the iterative hyperparameter optimization process and the deep recurrent architecture of TLSTM. This restricts the applicability of the model in real-time scenarios. Examples include emergency management and flood forecasting, where quick forecasts are essential. Techniques that reduce training overhead without compromising accuracy are necessary to overcome this constraint. Pruning, quantization, and low-rank factorization are examples of model compression techniques that can significantly reduce memory and processing demands. For operational use, knowledge distillation, in which a smaller student model learns from the larger TLSTM-DE framework, may also provide accurate yet portable substitutes. While specialized hardware accelerators, such as GPUs or tensor processing units (TPUs), can improve computational efficiency, parallelized or distributed training techniques can further reduce training time. Using reduced-order hydrological representations or surrogate models to approximate TLSTM outputs at a lower cost is another exciting avenue. When combined, these tactics would enhance the viability of the model in situations with limited resources and time constraints. Sensitivity to hyperparameter settings: The model is still sensitive to initial parameter configurations even though the suggested DE-based optimization framework enhances hyperparameter search. Ineffective initial values can result in longer optimization times or convergence to less-than-ideal solutions, which lowers overall efficacy. Particularly when applied to novel datasets with unknown or highly variable characteristics, this dependency is a drawback. Future research could investigate hybrid frameworks that combine local refinement techniques, such as gradient-based optimization, with global search techniques, like DE, to address this problem. The optimization process might also be stabilized by implementing adaptive parameter control, in which the crossover and mutation probabilities change dynamically during training. Additional opportunities to enhance efficiency are offered by Bayesian optimization and hyperparameter tuning, both of which are informed by reinforcement learning. Furthermore, meta-optimization can lessen reliance on initial guesses by using insights from prior optimization tasks to inform subsequent runs. Sensitivity analysis can also be used to determine which parameters are most important and to rank them in order of importance. These fixes would improve the scalability and usefulness of the model while strengthening its resistance to hyperparameter initialization. Data dependency: Having access to representative, high-quality datasets is crucial to the efficacy of the TLSTM component. The model cannot learn general hydrological patterns if the data is sparse, noisy, or biased. In operational tasks, this frequently results in poor predictive performance. This restriction is especially important in areas with limited monitoring infrastructure, including developing nations or basins with limited data availability. Augmenting data is one solution. For instance, to increase training coverage, generative adversarial networks (GANs) can produce realistic hydrological sequences. Performance may also be improved by transferring knowledge from basins with good instrumentation to areas with little data. Integrating multimodal data sources, including reanalysis datasets, climate indices, and satellite remote sensing, is an additional approach. These methods can supplement limited ground-based observations. By prioritizing the most instructive samples for labeling or collection, active learning techniques can increase data acquisition efficiency. Regularization and denoising autoencoders are two noise-robust training techniques that can lessen the effect of faulty data. By integrating these strategies, the model can maintain its predictive power in a variety of real-world scenarios while becoming more robust to data constraints.
Conclusion
Our study demonstrates that the spatial attention-enhanced TLSTM model, equipped with an optimized DE algorithm, significantly advances the field of RRM. The TLSTM model captures subtle temporal discrepancies and focuses on crucial data segments. These abilities result in marked improvements over traditional LSTM models. Specifically, the novel mutation strategy based on k-means clustering within the DE algorithm has proven effective for fine-tuning hyperparameters. This allows the model to adapt optimally to both individual and regional datasets. Evaluations using the CAMELS dataset show that the proposed model often performs better than existing LSTM-based models. In several cases, it achieved higher Nash-Sutcliffe efficiency ratings. For instance, in 8-day runoff predictions, the model achieved a rating of 0.728, which was higher than that of its nearest competitor. In regional modeling, the approach obtained a rating of 0.878, suggesting improved robustness and effectiveness in handling complex, large-scale hydrological data. The magnitude of improvements varies depending on basin characteristics and forecasting horizons. The results confirm the potential of integrating advanced machine learning techniques, such as spatial attention and transductive learning, into hydrological models. This integration enhances the predictive accuracy and reliability of RRM, which is crucial for effective disaster risk reduction, mitigation, and water resource management.
In future work, several limitations of the present study can be addressed to strengthen the applicability and extend the scope of the proposed model. One limitation is the high training time associated with TLSTM, which may restrict its scalability in operational hydrology. Future research could explore model compression, pruning, or knowledge distillation to reduce computational overhead without sacrificing accuracy. Another concern is that the spatial attention mechanism may sometimes highlight irrelevant spatial patterns, which introduces noise into predictions. Reliability could be improved by integrating adaptive attention regularization or combining spatial attention with domain-informed constraints. Regarding DE-based hyperparameter optimization, the k-means mutation strategy improved efficiency, but questions remain about scalability in very high-dimensional parameter spaces. Future studies could test hybrid approaches that combine DE with Bayesian optimization or reinforcement learning to improve adaptability. Moreover, extending the framework to multimodal data, such as remote sensing and climate indices, could further improve robustness under diverse hydrological regimes. Comparative evaluations with emerging models, such as graph neural networks and advanced transformer variants, are also recommended to benchmark generalization across basins.
Footnotes
Funding
This work was supported by the major teaching and research project of Anhui Province's Quality Engineering in 2024, “Construction and Practice of Safety Education System in Experimental Training Rooms of Higher Vocational Colleges”, (No. 2024jyxm0948).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
