A model based on decomposition and bidirectional long short-term memory network for short-term wind speed prediction

Abstract

One of the fundamental aspects essential for ensuring the steadiness of wind power generation and the management of power systems is the accurate forecast of wind speed. We propose a short-term wind speed prediction model based on decomposition and bidirectional long short-term memory network. Firstly, the short-term wind speed is input into complete ensemble empirical mode decomposition of adaptive noise processing, which decomposes it into components with different local characteristic information to decrease the complexity of the wind speed pattern. Then, the bidirectional long short-term memory network with the attention mechanism is fitted with the decomposed data, and the particle swarm optimization algorithm is selected to optimize the hyper-parameters of bidirectional long short-term memory network to reduce the errors in modeling process. To derive the final prediction results, the forecasted values of each model output are added. The experimental results of two real short-term wind speed datasets verify that the designed approach has high accuracy in short-term wind speed forecasting, and its prediction values are better than other comparison models.

Keywords

short-term wind speed prediction complete ensemble empirical mode decomposition of adaptive noise bidirectional long short-term memory particle swarm optimization attention mechanism

Introduction

For wind power generation scenario, energy storage faces the dilemma with high difficulty and high cost. Most generators adopt linear conversion architecture with direct source acquisition and direct terminal power supply. In addition, wind power integration is a major challenge for wind power generation (Shahid et al., 2023). During the peak period of power consumption, the access of a large number of intermittent energy sources will affect the frequency modulation range of power, resulting in power instability (Kumar et al., 2025).

Considering different prediction time, the prediction problem of wind speed can be divided into short-term prediction, medium-term and long-term. The long-term prediction mainly focuses on the general change trend of annual wind speed in the future with year as the time scale, and its forecast results can provide scientific support for the location decision of wind farms. The medium-term forecasting covers the wind speed from weeks to months in the future, which is the key basis for wind farm power generation planning. The short-term forecasting focuses on the wind speed changes in the next few minutes to hours, which can directly assist the real-time operation regulation of wind turbines. Therefore, it is very worthwhile to study the prediction index of improving short-term wind speed for power system scenarios (Dhaka et al., 2024).

At present, many achievements of short-term wind speed prediction have been reported, including physical models, artificial intelligence models, statistical models and combined prediction models (Tian et al., 2025). On the basis of real-time information such as on-site wind speed, wind direction and numerical weather prediction, physical models are usually estimated by solving N-S equations (Yang et al., 2025a). The artificial intelligence model is considered to be composed of deep learning model and machine learning model. Machine models such as random forest and support vector regression use the nonlinear relationship between wind speeds to obtain models. The statistical models predict wind speed based on historical statistical data. The combined prediction model considers the prediction effect of a single model. On the basis of the forecasting values of different approach, a statistical analysis model is established to integrate or weight each prediction model.

At present, many of the deep learning-based approaches are applied to the forecasting of short-term wind speed. Long Short-Term Memory (LSTM) network is chosen as forecasting model, and the results showed that the LSTM is superior to other methods (Huang et al., 2021). But the model complexity of LSTM network is high and the training time is relatively high. Aiming at decrease the complexity, Adam et al. proposed a deep learning frame based on Gated Recurrent Unit (GRU), and a comparative study between it and LSTM network found that the GRU is superior to the LSTM in accuracy and training time (Adam et al., 2021). Although each method has an adaptive scenario, it is often a challenge to make a single model produce the best prediction. Mohammed and Mohammed constructed a Convolutional Neural Network prediction model based on GRU network (CNN-GRU), where the convolutional layer extracts the data features to improve the forecasting accuracy, while the gated recursive unit stores the information in memory. The comparison results show that the CNN-GRU is superior to other benchmark approaches (Mohammed and Mohammed, 2022). However, the GRU network needs to adjust too many hyper-parameters when dealing with different types of data.

Many combined models are a combination of decomposition algorithms with multiple models to forecast short-term wind speed. The advantages of each model or algorithm are fully utilized to improve the accuracy of prediction. Han et al. used LSTM combined with Variational Mode Decomposition (VMD) for modeling and found that the forecasting accuracy of wind speed was significantly improved after VMD processing (Han et al., 2019). However, the computational complexity of VMD is high, especially for signals with long time series, and the computation time may be long. Liu et al. applied Empirical Modal Decomposition (EMD) to decompose the short-term wind speed and subsequently uses neural network for prediction. Their achievements increase the convenience of prediction and enhance the accuracy of prediction (Liu et al., 2013). However, the EMD method is prone to modal aliasing or endpoint effects. He and Wang combined the improved Ensemble Empirical Mode Decomposition (EEMD) with the least absolute shrinkage and selection operator-quantile regression neural network model to overcome the modal aliasing problem and enhance the prediction accuracy (He and Wang, 2021). However, there may be interference noise in the decomposition sequence, which affects the prediction accuracy. Aiming at suppress noise, Xiong et al. introduced Complete Ensemble Empirical Mode Decomposition (CEEMD) to decompose the wind power samples, which effectively eliminates the interference noise in the decomposed sequence, thus improving the fitting degree of prediction (Xiong et al., 2023). However, the disadvantage of CEEMD is that if the added white noise amplitude and the number of iterations are not properly selected, redundant IMF components are generated, which need to be restructured or processed.

In our study, the original short-term wind speed samples are handled with Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN). During the decomposition process, the final reconstructed signal has less noise residue than the EEMD result, thus minimizing the number of screenings. This is achieved by summing the IMF components of each order resulting from the white noise EMD decomposition. The decomposed components are input into the Bidirectional Long Short-term Memory (BiLSTM) model with attention mechanism (BiLSTM-Attention) for forecasting. BiLSTM-Attention can not only capture long-term dependencies on historical time steps in the sequence, but also handle importance based sampling. Meanwhile, the Particle Swarm Optimization (PSO) algorithm optimized the hyper-parameters of the BiLSTM-Attention model to reinforce the forecasting index. Simulation comparisons demonstrate the great forecasting index of the approach presented in this research. The main innovative work is as follows.

1. CEEMDAN is adopted to decompose short-term wind speed samples, reduce the complexity of original data, and is more conducive to the prediction of prediction models.

2. The attention mechanism is introduced into the BiLSTM-Attention model, which can capture the long-term dependence on the historical time step in the sequence, and can process the importance-based sampling information.

3. The PSO algorithm is used as optimization strategy to determine the hyper-parameters of the BiLSTM-Attention model to reduce forecasting error.

The other contents of this study are arranged as follows. Section 'Basic theory' gives the principles of CEEMDAN algorithm, BiLSTM with the attention mechanism, PSO algorithm and hyper-parameter optimization of BiLSTM. Section 'Designed prediction model' presents the implementation process of the designed prediction model. The effectiveness of the designed approach is validated in Section 'Case studies'. Section 'Conclusion' introduces the conclusions and future work.

Basic theory

CEEMDAN algorithm

CEEMDAN draws on the idea of EEMD algorithm, adds Gaussian white noise to it, and performs multiple superposition averaging on the basis of EMD improvement, so as to achieve the effect of removing noise (Li et al., 2025). It successfully solves the problem of excessive average time of EEMD algorithm and improves the decomposition efficiency of the algorithm. The CEEMDAN algorithm can be described as follows:

1. For original short-term wind speed signal $s (k)$ , the white noise $n_{b} (k)$ is added to obtain new signal sequence

s_{b} (k) = s (t) + γ n_{b} (k)

(1)

where

γ

is the amplitude of white noise.

2. New generated signal sequence $s_{b} (k)$ is processed by EMD to obtain the first component $I M F_{1}$ and residual $R_{1} (k)$ ,

I M F_{1} (k) = \frac{1}{m} \sum_{b = 1}^{m} I M F_{b 1} (k)

(2)

R_{1} (k) = s (k) - I M F_{1} (k)

(3)

where m is the number of white noise added.

3. After adding white noise $γ_{1} E_{1} (n_{b} (k))$ to the residual $R_{1} (k)$ , EMD decomposition is performed to obtain the second modal component $I M F_{2}$ and residuals $R_{2} (k)$ , i.e

I M F_{2} (k) = \frac{1}{m} \sum_{b = 1}^{m} (E_{1} (r_{1} (k)) + γ_{1} E_{1} (s_{b} (k)))

(4)

R_{2} (k) = R_{1} (k) - I M F_{2} (k)

(5)

4. According to equation (6), the original short-term wind speed signal is processed and the above steps are repeated. Additional white noise decomposition is performed each time until the resulting residual cannot be further decomposed.

s (k) = \sum_{b = 1}^{B} I M F_{b} (k) + R_{b} (k)

(6)

BiLSTM with the attention mechanism

Standard LSTM can only mine the feature information of time series data that has appeared in the past, but cannot mine the information of time series data that has never appeared, so the learning efficiency is low and the prediction accuracy is low. BiLSTM can not only use the information of past time samples, but also learn understand the features of future samples by bidirectional mining the information of time series data, which enhances the model’s utilization of samples and makes the learning ability better (Lu et al., 2025; Zhu et al., 2025). The frame of BiLSTM is shown in Figure 1.

Figure 1.

Frame of BiLSTM.

The attention mechanism is a way for neural network to concern on significant portion of the input data. When dealing with input samples, the attention mechanism can process the most relevant data by assigning a weight to each input time point (Wu et al., 2026). In this way, even in the case of missing or abnormal data at some time points, the BiLSTM can make accurate forecasting.

Adding an attention layer to the BiLSTM model structure is the basic principle behind the BiLSTM-Attention model. This makes it possible for the attention layer to sample the input time series data to determine its importance, and then input the importance sampling data as input data into the BiLSTM model for training, modeling and prediction. The BiLSTM-Attention model can not only deal with importance-based sampling, but also deal with the long-term dependence of sequences on historical time steps. The BiLSTM in Figure 1 introduces the attention mechanism by defining an attention layer, where the attention layer weight is denoted by W, $W = (w^{1}, w^{2}, . . ., w^{L})$ . Through this layer of attention weights W, the BiLSTM-Attention model samples the significance of the incoming sequential data $X = (X_{1}, X_{2}, . . ., X_{T})$ . The sampled data is defined as ${\tilde{X}}_{t}$ , where ${\tilde{X}}_{t} = (x_{t}^{1} w^{1}, x_{t}^{2} w^{2}, . . ., x_{t}^{L} w^{L})$ , after which the importance-sampled data ${\tilde{X}}_{t}$ is input into the BiLSTM network to get the forecasted values.

PSO algorithm

Some hyper-parameters in the BiLSTM can affect the prediction accuracy of the model, such as regularization parameters, learning rate, number of training iterations, batch size and the number of neurons, etc. (Li et al., 2022). The optimization algorithm can be applied to optimize the above hyper-parameters. As a classical optimization algorithm, PSO algorithm has achieved excellent performance in many fields.

As a swarm intelligence algorithm, PSO is come from the foraging behavior of birds (Alhussan et al., 2023; Priyadarshi and Kumar, 2025). The problem of finding food for birds in one-dimensional space is extended to multidimensional space. Suppose in a K-dimensional space, the position of a particle is denoted by $X (X_{1, K}, X_{2, K}, . . ., X_{M, K})$ , where M is the quantity of particles, the velocity of a particle is denoted as $V (V_{1, K}, V_{2, K}, . . ., V_{M, K})$ , the optimal position of an individual particle among all the positions is $P_{i} (P_{1, D}, P_{2, K}, . . ., P_{i, K})$ , and the optimal position among all the positions is $P_{g} (P_{1, K}, P_{2, K}, . . ., P_{g, K})$ . The update formulas of the velocity and position of the particle flight can be expressed as follows:

V_{M, K} = a * V_{M, K} + c_{1} * r a n d (P_{i, K} - X_{M, K}) + c_{2} * r a n d (P_{g, K} - X_{M, K})

(7)

X_{M, K} = V_{M, K} + X_{M, K}

(8)

where a denotes the inertia coefficient, which takes a non-negative value,

c_{1}, c_{2}

denotes the learning factor; and rand is a random generating function between [0, 1]. The optimization process of PSO algorithm is as follows:

Step 1 Initialize the velocity and position of the population, set the population size M and the maximum number of iterations L, the historical optimal position $p_{b e s t}$ of particles as the current position, and the optimal particle in the population as the current $g_{b e s t}$ .

Step 2 At each evolutionary stage, the fitness function of each particle is derived.

Step 3 $p_{b e s t}$ is calculated when the current fitness function value is superior to the historical optimum.

Step 4 When the current fitness function value is superior to the historical optimum, then $g_{b e s t}$ is calculated.

Step 5 The position and velocity of each particle are updated according to the above equations.

Step 6 Repeat Steps 2 to 5 above to continue searching for the global optimum. Stop iteration after finding the global optimum location or reaching the maximum number of iterations.

Hyper-parameter optimization of BiLSTM

The foreasting accuracy of BiLSTM model is closely related to the values of hyper-parameters. At present, the values of hyper-parameters are mostly based on experience, and sometimes even through multiple trials to achieve better prediction accuracy. The PSO algorithm has significant advantages in computing such complex problems due to its powerful optimization ability and fast convergence speed. PSO algorithm is determined to optimize the hyper-parameters of the BiLSTM. For the BiLSTM model, parameters such as the number of memory units and dropout rate do have a consequence on the capability of the network such as accuracy and generalization ability. But their impact is not as significant as the learning rate, the number of hidden neurons and the batch size. Klaus et al. has pointed out that the learning rate, the number of hidden neurons and batch size are the main hyper-parameters that have greatest impact on the prediction performance of BiLSTM (Klaus et al., 2017). Many studies also optimize these three parameters (Qiao et al., 2023; Yang et al., 2025b). The intervention of PSO is not conducive to the complexity of the model, and the number of optimization variables is also related to the time cost of the model. Based on the above considerations, this paper chooses PSO to process the three parameters of learning rate, number of hidden layer neurons and batch size. Figure 2 provides the specific optimization process.

Figure 2.

Flowchart of PSO optimized BiLSTM-Attention.

The implementation process of PSO optimized BiLSTM-Attention is introduced as follows.

Step 1 Obtain raw short-term wind speed samples.

Step 2 The division of datasets. The short-term wind speed data is divided into training set and test set. The training set is the first 80% of the samples, and the test set is the last 20% of the samples.

Step 3 Initialize the model and the corresponding parameters. Initialize the parameters of the Bi LSTM model such as learning rate, batch size, etc. Simultaneously, parameters such as the maximum number of iterations of the PSO algorithm are initialized. The fitness function is the Root Mean Square Error (RMSE) between the actual value of the sample and the predicted value of BiLSTM-Attention given by equation (9).

f i t n e s s (k) = \sqrt{\frac{1}{T} \sum_{i = 1}^{T} {(s_{i} - {\hat{s}}_{i})}^{2}}

(9)

where k is the current number of iterations, T is sample size,

s_{i}

is the real value of the sample, and

{\hat{s}}_{i}

is the predicted value.

Step 4 Parameter optimization of BiLSTM. By using PSO algorithm to optimize BiLSTM parameters, particles in the algorithm continuously change their position and velocity through the particle at the previous optimal position. All particles within the global scope also synchronously track the particles at the optimal position, continuously updating their own particle positions and velocities to obtain the global optimal position. Finally, the optimal particle is obtained by calculating the global optimal position, enabling BiLSTM to obtain the optimal parameters.

Step 5 Input the optimal parameters obtained by PSO algorithm into the BiLSTM model, and BiLSTM performs in-depth analysis on the samples. Then, the model output values are subjected to inverse normalization to obtain the final forecasted results.

Designed prediction model

The flowchart of the designed CEEMDAN-PSO-BiLSTM-Attention is shown in Figure 3.

Figure 3.

Flowchart of designed short-term wind speed prediction model.

The specific process of the model can be described as follows.

Step 1 The recorded raw samples contain interference, abnormal data, etc.; therefore, data pre-processing operations are required. Firstly, remove abnormal data to ensure the credibility of subsequent modeling. The Chauvenet method is chosen to test outlier data. The deviation between each data point and the mean value is calculated and compared with the standardized value (Z-score) to determine which samples are outliers. For each data point y, calculate $Z = (y - σ) / δ$ , where $σ$ is the mean value and $δ$ is the standard deviation. The corresponding standardized values are determined according to the sample size n. Normally, this standardized value can be found by looking up a table. If the standard score z value of a sample is greater than $D_{\max} = z / 2$ , this sample is an outlier and needs to be removed from the wind speed samples. For deleted abnormal data, supplementation is needed. For these outlier data or missing data, they can be supplemented by mean interpolation, which replaces deleted or missing data with the mean of the samples.

Step 2 After outlier processing and supplementation, as well as data processing for missing data, it is worth processing the noise in the samples. Median filtering can effectively eliminate noise, especially impulse noise and mutation from samples. Assuming the data sample is $x_{1}, x_{2}, \cdot \cdot \cdot, x_{n}$ , take the window length as an odd number l and perform median filtering. Take l samples from the input samples, sort them according to the value, and take the value of the center value as the filter output, which is $y_{i} = M e d {x_{i - v}, \cdot \cdot \cdot, x_{i}, \cdot \cdot \cdot, x_{i + v}}, v = (l - 1) / 2$ .

Step 3 The original short-term wind speed samples are generated some IMF components with different local feature information and smooth by CEEMDAN algorithm.

Step 4 The obtained IMF and residual needs to be normalized to increase the precision of the modeling. The normalization method is shown in equation (10).

w^{*} = \frac{w - W_{\min}}{W_{\max} - W_{\min}}

(10)

where

w^{*}

is the normalized data,

w

is the original wind speed data,

W_{\min}

is the minimum value of sample, and

W_{\max}

is the maximum value of sample. The normalized IMF components are modeled using BiLSTM-Attention model, where the first 80% of the data is used for training and the second 20% for prediction.

Step 5 The hyper-parameters in the BiLSTM-Attention model are optimized by PSO algorithm.

Step 6 Predictions are made for each component using the optimized BiLSTM-Attention model, and then all components are summed and inverted to normalize to obtain the final prediction.

Case studies

Datasets

The data of the study come from wind turbine SCADA in Turkey in 2018, which is divided into dataset A and dataset B according to the sampling interval. The data source is https://www.kaggle.com/datasets/berkerisen/wind-turbine-scada-dataset?resource=download. In dataset A, the collection period is from 00:00 on 1 February 2018 to 23:50 on 28 February 2018, the sample size is 4463, and the sampling period is 10 minutes. In dataset B, the collection period is from 00:00 on 1 August 2018 to 23:30 on 31 August 2018, the sample size is 1488 and a sampling period of 30 minutes. The training set and test set are designed based on the principle of 8:2. The number of training sets of dataset A is 3570, the number of test sets is 893, the number of training sets of dataset B is 1190, and the number of test sets is 298. The original wind speed sample is displayed in Figures 4 and 5.

Figure 4.

Original short-term wind speed of dataset A.

Figure 5.

Original short-term wind speed of dataset B.

The operating system used in the experiment is Windows 10, and Anaconda is used to build the virtual environment required for the experiment. The framework used for the model is Pytorch, and the experimental code development platform is Jupyter Notebook. The GPU model is NVIDIA GeForce MX250, the CPU model is Intel (R) Core (TM) I5-10400H 2.6 GHz, the memory is 16 GB, and the video memory is 6 GB.

CEEMDAN decomposition and comparison

For the CEEMDAN algorithm, there are two key parameters, namely, signal-to-noise ratio (SNR) and the average number of white noise additions (n). For SNR, it is usually set to 0.1 ∼ 0.3, while for n, it is set to 50–200. For the number of components obtained from decomposition, CEEMDAN is adaptive, similar to the EMD algorithm. This paper determines these two parameters by comparing the energy difference between the original short-term wind speed data and the decomposed components. The energy of a data sample is usually defined as the sum of the squares of the sample values, and then the average is taken. Table 1 shows the energy difference between the original short-term wind speed data samples and the sum of the corresponding decomposed components for different SNR and n values. Usually, the smaller the interpolation, the better the decomposition effect. As can be seen from Table 1, for dataset A, when SNR is 0.2 and n is 150, the energy difference between the original dataset A and the sum of the energies of each component is minimized. Similarly, for dataset B, the corresponding result is that SNR is 0.2 and n is 100.

Table 1.

The energy difference with different SNR and n.

Dataset A			Dataset B
SNR	n	Energy difference	SNR	n	Energy difference
0.1	50	34.3681	0.1	50	43.3657
0.1	100	36.2587	0.1	100	38.2247
0.1	150	31.0025	0.1	150	41.3361
0.1	200	32.0057	0.1	200	44.3328
0.2	50	33.0365	0.2	50	32.0027
0.2	100	26.3367	0.2	100	28.7726
0.2	150	23.0365	0.2	150	32.7627
0.2	200	24.3647	0.2	200	34.3368
0.3	50	28.0048	0.3	50	38.3025
0.3	100	32.0017	0.3	100	31.0057
0.3	150	36.3684	0.3	150	36.0034
0.3	200	38.0046	0.3	200	39.0017

To verify the performance of the CEEMDAN decomposition algorithm, a comparison is made with two decomposition algorithms, EMD and EEMD. Figures 6 –8 respectively show the decomposition effects of EMD, EEMD, and CEEMDAN on dataset A. Similarly, Figures 9 –11 show the decomposition effect on dataset B.

Figure 6.

EMD decomposition results (dataset A).

Figure 7.

EEMD decomposition results (dataset A).

Figure 8.

CEEMDAN decomposition results (dataset A).

Figure 9.

EMD decomposition results (dataset B).

Figure 10.

EEMD decomposition results (dataset B).

Figure 11.

CEEMDAN decomposition results (dataset B).

Using the decomposition results of dataset A to analyze, CEEMDAN decomposes the dataset A to generate 9 IMF components and one residual component. Among them, IMF0 to IMF2 belong to high-frequency IMF, with high fluctuation frequency and short wavelength. IMF3 to IMF5 belong to the intermediate frequency IMF, with a decrease in frequency and a corresponding increase in wavelength. IMF6 to IMF8 belong to low-frequency IMFs, with lower frequencies and longer wavelengths. Different IMFs carry different element from the original sample and also have their own fluctuation frequencies. Similarly, for the dataset B, CEEMDAN also achieves the best decomposition effect.

The components obtained from CEEMDAN and EEMD decomposition exhibit overall consistency, both of which can effectively suppress mode aliasing in EMD decomposition and have good decomposition effects. Aiming at compare the effectiveness of the three decomposition algorithms include CEEMDAN, EEMD, and EMD more clearly, the decomposed components are reconstructed and the residual between the reconstructed components and the original dataset is calculated. The specific results are shown in Figures 12 and 13.

Figure 12.

Comparison of residuals of decomposition reconstruction for dataset A.

Figure 13.

Comparison of residuals of decomposition reconstruction for dataset B.

The value of the reconstructed residual is an important indicator for judging the decomposition effect. The smaller the residual, the smaller the difference between the decomposed signal and the original signal, and the better the decomposition effect. From Figures 12 and 13, it can be seen that for dataset A and dataset B, the residual range of CEEMDAN is between −1.5 and 1.5, which is smaller than the residuals of EEMD and EMD algorithms. Therefore, the CEEMDAN algorithm has higher reconstruction ability for short-term wind speed data.

Hyper-parameters optimization results of BiLSTM

Afterward, the corresponding BiLSTM-Attention prediction model is established for each IMF component. Using the PSO algorithm to determine the hyper-parameters in the BiLSTM. PSO technolgy is calculated according to the above parameter settings and constraints, and the specific parameters are shown in Table 2. The optimal hyper-parameters of each component obtained after optimization are shown in Tables 3 and 4. Finally, the predicted values of each model output are added to generate the expected forecasted value.

Table 2.

Parameter setting of PSO algorithm.

Parameter	Parameter value
Population size	N = 25
Maximum number of iterations	M = 30
Upper limit of parameters	[0.01,100,128]
Lower limit of parameters	[0.001,10,64]
Learning factor, social factor	c1 = 1.5, c2 = 1.7
Random factor	r1 = 0.8, r2 = 0.3
Inertia weight	a = 0.5

Table 3.

Optimal hyper-parameters of each component of dataset A.

Component	Learning rate	Number of hidden neurons	Batch size
IMF1	0.00974	49	86
IMF2	0.00816	82	102
IMF3	0.00770	64	92
IMF4	0.00713	70	93
IMF5	0.00810	25	105
IMF6	0.00986	54	95
IMF7	0.00695	69	86
IMF8	0.00558	54	107
IMF9	0.00492	74	78

Table 4.

Optimal hyper-parameters of each component of dataset B.

Component	Learning rate	Number of hidden neurons	Batch size
IMF1	0.00186	74	105
IMF2	0.00942	63	94
IMF3	0.00754	23	80
IMF4	0.00778	46	95
IMF5	0.00803	70	87
IMF6	0.00875	14	89
IMF7	0.00838	35	96
IMF8	0.00555	59	87

The introduction of attention mechanism can significantly improve the identification ability of BiLSTM model for key local features of input sequence when modeling the components obtained from each decomposition. This mechanism enhances the contribution of important time step features to network parameter optimization by adaptively learning the attention weight distribution of sequence elements, while maintaining the model complexity equivalent to the original structure, thus achieving accurate capture of the dynamic characteristics of the sequence and improving prediction accuracy. Figure 14 shows the distribution of weights for the attention mechanism corresponding to the 10 components of dataset A after CEEMDAN decomposition. Similarly, the distribution of attention mechanism weights corresponding to the nine components of dataset B after CEEMDAN decomposition is shown in Figure 15.

Figure 14.

The distribution of weights for the attention mechanism corresponding to the 10 components (dataset A).

Figure 15.

The distribution of weights for the attention mechanism corresponding to the 9 components (dataset B).

From Figures 14 and 15 above, it can be seen that adding an Attention layer to BiLSTM can calculate the similarity between the current time and historical time, and generate normalized attention weights through the Softmax function. The weight matrix is multiplied with the output of BiLSTM to achieve dynamic weighted fusion of temporal features, highlighting the contribution of key time steps.

Wth the purpose of to verify the prediction result, this paper adopts RMSE, Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE) and coefficient of determination (R²) as the evaluation indexes, which are calculated by the following formulas:

RMSE

R M S E = \sqrt{\frac{1}{Q} \sum_{j = 1}^{Q} {(s_{j} - {\hat{s}}_{j})}^{2}}

(11)

MAE

M A E = \frac{1}{Q} \sum_{j = 1}^{Q} | s_{j} - {\hat{s}}_{j} |

(12)

MAPE

M A P E = \frac{1}{Q} \sum_{j = 1}^{Q} \frac{| s_{j} - {\hat{s}}_{j} |}{s_{j}} \times 100 %

(13)

R²

R^{2} = 1 - \frac{\sum_{j = 1}^{Q} {(s_{j} - {\hat{s}}_{j})}^{2}}{\sum_{j = 1}^{Q} {(s_{j} - \bar{s})}^{2}}

(14)

where

s_{j}

is real value,

{\hat{s}}_{j}

is the forecasted value,

\bar{s}

is the mean value and Q is the sample size used for training or testing.

Ablation experiment

BiLSTM model, CEEMDAN-BiLSTM, BiLSTM-Attention and CEEMDAN-BiLSTM-Attention are chosen to verify the effective of the designed approach. The forecasting results for dataset A and dataset B are provided in Figures 16 and 17, respectively. Among them, the hidden neurons are 32 and the batch size is 64. Comparing the real and forecasted value curves of each model in Figures 16 and 17, we can find the forecasted values of each model are closer to the real values, but the forecasted curve of the designed CEEMDAN-PSO-BiLSTM-Attention has the largest overlap with the actual value curve.

Figure 16.

Ablation experiment results of dataset A.

Figure 17.

Ablation experiment results of dataset B.

For the model mentioned, the performance metrics of RMSE, MAE, MAPE, and R² for dataset A and dataset B are given in Table 5 and Table 6, respectively. The results in Table 5 illustrate that the CEEMDAN-BiLSTM-Attention model reduces the RMSE, MAE, and MAPE error metrics by 24.2%, 27.9%, and 20.8%, respectively, and the R² improves by 0.62%, with respect to the CEEMDAN-BiLSTM model. The CEEMDAN-PSO-BiLSTM-Attention model also reduces RMSE, MAE, and MAPE error metrics by 15.5%, 18.0%, and 24.1%, respectively, R² improves by 1.41%, relative to the CEEMDAN-BiLSTM-Attention model. Table 6 similarly demonstrates that the CEEMDAN-PSO-BiLSTM-Attention model exceed other approaches. In summary, the prediction approach in this paper is superior to other prediction approaches in terms of performance metrics of error.

Table 5.

Comparison of performance metrics of dataset A.

Model/Training time (s)	Evaluation indicators
Model/Training time (s)	RMSE (m/s)	MAE (m/s)	MAPE (%)	R²
BiLSTM/61.572	0.871	0.630	5.779	0.961
CEEMDAN-BiLSTM/127.890	0.764	0.583	4.860	0.967
CEEMDAN-BiLSTM-attention/187.352	0.579	0.420	3.849	0.973
CEEMDAN-PSO-BiLSTM-attention/10706.313	0.489	0.344	2.920	0.987

Table 6.

Comparison of performance metrics of dataset B.

Model/Training time (s)	Evaluation indicators
Model/Training time (s)	RMSE (m/s)	MAE (m/s)	MAPE (%)	R²
BiLSTM/16.523	1.036	0.776	8.649	0.902
CEEMDAN-BiLSTM/31.402	0.820	0.636	6.045	0.951
CEEMDAN-BiLSTM-attention/173.952	0.716	0.523	5.437	0.953
CEEMDAN-PSO-BiLSTM-attention/9629.680	0.555	0.412	3.932	0.971

Comparison with classical models

Aiming to further illustrate the superiority of the CEEMDAN-PSO-BiLSTM-Attention, this experiment chooses four classical model include CNN-GRU, EMD-LSTM, Transformer and Temporal Convolutional Network and Bi-directional Gated Recurrent Unit (TCN-BiGRU) for comparison experiments. The number of 1D convolutions of CNN is set to be 64, the size of convolution kernel is 3 × 3, and the number of GRU implied neurons is 32. EMD decomposes the original wind speed data into 8 IMF components and a residual. For Transformer, batch size is 32, n_head is 8, number of epochs is 100, learning rate is 0.0001, drop_out is 0.1 For TCN-BiGRU model, the number of hidden layer neurons is 10, the activation function is ReLU, the number of convolution kernels is 1, the size of convolution kernels is 5, and the learning rate is 0.01. Figures 18 and 19 show the prediction results of these models for two datasets, respectively. From the comparison between the real values and the forecasted values of various models in the two figures, the obvious conclusion is that the CEEMDAN-PSO-BiLSTM-Attention has good fitting and can better track the change of actual short-term wind speed.

Figure 18.

The prediction results of these models for dataset A.

Figure 19.

The prediction results of these models for dataset B.

Table 7 is the performance metrics of the several prediction approaches for dataset A, and Table 8 is the performance metrics of the several prediction approaches for dataset B. As shown in Table 7, the RMSE, MAE, and MAPE error metrics of this paper’s model are less than other approaches. Meanwhile, R² of the CEEMDAN-PSO-BiLSTM-Attention is greater than other models. Table 8 also proves that the capability of the CEEMDAN-PSO-BiLSTM-Attention is excellent than others. The performance metrics of both tables indicate that the proposed CEEMDAN-PSO-BiLSTM-Attention model outperforms the comparison approaches.

Table 7.

Comparison of classical model performance metrics of dataset A.

Model/training time (s)	Evaluation indicators
Model/training time (s)	RMSE (m/s)	MAE (m/s)	MAPE (%)	R²
CNN-GRU/5.362	0.876	0.625	5.352	0.961
EMD-LSTM/107.673	0.826	0.605	6.156	0.964
Transformer/176.466	0.722	0.525	4.713	0.968
TCN-BiGRU/152.030	0.844	0.618	5.652	0.960
CEEMDAN-PSO-BiLSTM-attention/10706.314	0.489	0.344	2.920	0.987

Table 8.

Comparison of classical model performance metrics of dataset B.

Model/training time (s)	Evaluation indicators
Model/training time (s)	RMSE (m/s)	MAE (m/s)	MAPE (%)	R²
CNN-GRU/2.954	1.034	0.769	8.305	0.902
EMD-LSTM/23.473	0.904	0.683	6.785	0.923
Transformer/86.114	0.757	0.632	5.368	0.938
TCN-BiGRU/53.031	0.840	0.709	6.442	0.920
CEEMDAN-PSO-BiLSTM-attention/9629.681	0.555	0.412	3.932	0.971

Due to the CEEDMAN decomposition algorithm generating multiple components, each component requires PSO algorithm for hyper-parameter optimization, resulting in slightly longer training time for the model. Offline training can be used for online applications, combined with timed retraining to alleviate time pressure.

Multi-step prediction results

Multi-step prediction is an effective method to test the accuracy of forecasting models (Yu et al., 2025). In many forecasting applications, it is necessary to predict the trend over a future period of time, for example, multi-step wind speed prediction with large step length can provide more abundant time for grid adjustment in wind farms. Therefore, the CEEMDAN-BiLSTM-Attention model and the proposed CEEMDAN-PSO-BiLSTM-Attention are chosen to perform three-step, five-step, and ten-step forecasting experiments, respectively. Figure 20 presents the multi-step forecasting results of dataset A. Figure 21 presents the multi-step forecasting results of dataset B.

Figure 20.

Multi-step forecasting results (dataset A).

Figure 21.

Multi-step forecasting results (dataset B).

The comparison results of performance indicators are given in Table 9 and Table 10. The results show that the forecasting accuracy of multi-step forecasting is lower than that of single-step forecasting. The first cause is that the accuracy of multi-step forecasting is affected by the step length, and the forecasting of each step rest with the forecasting result of the past time, even if the forecasting of the first few steps is relatively accurate, errors may accumulate over time, resulting in a larger overall forecasting error. Therefore, the forecasting results of three, five, and ten steps gradually increase. The performance index of the CEEMDAN-PSO-BiLSTM-Attention is still better than CEEMDAN-BiLSTM-Attention model.

Table 9.

Comparison of multi-step prediction performance metrics of dataset A.

Model	Step	Performance metrics
Model	Step	RMSE (m/s)	MAE (m/s)	MAPE (%)	R²
CEEMDAN-BiLSTM-attention	3 step	1.115	0.809	7.104	0.935
CEEMDAN-PSO-BiLSTM-attention	3 step	0.977	0.703	6.686	0.951
CEEMDAN-BiLSTM-attention	5 step	1.375	1.003	8.956	0.901
CEEMDAN-PSO-BiLSTM-attention	5 step	1.200	0.896	8.504	0.925
CEEMDAN-BiLSTM-attention	10 step	1.735	1.289	11.597	0.844
CEEMDAN-PSO-BiLSTM-attention	10 step	1.506	1.096	10.856	0.883

Table 10.

Comparison of multi-step prediction performance metrics of dataset B.

Model	Step	Performance metrics
Model	Step	RMSE (m/s)	MAE (m/s)	MAPE (%)	R²
CEEMDAN-BiLSTM-attention	3 step	1.215	0.917	10.400	0.859
CEEMDAN-PSO-BiLSTM-attention	3 step	1.044	0.800	7.582	0.912
CEEMDAN-BiLSTM-attention	5 step	1.436	1.112	11.788	0.806
CEEMDAN-PSO-BiLSTM-attention	5 step	1.278	0.992	9.358	0.846
CEEMDAN-BiLSTM-attention	10 step	1.732	1.393	15.295	0.715
CEEMDAN-PSO-BiLSTM-attention	10 step	1.685	1.339	12.373	0.788

Conclusion

In our paper, the CEEMDAN algorithm is selected to process the fluctuating short-term wind speed data, which is decomposed into smooth and regular signals to decrease the complexity and volatility of the short-term wind speed sample. The decomposed component is then modeled and predicted using BiLSTM-Attention model, which not only captures the long-term dependence on the historical time step in the sequence, but also handles importance-based sampling. Besides, the PSO algorithm is adopted to acquire the optimal key hyper-parameters in BiLSTM, which decreases forecasting error and improves the forecasting effect. Two sets of real short-term wind speed datasets are collected as the study target. Through ablation experiments, comparison with other models, multi-step prediction and other results explain the efficiency of the designed approach.

The proposed model still needs to improve in terms of training time. In the future, some hardware accelerators such as Tensor Processing Unit and Field Programmable Gate Array can be combined to further raise the speed of model training. These accelerators are specifically optimized for machine learning and deep learning tasks, providing higher computational performance and energy efficiency.

Footnotes

ORCID iD

Zhongda Tian

Authors contributions

Zhongda Tian: Conceptualization, Methodology, Software, Validation, Writing, and Funding acquisition. Xinru Shao: Software and Validation.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper is supported by the Science Research Project of Liaoning Education Department (LJKZ0143), and the Open Project of State Key Laboratory of Synthetical Automation for Process Industries (2023-kfkt-01).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data used to support the results of this study can be obtained from the corresponding author.*

References

Adam

Liu

(2021) Wind power forecasting – a data-driven method along with gated recurrent neural network. Renewable Energy 163: 1895–1909.

Alhussan

El-Kenawy

Abdelhamid

, et al. (2023) Wind speed forecasting using optimized bidirectional LSTM based on dipper throated and genetic optimization algorithms. Frontiers in Energy Research 11: 1172176.

Dhaka

Sreejeth

Tripathi

(2024) A survey of artificial intelligence applications in wind energy forecasting. Archives of Computational Methods in Engineering 31(8): 4853–4878.

Han

Zhang

Wang

, et al. (2019) Multi-step wind power forecast based on VMD-LSTM. IET Renewable Power Generation 13(10): 1690–1700.

Wang

(2021) Short-term wind power prediction based on EEMD–LASSO–QRNN model. Applied Soft Computing 105: 107288.

Huang

Xiang

, et al. (2021) A new wind power forecasting algorithm based on long short-term memory neural network. International Transactions on Electrical Energy Systems 31(12): e13233.

Klaus

Jan

, et al. (2017) LSTM: a search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28(10): 2222–2232.

Kumar

Paul

Saha

, et al. (2025) Recent advancements in planning and reliability aspects of large-scale deep sea offshore wind power plants: a review. IEEE Access 13: 3738–3767.

Song

Wang

, et al. (2022) A novel offshore wind farm typhoon wind speed prediction model based on PSO–Bi-LSTM improved by VMD. Energy 251: 123848.

10.

Sun

, et al. (2025) Enhancing financial time series forecasting with hybrid deep learning: ceemdan-informer-lstm model. Applied Soft Computing 177: 113241.

11.

Liu

Wang

(2013) RBF prediction model based on EMD for forecasting GPS precipitable water vapor and annual precipitation. Advanced Materials Research 765: 2830–2834.

12.

Tang

Liu

, et al. (2025) Unsupervised damage localization method based on GAN-BiLSTM response modeling. Engineering Structures 328: 119714.

13.

Mohammed

(2022) Accurate photovoltaic power prediction models based on deep convolutional neural networks and gated recurrent units. Energy Sources, Part A: Recovery, Utilization, and Environmental Effects 44(3): 6303–6320.

14.

Priyadarshi

Kumar

(2025) Evolution of swarm intelligence: a systematic review of particle swarm and ant colony optimization approaches in modern research. Archives of Computational Methods in Engineering 32(6): 3609–3650.

15.

Qiao

Xin

, et al. (2023) Gas production prediction using AM-BiLSTM model optimized by whale optimization algorithm. Applied Geophysics 20(4): 499–506.

16.

Shahid

Zhuang

Malik

, et al. (2023) Large-scale wind power grid integration challenges and their solution: a detailed review. Environmental Science and Pollution Research International 30(47): 103424–103462.

17.

Tian

Feng

(2025) Short-term wind speed prediction model based on long short-term memory network with feature extraction. Earth Science Informatics 18(4): 333.

18.

Fan

Tian

, et al. (2026) Enhanced edge detection of harmful algal blooms using diffusion probability models and Sobel-convolutional attention mechanisms. Expert Systems with Applications 298: 129663.

19.

Xiong

Peng

Tao

, et al. (2023) A dual-scale deep learning model based on ELM-BiLSTM and improved reptile search algorithm for wind power prediction. Energy 266: 126419.

20.

Yang

Guo

Huang

(2025a) Wind power ultra-short-term prediction method based on NWP wind speed correction and double clustering division of transitional weather process. Energy 282: 128947.

21.

Yang

Meng

, et al. (2025b) Optimization of analog circuit parameters using bidirectional long short-term memory coupled with an enhanced whale optimization algorithm. Mathematics 13(1): 121.

22.

Cao

Wang

, et al. (2025) Dual-splitting conformal prediction for multi-step time series forecasting. Applied Soft Computing 184: 113825.

23.

Zhu

Chen

Wang

, et al. (2025) VSPNet: a vehicle speed prediction model incorporating transformer and BiLSTM. Measurement Science and Technology 36(2): 026118.