Prediction of Remaining Life of Aircraft Engines Based on BiLSTM-GRU-Attention Model

Abstract

This study aims to enhance the prediction precision of aircraft engine remaining useful life (RUL) by overcoming common challenges in current models, such as ineffective feature extraction and insufficient modeling of long-term temporal dependencies. We propose a novel multilayer hybrid architecture that combines bidirectional long short-term memory (BiLSTM) and gated recurrent unit (GRU) networks, augmented with an attention mechanism to enhance the model’s focus on informative temporal patterns. In this framework, raw time series data are initially processed by the BiLSTM to extract bidirectional features associated with engine health conditions. The GRU network is subsequently used to effectively model long-range dependencies, thereby enriching the temporal representation. An adaptive attention module is included to assign varying importance to different features, allowing the model to focus on key indicators of engine condition. Evaluation results on the FD001 and FD003 datasets show that the model achieves root mean squared error reductions ranging from 8.81% to 30.60% and from 7.48% to 37.96%, validating its performance and robustness in RUL forecasting. In comparison with conventional BiLSTM and GRU models, the proposed BiLSTM-GRU-Attention architecture integrates attention-based feature weighting with a hybrid recurrent framework, thereby offering a concise and effective approach to RUL prediction for aircraft engines.

Keywords

aircraft engines bidirectional long short-term memory network gated recurrent unit remaining useful life

Introduction

As the core propulsion system of aerospace vehicles, the operational status of aircraft engines directly affects equipment reliability and flight safety. Aircraft engine failures can pose significant threats to the stable operation of the entire system. Therefore, maintenance and inspection of aircraft engines must be prioritized in aviation maintenance planning.¹ With the continuous advancement of science and technology, a new form of productive force—characterized by innovation capability, technological advancement, and intelligent applications—has been widely applied in the industrial sector, bringing transformative changes to the maintenance and management of aero-engines. In Prognostics and Health Management, the integration of advanced data analytics, artificial intelligence (AI), and internet of things technologies has become a key enabler for improving the accuracy and reliability of remaining useful life (RUL) estimation.

Accurate prediction of the RUL of aircraft engines not only ensures stable operation but also enables effective condition-based maintenance of the engines, thereby reducing the likelihood of catastrophic failures during operation, but also assists stakeholders in formulating scientifically grounded maintenance strategies. This facilitates a shift from traditional time-based maintenance to condition-based maintenance, significantly enhancing both the safety and economic efficiency of aviation operations. In this process, big data analytics, machine learning, and intelligent decision support tools—key components of emerging productivity paradigms—play a pivotal role in optimizing predictive models, improving prediction accuracy, and enhancing real-time responsiveness. Consequently, these technologies have driven substantial technological advancements and management innovations within the aerospace industry.

Existing methodologies for predicting the RUL of aircraft engines are generally classified into three main categories: physics-based models, data-driven approaches, and hybrid techniques,^2–4 with a comparative overview illustrated in Table 1.

Table 1.

Comparison of commonly used remaining useful life prediction methods

MethodCategory	Prediction accuracy: high → low
MethodCategory	Empirical method	Data method	Model method
Physical model	Unnecessary	Unnecessary	Necessary
Failure history	Necessary	Unnecessary	Unnecessary
Historical working status	Unnecessary	Necessary	Necessary
Current status	Unnecessary	Necessary	Necessary
Identify fault type	Unnecessary	Necessary	Necessary
Maintenance history	Try to have this one	Unnecessary	Try to have this one

Physics-based approaches estimate the RUL by incorporating observed data into mathematical or physical models that describe the degradation behavior of industrial system components.⁵ Li et al.⁶ simplified the computation by utilizing the stress ratio to represent variable loading sequences and used particle filtering to estimate the parameters of the crack propagation model. The findings validated the capability of this approach in accurately estimating crack propagation and fatigue lifespan.

Nevertheless, physics-based approaches have notable limitations in practice. They rely on an in-depth understanding of degradation mechanisms and precise mathematical formulations, which are often difficult to establish under nonlinear, multicondition, or noisy sensor environments. Moreover, the high cost of acquiring complete parameters and boundary conditions further restricts their applicability. Consequently, attention has increasingly shifted toward data-driven approaches.⁷ By directly extracting features and learning degradation patterns from historical monitoring data, data-driven methods avoid the need for explicit physical modeling and offer greater flexibility in capturing nonlinear dynamics and multidimensional characteristics. This methodological transition not only addresses the shortcomings of physics-based models but also lays the groundwork for introducing advanced deep-learning techniques.

Data-driven prognostic methods leverage statistical and machine learning algorithms to directly extract relevant features and patterns from monitoring data, thereby avoiding the limitations inherent in physics-based models.⁸ Owing to their flexibility, these methods can accommodate various forms of data and effectively uncover subtle features that conventional rule-based systems are often unable to detect. Statistical and AI techniques⁹ represent the two main branches of data-driven prognostics. Classical statistical models, such as the Wiener and gamma processes,¹⁰ can effectively model the degradation as a stochastic process but face challenges in feature extraction for complex equipment components. Wang et al.¹¹ proposed a nonlinear Wiener process-based model to address the nonlinear characteristics and triple-source uncertainties commonly observed during the performance degradation of aircraft engines. Liu et al.¹² evaluates the reliability of wind turbine blades based on a nonlinear Wiener degradation process, taking into account the influence of stochastic failure thresholds across different degradation stages.

In the field of machine learning, methods such as support vector machines, relevance vector machines, and neural networks have been widely used for RUL prediction. Among these, recurrent neural networks (RNNs) and convolutional neural networks (CNNs)¹³ are two representative algorithmic paradigms. Jiahao et al.¹⁴ proposed a variant of the RNN, to mitigate the limitation of dependency on long-term relationships, effectively uncovering latent patterns within sensor data for more accurate RUL prediction of aircraft engines. Ren et al.¹⁵ introduced a multiscale dense gated recurrent unit (GRU) model composed of a feature layer, multiscale layer, GRU skip connections, and dense layers, in which the feature layer was initialized using a pretrained restricted Boltzmann machine. Experimental validation based on real-world bearing datasets demonstrated the model’s effectiveness in RUL prediction. As a representative deep learning architecture, CNN is a feedforward neural network comprising convolutional layers, pooling layers, and fully connected (FC) layers. Li et al.¹⁶ proposed a novel deep CNN-based approach that processes data along the temporal dimension for RUL prediction.

In practical applications, the temporal evolution of various feature parameters exhibits diverse patterns, which poses challenges for achieving high prediction accuracy through a single approach. Therefore, hybrid methods that integrate physics-informed models and data-driven models have emerged, combining the high precision of model-based methods with the strong robustness of data-driven techniques to enhance both prediction accuracy and reliability. Remadna et al.¹⁷ proposed a hybrid model combining CNNs and bidirectional long short-term memory (BiLSTM) to capture spatial and temporal features, respectively, for predicting engine RUL. Liu et al.¹⁸ applied attention mechanisms (AM) to condition monitoring data to evaluate the relevance of input features, followed by bidirectional GRU (BGRU) and CNN for encoding critical information, and ultimately used an FC network for decoding and RUL estimation.

With the rapid advancement of machine learning and sensor technologies, machine learning-based hybrid approaches have demonstrated significant advantages and promising research prospects in the field of RUL prediction.¹⁹ Although machine learning has exhibited outstanding performance in time-series data modeling, most existing methods tend to overlook the quantification of uncertainty. Quantification of uncertainty in RUL prediction aims to characterize the distribution of predicted RUL by accounting for multiple sources of uncertainty inherent in the prediction process, thereby enabling interval estimation. Confidence intervals for RUL predictions can offer insight into the reliability of the forecast results and are of critical importance for optimal maintenance decision-making. Currently, the main approaches to uncertainty quantification in RUL prediction include stochastic processes,²⁰ filtering algorithms,²¹ and Bayesian neural networks.²²

Beyond aero-engine prognostics, deep learning has advanced a wide spectrum of predictive tasks across domains. For example, education analytics leverages CNN-derived multimedia features fused with ensemble learners to predict academic success, achieving state-of-the-art performance on LMS data.²³ In medical imaging, multistage pipelines built upon high-capacity CNN backbones (e.g., NASNet) have reported consistent gains for lesion detection under limited supervision and noisy acquisition settings.²⁴

In industrial and engineering domains, deep learning frameworks have been widely used for fault diagnosis and health monitoring of rotating machinery,²⁵ as well as for RUL prediction of aero-engines using transformer-based architectures and multisensor fusion.²⁶ Recent studies also explore multiscale attention networks for time series prognostics in manufacturing systems, achieving improved feature representation and robustness under covariate shifts.²⁷ These trends are aligned with our design choices—bidirectional temporal encoding (BiLSTM), efficient long-range modeling (GRU), and data-dependent weighting (self-attention)—which together yield stronger RUL estimates under single-condition subsets.

Based on the investigation and application of the aforementioned methods, this research proposes a new network architecture that combines BiLSTM, GRU, and AM, along with dropout-based uncertainty quantification, for the prediction of aircraft engine RUL. This study improves RUL prediction by separating the model into three steps: learning local patterns in both the time directions (BiLSTM), capturing longer term trends (GRU), and weighting the most informative signals (attention). Unlike single models or hybrids that blend these steps, our BiLSTM–GRU–Attention design keeps their roles distinct, trains more stably, and makes it easier to see which inputs matter.

Materials and Methods

LSTM model and BiLSTM model

LSTM²⁸ is a special type of RNN,²⁹ specifically designed to address the long-term dependency problem in time series data.³⁰ By introducing gating mechanisms, LSTM effectively overcomes the gradient vanishing and exploding issues commonly encountered by traditional RNNs when processing long sequences.³¹ The structural diagram of the LSTM model is illustrated in Figure 1.

FIG. 1.

Long short-term memory (LSTM) model structure diagram (adapted from refs.^28–30; redrawn by authors).

The fundamental unit of an LSTM network is referred to as an “LSTM cell.” Each LSTM cell consists of four primary components: the forget gate, input gate, candidate memory-cell update, and output gate.³² These gating mechanisms are realized via specialized neural network layers and nonlinear activation functions, allowing the model to selectively preserve, modify, and release information across temporal sequences.

Forget Gate

The forget gate regulates the degree to which the previous cell state is preserved at the current time step.³³ It receives the current input $χ_{t}$ and the previous hidden state $h_{t - 1}$ , and outputs a value between 0 and 1, which serves as a weighting factor for the previous memory. The associated equation is defined as follows:

f_{t} = σ (W_{f} \cdot [h_{t - 1}, χ_{t}] + b_{f})

(1)

The symbol $σ$ denotes the Sigmoid activation function, whereas $W_{f}$ and $b_{f}$ represent the weight matrix and bias vector of the forget gate, respectively.

Input Gate

The input gate determines the extent to which current input data impact the memory cell state. It comprises two components: the input gate itself and the candidate memory state. The input gate controls the admission of new information into the memory cell, whereas the candidate memory state produces potential memory content. The relationship is quantitatively described by the following equation:

i_{t} = σ (W_{i} \cdot [h_{t - 1}, χ_{t}] + b_{i})

(2)

{\tilde{C}}_{t} = \tanh (W_{C} \cdot [h_{t - 1}, χ_{t}] + b_{C})

(3)

Where $\tanh$ is the hyperbolic tangent activation function. $W_{i}$ , $b_{i}$ , $W_{C}$ , and $b_{C}$ denote the weight matrices and bias vectors associated with the input gate and the candidate memory state, respectively.

Cell State Update

The cell state is modified according to the outputs of the forget and input gates. The forget gate specifies the portion of the prior cell state to preserve, while the input gate dictates how much of the new candidate memory should be added. The update rule is defined as follows:

C_{t} = f_{t} \cdot C_{t - 1} + i_{t} \cdot {\tilde{C}}_{t}

(4)

Among them, $C_{t}$ is the state of the memory unit at the current moment, and $C_{t - 1}$ is the state of the memory unit at the previous moment.

Output Gate

The output gate controls the influence of the memory cell state on the hidden state at the current time step. It integrates the current input with the updated cell state to generate the new hidden state. The computation is given by the following:

o_{t} = σ (W_{o} \cdot [h_{t - 1}, χ_{t}] + b_{o})

(5)

h_{t} = o_{t} \cdot \tanh (C_{t})

(6)

where

W_{o}

and

b_{o}

are the weight matrix and bias vector of the output gate, respectively.

BiLSTM³⁴ represents an evolution of the RNN, integrating information from both the past and future contexts, enabling the model to process information in two directions. However, traditional LSTM can only process sequential data in one direction (typically from past to future),³⁵ which limits its performance in many practical applications. The BiLSTM architecture captures contextual dependencies from both past and future time steps by incorporating hidden layers that process the sequence in forward and backward directions simultaneously. These hidden layers jointly extract key information from both directions at each time step,³⁶ thereby further enhancing the capability to model time series data. The architecture of the BiLSTM model is illustrated in Figure 2.

FIG. 2.

Structural illustration of the bidirectional long short-term memory (BiLSTM) network (adapted from refs.^34,35,37).

In a BiLSTM network, one LSTM layer captures information from past to future, while the other traverses the sequence in reverse.³⁷ Each LSTM layer generates an output sequence of the same length as the input. The final representation at each time step is obtained by concatenating the outputs from both directions, thereby capturing contextual information from both the past and the future.³⁸ Specifically, for a given input sequence $X = (x_{1}, x_{2}, \cdot \cdot \cdot \cdot \cdot \cdot, x_{T}),$ the forward and backward layers of the BiLSTM compute:

\vec{h_{t}} = LSTM (x_{t}, \vec{h_{t - 1}})

(7)

\overset{\leftarrow}{h_{t}} = LSTM (x_{t}, \overset{\leftarrow}{h_{t - 1}})

(8)

At time step $t$ , $\vec{h_{t}}$ and $\overset{\leftarrow}{h_{t}}$ denote the LSTM’s hidden states in the forward and backward directions, respectively. The final output $h_{t}$ is obtained by concatenating these two hidden states:

h_{t} = [\vec{h_{t}} ； \overset{\leftarrow}{h_{t}}]

(9)

GRU network model

GRU is an improved variant of RNNs. Compared with LSTM networks, GRU introduces two main modifications: First, it merges the forget gate and input gate into a single gate called the update gate, alongside another gate termed the reset gate; second, it does not introduce an additional internal cell state $C,$ but instead establishes a direct linear dependency between the current hidden state $h_{t}$ and the previous hidden state $h_{t - 1}$ . The GRU architecture is more streamlined and has demonstrated excellent performance across various tasks.

The fundamental building block of the GRU network is referred to as the “GRU cell.” Each GRU cell comprises two primary gating mechanisms: the reset gate and the update gate. These gates regulate the flow of information and are implemented through distinct neural network layers and nonlinear activation functions. The architectural structure of the GRU model is illustrated in Figure 3.

FIG. 3.

Gated recurrent unit (GRU) model structure diagram (adapted from refs.39,40).

Reset Gate

The reset gate regulates how much the previous hidden state influences the candidate activation at the current time step.³⁹ It receives the current input vector $χ_{t}$ and the hidden state from the prior time step $h_{t - 1}$ as inputs, producing a gating vector with values ranging from 0 to 1. This vector modulates the impact of the previous hidden state. The mathematical formulation of the reset gate is given by the following:

r_{t} = σ (W_{r} \cdot [h_{t - 1}, χ_{t}] + b_{r})

(10)

Here, $σ$ represents the Sigmoid activation function, while $W_{r}$ and $b_{r}$ denote the weight matrix and bias vector of the reset gate, respectively.

Update Gate

The update gate determines the proportion by which the current hidden state $h_{t}$ is composed of the previous hidden state $h_{t - 1}$ and the current candidate hidden state ${\tilde{h}}_{t}$ . The model takes the current input $χ_{t}$ and the preceding hidden state $h_{t - 1}$ as inputs, producing an output constrained within the interval [0, 1].⁴⁰ The specific formulation is as follows:

z_{t} = σ (W_{z} \cdot [h_{t - 1}, χ_{t}] + b_{z})

(11)

Candidate Hidden State Update

The candidate hidden state ${\tilde{h}}_{t}$ is a combination of the current input information and the previous hidden state. The influence of the previous hidden state is regulated by the reset gate to generate the candidate hidden state. The specific formulation is as follows:

{\tilde{h}}_{t} = \tanh (W \cdot [r_{t} * h_{t - 1}, χ_{t}] + b)

(12)

Where $\tanh$ denotes the hyperbolic tangent activation function, and $W$ and $b$ represent the weight matrix and bias vector of the candidate hidden state, respectively.

Hidden State Update

The updated hidden state $h_{t}$ results from a weighted combination of the previous hidden state $h_{t - 1}$ and the candidate hidden state ${\tilde{h}}_{t}$ , where the update gate $z_{t}$ determines their respective contributions. The specific formula is as follows:

h_{t} = (1 - z_{t}) * h_{t - 1} + z_{t} * {\tilde{h}}_{t}

(13)

Attention mechanism

The AM is a widely used technique in deep learning models, originally introduced in machine translation. It has since been extensively applied to various time series processing tasks and has become a crucial component for enhancing model performance. The core idea of attention lies in allowing the model to dynamically adjust attention weights for each input by evaluating the importance of different segments of the input sequence. This enables the model to more effectively capture long-range dependencies and salient features.⁴¹

Several forms of attention have been developed, including additive, multiplicative, and self-attention.⁴² This study uses the self-attention mechanism,⁴³ a variant that establishes dependencies among different positions within a single sequence. The core principle involves computing the pairwise correlations among all positions within the input sequence for the purpose of allocating importance weights to each position. When processing sequential data, self-attention can capture dependencies between any two positions in the sequence, thereby improving the model’s representation capability and predictive performance. The specific formulation of the self-attention mechanism is as follows:

e_{i j} = \frac{(Q K^{T})}{\sqrt{d_{k}}}

(14)

where

Q

represents the query vector,

K

denotes the key vector, and

d_{k}

is the dimensionality of the key vector.

Uncertainty quantification

In engineering practice, data collection processes often encounter issues such as missing data or measurement errors, which exacerbate the uncertainty in model predictions.⁴⁴ The raw signals acquired by monitoring systems are typically accompanied by noise interference, causing point estimates in each prediction to vary. Both data-driven and model-driven prognostic methods inevitably face two types of uncertainties: epistemic uncertainty and aleatory uncertainty. Epistemic uncertainty reflects the limitations in understanding the model’s inherent properties, whereas aleatory uncertainty arises from noise and disturbances during data acquisition and transmission.⁴⁵

Assuming the model is expressed as $y = a e^{b x} + c e^{d x}$ , with model parameters $θ = {a, b, c, d}$ and real-world responses $(x, y)$ , the actual observations obtained, denoted as $(X_{i}, Y_{i}),$ are often incomplete due to limitations in data acquisition and technical research. This leads to biased estimates of $\theta$, which can be characterized by the probability distribution $P (θ | X, Y)$ . Epistemic uncertainty primarily arises from the inherent limitations of the model structure and its parameters. Moreover, even with a large amount of observational data, measurement and data collection processes are usually affected by random noise (e.g., measurement errors). The influence of such random noise can be described by $P (θ | X, Y) = P (y = a e^{b x} + c e^{d x}; σ)$ , where $σ$ denotes the noise variance. This stochastic interference is referred to as aleatory uncertainty,⁴⁶ as illustrated in Figure 4.

FIG. 4.

Sources of uncertainty (adapted from refs.⁴⁶).

Building on the above modeling components and uncertainty handling, the next section presents the overall architecture and information flow of the proposed BiLSTM–GRU–Attention model.

Remaining Life Prediction Model Based on BiLSTM-GRU-Attention

BiLSTM–GRU–attention network architecture

This study proposes a hybrid network architecture combining BiLSTM and GRU enhanced by an AM, as depicted in Figure 5. The architecture is composed of four sequential layers, beginning with a BiLSTM layer aimed at extracting bidirectional temporal dependencies; the second layer uses a GRU to enhance the learning of sequential patterns while reducing computational complexity; the third layer introduces an AM that dynamically allocates weights to significant temporal features, while the final layer serves as an FC layer responsible for generating the output prediction. Guided by this rationale, the layered design proceeds from bidirectional within-window encoding to cross-scale temporal integration and data-dependent weighting, culminating in a robust regression of RUL.

FIG. 5.

Bidirectional long short-term memory (BiLSTM)-gated recurrent unit (GRU)-attention prediction model flowchart.

Initially, in the context of multivariate time series prediction, continuous sequences are generated by setting a specific cycle interval. A sliding window approach is then applied to produce multiple groups of multivariate time series, thereby constructing a time series training set for various devices. After normalization and sample construction, the BiLSTM aggregates forward and backward information at each time step to extract more complete degradation features, alleviating gradient-related issues in long-sequence modeling.

Subsequently, the outputs from the BiLSTM layer are input into the GRU layer to further extract complex temporal features. The GRU layer, through its update and reset gates, effectively captures long-term dependencies within the sequence and alleviates the gradient vanishing problem that may occur in LSTM, thereby enhancing the model’s feature representation capability.

Following this, the attention layer performs weighted processing on the outputs of both the BiLSTM and GRU layers. By computing importance weights for each time step, the AM enables the model to emphasize the most relevant features and create a feature representation that accentuates key information, leading to improved prediction accuracy.⁴⁷ Finally, the features processed by the AM are fed into an FC layer for further dimensionality reduction. Through two successive linear transformations, the feature dimension is progressively reduced while retaining essential information, enabling accurate multivariate time series forecasting.

Evaluation metrics for RUL prediction models

To compare the RUL prediction performance among different models, two commonly used evaluation metrics in regression tasks,⁴⁸ namely root mean squared error (RMSE) and the score function,⁴⁹ are adopted as rating indicators. The RMSE is commonly utilized to evaluate the extent of deviation between estimated and actual values, and is computed using the following formulation:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}

(15)

where

{\hat{y}}_{i}

denotes the predicted value,

y_{i}

is the actual value, and

n

represents the total number of samples. RMSE provides an intuitive measure of the prediction error magnitude; a smaller RMSE value indicates a better predictive performance of the model. The calculation formula of the score function is expressed as follows:

S = {\begin{matrix} \sum_{i = 1}^{n} e^{- (\frac{h}{13})} - 1, h < 0 \\ \sum_{i = 1}^{n} e^{(\frac{h}{10})} - 1, h \geq 0 \end{matrix}

(16)

where

n

denotes the total number of engines in the test dataset, and

S

refers to the evaluated score. Let

h = R_{prediction} - R_{actual}

, where

R_{prediction}

and

R_{actual}

represent the predicted and true RUL, respectively. The scoring function imposes asymmetrical penalty structure, assigning lower penalties when the predicted RUL underestimates the actual value and higher penalties when it overestimates, thereby discouraging late failure predictions. This design aims to avoid late predictions of engine failure, which could result in severe losses of life and property.⁵⁰ In addition, a lower score indicates a better prediction performance.

Results and Discussion

Dataset

Due to the confidentiality requirements and high acquisition costs associated with the aerospace industry, this study uses the publicly available C-MAPSS turbofan engine degradation simulation dataset released by NASA.⁵¹ Owing to its practical relevance and variability in testing environments, the dataset—made available in 2008—has become a benchmark resource widely adopted in RUL-related studies across the global academic community. Experts and scholars have utilized the dataset—which records critical condition monitoring data and the corresponding operational parameters of turbofan engines throughout their operational cycles until failure—for research on RUL prediction,⁵² performance degradation analysis, and fault diagnosis, thereby validating the effectiveness of proposed models or algorithm.⁵³

All four subsets within the C-MAPSS dataset (FD001–FD004) share a consistent set of parameters, and the corresponding raw data are made available in standardized plain text (.txt) files. Each subset is organized as an $n \times 26$ matrix, where $n$ denotes the number of data points within the subset, and each row corresponds to the measurements collected during a single-engine operational cycle. Specifically, the first column represents the engine unit number, the second column denotes the flight cycle index, columns 3–5 correspond to three operating conditions, and columns 6–26 contain readings from 21 different sensors. As summarized in Table 2, the sensor data include information on engine operating parameters, vibration, temperature, etc. provided by 24 sensors.

Table 2.

Column parameter names

Number of columns	1	2	3–5	6–26
Parameter name	Engine ID	The current cycle number of the engine	Operating conditions	Sensor data

As illustrated in Figure 6, os_1–3 denote the three operational conditions, while sm_1–21 represent the 21 sensor measurements. It was observed that the readings from sensors 1, 5, 6, 10, 16, 18, and 19 remain constant throughout the engine’s operational life cycle and exhibit no correlation with the degradation process. Therefore, these seven sensor signals were excluded from the analysis. The remaining 14 sensor measurements, in conjunction with the three operational condition parameters, were selected as input features for subsequent model evaluation and training.

FIG. 6.

Sensor data.

Data preprocessing

Differences in units and scales can cause large-magnitude features to dominate training and mask others. To remove dimensional/scale effects and improve stability and convergence, we apply min–max normalization to [0,1] for all features.⁵⁴ The corresponding normalization procedure is mathematically defined by Equation (17) in this article:

X_{i, j}^{'} (t) = \frac{X_{i, j} (t) - \min (X_{, j})}{\max (X_{, j}) - \min (X_{, j})}

(17)

where

X_{i, j}^{'} (t)

denotes the normalized value of the raw sensor data

X_{i, j} (t)

corresponding to the

j

sensor of the

i

engine at time

t

. The terms

\min (X_{, j})

and

\max (X_{, j})

represent the minimum and maximum values, respectively, of the

j

sensor across all engines and all time steps.

To preclude information leakage, all splits were conducted strictly at the engine ID level. We used the official C-MAPSS train/test partition and did not reassign engines across sets; thus, no engine appears in more than one set. Within the training set, a validation set was created by randomly selecting a fixed proportion of engines (engine-level split; fixed random seed) and reserving the remaining engines for training. Sliding-window sequences were then generated independently within each engine, ensuring that windows do not cross engine boundaries. Feature normalization (min–max scaling) was fit on the training set only and subsequently applied to the validation and test sets using the same parameters. This protocol eliminates cross-set contamination arising from windowing or scaling and yields a credible evaluation of generalization performance.

Model parameter settings

After normalization, the input data derived from the C-MAPSS dataset are scaled to the range of [0,1]. The normalized data are subsequently processed using a sliding time window to generate the input sequences for the network. During model training, to mitigate the risk of overfitting, the dropout technique is used. Specifically, dropout randomly deactivates a subset of neurons during each training iteration, thereby compelling the network to learn more robust and generalizable feature representations. This approach reduces the model’s dependency on specific training data and effectively enhances both its generalization capability and robustness.

To ensure the reliability of the experimental results, the model was trained and validated multiple times, and the final hyperparameter configuration is summarized in Table 3. The learning rate was set to 0.01, which provided a balance between convergence speed and stability, avoiding both slow convergence at lower rates and instability at higher ones. The dropout rate was fixed at 0.2 to mitigate overfitting while preserving model capacity for complex feature learning. A batch size of 100 was used to balance computational efficiency and generalization. The number of training epochs was set to 32 for FD001 and 64 for FD003, determined by observed convergence behavior to ensure sufficient training without overfitting. The sliding window length was set to 30 cycles, enabling the capture of key temporal dependencies while reducing noise. These settings were validated through repeated experiments, confirming stable and robust performance of the proposed framework.

Table 3.

Model parameter settings

Dataset	Batch size	Epoch	Winsize	Dropout rate	Learning rate
FD001	100	32	30	0.2	0.01
FD003	100	64	30	0.2	0.01

Complexity comparison

To demonstrate the advantages of the proposed BiLSTM-GRU-Attention prediction model over the LSTM, BiLSTM, and GRU models, the complexity of these four models was compared. The comparison results of a number of model parameters are presented in Table 4. As shown in Table 4, the BiLSTM-GRU-Attention prediction model has a greater number of parameters than the LSTM, BiLSTM, and GRU models.

Table 4.

Parameter comparison across different models

Prediction methods	Parameter quantity
LSTM	13600
BiLSTM	27200
GRU	15926
BiLSTM-GRU-Attention	195937

BiLSTM, bidirectional long short-term memory; GRU, gated recurrent unit; LSTM, long short-term memory.

Analysis of prediction results

To thoroughly evaluate the proposed model’s performance, ablation studies were conducted on the BiLSTM-GRU-Attention framework used. Specifically, BiLSTM, BiLSTM-Attention, and BiLSTM-GRU models were individually trained and their prediction performance was compared. The FD001 and FD003 datasets, characterized by single operating conditions from the C-MAPSS dataset, were chosen for conducting several experimental trials. One representative set of RUL prediction results is presented in Figure 7 and Table 5. In Figure 7, the x-axis represents sequential sample indices, and the y-axis represents normalized RUL values after regularization. Due to the high sampling density, some tick labels on the x-axis may overlap. All curves are plotted using the original data, and slight smoothing is applied only for visual clarity.

FIG. 7.

Ablation experiment prediction results.

Ablation Experiment

The final results demonstrate that the proposed BiLSTM-GRU-Attention-based RUL prediction method outperforms the BiLSTM, BiLSTM-Attention, and BiLSTM-GRU models in terms of both RMSE and score metrics. As shown in Table 5, for the FD001 and FD003 datasets, the predicted values generated by the proposed method are the closest to the actual values, yielding the lowest mean absolute error and maximum absolute error among the four models. Specifically, for the FD001 dataset, the BiLSTM-GRU-Attention model reduces the RMSE by 30.60% compared with the BiLSTM model, and by 8.81% and 10.24% compared with the BiLSTM-Attention and BiLSTM-GRU models, respectively. For the FD003 dataset, the RMSE is reduced by 37.96% relative to the BiLSTM model, and by 7.48% and 13.64% relative to the BiLSTM-Attention and BiLSTM-GRU models, respectively.

Table 5.

Root mean squared error and score results of ablation experiments on a single-condition dataset

Dataset	BiLSTM		BiLSTM-Attention		BiLSTM-GRU		BiLSTM-GRU-Attention
Dataset	RMSE	Score	RMSE	Score	RMSE	Score	RMSE	Score
FD001	20.59	374.53	15.67	326.15	15.92	346.85	13.12	273.69
FD003	22.34	722.56	14.98	324.13	16.05	332.37	13.86	314.07

RMSE, root mean squared error.

To visually illustrate the comparative results of the ablation experiments, a bar chart is presented in Figure 8. As observed, the BiLSTM-GRU-Attention model consistently yields lower RMSE and score values compared with the other three models, demonstrating a substantial enhancement in predictive accuracy. The practical efficacy and robustness of the proposed approach are further substantiated from multiple perspectives.

FIG. 8.

Comparative histogram of ablation experiments.

Uncertainty Quantification

Predictive uncertainty is estimated via the Monte Carlo dropout; the mean of stochastic passes serves as the point estimate, and the 95% interval quantifies confidence. Tighter intervals on FD001 indicate higher model certainty.

As illustrated in Figure 9, the uncertainty in the FD001 dataset gradually decreases, indicating that the model’s comprehension of the data improves with an increasing number of samples. The narrowing of the confidence interval suggests more stable and reliable predictions over time. In contrast, the FD003 dataset exhibits pronounced fluctuations in predictive uncertainty, with considerable instability persisting even at higher sample indices. This behavior may be attributed to higher levels of noise, greater environmental variability, or increased heterogeneity in the data characteristics. These complexities hinder the model’s ability to accurately learn the underlying patterns, resulting in less consistent and more volatile predictions.

FIG. 9.

Uncertainty quantification results.

Comparative Experiment

To rigorously evaluate the feasibility of the proposed prediction approach, a comprehensive comparative analysis was conducted among the proposed BiLSTM-GRU-Attention model, conventional architectures such as BiLSTM and GRU, and several state-of-the-art methods reported in the literature. All models were trained and tested on the same single-operating-condition subsets FD001 and FD003 from the C-MAPSS dataset, obtaining RMSE and score of different methods for comparative analysis.

As shown in Table 6, the proposed BiLSTM-GRU-Attention model achieves lower values in both the score function and RMSE compared with the BiLSTM and GRU models on the FD001 and FD003 subsets. Furthermore, in comparison with other approaches reported in existing literature, the BiLSTM-GRU-Attention model also outperforms in terms of both the evaluation metrics across the same subsets. These results demonstrate that the proposed multilayer architecture, with enhanced feature extraction capability, achieves superior predictive performance compared with the conventional GRU and BiLSTM models.

Table 6.

Evaluation of experimental outcomes across various methods

Model	FD001		FD003
Model	RMSE	Score	RMSE	Score
BiLSTM	20.59	374.53	22.34	722.56
GRU	16.31	529.11	15.59	664.51
BiLSTM-GRU-Attention	13.12	273.69	13.86	314.07
LSTM⁵⁵	16.14	338.00	16.18	852.00
RF	17.91	479.95	20.27	711.13
GB⁵⁶	15.67	474.01	16.84	576.72

Results marked as “proposed” are obtained from our experiments, whereas all other baseline results are either (1) reproduced under identical data split and normalization settings using open implementations, or (2) directly cited from the corresponding literature (^50,51). All methods were evaluated using the same metrics and datasets to ensure fair comparison.

Conclusions

This study investigates and resolves key shortcomings in conventional RUL prediction approaches for aircraft engines, focusing on enhancing feature extraction, improving long-sequence data modeling, and increasing model interpretability. To address these issues, an optimized hybrid model, integrating BiLSTM and GRU with an AM, was proposed to improve the accuracy of RUL prediction. The study draws the following three conclusions: (1)

The BiLSTM-GRU-Attention hybrid model integrates BiLSTM’s bidirectional feature extraction capability, the AM’s capability to select critical features, and GRU’s efficient handling of long-term dependencies. This integration facilitates the extraction and combination of diverse feature representations across different network layers, thereby constructing a richer and more comprehensive feature space. Furthermore, the model reduces computational complexity and memory requirements, which improves both training and inference efficiency; it also exhibits enhanced generalization ability and adaptability.

(2)

By integrating the AM, the proposed model dynamically emphasizes the most critical features, thereby further improving prediction accuracy. Results from ablation studies indicate that models incorporating attention consistently achieve lower RMSE and score metrics compared with their counterparts without attention. This demonstrates that attention-enabled models effectively identify and prioritize features exerting the greatest influence on prediction outcomes, enhancing sensitivity to salient information while mitigating the impact of irrelevant or secondary features, ultimately leading to improved predictive performance.

(3)

It is shown through experimental results that the proposed BiLSTM-GRU-Attention model attains enhanced prediction accuracy by significantly mitigating errors in the RUL estimation for aircraft engines. Compared with BiLSTM, BiLSTM–Attention, and BiLSTM–GRU models, the BiLSTM–GRU–Attention model achieves superior performance on the RMSE and score metrics, indicating improved accuracy in RUL prediction.

Although the proposed BiLSTM–GRU–Attention framework demonstrates significant progress in feature extraction, long-sequence modeling, and prediction accuracy, certain limitations remain. The experiments were conducted solely on the C-MAPSS dataset, which may not fully capture the complexity and diversity of real-world operating conditions. Moreover, the uncertainty analysis is primarily qualitative and has not yet systematically incorporated quantitative reliability metrics. Future research will focus on extending the framework to more representative and diverse datasets, integrating more rigorous quantitative uncertainty evaluation methods, and further refining the theoretical foundations of the hybrid architecture, with the aim of enhancing the robustness and practical applicability of RUL prediction models.

Authors’ Contributions

Q.H. wrote the first draft of the article, contributed to the conception of the study, and worked on the coding of tables and figures. X.G. contributed to the design of the study and helped perform the analysis with constructive discussions. All the authors read the article and approved the final version of the article.

Footnotes

Author Disclosure Statement

The authors have no relevant financial or nonfinancial interests to disclose.

Funding Information

This article was supported by the Beijing Municipal Education Commission Research Plan General Project (grant number: KM202411232007).

Data Sharing Agreement

The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Abbreviations

References

Guo

, Liu

. Prediction method of remaining useful life of aero-engine based on long sequence. J. Beijing Univ. Aeronaut. Astronaut, 2024; 50(3):774–784; doi: 10.13700/j.bh.1001-5965.2022.0354

Lei

, Li

, Guo

, et al. Machinery health prognostics: A systematic review from data acquisition to RUL prediction. Mech. Syst. Signal Process, 2018; 104:799–834; doi: 10.1016/j.ymssp.2017.11.016

, Sun

, Hou

. Remaining useful life prediction of aero-engine based on feature on attention mechanism and resnet. Comput. Appl. Softw, 2025; 42(3):22–28. 47.

Zheng

, Du

, Zhang

, et al. Lifetime prediction analysis of proton exchange membrane fuel cells based on empirical mode decomposition—temporal convolutional network. Batteries (Basel), 2025; 11(6):226.

Cubillo

, Perinpanayagam

, Esperon Miguez

. A review of physics-based models in prognostics: Application to gears and bearings of rotating machinery. Adv. Mech. Eng, 2016; 8(8); doi: 10.1177/1687814016664660

, Liu

, Li

, et al. Fatigue crack growth prediction model under variable amplitude loading conditions. Int. J. Damage Mech, 2021; 30(9):1315–1326.

Zhao

, Addepalli

. Remaining useful life prediction using deep learning approaches: A review. Procedia Manuf, 2020; 49:81–88; doi: 10.1177/1056789521998737

Zhao

, Li

, Yang

, et al. Remaining useful life prediction based on attention mechanism CNN-LSTM, Rolling Stock. 2022; 60:1–7.

Zhang

, Yang

, Liu

. Servo system state prediction algorithm based on deep learning. Comput. Appl. Softw, 2019; 36(3):236–242.

10.

, Wang

, Zhang

, et al. Residual life prediction of low-voltage circuit breaker thermal trip based on the Wiener process. Energies (Basel), 2024; 17(5):1189.

11.

Wang

, Hu

, Ren

, et al. Performance degradation modeling and remaining useful life prediction for aero-engine based on nonlinear Wiener process. Acta Aeronaut. Astronaut. Sin, 2020; 41(2):195–205.

12.

Liu

, Bi

, Li

, et al. Two-stage reliability assessment of wind turbine blades considering random failure thresholds. Acta Energ. Solar. Sin, 2024; 45(12):269–276; doi: 10.19912/j.0254-0096.tynxb.2023-1295

13.

Singh

. A text independent speaker identification system using ANN, RNN, and CNN classification technique. Multimed Tools Appl, 2023; 83(16):48105–48117.

14.

Jiahao

, Qingyu

, Yuelin

, et al. Research on Whale Optimization Algorithm-bidirectional Long-short-term Memory Neural Network for Prediction of Machinery Remaining Useful Life of Circuit Breaker. High Volt. Eng, 2024; 50(1):250–262.

15.

Ren

, Cheng

, Wang

, et al. Multiscale dense gate recurrent unit networks for bearing remaining useful life prediction. Future Gener. Comput. Syst, 2019; 94:601–609.

16.

, Ding

, Sun

. Remaining useful life estimation in prognostics using deep convolution neural networks. Reliab. Eng. Syst. Saf, 2018; 172:1–11; doi: 10.1016/j.future.2018.12.009

17.

Remadna

, Terrissa

, Zemouri

, et al. (2020) Leveraging the power of the combination of CNN and bidirectional LSTM networks for aircraft engine RUL estimation, in: Proc. Prognostics and Health Management Conference, IEEE, Besançon, France, pp. 116–121.

18.

Liu

, Liu

, Jia

, et al. Remaining useful life prediction using a novel feature-attention-based end-to-end approach. IEEE Trans Ind Inf, 2021; 17(2):1197–1207; doi: 10.1109/TII.2020.2983760

19.

Lin

, Yu

, Wang

, et al. Remaining useful life prediction in prognostics using multi-scale sequence and long short-term memory network. J. Comput. Sci, 2022; 57:101508.

20.

Yan

, Ma

, Huang

, et al. Two-stage physics-based Wiener process models for online RUL prediction in field vibration data. Mech. Syst. Signal Process, 2021; 152:107378; doi: 10.1016/j.ymssp.2020.107378

21.

Liu

, Cai

, Yuan

, et al. A hybrid multi-stage methodology for remaining useful life prediction of control system: Subsea christmas tree as a case study. Expert Syst. Appl, 2023; 215:119335; doi: 10.1016/j.eswa.2022.119335

22.

Blundell

, Cornebise

, Kavukcuoglu

, et al. (2015) Weight uncertainty in neural network, in: Bach

, Blei

(eds), Proceedings of the 32nd International Conference on Machine Learning, JMLR.org, Lille, France, pp. 1613–1622.

23.

Al-Ameri

, Al-Shammari

, Castiglione

, et al. Student academic success prediction using learning management multimedia data with convoluted features and ensemble model. J Data and Information Quality, 2025; 17(3):1–16; doi: 10.1145/3687268

24.

Altamimi

, Alrowais

, Karamti

, et al. An improved skin lesion detection solution using multi-step preprocessing features and NASNet transfer learning model. Image Vis Comput, 2024; 144:104969; doi: 10.1016/j.imavis.2024.104969

25.

Umer

, Alabdulqader

, Alarfaj

, et al. Cyberbullying detection using PCA extracted GLOVE features and RoBERTaNet transformer learning model. IEEE Trans Comput Soc Syst, 2025; 12(5):3881–3890; doi: 10.1109/TCSS.2024.3422185

26.

Altamimi

, Umer

, Hanif

, et al. Employing Siamese MaLSTM model and ELMO word embedding for Quora duplicate questions detection. IEEE Access, 2024; 12:29072–29082; doi: 10.1109/ACCESS.2024.3367978

27.

Ashraf

, Chen

, Innab

, et al. Novel 3-D deep neural network architecture for crop classification using remote sensing-based hyperspectral images. IEEE J Sel Top Appl Earth Observations Remote Sensing, 2024; 17:12649–12665; doi: 10.1109/JSTARS.2024.3422078

28.

Zhang

, He

, Yang

, et al. High-speed train group tracking trajectory prediction method based on LSTM-KF model. J. Transp. Eng, 2024; 24(3):296–310; doi: 10.19818/j.cnki.1671-1637.2024.03.021

29.

, Qi

, Zhou

, et al. Trajectory planning and power allocation of UAV swarm with sensing-communication-computing integration. J. Xi’an Univ. Electron. Sci. Technol, 2023; 50(3):61–74; doi: 10.19665/j.issn1001-2400.2023.03.006

30.

, Chen

, Bao

. Consumer finance risk detection model based on kNN-Smote-LSTM: A case study of credit card fraud detection. Syst. Sci. Math, 2021; 41(2):481–498.

31.

, Duan

, Hu

, et al. Wind direction prediction based on feature decomposition and Bi-LSTM-Attention model. Electr. Power Sci. Eng, 2024; 40(8):63–69.

32.

, Ma

, Yu

, et al. Long-short term memory networks aided fault detection of power facilities. Intell. Syst. Appl, 2024; 22:200395.

33.

Yadav

, Khargotra

, Lee

, et al. Novel applications of various neural network models for prediction of photovoltaic system power under outdoor condition of mountainous region. Sustain. Energy Grids Netw, 2024; 38:101318.

34.

Qin

, Qin

, Jiang

, et al. Forecasting carbon price with attention mechanism and bidirectional long short-term memory network. Energy, 2024; 299:131410.

35.

Alizadegan

, Rashidi Malki

, Radmehr

, et al. Comparative study of long short-term memory (LSTM), bidirectional LSTM, and traditional machine learning approaches for energy consumption prediction. Energy Explor. Exploit, 2025; 43(1):281–301.

36.

Zhao

, Jiang

, Lin

. Remaining life prediction of rolling bearing based on CNN-BiLSTM model with attention mechanism. J. Mech. Electr. Eng, 2021; 38:1253–1260.

37.

Olatinwo

, Abu-Mahfouz

, Myburgh

. Mental Disorder Assessment in IoT-Enabled WBAN Systems with Dimensionality Reduction and Deep Learning. JSAN, 2025; 14(3):49.

38.

Unlu

, Peña

. Comparative analysis of hybrid deep learning models for electricity load forecasting during extreme weather. Energies (Basel), 2025; 18(12):3068.

39.

Zhang

, Ren

, Liu

, et al. Federated deep reinforcement learning for varying-scale multi-energy microgrids energy management considering comprehensive security. Appl. Energy, 2025; 380:125072.

40.

Kansal

, Jain

, Biswas

, et al. SuspAct: Novel suspicious activity prediction based on deep learning in the real-time environment. Neural Comput. Appl, 2024:1–14.

41.

Chen

, Guo

, Chen

, et al. Prediction of remaining useful life of aero-engine based on residual NLSTM network and attention mechanism. J. Aeronaut. Power, 2023; 38(5):1176–1184; doi: 10.13224/j.cnki.jasp.20210728

42.

Xia

, Yang

, Xue

. Large-scale network document sentiment analysis using self-attention mechanism. Comput. Eng. Des., 2021; 42(9):2642–2648; doi: 10.16208/j.issn1000-7024.2021.09.032

43.

Guo

, Xu

, Guo

. Aero-engine life prediction algorithm optimized by self-attention based on improved GRU. J. Aeronaut. Power, 2024; 39(12):447–457; doi: 10.13224/j.cnki.jasp.20220984

44.

Zhang

, Fang

. Optimization Control of Complex Surface CNC Laser Cutting Tool Path Under Improved C-means Clustering AlgorithmJ. Baoding Univ, 2025; 38(3):108–116; doi: 10.13747/j.cnki.bdxyxb.2025.03.013

45.

Russell

, Reale

. Multivariate uncertainty in deep learning. IEEE Trans Neural Netw Learn Syst, 2022; 33(12):7937–7943; doi: 10.1109/TNNLS.2021.3086757

46.

Jiangyan

, Ma

, Wu

. A regularized constrained two-stream convolution augmented transformer for aircraft engine remaining useful life prediction. Eng. Appl. Artif. Intell, 2024; 133:108161.

47.

, Wang

, Liu

. (2024) Deep learning empowered blockchain transaction prediction and anomaly detection, in: Proc. Blockchain and Web3 Technology Innovation and Application Exchange Conf., Springer Nature Singapore, Singapore, pp. 50–61.

48.

Peng

, Sun

, Su

, et al. An improved similarity-based RUL prediction method considering degradation degree of multiple condition monitoring parameters for aero-engines. Meas. Sci. Technol., 2024; 35(12):126213.

49.

Yang

, Liu

, Yu

. Sequence recommendation based on user interest boundary and learnable filter. J. Northeast Normal Univ. (Nat. Sci. Ed.), 2025; 57(2):82–89.

50.

Wang

, Peng

, Liu

. Prediction of aero-engine remaining useful life combined with fault information. Machines (Basel), 2022; 10(10):927; doi: 10.3390/machines10100927

51.

Ensarioğlu

, İnkaya

, Emel

. Remaining useful life estimation of turbofan engines with deep learning using change-point detection based labeling and feature engineering. Appl. Sci, 2023; 13(21):11893.

52.

Muneer

, Taib

, Fati

, et al. Deep-learning based prognosis approach for remaining useful life prediction of turbofan engine. Symmetry (Basel), 2021; 13(10):1861; doi: 10.3390/sym13101861

53.

Chao

, Kulkarni

, Goebel

, et al. Fusing physics-based and deep learning models for prognostics. Reliab. Eng. Syst. Saf, 2022; 217:107961; doi: 10.1016/j.ress.2021.107961

54.

Zhang

, Peng

, Xue

. Oil well production prediction model based on improved graph attention network. J. Jilin Univ. (Sci. Ed.), 2024; 62(4):933–942.

55.

Zheng

, Ristovski

, Farahat

, et al. (2017) Long short-term memory network for remaining useful life estimation, in: (eds), Proc. 2017 IEEE International Conference on Prognostics and Health Management (ICPHM), IEEE, USA, pp. 88–95.

56.

Chong

, Pin

, K

, et al. Multiobjective deep belief networks ensemble for remaining useful life estimation in prognostics. IEEE Trans. Neural Netw. Learn. Syst, 2016; 28(10):2306–2318; doi: 10.1109/TNNLS.2016.2582798