Abstract
Modern industrial processes often have nonlinearity, multivariate, time-delay, and measurement outliers, which make accurate data-driven modeling of key performance indicators difficult. To address these issues, a robust and regularized long short-term memory (LSTM) neural network for soft sensors in complex industrial processes was proposed. First, a conventional LSTM architecture was used as the basic model to deal with nonlinearity and time delay. Thereafter, a novel LSTM loss function that combines the excellent resistance to outliers of Huber M-Loss with the superior model reduction capability of ℓ1 regularization was designed. Subsequently, a backpropagation through time training algorithm for the proposed LSTM was developed, including the chain derivative calculation and updating formulas. The adaptive moment estimation was applied to perform the gradient update, while the grid search and moving window cross-validation were used to find the optimal hyperparameters. Finally, nonlinear artificial datasets with time series and outliers, as well as an industrial dataset of a desulfurization process, were applied to investigate the performance of the proposed soft sensor. Simulation results show that the proposed algorithm outperforms other state-of-the-art soft sensors in terms of predictive accuracy and training time. The causal relationship of the data-driven soft sensor trained by the proposed algorithm is consistent with the field operation and chemical reactions of the desulfurization process.
Introduction
In modern industrial processes, the accurate and real-time monitoring of key performance indicators is necessary to ensure process safety, production efficiency, and product quality. However, some indicators, such as the slurry composition, fluid viscosity, and gas concentration, are difficult to measure directly using hardware sensors. Specific online analyzers are often expensive and time-consuming to measure. Soft sensors provide fast, low-cost, and convenient alternatives to estimate hard-to-measure variables with easy-to-measure variables, such as temperature, pressure, and flow rate [1].
Soft sensors are categorized into two classes: mechanism modeling [2] and data-driven modeling [3]. Mechanistic modeling approaches have been developed using theorems of physics or chemistry, such as mass balance, energy balance, and reaction principles. However, owing to the complexity and ambiguity of actual industrial processes, it is impractical to establish sufficiently accurate mechanistic models. Data-driven soft sensors use the historical data collected by process automation systems to establish mathematical models without much dependency on the mechanism knowledge of the processes and are therefore much easier to implement. Commonly used data-driven soft sensors consist of linear algorithms, such as principal component analysis [4], partial least squares [5], nonnegative garrote [6], and nonlinear algorithms, such as support vector machines [7] and neural networks (NN) [8].
Recently, NNs have increasingly been used in soft-sensor development owing to their powerful nonlinear mapping capabilities. Depending on whether there are loops in the structure, NNs are divided into two categories: feedforward and recurrent NNs. Feedforward NNs, such as the multilayer perceptron (MLP) [9], convolutional neural network [10] and stochastic configuration networks [11], pass the data in only one direction from the input layer to the output layer. Such a structure is ideally suitable for modeling processes that are not temporally dynamic and is inefficient for time-series processes [12]. Recurrent neural networks (RNNs) are more suitable for time-series processes because they provide bidirectional signal paths by introducing loops to the network architecture [13].
To overcome the gradient vanishing and gradient explosion problems of a standard RNN, long short-term memory (LSTM) and gated recurrent units (GRU) can be used to control the flow of information; gated units can be designed to forget previous useless information and store useful information in the memory cell for long periods and have therefore been widely studied in the field of soft sensors. For example, Yuan et al. [14] proposed a spatiotemporal attention-based LSTM network for key variable prediction in the hydrocracking of industrial crude oil. Liu et al. [15] designed an LSTM network with a spatiotemporal attention mechanism to forecast dissolved oxygen. In [16], an LSTM-based soft sensor was developed and applied to predict the oxygen content in the boiler flue gas. In [17], a novel integrated GRU stacked autoencoder for extracting features and learning dynamic relations was proposed and applied to the soft sensor for a debutanizer column. Ma et al. [18] designed a bidirectional GRU structure by combining future and historical information and applied it to the quality prediction of a hot rolling process. Although GRU has better computational efficiency than LSTM, it is less competent for tasks that require long-term memory and long-distance dependencies.
Vaswani et al. [19] proposed a Transformer model, which can capture time-series information by introducing multi-head self-attention mechanism and position encoding. Geng et al. [20] developed a dynamic soft-sensing model with a Transformer and gated convolutional neural network and applied it to practical industrial processes. However, the Transformer depended highly on the self-attention mechanism, whose computational complexity was quadratic to the length of the input sequence.
Owing to the increasing number of input variables, the structures of NN-based soft sensors have become increasingly complex, which brings many redundant neurons and weights and undermines the performance of soft sensors. Tibshirani [21] proposed an ℓ1 regularization algorithm, called the least absolute shrinkage and selection operator (LASSO), to shrink the coefficients of multiple linear regression problems. Recently, the ℓ1 regularization has been extended to multiple nonlinear regression models. Sun et al. introduced LASSO into MLP to perform variable selection and developed a soft sensor for practical crude oil processing [22]. Subsequently, Fan et al. [23] proposed a double LASSO for MLP, not only to shrink the input variables but also to simplify the hidden layer. Wu et al. [24] designed an input selection algorithm with LASSO for LSTM, which is referred to as LASSO-LSTM.
However, the aforementioned LSTM-based algorithms have two drawbacks. First, these algorithms use the mean square error (MSE) as the loss function for training and optimization owing to their excellent characteristics, such as convexity, smoothness, and low computational cost under the Gaussian assumption. In [25], the statistical results demonstrated that model estimation with the MSE loss function has weak resistance to outliers. Practical industrial processes are sometimes unstable, which inevitably introduces outliers in the measured data. To address this problem, the design of an effective robust loss function helps improve the robustness of soft sensors. There are various robust loss functions, such as Huber M-Loss [26], maximum correlation criterion [27], and least absolute deviation [28], implemented in linear or nonlinear models. Among them, Huber M-Loss, which combines squared loss with absolute loss, is quadratic for small errors but grows linearly for large errors. Specifically, the Huber M-Loss can adapt to the process with different error distributions by adjusting the tunable hyperparameter.
The second drawback is the computational cost of variable selection and structural optimization for LSTM. For example, the spatiotemporal attention mechanism needs to calculate the relevance between the input variables and output variables and then assign different weights to the input variable at each time step. LASSO-LSTM introduces the ℓ1 penalty into a well-trained LSTM to create a quadratic optimization equation and then achieves variable selection by solving the equation. These algorithms are time consuming owing to the intrinsic complexity of the LSTM structure. Therefore, the development of fast training and easily implemented optimization methods for LSTM has become significant.
In this study, a new regularized and robust LSTM learning algorithm for industrial soft sensors is proposed. The main contributions of this study are as follows: To improve the resistance to outliers and reduce the influence of redundant input variables, Huber M-Loss combined with ℓ1 regularization was introduced into the loss function of LSTM. A backpropagation through time (BPTT) training algorithm for the proposed LSTM was developed, including derivative calculation and updating formulas of network weights. The superiority of the proposed soft sensor was demonstrated by artificial datasets and its application in an actual desulfurization process. The feature importance analysis shows that the presented soft-sensing model has valuable interpretability that is consistent with chemical reactions and field operations.
The remainder of this paper is organized as follows. Section 2 provides a brief introduction to related works. In Section 3, the proposed algorithm is described in detail. In Section 4, the proposed algorithm is verified using artificial datasets and is compared with other approaches. In Section 5, the application of the proposed soft sensor to a practical desulfurization process and related discussions are presented. Finally, concluding remarks and future work are presented in Section 6.
Related work
In this study, all vectors are column vectors and are denoted by boldface lower-case letters. Matrices are represented in bold capital letters. For vector
Long short-term memory neural network
LSTM can predict time series by retaining the state at each moment during the learning process. When a model is built, a sliding regression method must be introduced to handle the data. For the original time series

Unit structure of LSTM.
The three inputs are the current input
The LASSO for the multivariate linear regression model is formulated as:
The first term in the formula is the residual loss that controls the approximation degree of the model, whereas the second term is the ℓ1 penalty that regularizes the model coefficients. λ is a nonnegative tunable parameter that controls the shrinking strength. When λ equals zero, model (8) is equivalent to an ordinary least square estimate. As λ keeps increasing, the coefficients of some variables shrink to zero, and the variable selection or model reduction is achieved.
Huber M-Loss is robust and adaptive because it combines a quadratic loss with a linear loss, and its expression is given as
Here, parameter δ controls the degree to which Huber loss is quadratic or linear. The quadratic loss is based on squared errors and punishes larger errors more, which makes it effective when the errors are normally distributed. However, if outliers exist in the data, the square part of the function results in a biased estimation of the model. By measuring the absolute differences, the linear loss is less sensitive to outliers in the data. Huber M-Loss incorporates quadratic loss and linear loss in one expression, which is robust to outliers and differentiable everywhere. The hyperparameter δ that regulates the degree of quadratic or linear, can be determined through a data-driven cross-validation approach.
In this study, a new soft-sensing algorithm based on robust and regularized LSTM is proposed. A detailed description of the proposed algorithm is provided below.
LSTM with Huber M-Loss and ℓ1 regularization
Compared to the standard LSTM, there are two major revisions to the loss function of the proposed LSTM. First, to improve the resistance to outliers, the MSE loss of the conventional LSTM is replaced by the Huber M-Loss. By adjusting the value of δ in the Huber M-Loss, the LSTM can adapt to various processes with different error distributions. Second, the ℓ1 penalty is added to the loss function to reduce the model complexity. As λ increases, the values of the weight matrices continuously shrink until some of them are forced to zero, implying that some redundant units are eliminated from the LSTM. Hence, it reduces the possibility of overfitting and improves the generalization performance.
Combining Equation (8) with Equation (9), the loss function is designed as
In this study, a BPTT training algorithm is developed for the proposed LSTM to minimize
where
The gradient of the relevant parameters can be obtained by the chain derivative rule:
The detailed derivation process of the weight parameters of the proposed LSTM is shown in Appendix A. In practical applications, soft sensors are usually trained and executed in industrial computers, which may have different integrated development environments (IDEs). The BPTT formulas can be easily implemented in different IDEs for industrial applications. After completing the derivation of the BPTT algorithm for the proposed LSTM, the adaptive moment estimation (Adam) of [29] is implemented to perform the gradient update. Adam is an efficient optimization approach for the large-scale weight estimation of NNs and can provide fast, stable, and lower error convergence for the proposed model.
Hyperparameters are essential for neural network learning because they have a significant impact on model performance. There are two categories of hyperparameters in the proposed algorithm. The first category comprises the regular training hyperparameters of the LSTM, such as the learning rate, number of hidden layer neurons, maximum epochs, and size m in the sliding window, which can be determined by experience and some trials. The second category comprises the advanced hyperparameters, including δ in Huber M-Loss and λ in ℓ1 regularization, which are tuned by grid search and moving window cross-validation (MWCV) [30].
Grid search: In this study, the process is performed in three steps: (1) Set the bound domain of the hyperparameters. The bound domain of δ is set to [0 . 1, 1], and the bound domain of λ is set to [0, λ ub ], where λ ub is sufficiently large to make the model underfit. (2) A list of possible hyperparameter combinations is generated from the defined domains. (3) Find the best hyperparameter combination through two loops, in which MWCV is used to evaluate the average loss function in the inner loop, and the outer loop ensures that every possible hyperparameter combination is traversed.
MWCV: In this study, we construct multiple windows that move synchronously along the sub-dataset with a step size of 1 to split a validation set Z val = Zk+1, and consider the union of the previous sub-datasets as the training dataset Z train = [Z1, Z2 … Z k ]. Under the given combination, MWCV is repeated K - 1 times to obtain an average validation loss. After trying every possible combination of δ and λ, the combination with the smallest validation loss is considered the optimal [δ, λ].
Combining the above methods, the pseudocode for determining the optimal [δ, λ] is presented in Algorithm 1.
Computational flow of proposed algorithm
The overall computational flow of the proposed algorithm is as follows: First, a multi-step time-series dataset is constructed and divided into a training set and a testing set. Second, the loss function of LSTM is reformulated by introducing Huber M-Loss and ℓ1 regularization into the standard LSTM. Subsequently, the optimal [δ, λ] is determined using Algorithm 1. Finally, BPTT and Adam are used to train the proposed LSTM with the designed loss function. The pseudocode of the proposed ℓ1-HM-LSTM is presented in Algorithm 2.
Simulation results on artificial datasets
Experimental setup
In this study, the proposed algorithm was simulated on numerical cases and a practical industrial dataset and compared with LASSO-MLP [22], LSTM, Transformer, LASSO-LSTM [24], STA-LSTM [15] and ℓ1-HM-GRU. To ensure fairness, all the LSTM-based soft-sensing algorithms had the same network structure and training methods, while the hyperparameters of LASSO-MLP were determined using cross-validation. The optimal hyperparameters of different algorithms involved in the experiments are shown in Appendix B.
The algorithms were simulated in the same environment, which had a Ryzen 7 5800 H (3.2 GHz) CPU with 16GB RAM, a Windows 11 operating system, and Python 3.6 programming software. The following statistical indices were used to evaluate the algorithms’ performance.
1) Root mean square error (RMSE):
The artificial datasets were generated based on the Friedman dataset [31], which has five relevant input variables and 2000 samples. The output variable was derived as:
Here, the independent variable x is randomly sampled from a uniform distribution (0,1) and ɛ denotes the Gaussian noise. After that, redundant variables and outliers were added to test the performance of the algorithm.
The detailed procedure of the dataset generation is as follows. First, the dataset of five relevant input variables (x1, x2, x3, x4, x5) was generated as
Five irrelevant input variables have been added to the test case, and the data are normally distributed. Table 1 shows the mean, standard deviation, and worst and best experimental results for each algorithm over 10 runs. The comparative results demonstrate that ℓ1-HM-LSTM is better than the other algorithms for all accuracy indices. Meanwhile, the MLP-based model, as a typical feedforward neural network, is not suitable for modeling time series with different durations.
Simulation results on the artificial dataset of 10 input variables with normal distribution
Simulation results on the artificial dataset of 10 input variables with normal distribution
Moreover, the time cost of completing a single model training after all the hyperparameters are determined is compared. Notable is the fact that the time cost of the ℓ1-HM-LSTM is close to that of LSTM and much smaller than other LSTM-based models. Because the proposed algorithm is a conventional LSTM with a modified loss function, the training cost does not increase significantly. However, both LASSO-LSTM and STA-LSTM are additional optimizations of the well-trained LSTM model, so their training costs increase significantly.
To demonstrate the effect of ℓ1 regularization and the interpretability of the trained model, a mean impact value (MIV) analysis for all input variables was performed on the conventional LSTM and the ℓ1-HM-LSTM. MIV is an indicator that measures the importance of the input variables to the output in complex black box models [32]. The computational steps are as follows: First, for every input variable of a well-trained model, the data of the corresponding column in the input dataset increases and decreases by 10% to obtain two modified datasets. Second, the two modified datasets are input into the model to obtain their corresponding outputs. Third, the absolute value of the difference in the outputs is calculated to quantify the effects of the input variable on the output, represented as Δ i , i = 1, 2, …, s. After calculating the MIV of every input variable, an MIV vector Δ = (Δ1, Δ2, …, Δ s ) is obtained. The contribution ratio of variable i to the output can be measured as
The contribution ratio assessment was performed with the testing data on the best models of the LSTM and the ℓ1-HM-LSTM over 10 runs. The percentage diagrams of the contribution ratios of all variables are shown in Fig. 2.

(a) Feature importance distribution of the LSTM model (b) Feature importance distribution of the ℓ1-HM-LSTM model.
First, the relevant variables, that is, x1 ∼ x5, have larger shares of contribution ratio than irrelevant variables in both diagrams, which is consistent with Equation (24). Second, the contribution ratio of variable x4 is significantly larger than that of the other variables because it has multiple time delays to the output, as shown in Equation (24). The results show that LSTM can effectively deal with short- and long-term dependencies between the input and output. Third, for the conventional LSTM, the total contribution ratio of irrelevant variables is so significant that it significantly deteriorates model performance. For the ℓ1-HM-LSTM model, the contribution ratio of irrelevant variables is compressed to a negligible scale.
In addition, to verify the potential relationship between the optimal network structure and the feature importance analysis, the sum of the absolute values of the weights connecting each input variable in the optimal ℓ1-HM-LSTM are calculated. Then, the percentage of each input variable to all variables is calculated according to their respective sums and shown in Fig. 3. The top five largest values are variables 4, 3, 1, 2, and 5, which is consistent with the feature importance analysis of Fig. 2. (b). The result further demonstrates the interpretability of the soft sensor model with our approach.

Feature importance distribution of ℓ1-HM-LSTM model with weights analysis.
To explore the model reduction ability of the ℓ1 regularization, the shrinkage on the weights connecting a relevant variable and an irrelevant variable is presented. For the test case, the number of weights connecting one input variable is 120. The frequency distributions of the weights connecting the relevant variable x4 and the irrelevant variable x8 of the LSTM and the ℓ1-HM-LSTM are shown in Figs. 4 and 5, respectively. First, the weights are normally distributed in the LSTM model, while most of the weights are shrunk to very small values close to zero in the ℓ1-HM-LSTM model. The results demonstrate the powerful model reduction ability of the ℓ1 regularization. Second, the weights connecting the irrelevant variable x8 are shrunk much more completely than those of variable x4, illustrating the shrinking precision of the ℓ1 regularization.

(a) Frequency distribution of the weights connecting x4 before sparsity (b) Frequency distribution of the weights connecting x4 after sparsity.

(a) Frequency distribution of the weights connecting x8 before sparsity (b) Frequency distribution of the weights connecting x8 after sparsity.
The artificial dataset consists of 10 input variables and 10% of the samples are augmented with outliers to demonstrate the robustness of the proposed ℓ1-HM-LSTM. The simulation results for the modified dataset are presented in Table 2.
Simulation results on the artificial dataset of 10 input variables with 10% outliers
Simulation results on the artificial dataset of 10 input variables with 10% outliers
Compared to the results on the artificial dataset with normal distribution, the performance of all algorithms deteriorates. The ℓ1-HM-LSTM shows better performance than the other algorithms, with small performance degradation.
To further verify the robustness of the proposed algorithm when the number of outliers and redundancy increase, larger artificial datasets of 50 input variables with different degrees of outliers were constructed. The contamination ratio η ranged from 0.1 to 0.6. The statistical results with η = 0.1 are shown in Table 3. It can be seen that LSTM suffers from overfitting owing to the existence of redundant input variables and outliers, while the proposed ℓ1-HM-LSTM shows significant advantages over the other algorithms.
Simulation results on the artificial dataset of 50 input variables with 10% outliers
Simulation results on the artificial dataset of 50 input variables with 10% outliers
To demonstrate how the hyperparameters δ and λ affect the performance of the proposed algorithm, a dataset of 50 input variables and η = 0 . 3 was taken. After the MWCV process,λ = 0 . 0015 and δ = 0 . 4 were determined as the optimal hyperparameter combination of the ℓ1-HM-LSTM. Then, the model performance was compared by regulating one hyperparameter while holding the other constant to illustrate the influence of a single hyperparameter.
First, let λ = 0 . 0015; the RMSE with different δ is given in Fig. 6. A small δ means that the Huber loss function favors MAE and therefore makes the model estimation biased, while a large δ means that the function prefers MSE and makes the model less sensitive to outliers. Second, let δ = 0 . 4; the RMSE with different λ is presented in Fig. 7. Obviously, the smallest RMSE appears at λ = 0 . 0015, where the redundant nodes are well shrunk, and an optimal model is obtained. If λ is very small, it means there is little shrinkage on the redundant nodes, which makes the model performance inadequate. As the λ continues to increase beyond the optimal value, more and more valuable nodes are removed, which makes the model performance deteriorate successively.

The validation RMSE of ℓ1-HM-LSTM with different δ when λ = 0 . 0015.

The validation RMSE of ℓ1-HM-LSTM with different λ when δ = 0 . 4.
The comparative results with different values of η are shown in Fig. 8, which demonstrates that ℓ1-HM-LSTM has the best performance in each evaluation index. Especially, when η > 0 . 4, the R2 of the other algorithms other than ℓ1-HM-GRU is below 0.7, which means that the model performance of these algorithms began to collapse. Besides, the ℓ1-HM-GRU is obviously worse than ℓ1-HM-LSTM in prediction accuracy. Since the GRU has only one update gates to control information updating, it is less effective than LSTM in handling complex sequence modeling tasks with long-term dependencies. Although both ℓ1 regularization and STA technology can reduce the model complexity and improve the model performance, the LSTM with MSE loss has insufficient resistance to outliers. The proposed Huber M-Loss with ℓ1 regularization not only simplifies the LSTM model, but has sufficient resistance to outliers, and therefore produces more robust models.

Performance comparison with different η.
Process description
In this section, the proposed algorithm is applied to predict the SO2 content in the gas emitted from the desulfurization process of an actual thermal power plant. In the plant, a double absorption tower structure with limestone-gypsum wet flue gas desulfurization (FGD) technology is adopted. A schematic of the process is shown in Fig. 9.

Process flow chart of double-tower desulfurization system.
The process flow is briefly described as follows: First, the lump limestone is ground and crushed into powder using a wet ball mill and mixed with water to generate limestone slurry, which is then further treated and entered the oxidation reaction pool in the absorption tower to form a circulating slurry. Second, the circulating slurry is sent into the spray layer by a circulating water pump, which is atomized in the spray layer and fully in contact with the flue gas to produce gypsum slurry. Finally, the gypsum slurry is dewatered by the gypsum treatment system to form gypsum by-products. The moisture of the purified flue gas is removed by the absorption tower demister, and the dust is removed by the electrostatic precipitator before it is discharged into the atmosphere. The principle of the chemical reaction is as follows.
In this system, the SO2 concentration in the flue gas emitted from the secondary absorption tower is a key performance indicator related to environmental standards and energy consumption. However, this indicator is unstable, and an online analyzer requires regular maintenance. It is necessary to design soft sensors to provide calibration information and improve the system’s reliability. In addition, the indicator can be used as the output to design a model predictive control system to improve absorption efficiency and reduce energy consumption.
However, there are two challenges when performing data-driven modeling of the process. First, the double-tower structure increases the duration and complexity of the process, which leads to a considerable time delay between the input and output, as well as model redundancy. Second, owing to process disturbances, instrument deviations, and personnel operation errors, there are some outliers in the measured data. Therefore, it is of great significance to develop a soft sensor with long-term memory, resistance to outliers, and reduced model redundancy.
The dataset collected from the FGD process consists of 53 variables, of which SO2 is the response variable. After data analysis and preprocessing, a candidate input variable pool consisting of 35 variables was determined, as shown in Table 4. A total of 2676 samples were collected at one-minute sampling intervals, the first 80% of which were taken for training, and the others were taken for testing.
Candidate input variables of desulfurization system
Candidate input variables of desulfurization system
*FG = flue-gas; CSP = circulating slurry pump; DE = desulfurization efficiency.
In this study, the dataset was taken without any modification, and the anomaly detection toolkit (ADTK) [33] was used to detect the presence of outliers from the dataset. Figure 10 shows the detected outliers of the studied dataset with ADTK criteria, in which 71 samples were identified as outliers.

Distribution of measured SO2 concentration.
Table 5 shows the statistical results of the SO2 prediction with different soft sensors. First, it can be observed that the conventional LSTM model performs poorly owing to excessive irrelevant variables and outliers. Second, the LASSO-MLP shows unsatisfactory performance owing to the time delays with unknown durations caused by the design of the double-tower structure and the complex reaction mechanism. The feedforward structure of MLP is inefficient for the modeling of time-series processes. Third, although the Transformer model using the attention mechanism can capture global key features, the encoder is easily disturbed by outliers and has poor robustness. Fourth, although LASSO-LSTM and STA-LSTM can capture the time-series characteristics of the process and simplify the model, their performance is less favorable. The proposed ℓ1-HM-LSTM is superior to other algorithms in all criteria, indicating that model redundancy and outliers in the measured data are handled well. Our approach has better generalization performance and higher robustness in industrial processes with outliers. The best predictive results with ℓ1-HM-LSTM and different algorithms’ error distributions are shown in Figs. 11 and 12, respectively. The fitting curve demonstrates that the proposed algorithm can accurately follow the dynamics of the target variable. The box-plot error distribution of our approach is the densest, which illustrates the superiority of our algorithm in actual industrial applications.

Fitting curve of SO2 concentration with ℓ1-HM-LSTM.

Error box plots of different algorithms.
Simulation results on the SO2 prediction
To demonstrate the interpretability of the soft-sensor model trained with our approach, a feature importance analysis was performed on the conventional LSTM and the ℓ1-HM-LSTM, with the percentage diagrams of the contribution ratio shown in Fig. 13. In the percentage diagram of the conventional LSTM, the contribution ratio of each variable is relatively even. The relevant input variables do not contribute significantly more than the irrelevant variables in the LSTM model. In contrast, the distribution of the contribution ratio of the ℓ1-HM-LSTM is much more explicit.

Feature importance distribution of the LSTM model and the ℓ1-HM-LSTM model.
In the ℓ1-HM-LSTM model, the first variable that prevails is variable 3, and the outlet flue gas SO2 content of absorption tower #1. It is highly correlated with the output. The second highest contributing variable is variable 12, the current of #1 circulating slurry pump of absorption tower #2. In the field operation, this variable is used to control the amount of slurry sprayed in absorption tower #2, which is important for the SO2 absorption. Moreover, variables 2, 5, and 8 are the other three most influential variables. Variable 2 is the inlet flue gas SO2 content of absorption tower #1, which has an inevitable but smaller impact on the output than variable 3. Variable 5 is the flow rate of the limestone slurry in the absorption tower, which appears to influence the SO2 absorption of the system, as shown in Equation (26). Variable 8, that is, the #1 circulating slurry pump of absorption tower #1, is used to control the amount of slurry sprayed in absorption tower #1.
There are four circulating pumps in absorption tower #1 and three pumps in absorption tower #2. Under normal field conditions, the “3 + 2” operation mode is adopted, in which three pumps in tower #1 and two in tower #2 are kept running, and the others are idle. Sometimes, the “2 + 1” operation mode is taken to save energy when the load on the power plant is low at night. The #1 pumps of both towers run as manipulated variables to control the SO2 absorption. As it is closer to the process outlet, pump #1 of tower #2 contributes more to the output variable than pump #1 of tower #1, as shown in Fig. 8. In summary, the feature importance distribution presented by the ℓ1-HM-LSTM is consistent with the chemical reaction and field operation of the plant.
To further illustrate the contribution of each component in the ℓ1-HM-LSTM model, ablation experiments were conducted on the artificial dataset of 50 input variables with outliers and η = 0.3 as well as the FGD process dataset. The experimental results are shown in Fig. 14. It can be observed that ℓ1 regularization and Huber M-Loss are effective for the experimental cases, and removing each component leads to a decline in the model performance. By comparison, removing the ℓ1 regularization has a more significant impact on performance than removing the Huber M-Loss does, indicating that ℓ1 regularization with model parsimony contributes more to prediction accuracy when there are excessive redundant variables and outliers.

Ablation study results in terms of (a) the R2, (b) the RMSE and (c) the MAE on the artificial dataset and the FGD dataset.
Considering that the soft sensing of practical industrial processes often experiences multivariate, time series, and outliers, a robust LSTM with Huber M-Loss and ℓ1 regularization was proposed. The proposed algorithm takes advantage of the adaptability of the Huber M-Loss to outliers and applies ℓ1 regularization to achieve input variable selection and model reduction. The weight-updating formulas for the designed LSTM were derived using the BPTT algorithm. The hyperparameters of Huber M-Loss and ℓ1 regularization were determined using the MWCV method. Numerical datasets with different dimensions and outliers were used to test the performance. The comparative results with other approaches demonstrate the excellent robustness and rapidity of the proposed algorithm. Furthermore, the proposed algorithm was implemented for the SO2 estimation of an actual desulfurization process. Our approach presents sufficiently accurate models that can provide a solid foundation for the optimization and control system improvement of the process.
Regarding future research, our approach can be improved in a few ways. First, the proposed algorithm is limited to regression problems, and the algorithm can be extended to classification problems. Moreover, the proposed algorithm can only train offline models, which is very effective for the process with stable operating conditions. However, the model performance degrades when concept drift occurs or operating conditions change. Therefore, another direction for future research is to develop online learning or incremental learning techniques for such algorithms.
Footnotes
Acknowledgments
The authors would like to thank the editors and reviewers for the time and effort they spent reviewing this paper as well as for their detailed and constructive comments for the paper improvement in terms of presentation and quality.
Funding
This work is funded in part by the Open Foundation of State Key Laboratory of Process Automation in Mining & Metallurgy (BGRIMM-KZSKL-2021-07), the Shandong Provincial Natural Science Foundation of China (ZR2021MF022).
Appendix A
In this section, the detailed derivation of the gradient update equations for the proposed ℓ1-HM-LSTM is provided as follows. At time t - 1, the error term of BPTT is defined as,
From Equation (16), we obtain
Note that we only provide the variations applied to the output gate of the LSTM network, whereas the variations for the remaining gates can be derived in a similar manner. The weight gradient for
At time t, the gradient of
Appendix B
Optimal hyperparameters of different algorithms on the SO2 prediction
| Hyperparameters | LASSO-MLP | Transformer | LSTM | LASSO-LSTM | STA-LSTM | ℓ1-HM-GRU | ℓ1-HM-LSTM |
| Input dimension | 35 | 35 | 35 | 35 | 35 | 35 | 35 |
| Output dimension | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| Number of hidden neurons | 40 | 40 | 40 | 40 | 40 | 40 | 40 |
| λ | 0.5 | – | – | 0.04 | 0.003 | 0.002 | |
| δ | – | – | – | – | – | 0.4 | 0.5 |
| Initial learning rate | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| Sliding window m | – | 8 | 8 | 8 | 8 | 8 | 8 |
| Max epochs | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 | 1000 |
