Data reconciliation using MA-PCA and EWMA-PCA for large dimensional data

Abstract

In process industries, measurements usually contain errors due to the improper instrumental variation, physical leakages in process streams and nodes, and inaccurate recording/reporting. Thus, these measurements violate the laws of conservation, and do not conform to process constraints. Data reconciliation (DR) is used to resolve the difference between measurements and constraints. DR is also used in reducing the effect of random errors and more accurately estimating the true values. A multivariate technique that is used to obtain estimates of true values while preserving the most significant inherent variation is Principal Component Analysis (PCA). PCA is used to reduce the dimensionality of the data with minimum information loss. In this paper, two new DR techniques are proposed moving-average PCA (MA-PCA) and exponentially weighted moving average PCA (EWMA-PCA) to improve the performance of DR and obtain more accurate and consistent data. These DR techniques are compared based on RMSE. Further, these techniques are analyzed for different values of sample size, weighting factor, and variances.

Keywords

Data reconciliation MA-PCA EWMA-PCA

1 Introduction

In industries, process data play an important role in process control, monitoring, optimization, and decision making. As technological innovations continue to rise in the industrial sector, visibility into a process and its monitoring are frequently obstructed by randomness in on-line measurements, instrument inaccuracies and malfunctions, which can be classified either as gross errors or random errors [2 , 13]. Data Reconciliation (DR) techniques are used to minimize the errors in process measurement data [1 , 16].

A method popularly used for data processing and multivariate analysis is principal component analysis (PCA). PCA is mainly used for dimensionality reduction, and aids in developing predictive multivariate models. The transformed variables (known as Principal Components or PCs) are orthogonal, and when arranged in decreasing order, the first few principal components capture the largest amount of variation contained in the multivariate data [6 , 15].

In most process monitoring applications, steady state process variables are assumed to have a fixed mean. Due to process dynamics, environmental variations, and equipment degradation, process data become time-dependent. These non-stationary behaviors are not captured by PCA and scaling (of statistical parameters like mean and variance) is required before decomposing large dimensional data. When measured values of process variables differ by a large magnitude (spanning across orders of magnitude), the data reconciliation may be inaccurate, due to the stretching of variables along one single axis, known in this context as the dominant PC axis. Thus, PCA based DR is sensitive to the magnitude of process variables. In this paper, two techniques MA-PCA and EWMA-PCA based DR are introduced to overcome this limitation [7 , 19].

The paper is organized as follows. Section 2 provides a short summary of MA-PCA and EWMA-PCA based DR techniques. In Section 3, characteristics of the chosen benchmark problem (a steam metering circuit), simulation assumptions considered for process environment, and simulation results are presented. Section 4 contains the results and discussion of the simulation study.

2 Data reconciliation (DR)

DR estimates the true value of process variables via de-noising the measurement data. The reconciled data are more accurate compared to the original data. The variation between data reconciliation and other filtering techniques is that data reconciliation utilizes the process constraints in the model (typically based on conservation laws).

The steady state data reconciliation problem is defined in equation (1), $Min d = \sum_{i = 1}^{n} (\frac{\ddot{x} - x}{σ_{i}})^{2}$ (1) Subject to: G_m (x_i) =0, m = 1, 2, . . . ,

Where, $\ddot{x}$ –reconciled data, x is raw the measurement, σ is measurement variance and G_m contains the steady state process constraints. The constraint equations are obtained and are represented in the form of a constrained matrix given by the following set of linear relationships: $C (mXn) x (nX 1) = 0_{(mX 1)}$ (2) Where x_i ∈ R and C represent the true value of measurements and process constraints respectively. Let y_i (k) ∈ R represents the raw measurements that are contaminated by random error e (k), where k = 1, 2, 3, . . . , N and i = 1, 2, . . . , n . These observations are organized in raw measurements as Y with (nXN) dimension. $Y = X + E$ (3) where, n is number of process variables, and N is number of observations.

2.1 Data reconciliation using pre-processing techniques

Figure 1 illustrates the process of data reconciliation to estimate true values. First raw measurements are obtained, then they are processed using MA or EWMA. Later PCA is applied to pre-processed data (Y_ma or Y_e) to estimate reconciled data ( $\ddot{X}$ ). MA-PCA and EWMA-PCA based DR techniques are explained in the following section.

Fig. 1

MA-PCA and EWMA-PCA based DR.

2.1.1 MA-PCA based DR

Moving Average (MA) is a method popularly used for pre-processing data to filter ‘white-noise’. It averages a specified number of variables, giving equal weight to each variable of the data [4 , 19]. The moving-average can be expressed mathematically as: $y_{{ma}_{i}} (k) = \frac{1}{k^{*}} \sum_{j = (k - k^{*}) + 1}^{k^{*}} y_{i} (j)$ (4) where k^* is the number of samples that are being averaged and Y_ma is the moving average of the i^th variable. To obtain a reconciled estimate, PCA is then applied to moving-averaged data y_{ma
_i}.

2.1.2 EWMA-PCA based DR

The exponentially weighted moving average (EWMA) is a widely used method that gives weight to the recent data by improving simple variance [4 , 10]. Here, the same approach is described based on a statistic $y_{e} (k) = λ y_{i} (k) + (1 - λ) y_{e} (k - 1)$ (5) where λ is the weighting factor varying 0 < λ < 1 and y_e is the exponentially moving averaged data. The selection of λ plays a major role and this was briefly explained by Lucas and Saccucci, 1990. Then, PCA is applied to EWMA data and reconciled estimates are calculated.

2.1.3 Application of PCA in DR

PCA based DR techniques are used for reducing random errors present in raw measurements and obtain reconciled estimates. In this technique, principal components (PCs) are obtained by linearly transforming the original variables into new ones that are mutually independent (orthogonal). Here, the new variable that has the highest variance is termed as the first PC. Orthogonal to the first PC is the second PC, which has next highest variance, and so on. Further, the dimensionality of the data is reduced to obtain reconciled data, and this is done by retaining the first few PCs that capture the most variance (typically selected using the 80:20 Pareto Principle). To obtain these PCs, Singular Value Decomposition (SVD) is used [11 , 18].

To obtain PCs, Singular Value Decomposition (SVD) is applied to Y_ma or Y_e as shown in equation (6) $Y_{s} = svd (Y_{ma} or Y_{e})$ (6) The equation (5) can be further explained as in equation (7) $Y_{s} = {VSW}^{T} = V_{1} S_{1} W_{1}^{T} + V_{2} S_{2} W_{2}^{T}$ (7) The reconciled estimates of Y_s $\ddot{X} = V_{1} S_{1} W_{1}^{T}$ (8) where V and W are the matrices with orthonormal vectors chosen from $Y_{s} Y_{s}^{T}$ and $Y_{s}^{T} Y_{s}$ respectively. Here, eigenvalues are obtained from V where the largest ‘p’ eigenvalues correspond to V₁ and the smallest (n-p) eigenvalues correspond to V₂ of the covariance matrix. While, S₁ and S₂ are the singular matrices [6, 15]. Here, ‘P’ PCs are only considered to obtain reconciled data. As in equation (8), $\ddot{X}$ corresponds to the reconciled data.

2.1.4 Performance measure of DR

Root Mean Square Error (RMSE) is a measure of a performance metric that is applied to true values and the reconciled estimates. The individual differences between true values and the reconciled data are also called as residuals. RMSE serves to aggregate them to compare the prediction errors across different DR techniques [17]. Mathematically, it is defined as: $RMSE = \sqrt{\frac{1}{N} \sum_{k = 1}^{N} (\ddot{X} - X)^{2}}$ (9)

3 Simulation and results

Figure 2 shows a large steam metering circuit, to obtain a materialized data matrix to implement DR techniques. This circuit consists of 28 flow variables and 11 nodes [3, 5]. At each node, it has the notation of inflows and outflows by an arrow that indicates the direction of flow. In the circuit (Fig. 2), the base values of flow rates are estimated through the mass-balance equation.

Fig. 2

Steam metering circuit.

The constraint matrix (C) is derived from the mass balance equation of each node. Thus, 11 constraint equations are obtained as in Equation (10). Here, base values are assigned to the flow variables as shown in Table 1.

Table 1

Base value of flow variable

Flow variable	1	2	3	4	5	6	7	8	9	10	11	12	13	14
Base value	10	10	30	10	20	20	30	20	10	10	10	20	10	10
Flow variable	15	16	17	18	19	20	21	22	23	24	25	26	27	28
Base value	10	5	5	40	5	5	10	20	20	10	5	20	45	15

Node 1 : F 1 + F 2 + F 4 - F 3 = 0

(10)

Node 2 : - F 5 - F 6 + F 7 + F 8 - F 9 = 0

\dots

Node 11 : - F 8 + F 20 + F 26 - F 28 = 0

The above mentioned DR techniques are implemented through python.

To simulate the data, the following assumptions were made [12, 15]:

The flow process operates under the linear steady-state condition.

The measured value of each flow variable is normally distributed, i.e., y (j) ∼ N (0, σ²)

Process constraints are linear.

The variables in the process have identical variances.

The samples in the measurement errors are independent and identically distributed (i . i . d).

The covariance matrix is known and considered, with ∑ = I

The proposed DR techniques described in Section 2 were implemented and results are presented in Table 2. Table 2 compares the performance of PCA, MA-PCA (sample size k^* = 50), and EWMA-PCA (λ=0.01) based DR techniques. From Table 2, it is seen that PCA based DR values of node 8 with inflow 15, 22, and 23 are 0.4401, 0.8777, and 0.878 respectively, while the values of the outflow variables 18 and 24 are 1.7544 and 0.4405, respectively. RMSE of flow variables 15 and 24 are almost the same. This shows that RMSE of a node is independent of inflow and outflow.

Table 2

Performance of DR techniques

Flow	RMSE				Flow	RMSE
variable	Raw data	PCA	MA-PCA (k^* = 50)	EWMA-PCA (λ = 0.01)	variable	Raw data	PCA	MA-PCA (k^* = 50)	EWMA-PCA (λ = 0.01)
1	1.0021	0.4412	0.0649	0.0359	15	1.0021	0.4401	0.0647	0.0357
2	1.0021	0.4413	0.0649	0.0359	16	1.0021	0.2243	0.0327	0.0184
3	1.0021	1.3154	0.1941	0.1073	17	1.0021	0.2242	0.0327	0.0184
4	1.0021	0.4404	0.0648	0.0359	18	1.0021	1.7544	0.259	0.1432
5	1.0021	0.8772	0.1294	0.0716	19	1.0021	0.2242	0.0327	0.0183
6	1.0021	0.8777	0.1295	0.0716	20	1.0021	0.224	0.0327	0.0183
7	1.0021	1.315	0.1941	0.1073	21	1.0021	0.4412	0.0648	0.0359
8	1.0021	0.877	0.1294	0.0715	22	1.0021	0.8777	0.1295	0.0716
9	1.0021	0.4408	0.0648	0.0358	23	1.0021	0.878	0.1295	0.0717
10	1.0021	0.4407	0.0648	0.0358	24	1.0021	0.4405	0.0648	0.0358
11	1.0021	0.4409	0.0648	0.0358	25	1.0021	0.2242	0.0327	0.0183
12	1.0021	0.8773	0.1294	0.0716	26	1.0021	1.3159	0.1942	0.1074
13	1.0021	0.4407	0.0648	0.0358	27	1.0021	1.9727	0.2912	0.161
14	1.0021	0.4407	0.0648	0.0358	28	1.0021	0.659	0.0971	0.0538

Further, it is noticed that the RMSE of DR techniques of nodes 5, 9, and 10 with its corresponding flow variables are varying randomly. From this, it is concluded that the DR techniques are not influenced by inflow and outflow variables.

When the results were analyzed based on the base value of magnitudes (30, 40, and 45), the RMSE of PCA based DR of flow variables (F3, F18, and F27) were determined to be 1.3154, 1.7544, and 1.9727, respectively. This shows that there is a gradual increase in RMSE with an increase in the magnitude of flow variables.

Further, the performance of DR techniques is compared in Table 2. It is observed that after DR, reconciled data have smaller RMSE than raw data. This concludes that the performance of DR techniques led to an improvement (in the estimates of true values of process variables) compared to the raw data.

Figure 3 shows the performance of the DR technique for varying magnitude of flow variables. To compare these techniques, seven flow variables F1, F5, F7, F18, F25, F27, and F28 are selected and the graph is plotted. From this, it is depicted that RMSE of raw data is constant throughout flow variables, whereas, for DR techniques there is an increase in RMSE for a corresponding increase in the magnitude of flow variables. This shows that DR techniques are sensitive to the magnitude of flow variables. Here, it is also observed that the PCA has a high slope while EWMA-PCA has a low slope. This indicates that PCA is more sensitive and EWMA-PCA is less sensitive to the magnitude of flow variables. Further, it also inferred that EWMA-PCA and MA-PCA provide greater improvement in terms of DR accuracy compared to PCA.

Fig. 3

Performance of DR techniques for varying magnitude.

MA-PCA based DR is performed for different values of the sample group (k^* = 3, 7, 10, 20, 50). Figure 4 depicts that MA-PCA with a large value of k^* (k^* = 50) has a low slope and is less sensitive to the magnitude of flow variables, compared to PCA. This concludes that MA-PCA provides good improvement for large sample groups (k^*).

Fig. 4

Performance of MA-PCA for varying k^*.

Figure 5 presents the performance of MA-PCA for varying sample sizes (k^*). Sample sizes 3, 7, 10, 20, and 50 are selected for simulation. However, sample sizes of more than 50 may take longer computation time. For increasing sample size, the performance of MA-PCA shows an improving trend. RMSE for all selected variables is consistent from k^*=20 to 50.

Fig. 5

Performance of MA-PCA based DR techniques for varying magnitude.

Figure 6, shows the performance of different variables to varying values of the filter coefficient (λ). For higher values of filter coefficient, the performance of the DR technique shows a decline. Therefore, the range of filter coefficient is fixed between 0.01 and 0.05. For λ = 0.01, RMSE of all variables is low, and slowly increases for higher λ.

Fig. 6

Performance of EWMA-PCA for varying λ.

Figure 7 shows the plot of EWMA-PCA based DR for varying magnitude with filter coefficient λ of 0.01 and 0.05. From this, it is observed that a λ of 0.01 has a low slope which implies it is less sensitive to the magnitude of flow variables. This implies that EWMA-PCA has improved for a small value of weighting factor (λ). Further, these DR techniques were also compared and analyzed for different values of variance (σ²). It was observed that a smaller variance made the DR performance less sensitive to the magnitude of flow variables.

Fig. 7

Performance of EWMA-PCA based DR techniques for varying magnitude.

4 Conclusion

In this paper, two DR techniques MA-PCA and EWMA-PCA are proposed, and implemented using Python. Here, simple PCA based DR was implemented to obtain reconciled estimates by reducing dimensionality as well as the effect of random errors. It was observed that the performance of PCA is more sensitive to the magnitude of flow variables. To reduce the sensitivity, MA-PCA and EWMA-PCA based DR techniques are recommended.

From the results, it was seen that EWMA-PCA is less sensitive to the magnitude of flow variables and provides more accurate estimate of process data, and MA-PCA has greater accuracy than PCA. Further, MA-PCA and EWMA-PCA were performed for different values of the sample group and filter coefficient, it was observed that MA-PCA with a large value of the sample group (k^*) and EWMA-PCA for a small value of filter coefficient (λ) showed improvement over PCA, and provided more consistent data.

DR techniques were seen not to be influenced by the inflow and outflow of variables. This study has been carried out based on assumptions that the errors in the measurement are independent and identically distributed (i . i . d .) with all variables having identical variance. Further studies can be conducted for analyzing the performance of these DR techniques with variables having different variance, with a different set of assumptions.

References

Bhattacharyya

, Yogi

, Singla

, Bhushan

, Kelkar

M.G.

, Tiwari

A.A.

, Pramank

and Belur

M.N.

Adaptive, online models to detect and estimate gross error in SPND, Proceedings of Indian Control Conference (ICC), Guwahati, (2017), 149–154. doi: 10.1109/INDIANCC.2017.7846467.

Dyskin

A.V.

, Basarir

and Doherty

, Computational monitoring in real time: review of methods and applications, Geomech Geophys Geo-energ Geo-resour4 (2018), 235–271.

Varshith

C.R.

, Rishika

J.R.

, Ganesh

and Jeyanthi

, Principal component analysis-based data reconciliation for a steam metering circuit, Proceedings of International Conference on Soft Computing and Signal Processing, Advances in Intelligent Systems and Computing2 (2017), 619–626.

Seborg

D.E.

, Edgar

T.F.

, Duncan

, Mellichamp

D.A.

and Doyle

F.J.

, Process Dynamics and Control, 3rd Edition, Wiley & Sons, Inc., USA. 1990.

Valle

E.C.

, Kalid

R.E.

, Secchi

A.R.

and Kiperstok

, Collection of benchmark test problems for data reconciliation and gross error detection and identification, Computers and Chemical Engineering111 (2018), 134–148. doi: https://doi.org/10.1016/j.compchemeng.2018.01.002.

Jolliffe

I.T.

and Cadima

, Principal component analysis: a review and recent developments, Phil Trans R SocA 374:20150202 (2016). doi: https://dx-doi-org.web.bisu.edu.cn/10.1098/rsta.2015.0202.

Lucas

J.M.

and Saccucci

M.S.

, Exponentially Weighted Moving Average Control Schemes: Properties and Enhancements, Technometrics32 (1990).

Ratheesh

K.M.

, Seah

L.K.

and Murukeshan

V.M.

, Spectral phase-based automatic calibration scheme for swept sourcebased optical coherence tomography systemsPhys Med Biol61 (2016), 7652–7663. http://iopscience.iop.org/0031-9155/61/21/7652.

Saimurugan

and Ramprasad

, A dual sensor signal fusion approach for detection of faults in rotating machines, Journal of Vibration and Control24(12) (2018), 2621–2630.

10.

Mehran

and Movahhedinia

, Non-uniform EWMA-PCA based cache size allocation scheme in Named Data Networks, China Inf Sci61 (2018). https://doi.org/10.1007/s11432-016-0501-5.

11.

Megha

, Sowmya

and Soman

K.P.

, Effect of dynamic mode decomposition-based dimension reduction technique on hyperspectral image classification, Lecture Notes in Electrical Engineering490 (2018), 89–99.

12.

Jeyanthi

and Devanathan

Addressing Higher Order Serial Correlation in Techniques for Gross Error Detection, J Comput Theor Nanosci3 (2012), 236–244.

13.

Meleppat

R.K.

, Matham

M.V.

and Seah

L.K.

, An efficient phase analysis-based wavenumber linearization scheme for swept source optical coherence tomography systems, Laser Physics Letters12(5) (2015), 1–7. http://iopscience.iop.org/1612-202X/12/5/055601.

14.

Akrami

S.A.

, El-Shafie

, Naseri

and Santos

C.A.G.

, Rainfall data analyzing using moving average (MA) model and wavelet multi-resolution intelligent model for noise evaluation to improve the forecasting accuracy, Neural Computing and Applications25(7-8) (2014), 1853–1861.

15.

Narasimhan

and Bhatt

, Deconstructing principal component analysis using a data reconciliation perspective, Computers and Chemical Engineering77 (2015), 74–84. https://doi.org/10.1016/j.compchemeng.2015.03.016.

16.

Narasimhan

and Shah

, Model identification and error covariance matrix estimation from noisy data using PCA, Control Engineering Practice16 (2008), 146–155. https://doi.org /10.1016/j.conengprac.2007.04.006

17.

Neill

S.P.

and Hashemi

M.R.

, Ocean Modelling for Resource Characterization, Fundamentals of Ocean Renewable Energy, First edn. Academic Press, (2018), 193–235.

18.

Babu

Y.M.M.

, Subramanyam

M.V.

and Giriprasad

M.N.

, PCA based image denoising, Signal & Image Processing, Int J SIPIJ17 (2020), 297–302.

19.

Zhao

and Liu

, Industrial monitoring based on moving average PCA and neural network, Proceedings of 30th Annual Conference of IEEE Industrial Electronics Society (IECON 2004), Busan, South Korea, 3 (2004), 2168–2177. doi: 10.1109/IECON.2004.1432133