Abstract
With the increasing level of grid intelligence and the related demand response database expanding, it is important to study a compound problem data governance method for demand response, while the traditional data governance methods have problems such as not considering data temporality and ignoring the impact of noise and duplicate data on data repair. As a result, this project will develop an anomaly data extraction and repair model based on two-way long and short memory networks, and repair the anomaly data by respective noise smoothing, missing data filling, and duplicate data cleaning. The paper also provides an adaptive moment estimation approach for optimisation to raise the model’s accuracy. The outcomes demonstrated that the study model’s precision for anomalous data extraction was 100% and its recall rate was 80%, which was a significant improvement over the previous state. In terms of anomalous data repair, the research model had the root mean square error value and lowest mean absolute percentage error value when compared with related models, at 0.0049 MPa and 1.375% respectively. Both the abnormal data extraction and repair performance of the research model are greatly improved over the related models, and have important value in the abnormal data governance of demand response databases.
Introduction
With the development of the electric power industry, as well as the continuous improvement of the level of grid intelligence, the user’s demand for electricity is also higher and higher, most of the electric power companies began to implement the management mechanism based on the user’s demand response, the demand response data continues to increase, the demand response database data anomalies have also increased, the study of a data governance method for the abnormal data of the demand response database is of great significance. Data governance research includes both abnormal data extraction techniques and abnormal data processing. In terms of abnormal data extraction, traditional machine learning methods have the problem of not considering the temporal characteristics of the data [1]. Most data cleaning techniques only take into account the removal of missing data when processing anomalous data, failing to take into account the effects of duplicate and noise-related data. Time-series data is also not taken into account by the model, which results in low prediction accuracy. Long- and short-term memory networks contain memory cell units and gating mechanisms that can preserve and selectively forget previous input information, capture long-distance dependencies in long-time sequences, and effectively deal with long-term dependencies. Therefore, the research will address the problem that traditional data governance methods do not consider the data time-series, which is not highly accurate, and establish a time-series data extraction and data cleaning model based on Long-Short Time Memory (LSTM) network to effectively deal with the composite abnormal data of demand response. The first part introduces the data extraction techniques based on demand response and the current research status of data cleaning methods; the second part is to build a model for temporal anomalous data based on Bidirectional Long Short-Term Memory Network (Bi-LSTM). In the second part, a model for extracting and repairing temporal anomalous data based on Bi-LSTM is developed, and the Adam algorithm is introduced to optimise the model.
Related works
In the area of anomalous data governance, many scholars have conducted related research. An image-based approach for wind turbine anomaly data identification and cleaning was proposed by Long et al. It entails three steps: data pre-cleaning, normal data extraction, and data tagging [2]. A comprehensive method for locating and eliminating outliers from wind power data was proposed by Luo et al. It makes use of the density difference between normal and outlier points to improve cleaning performance and combines boundary extraction and boundary regularization to completely eliminate boundary outliers [3]. The issue of employing raw skeletal data in skeleton-based anomalous gait recognition lowering recognition performance was addressed by Jun et al. The technique, which extracts features from skeletal gait data, increases the performance of skeleton-based anomalous gait recognition and has higher recognition accuracy, according to testing results [4]. The method for detecting network anomalies put out by Wang et al. combined principal component analysis with a single-stage face identification algorithm. It is proposed to use a single-stage face identification algorithm and principal component analysis to detect network anomalies. It has been empirically proven that this method greatly surpasses previous detection methods in terms of speed and accuracy [5]. Dong et al. address the problem that most current anomaly detection models are based on a common dataset for simulation experiments, and the common dataset contains too many kinds of data for traditional machine learning methods to handle [6].
Numerous researchers have also carried out pertinent research in the areas of data cleaning and data pre-processing. Wang et al. proposed a data cleaning method based on mobile edge nodes for data collection, the method obtained model training data through angle-based outlier detection method, and the method built a cleaning model through support vector machine to build a cleaning model. Experiments show that the method greatly improves the cleaning efficiency of data while maintaining data reliability [7]. Zyblewski et al. designed a data processing framework to first train the base classifiers using the hierarchical bagging method, and then integrated the data preprocessing and dynamic ensemble selection methods into the classification of unbalanced data streams. The study designed four preprocessing techniques and two dynamic selection methods for bagging classifiers and base estimators. The experimental results show that the dynamic ensemble selection method combined with data preprocessing outperforms the existing state-of-the-art methods for highly unbalanced data streams [8]. Wang et al. proposed a method for detecting and identifying anomalous data, in response to the problems of identifying and modifying state-based anomalous data, such as non-convergence in iterative computation and difficulty in handling large amounts of data effectively. The method builds a data constraint model based on temporal data similarity, and shows through experiments that the method can effectively detect and repair anomalous data [9]. Diffusion-weighted magnetic resonance imaging is the primary method for noninvasive study of white matter tissue in the human brain. Cieslak et al. constructed a preprocessing platform for diffusion image data that is compatible with all diffusion-weighted magnetic resonance imaging sampling modalities [10].
In conclusion, typical machine learning approaches have the issue of not taking into account the temporal properties of the data when it comes to anomalous data extraction. Most data cleaning techniques merely take into account the filling in of missing data while processing anomalous data, failing to take into account the effects of noise and duplicate data. Moreover, the lack of data temporality can lead to the loss of time-related key information; the model cannot distinguish the sequence of time and thus cannot accurately capture the temporal features of the data, leading to modeling errors and errors in feature importance, and lowering the accuracy of model prediction. To efficiently analyse the compound anomalous demand response data, this research will develop a temporal data extraction and data cleaning model based on Bi-LSTM network.
Bi-LSTM based anomaly data pre-processing method research
In the research on data governance, the traditional approach suffers from not considering the time-series characteristics of the data and the model prediction accuracy is not high enough. This research will build a machine learning based data extraction and cleaning model to correctly predict abnormal data and perform cleaning and repair operations on the data based on the demand response database.
Design of a Bi-LSTM-based data governance framework
The data anomalies of the demand response database mainly include missing data values, duplicate data values, inconsistent data and other anomalies. Only by carrying out data governance operations such as anomaly extraction and data cleaning on the data anomalies of the response database can the data of the demand response database be true and reliable and meet its proper value [11]. This research mainly classifies abnormal data into three categories, one is dead data, i.e. a situation where the data recorded in the database is always the same value without any change. The second is empty data, i.e. a situation where there are records with data that really do not contain data. The third is live data, i.e. data that can change continuously within a certain range and can be used normally. Dead data and empty data are reported as errors, and live data is extracted for further data governance operations. The abnormal data governance framework for the demand response database in this study is shown in Fig. 1.
Abnormal data governance framework.
As shown in Fig. 1, the data governance solution for this study is broadly divided into 3 modules. The first is the classification and collation of the data, as the demand response database contains a large number of different types of data, so these different types of data need to be collated and classified before the cleaning can be carried out. The next step is to design the abnormal data extraction module of the model. Finally, the data repair module is put in place to carry out data repair operations on the extracted abnormal data. The study has selected three main data repair methods, namely data noise smoothing, missing data filling and duplicate data cleaning methods. The data in the response database has a certain amount of white noise, for the data anomalies caused by noise, the study will introduce the smooth function to carry out data repair, and the Generalized Cross-Validation (GCV) method is chosen to carry out the selection of smooth parameters [12, 13]. Equation (1) displays the smooth function’s mathematical expression.
As in Eq. (1),
Equation (2) is the expression formula for the temporal data points of the response database, and the GCV method was chosen for the study to carry out the selection of the smooth parameter
As in Eq. (3),
As in Eq. (4),
This study employs a combined data filling strategy that is based on the GPR model and the LSTM neural network model; Eq. (6) gives the method’s mathematical expression.
As in Eq. (6),
LSTM model structure diagram.
In Fig. 2, the product of vector elements is represented by
As in Eq. (7),
As in Eq. (8),
As in Eq. (9),
In this study, the anomaly data governance will be divided into two modules: anomaly data extraction and anomaly data repair, for the building of the Bi-LSTM-based anomaly data governance model depicted in Fig. ‘1. For the anomaly data extraction module, the research will first normalise the input data set and then carry out anomaly data prediction extraction through the Bi-LSTM network model optimised by Adam’s algorithm the training process of the optimised Bi-LSTM network model is shown in Fig. 3.
Optimized Bi-LSTM network model training process.
As in Fig. 3, the model first preprocesses the dataset and converts it into a 3D vector matching the model input, while setting the time step to 4 and the feature dimension to 5. Dropout is added under the Bi-LSTM layer in order to avoid overfitting during the training process. As data extraction is a two-class problem, the study set the excitation function of the dense layer to be a sigmoid function and the output of the function to be set between (0, 1). Also, the study chose the Adam algorithm to optimise the model training, and the mean square error (MSE) was chosen as the loss function of the model [16]. The normalisation formula for the data set is shown in Eq. (10).
As in Eq. (10),
As in Eq. (11),
Abnormality identification and repair program process.
As shown in Fig. 4, the general flow of the module is, firstly, to clean the extracted anomalous data one by one, and secondly, to calculate the residuals of the model, the formula for which is shown in Eq. (12).
As in Eq. (12),
In this study, two categorical evaluation metrics, Precision and Recall, will be introduced to assess the effectiveness of abnormal data extraction, and Relative Error, Mean Absolute Percentage Error (MAPE) and Root Mean Squared Error (RMSE) will be introduced to assess the effectiveness of data restoration of the model [17]. The relative error is mainly used to assess the degree of fit between the measured and predicted values, and the relevant formula is shown in Eq. (13).
This study will introduce two classification evaluation indicators, Precision and Recall, to evaluate the extraction effect of abnormal data. Relative error, Mean Absolute Percentage Error (MAPE), and Root Mean Squared Error (RMSE) will be introduced to evaluate the data repair effect of the model [17].
As shown in Eq. (13),
As in Eq. (14),
The smaller the value of RMSE, the less discrete the model is and the better the model prediction.
To test the viability of the research model and evaluate how effectively it operates in terms of both anomalous data extraction and anomalous data repair, the study will conduct data governance experiments by actually responding to anomalous information from the database. The hardware and software requirements for this experiment are specified in Table 1, and the model must be trained in a specific experimental environment.
Experimental software and hardware environment
Experimental software and hardware environment
The raw data for this experiment came from the real load demand response database of Chaoyang City, Liaoning Province, China, and a total of 2000 demand response data for one month were arbitrarily selected as the experimental raw data for the abnormal data extraction and repair experiments. To confirm the model’s anomalous data repair impact, model training experiments were conducted using the first 1500 of these data as the training group and the final 500 as the test group. The sample situation of the raw data is shown in Fig. 5.
Abnormality identification and repair program process.
Figure 5 shows the raw data situation of the training and test sets for the experiment, where the training set data will be processed by the Bi-LSTM model optimised by Adam’s algorithm and the test set data will be predicted by the unimproved LSTM model. The anomalies will first be extracted from the raw data by the experiments, and the discriminant rule for the anomalies must be established by the standard deviation of the model’s residuals. Figure 6 displays the cumulative probability plot of the training and test models’ residuals.
Model residual cumulative probability graph.
As shown in Fig. 6, the training errors of the improved model were more concentrated than before the improvement, and the training errors of the model before and after the improvement were better fitted to the normal reference line, and the standard deviation of the residuals calculated by the model was about 0.005 MPa. The training error of most of the data is within the error interval (
The loss and accuracy of the model on training and testing sets.
Figure 7 demonstrates that the model’s accuracy on the training set was 100%, an improvement of 30.5% over the test set, while the recall was 80%, an improvement of 25.5% over the test set, and the model loss was 7.1%, a reduction of 46.2% over the test set. Secondly, the improved model reached convergence more quickly and the accuracy of the model training reached convergence the fastest. Overall, the model reached convergence in accuracy, recall and loss at approximately 300 iterations, at which point the model reached optimality. The demand response data is time-series data, and the model will test identifying continuous outliers on the original data. The results of this outlier extraction are displayed in Fig. 8.
Abnormal data extraction results of the model before and after improvement.
As shown in Fig. 8(a), the time series LSTM model before the improvement has obvious cases of missed and false detection, and the first anomalies of a section of continuous anomalous data are all identified accurately, the locations of the actual anomalies and the predicted anomalies do not consistently correspond to each other in Fig. 8(a). Which is due to the fact that the model will be influenced by the time series features and there is a certain lag, resulting in easy false detection and missed detection in the next prediction of the anomalous data. As shown in Fig. 8(b), this research has improved the model through Adam’s algorithm by avoiding the influence of data lagging through bi-directional LSTM improvement, which has greatly improved the recognition accuracy of anomaly data.
The study will perform data repair process on the anomalous data based on the anomalous data extraction results of the model. On the same dataset as the research model, experiments will introduce Back Propagation (BP) neural network model and Support Vector Machine (SVM) model in order to compare and analyse the data repair effect of the research model. They will also introduce the parameters RMSE and MAPE in order to assess the model’s performance. The first step of anomaly repair is to discriminate the type of anomalies, and the repair operation can only be performed if the anomalous data is live data. The accuracy, precision and recall of the anomaly discrimination of each model are shown in Fig. 9.
Abnormal point discrimination result graph.
As shown in Fig. 9, firstly, the research improved Bi-LSTM model had the highest accuracy, precision and recall among the experimental models and both reached convergence the fastest, with 100% precision, 95.07% accuracy and 80% recall, respectively. Secondly, the BP and SVM models had higher precision of 90.39% and 94.71% respectively, while the recall was lower, almost 10% less than the study model, while the unimproved LSTM model had the lowest relative performance. In conclusion, overall, the research models improved the performance of the model to some extent, with both accuracy and recall significantly improved over the common models. Next, data repair experiments such as noise smoothing, data filling and duplicate data cleaning will be performed on the anomalous data, and the repair effect will be evaluated by RMSE and MAPE. The data repair evaluation results of the model are shown in Table 2.
Abnormal data repair evaluation results
As shown in Table 2, the improved Bi-LSTM model had the best anomaly data repair performance with the lowest RMSE and MAPE values of 0.0049 MPa and 1.375%, respectively. The BP and SVM models both had equal RMSE and MAPE values of 0.006 MPa and 1.684%, respectively. The unimproved LSTM model had the worst repair performance, with RMSE and MAPE values of 0.008 MPa and 2.463% respectively, while the study model reduced the RMSE value by 0.0031 MPa and the MAPE value by 1.088% compared to the pre-improvement model. Overall, the anomalous data repair effectiveness of the research model was improved over both the pre-improvement LSTM model, which is a machine learning method with higher correlation performance. Data repair experiments were conducted on the anomalous data according to the research model, and the results of the experimental data repair compared with the actual data are shown in Fig. 10.
Abnormal data repair result graph.
As shown in Fig. 10(a), the restoration values of the studied improved model fit the actual values well, and the restoration effect of the model is in line with the real situation. As in Fig. 10(b), the unimproved LSTM model suffers from too high repair values at the crests, which may be due to errors in the model’s preservation of information about the extreme values of the data. At the same time, both models have the problem that the restoration values at the troughs are too flat and some of the fluctuation information is lost. Overall, the overall repair results of the improved model are higher than those of the pre-improved model, and the anomalous data repair is more effective.
In order to meet the power industry’s demand for demand response data processing needs, the study is based on LSTM networks to design anomalous data extraction and revision models for effective processing of composite anomalous data for demand response, and relevant algorithms are introduced for comparative experiments. The experimental results show that, firstly, in terms of anomalous data extraction, the standard deviation of the residuals of the research model is around 0.005 MPa, with an error interval (
Footnotes
Acknowledgments
The work was financially supported by the 2023 Open Fund Project of the Beijing Key Laboratory of Demand Side Multi-Energy Carriers Optimization and Interaction Technique (Research on Construction and Data Governance Method of Demand Response Surveying Database Based on Machine Learning, YDB51202301442).
