Abstract
In order to solve the problems of monthly electricity generation forecasting being limited by the lack of actual data source, and the large errors caused by the influence of various factors such as weather and holidays, and the limitations of the applicable scenarios of the existing research results, a monthly electricity generation forecasting model based on similar month screening and Seasonal and Trend decomposition using Loess (STL) was proposed in this paper. The complementary advantages of Multiple Linear Regression (MLR) and Improved Random Forest Regression (RFR) are utilized to achieve the monthly electricity generation prediction in the province. This prediction model does not require a large number of data to obtain a better prediction accuracy, and breaks through the limitations of the existing monthly electricity prediction model that are only suitable for a certain industry or a certain region. Experiments performed on an actual electric power generation series validate the efficiency of the proposed model.
Keywords
Introduction
In recent years, China’s electric power industry has developed rapidly. Scientific prediction of power data can provide particularly important basic data for the electric power system to carry out rational planning and design. Accurate monthly electricity forecast can be formulated for the power system in the medium- and long-term electric power energy deployment plans, equipment maintenance and overhaul plans, as well as medium and long-term power generation plan to provide a guiding basis [1]. Therefore, many achievements have been made in the development of electricity forecasting-related technologies to this day [2, 3, 4, 5].
The existing monthly electricity prediction models mainly fall into two categories: one is to only consider the historical electricity data and predict the future electricity value according to the change law of the historical electricity data; However, the former is more suitable for the data with strong autocorrelation of data series. For the quantity data with weak autocorrelation, it is necessary to predict the quantity according to the relevant influencing factors. In literature [6], Pearson correlation coefficient was adopted to quantify industry correlation and select industries with strong correlation. Then, the Long Short-Term Memory (LSTM) model was used to forecast the medium and long term industrial load combined with the historical data of related industries. The feasibility and accuracy of this model were illustrated by taking 2 major industries and 30 related industries in Jiangsu Province as examples. However, the model is not suitable for forecasting the electricity of the whole society in the province. In literature [7], a prediction model of available photovoltaic power consumption based on the differential autoregressive moving average model was established to realize the horizontal monthly prediction. Then, a forecast value correction method based on the hidden Markov model was established to realize the vertical correction of the future predicted value. However, the autoregressive moving average model (ARIMA) used in this model can only capture linear relationships by nature, so the prediction accuracy is not optimal. Literature [8] established a medium- and long-term power load prediction model based on recursive neural network considering relevant influencing factors such as holidays, weather and temperature. The relationship between inputs and outputs was obtained through extended calculation in the hidden layer. The training of the model was realized through the back propagation of time, so that the parameters of the neural network were optimized to achieve a better prediction effect. In literature [9], a medium- and long-term electricity prediction model was proposed considering correlation factors. Firstly, correlation matrix was used to screen out strong correlation factors of electricity data among 27 influencing factors, and X-12 method was used to decompose the influencing factor data and electricity data into three components respectively. Then, the hysteresis effect of influencing factors was taken into account, and the data were purified by principal component analysis, so the final prediction accuracy was obviously improved. Literature [10] integrated the annual, monthly and daily electric quantity data and took the influencing factors of the electric quantity data into account to propose a medium- and long-term electric quantity prediction model of stacked LSTM. The effectiveness of this model was verified through comparative experiments. However, this model is not suitable for monthly electricity forecasting with small data volumes. In literature [11], influencing factors of electricity consumption data of various industries were selected based on causality analysis, and a sub-industry electricity prediction model with embedded causality conduction was proposed based on seasonal decomposition and vector error correction. Finally, the applicability of the proposed model was verified through case analysis. But the model is limited to monthly electricity forecasts for different industries. Literature [12] uses principal component analysis to process the eigenquants, calculates the gray correlation degree between each principal component and the maximum load, determines the weight of the regression model according to the degree of association, establishes a local weighted regression prediction model based on gray correlation degree and particle swarm optimization optimization, and verifies the effectiveness of the model through case analysis. In summary, since the power data are affected by a variety of external factors and tend to exhibit strongly stochastic and highly nonlinear sequence feature, taking the external factors into account will improve the prediction accuracy of the power supply, and when the amount of data is small, machine learning models can exhibit better performance than deep learning models.
Seasonal and Trend decomposition using Loess (STL) based on local weighted regression is a time series decomposition algorithm. In literature [13], the STL algorithm was used to decompose the traffic flow of the expressway, and the optimal model selection and calculation was carried out for the decomposed components, and then the prediction was made respectively. Literature [14] analyzed the characteristics of runoff subsequence obtained from STL decomposition, and used the prediction model of multi-model combination to obtain the predicted value of runoff. Literature [15]proposed a monthly electricity prediction model based on STL and X12 decomposition. Firstly, X12 decomposition algorithm was used to extract the trend component of economic factor data, and STL decomposition algorithm was used to extract the trend component, seasonal component and residual component of electricity data. Then the trend component was predicted according to the changing trend of economy, the seasonal component was predicted according to whether it is the seasonal inflection point or not, and the predicted value of the residual component was directly adopted the average value of historical data. Finally, the predicted value of monthly electricity data was obtained by stacking the predicted results. Through case analysis, this kind of literature has proved that the STL decomposition is a very general and robust to outliers time series decomposition algorithm, capable of dealing with data with arbitrary periodic changes, and rate of the change of periodic components and the smoothness of the trend components are both controllable, so it is widely used. In this paper, the monthly electricity data will be decomposed based on the STL algorithm to fully explore its changing law.
For monthly electricity data, the limited basic data due to the short time of completion of the power system data collection facilities and the large time span of data collection often lead to low accuracy of monthly electricity forecasting. Literature [16] designed an algorithm based on the correlation between historical meteorological data and electricity data to expand the electric quantity data using historical meteorological data, but this model is built on the basis of having a large amount of meteorological data. Literature [17] proposed a seasonal index model based on grey model optimization, which can have high and stable forecasting performance even if the monthly electricity dataset is small. However, the model only considers the trend and seasonal pattern of the electricity data when predicting, and does not consider external influences, so the model will be ineffective when there is no obvious trend and cyclical pattern in the electricity data.
However, for the monthly electricity prediction, the application scenarios of the existing research results are all limitations and cannot be fully applied to the province-wide electricity forecasting. Therefore, based on the similar day theory [18, 19, 20, 21], this paper proposes a monthly electricity prediction model based on similar month screening and STL decomposition: Taking weather and holidays into account, similar month screening and STL decomposition were used to preprocess the data, which avoided the problem of model failure when there was no obvious trend and periodic law in the power data. The decomposed components were predicted by using more suitable prediction models (Multiple Linear Regression (MLR) model was used for the decomposed trend terms, and the improved Random Forest Regression (RFR) model was used for the decomposed detrending terms), which improved the prediction accuracy of the model. Finally, the predicted values are superimposed to obtain the predicted value of the monthly electricity supply. This model solves the problem that the monthly electricity generation prediction is limited by the lack of actual electricity data source, and the prediction results are affected by many factors such as weather and holidays, which leads to large errors, and breaks through the limitations of the application scenarios of existing research results.
This article includes: Section 1 contains some literature reviews; Section 2 contains the theoretical underpinnings covered in this article; Section 3 is the forecasting process of the forecasting model proposed in this paper; Section 4 is an example analysis, including some comparative experiments; Section 5 contains a summary and outlook.
Theoretical basis
Maximum information coefficient algorithm
Maximal Information Coefficient (MIC) is an algorithm used to analyze the degree of correlation between two variables based on mutual information and grid partitioning [22] for linear or nonlinear data, and is widely used in feature screening for machine learning.
For variable
Where,
The calculating steps of MIC are as follows:
Step1: The scatter diagram D composed of variables
In summary, the definition of MIC formula is shown in Eq. (2).
Grey Relation Analysis (GRA) algorithm quantitatively analyzes the correlation degree between samples according to the similarity or dissimilarity of the variation trends between samples. Since the units of corresponding elements of each group of samples are the same, there is no need to dimensionlessly process the data in this paper. The correlation coefficients are calculated are as follows:
Step1: Set reference sequence
Step2: Calculate the grey correlation coefficient
Where,
Step3: Calculate the correlation degree
As can be seen from Eq. (6), correlation degree is the combination of the values of each influencing factor in a sample and the correlation coefficients of the corresponding values of the month to be predicted into an average value, centrally processing scattered data. GRA algorithm can screen the similar months of the month to be predicted.
STL decomposition principle is referenced in [23]. Seasonal and Trend decomposition using Loess (STL) is a very general and robust to outliers decomposition algorithm that can decompose the time series into three components: trend, seasonal and residual terms [23], as shown in Eq. (7). The nonlinear relationship estimation algorithm Loess (Locally weighted regression) is used here to extract smoothed estimation of the three components.
Where
STL algorithm is divided into inner loop and outer loop. Each inner loop use the Loess algorithm to weighted the data for regression, realizing the smoothing of trend and periodic terms. After each end of the inner loop, the outer loop will calculate the robust weights based on the results of the inner loop to weaken the effect of noise on the trend and periodic terms in the next inner loop. The inner loop process is shown in Fig. 1, including six steps: detrending, smoothing period term subsequence, filtering the low flux of periodic subsequence, removing the trend of the smoothed period subsequence, de-cycling and smoothing the trend term. A new trend component and seasonal component are obtained at the end of each inner loop, and the residual component is calculated based on these two components in the outer cycle.
Flow of the inner loop.
Principle of RFR algorithm.
The principle of MLR model is to set up a regression equation, using multiple independent variables to fit the dependent variable, and to explain and predict the value of the dependent variable by calculating the coefficients and intercept values of the independent variables, when the data present an approximate linear relationship, MLR training speed is more advantageous compared to the neural network [24]. The main idea of using MLR for trend item prediction is to take the moment
Where,
Random Forest (RF) is an algorithm [25] that builds a forest with multiple decision trees in a random way. RF can be used not only for classification problems, but also for regression problems. For a new sample, each tree in the forest judges its category and finally takes the result of voting as the output of RF, while for regression problems, the output result of random forest is the average of all the output results of decision trees.
The principle of RFR is shown in Fig. 2, and the steps are as follows:
Step1: Assuming that the number of samples in a given training set is
It can be seen that there are two random processes in the prediction process of RFR: one is that the training set of each decision tree is randomly sampled; The other is that the characteristic factors are sampled randomly when the decision tree is constructed. Using these two random processes can reduce the correlation between each decision tree and improve the generalization ability of the prediction model.
Support vector regression
Support Vector Regression (SVR) can show strong performance in sample generalization ability. The principle of support vector regression (SVR) is to find a decision plane with minimal structural risk and minimize [26] the distance between sample points and the plane, so as to achieve data fitting.
Suppose that the training set and the regression function are respectively:
Where,
The basic SVR model is:
In order to make the model have stronger generalization ability, relaxation variables
Obviously, Eq. (2.6) is a quadratic programming problem, so the Lagrange multiplier method is introduced to transform it into a convex optimization problem:
The above problems need to satisfy the KKT condition, that is, the regular function is:
Model flow chart.
On the basis of the above theories, this paper proposes a monthly electricityelectricity generation forecasting model based on similar month screening and STL decomposition, as shown in Fig. 3.
The prediction steps are as follows:
Step1: Data decomposition. Use STL algorithm to decompose the monthly electricity quantity data into trend component, periodic component and residual component, superpose the periodic component and residual component to get the detrended component, where the trend component is mainly affected by long-term factors such as economy, and the detrended component is mainly affected by uncertain factors such as holidays and weather; Step2: Screening meteorological factors. Using MIC algorithm to analyze the correlation between periodic components of electricity data and meteorological data, and filter out meteorological factors with strong correlation; Step3: Construct the similarity set. Obtain the historical month similar to the meteorological samples of the month to be tested as the similarity set by GRA method; Step4: Forecasting. Use the MLR model for the trend component prediction, and the improved RFR algorithm for the detrend component prediction; Step5: Superimpose the two parts of Step4’s prediction results to get the monthly electricity generation forecast value.
Among them, the improved RFR model is shown in Fig. 4, which essencially fuse the RFR model and SVR model. The traditional RFR model is to give the same weight to each decision subtree, and the prediction result is the weighted average of the output values of each decision subtree, which does not take into account the performance of each decision subtree itself. SVR shows relatively excellent prediction performance due to its ability to flexibly handle high-dimensional, nonlinear, and small-sample data. Therefore, this paper proposes a prediction model fusing RFR and SVR.
According to Fig. 4, the original samples are first divided into input and output data and normalized. The training samples of each decision subtree are obtained through Bootstrap sampling method. The decision results of each decision subtrees are used as the input of SVR model, while the normalized output data are used as the output of SVR model. Support vector regression algorithm is used to integrate the performance of each decision subtree to further improve the prediction accuracy of the RFR model.
Improved RFR prediction model.
Experimental data pre-processing
In this paper, the monthly electricity generation data, meteorological data including temperature, barometric pressure, etc., and the number of holidays in each month corresponding to a province from February 2005 to May 2022 were collected. The electricity data were collected at an interval of one month, and the meteorological data were the average monthly meteorological data of the capital city of the province, with a total of 212 sets of samples. The samples from February 2005 to May 2020 are used as the training set, and the samples from June 2020 to May 2022 are used as the test set.
Electricity decomposition results
The STL decomposition algorithm is used to extract the trend component, periodic component and residual component from the original electricity data, as shown in Fig. 5.
STL decomposition diagram.
As can be seen from Fig. 5, the trend component shows an approximately linear change, which is mainly influenced by long-term factors such as economy; The cycle of the periodic component is 12 months, which is mainly influenced by meteorological factors and shows a cyclic change. However, as the trend component grows year by year, the fluctuation range of the periodic component becomes larger, so the value of the trend component should also be used as the characteristic data of the de-trend component. The residual component has a certain randomness and is mainly affected by uncertainties such as holidays and major social events. The de-trend component is obtained by superposition of the periodic component and the residual component.
The meteorological data collected in this paper include: air temperature (T), atmospheric pressure (Pa), wind speed (Ff), relative humidity (U) and dew point temperature (Td). Figure 6 plots the scatter diagram of the periodic components of each meteorological data and electricity. According to the scatter plots, it is obvious that air temperature, atmospheric pressure and dew point temperature show certain nonlinear relationships with the periodic components of electricity, while there is no significant correlation between wind speed and relative humidity and periodic component of electricity. In order to quantify the correlation between them, the MIC algorithm is used in this paper to calculate the correlation degree between each meteorological factor and the periodic component of electricity, and the calculation results are shown in Table 1. Table 2 shows the criteria for the degree of correlation in general.
Results of correlation analysis of meteorological factors
Results of correlation analysis of meteorological factors
Correlation degree classification criteria
Scatter diagram of each meteorological factor and periodic component of electricity.
Based on the quantitative analysis results in Table 1 and the correlation degree division criteria in Table 2, it can be seen that temperature has the highest correlation with the electricity cycle term, and the correlation coefficient is 0.83, which belongs to the category of “very strong”. The correlation coefficient of relative humidity is 0.22, which has the lowest correlation and belongs to the category of “weak”. Therefore, the threshold of correlation coefficient is set as 0.7 in this paper, that is, the meteorological factors which are “very strong” or “strong” correlated to the electricity cycle term, i.e., air temperature, atmospheric pressure and dew point temperature, are selected and taken as the characteristic data.
In order to reduce the training difficulty of the random forest regression prediction model, it is necessary to analyze the degree of similarity between the weather series of the month to be tested and the weather series of the historical months, and set a threshold to screen the historical data with similar weather of the month to be tested as the training set. The threshold value of similarity is obtained through multiple tests and comprisons. The similar set of the predicted month is screened by setting different thresholds, and the similarity value that can obtain highest prediction accuracy is determined as the threshold after repeated tests. As shown in Table 2, the similar months screened under the well-determined threshold value are all “very strong” correlated to the predicted month. The similarity thresholds for each month are shown in Table 3. Taking June 2020 as an example, Table 4 shows the list of similar months to that month when the similarity threshold is 0.85, if the threshold of similarity is set to 0.85, the prediction accuracy of the obtained detrend item is higher, then the similar set of that month is shown in Table 3 (Note: r in the table is correlation).
Similarity threshold for each month
Similarity threshold for each month
Similarity sets for June 2020
Root Mean Square Error (RMSE), Mean Absolute Error (MAE) and Symmetric Mean Absolute Percentage Error (SMAPE) are selected to evaluate the performance of the model proposed in this paper. The calculation formulae of each index are shown in Eqs (15)–(17).
Where,
For the trend component of monthly electricity data, this paper uses the MLR algorithm to roll up the trend component value for the month ahead with the historical data with a lag of 12 months; For the detrended component, the improved RFR prediction algorithm is used, with the inputs of monthly averages of air temperature (T), atmospheric pressure (Pa) and dew point temperature (Td), the number of holiday days (F), and the predicted values of the trend component (Trend). The input-output relationships of the trend (Trend) and de-trend component (Detrend) prediction models are shown in Fig. 7 (Note:
Input-output relationship of prediction model.
In this section, the collected data are substituted into the proposed prediction model for case analysis, and two sets of comparison experiments are designed to verify the validity of the proposed model in this paper using the model evaluation metrics in Section 4.2:
(1) Experiment 1
In order to verify the performance of the monthly combined predictive model in this paper, the prediction results of the following commonly used single models and the medium- and long-term electricity prediction model proposed in the literature [12] are compared and analyzed.
Model-I: The model of this paper; Model-II: Medium- and long-term electricity prediction model proposed by literature [12]; Model-III: Support vector regression (SVR) model; Model-IV: Stochastic forest regression (RFR) model; Model-V: Multiple linear regression (MLR) models.
The monthly electricity generation forecast results of the five models for the test set (i.e., June 2020 to May 2022) are shown in Fig. 8, and the relative errors of the forecasts for each month are shown in Fig. 9.
The evaluation indexes in 4.2 are selected to evaluate the performance of the five models, and the evaluation results are shown in Table 5.
Errors of prediction results for each model in experiment 1
Errors of prediction results for each model in experiment 1
Forecast results of the five models in experiment 1.
Relative errors in the prediction for each month in experiment 1.
According to the comparison experiment 1, it can be clearly seen that the model proposed in this paper has a higher prediction accuracy, because: the single models only focus on the trend of historical monthly electricity data in the prediction, and their ability to mine the nonlinear relationships in historical data is very limited, leading to their poor prediction performances; The model in the literature [12] only takes into account the relationship between the influencing factors and electricity data, and does not consider the trend and periodicity in the electricity data itself, so its prediction accuracy is low. In contrast, the prediction model in this paper combines different data preprocessing methods and time series forecasting models to achieve the advantages of the individual models complement each other, and at the same time, the model also takes the influencing factors of electricity data into account which further enhances the model’s predictive performance.
(2) Experiment 2
In order to verify the necessity of similar month screening and RFR model improvement of the prediction model in this paper, the prediction results of the following model are compared and analyzed.
Model-I: The model of this paper; Model-II: Replacing the improved RFR in this paper with the unimproved RFR algorithm; Model-III: SVR algorithm instead of the improved RFR in the model of this paper; Model-IV: All historical data samples are directly taken as the training set for each month to be tested without similarity set screening; Model-V: The similarity set was filtered using the same similarity threshold (0.85) for each month to be predicted.
The monthly electricity generation forecast results of the five models for the test set (i.e., June 2020–May 2022) are shown in Fig. 10, and the relative errors of the forecast values for each month are shown in Fig. 11.
The evaluation indexes in 4.2 are selected to evaluate the performance of the four models, and the evaluation results are shown in Table 6.
Errors of prediction results for each model in experiment 2
Forecast results of the five models in experiment 2.
Relative errors in the prediction for each month in experiment 2.
Based on the comparison experiment 2, the following conclusions can be drawn: according to Fig. 10, it can be seen that the predicted value curve of Model-I and actual value curve are closer to each other on the test set; according to Fig. 11, the relative error of Model-I is smaller for most months; according to Table 6, combined with the values of the three evaluation indexes, the three average errors of Model-I are smaller than the other models. This is because: Model-II and Model-III use RFR and SVR models respectively to predict detrended components, with limited prediction accuracy, while the model in this paper fuses the RFR model and SVR model, and uses SVR to take into account the performance of each decision subtree in RFR, which further improves the performance of the prediction model. Model-IV does not screen the similar set of the month to be tested, but instead, takes all the data as the training set and uses the same improved RFR prediction model for the detrended component, which inputs all the data without selection, not only increases the training difficulty of the model, but also leads to lower prediction accuracy; Model-V uses the same similarity threshold when filtering similar sets, and does not consider the differences in electricity data as well as the amount of data between each month, resulting in the performance of the model inferior to that of Model-I. Thus, it is demonstrated that preprocessing the data using similar month screening and STL decomposition can avoid the problem of model failure when there is no obvious trend and periodic pattern in the electricity data, and the prediction of the decomposed trend component and de-trended component using the MLR model and the improved RFR model, respectively, can improve the prediction accuracy of the model to a great extent.
Although the model proposed in this paper has better performance in general, the prediction error for months such as February 2021 are still large, probably due to sudden changes in electricity data caused by socially significant events such as the epidemic and changes in the composition of electricity generation and consumption equipment. It is difficult to achieve high accuracy considering only meteorological and holiday factors.
In addition, the processor of the prediction model involved in the experiments run in this paper is: the 12th Generation Intel(R) Core(TM) i7-12700F 2.10 GHz, the training time of the model proposed in this paper is about 383 ms, and the time for predicting a moment is about 4.6 ms. The main parameters that need to be tuned include: n_estimators (the number of the RFR model estimators), random_state (the number of random seeds of the RFR model), C (the penalty coefficient for the SVR model) and gamma (the kernel coefficient of the SVR model when “rbf” is selected as the kernel function). Some models in the control group have shorter training and prediction times due to fewer data processing methods and prediction models involved, with fewer parameters, for example: model-V in experiment 1 has a training time of about 320 ms, and predicts a moment of about 1.5 ms; model-III in experiment 2 has a training time of about 335 ms and predicts a moment of about 2.5 ms. Instead of filtering a similar set, model-IV in experiment 2 uses all the historical data samples as the training set for each month, thus the model training is complicated and takes about 631 ms and 3.7 ms to predict a moment.
For monthly electricity forecasting, most existing models treat the data as a time series and analyze the trend of its historical data to achieve the prediction of future data. But monthly electricity data are often little influenced by the trend of historical data, so the forecasting accuracy of such models are not high. Therefore, this paper proposes a monthly electricity generation forecasting model based on similar month screening and STL decomposition, and the following conclusions are obtained:
In this paper, meteorological factors and holiday factors are fully considered, and the MIC is used to realize the screening of meteorological factors and remove the factors with low correlation, thus reducing the complexity of the model and improving the prediction accuracy; The trend and de-trend component of the electricity data are extracted by STL decomposition in this paper,, and then the appropriate prediction models are selected to predict respectively, and the effectiveness of the combined model is verified though the analysis of arithmetic cases; The similar months of the month to be tested are screened and used as the training set to reduce the training difficulty of the prediction model, and the necessity of the similar set screening is verified by comparison experiments; The random forest regression algorithm is improved in this paper and used to predict the detrended components, and the effectiveness of the improved random forest regression algorithm is verified by comparison experiments.
The influence of major social events, economic and other factors on electricity needs to be further analyzed in the future to further improve the accuracy of of monthly electricity generation forecasting.
Footnotes
Acknowledgments
This work was fully supported by the scientific and technological project of State Grid Fujian Electric Power Co. Ltd. (B3130N22000Q).
