Abstract
Machine learning methods have been adopted in the literature as contenders to conventional methods to solve the energy time series forecasting (TSF) problems. Recently, deep learning methods have been emerged in the artificial intelligence field attaining astonishing performance in a wide range of applications. Yet, the evidence about their performance in to solve the energy TSF problems, in terms of accuracy and computational requirements, is scanty. Most of the review articles that handle the energy TSF problem are systematic reviews, however, a qualitative and quantitative study for the energy TSF problem is not yet available in the literature. The purpose of this paper is twofold, first it provides a comprehensive analytical assessment for conventional, machine learning, and deep learning methods that can be utilized to solve various energy TSF problems. Second, the paper carries out an empirical assessment for many selected methods through three real-world datasets. These datasets related to electrical energy consumption problem, natural gas problem, and electric power consumption of an individual household problem. The first two problems are univariate TSF and the third problem is a multivariate TSF. Compared to both conventional and machine learning contenders, the deep learning methods attain a significant improvement in terms of accuracy and forecasting horizons examined. In the meantime, their computational requirements are notably greater than other contenders. Eventually, the paper identifies a number of challenges, potential research directions, and recommendations to the research community may serve as a basis for further research in the energy forecasting domain.
Keywords
Abbreviations
Time Series Forecasting
Autoregression Integrated Moving Average
Autoregression Moving Average
Decline Curve Analysis
Vector Autoregressive
Bayesian VAR
Support Vector Machine
Support Vector Regression
Least Squares SVR
k-Nearest Neighbour
Artificial Neural Networks
univariate Time Series
Multivariate Time Series
Exponential Smoothing
Simple Exponential Smoothing
Multivariate Exponential Smoothing
Multilayer Perceptron
backpropagation
Recurrent Neural Networks
Deep RNN
Long Short-Term Memory
Deep LSTM
Echo State Network
Mean Absolute Error
Mean Square Error
Root MSE
Root Mean Square Percentage Error
Mean Absolute Percentage Error
Introduction
Problem overview
The advent of sensors and measurement technologies has resulted in an exponential growth of time series data that recorded from multiple sources over time [87]. This kind of data are rich with dynamical information and has characteristics of temporal dependencies and high dimensionalities. These properties have attracted the practitioners to utilize the time series data to achieve the task of forecasting in a wide spectrum of applications [18]. Actually, forecasting is one of the oldest problems in the human history since the ancient Egyptians who established various mechanisms to measure the Nile river flow to predict flood [14]. However, the emphasis of this paper is on the forecasting using time series data, which will be called henceforth as time series forecasting (TSF) [69].
TSF is the rational prediction of future trends based on the past and current observations. This is based on the concept that the past observations include intrinsic patterns, which contain related information to the future representation of the problem at hand. This property enables the time series data to capture the causalities of the underlying processes. Therefore, the TSF problem is included in a wide spectrum of applications, including energy.
Energy TSF is one of the most exciting and potentially ground-breaking research field. In the past decades with the dominant usage of traditional energy sources, there were no need for forecasting techniques where the energy demand could be ideally matched with the energy supply. However, in recent decades, the world wide energy markets have drastically grown, particularly with the increasing exploitation of renewable energy sources as well as the intraday real-time trading [49]. As such, these challenges, and many more, require more accurate and fine-grained forecasting techniques able to predict the energy demands for days and months ahead [40].
Certainly, accurate forecasting of energy supply and demand are an essential requirement for adjusting energy production and consumption and thus, for the stability of the energy markets. However, obtaining reasonably accurate predictions from the energy time series data is quite difficult due to many inevitable limitations. One of the limitations is the fact that the time series notoriously violates the independence and identical distribution property of spatial statistics [64]. Another limitation is the high volatility and uncertainty of some energy applications, such as household energy load [32]. There are other limitations including strong linear dependence between observations, lack of stationarity, and curse of dimensionality [49].
Due to these inherent limitations, a limited number of efficient forecasting models have been presented so far, despite sincere research efforts [8]. For several years, the conventional forecasting methodologies have been widely used to solve the energy TSF [49]. The conventional methodologies are based on three stochastic parametric methods; namely, autoregression (AR), moving average (MA), and Arps equation. AR and MA were combined together later on to establish a new method called autoregression moving average (ARMA) [18]. ARMA, and its variant autoregression integrated moving average (ARIMA), still represent the widespread techniques used widely to achieve diverse forecasting applications. The Arps equation is the basic of a known forecasting method known as the decline curve analysis (DCA), which has been widely used in the industrial domains [55]. Vector autoregressive (VAR) method is also used widely to achieve forecasting using multiple time series [20].
In the last decade, with the outstanding progress in AI, research on energy forecasting based on machine learning has been booming. Several machine learning algorithms have drawn a wide attention and presented as serious contenders to conventional methods in the forecasting community due to their favorable performance [75]. Examples of these approaches are the support vector machine (SVM) [57], the k-nearest neighbour (kNN) [54], and the artificial neural networks (ANNs) [27], and their variations. the are used widely in modeling the energy TSF problems. The inherently nonlinear structure of traditional ANNs, with shallow architectures, is particularly versatile for capturing the complex underlying relationship in many real-world energy forecasting problems [67].
Recently, with the growing emerging of sensing technology and, therefore, the voluminousness in energy time series datasets, the demand for relevant forecasting models has grown. Accordingly, an ANN with a shallow architecture might not be proper to cope with such voluminousness and complexities in datasets, particularly when attempting to model long interval and nonlinear time series dataset [63]. Consequently, deep learning has emerged in the AI field achieving impressive performance in a vast range of classification and regression applications [48, 90]. In energy research, there is a remarkable potential from researchers to use deep learning approaches to solve various energy TSF problems [32, 80].
Related work
It is expected that the growing research interest in developing efficient energy management systems to continue in the light of global sustainability drive [11]. This makes real time energy forecasting systems very important in this regard. In line with this direction, several comprehensive reviews are carried out, in the last two decades, addressing the problem of energy forecasting [59]. For example, in 2001, Hippert et al. [31] presented a systematic review on short-term load forecasting using ANNs. After ten years, Zhao et al. [35] presented another systematic review for statistical and AI techniques on prediction of building energy consumption. In 2015, Raza et al. [67] presented a review on short-term load forecasting based on AI techniques. In the same years, Martinez-Alvarez et al. [22] presented a survey on data mining techniques for time series forecasting of electricity. Daut et al. [52] presented a review on the problem of building electrical energy consumption using conventional and AI methods.
However, a number of recent articles, handling various problems in the energy forecasting domain, are comparable to our paper. For example, in 2017, Deb et al. [13] presented a a systematic review for nine popular machine learning techniques for forecasting univariate and multivariate time series energy consumption. No experiments are conducted in their paper, where they relied on the experiments described in other papers that used any of these nine techniques. In 2018, Chou et al. [44] presented another review of machine learning techniques using data collected from a smart grid installed in a building. They proposed a number of hybrid models built on platforms, e.g. MATLAB, which are not easy to use. In addition, they base their finding on one short-term dataset, which does not ensure generalization.
In 2019, Wei et al. [59] presented a similar review of conventional methods and AI-based methods for energy consumption across various forecasting horizons. Wei’s review showed that conventional methods outperform the AI-based methods. Like other reviews, no experiments are conducted in their paper, where they reviewed the techniques described in other articles. In the same year, Divina et al. [23] presented a comparative study of different forecasting methods for energy consumption of smart buildings. Divina’s study showed that the machine learning methods are the ideal to achieve this task.
Certainly, the aforementioned reviews provide vital information about various energy forecasting strategies through different scales and horizons. Nevertheless, some of the earlier related works are based on static data, which usually fits a dependent variable to a set of independent variables [31, 35]. Most of the recent reported reviews on the energy forecasting domain are systematic reviews in a specific energy problem. A systematic research, sometimes called a taxonomy research, refers to the process of systematically dividing the baseline problem into several categories [34]. The author of a systematic review, or taxonomy, paper usually counts the number of published articles in each category, for example, a review for 40 papers in [31], 116 papers in [59], and 157 papers in [13].
Yet, the most common practice in most of these taxonomy articles is comparing the performance of existing methods on different datasets and different experimental conditions; for example [13, 67]. In such cases, the evaluation of existing methods has not a unified base of comparison, which therefore reduces such review’s benefit. There are very few review papers that conducted an empirical comparison besides the analytical review but using one dataset, such as [23, 44].
Objectives of the paper
To fill the aforementioned research gaps, the objective of this study is twofold.
Accordingly, we can summarize the objectives (or contributions) of the paper in the following items: Provide a comprehensive review of the energy TSF problems along with a review to 15 most common techniques. Carry out a qualitative analysis of these techniques identifies the advantages and disadvantages. Carry out a quantitative empirical comparison among these techniques in terms of accuracies and computational requirements. Assess the performance of deep learning techniques to solve various energy TSF problems.
Finally, it is worth mentioning that this study focuses on the optimization of conventional, machine learning, and deep learning methods that fit to solve the energy TSF problem, rather than looking into the optimization aspects of the energy consumption or demand. Consequently, the scope of this study can be easily extended into other TSF domains.
The remaining of the paper is organized as follows. Section 2 shows a number of basic concepts. Section 3 shows a description of three categories of techniques that used to solve the TSF problem along with a brief overview of selected models from each category. Section 4 describes the forecasting evaluation metrics that used to assess the forecasting techniques. The dataset configuration and the empirical results for each case study are provided in section 5. Section 6 demonstrates a discussion and analysis of the earned results. A number of challenges and future research directions are given in section 7. Section 8 shows the paper conclusion and recommendations.
Background
This section shows a few basic concepts that will help the reader to go through the paper.
Statement of a TSF problem
A time series is a sequence set of observations or data points measured in a proper chronological order of a variable of interest. It can be either a discrete series or a continuous series. In the discrete time series, the observations are measured at discrete time interval, whereas in the continuous time series, the observations are measured at every time instance. If a time series contains observations of one variable, it is denoted as a univariate time series (UTS), otherwise it is a multivariate time series (MTS) [87].
A TSF problem is the task of predicting future values of time series data either using previous data of the same signal, i.e. UTS forecasting, or using previous data of other correlated signals, i.e. MTS forecasting. Mathematically, the UTS problem can be formalized as a sequence of n real-valued numbers
In UTS forecasting problems, the forecasting model usually predicts the variable of interest x using past values that precede x. In this manner, the forecasting model perceives the structure and learns the trend of the underlying pattern and extrapolates the interested process into the future. On the other hand, the MTS forecasting problem has two or more correlated variables. It can be defined as a finite sequence of several UTS problems, such that each UTS problem models a different pattern. Then the following formula,
represents an MTS problem includes m variables, where the corresponding component of the jth variable x
j
is a UTS problem of length n and can be given as,
Regardless of a univariate or a multivariate, the common TSF problems are divided into three categories based on the time horizon [18], as follows, Short-term forecasting problem that extends from one hour to one day or weeks or months ahead. Medium term forecasting problem that extends from one month to one or two years ahead Long-term forecasting problem that extends from one year up to ten years, or more, ahead.
It is demonstrated that, there are four factors that affect the observations of a time series, as follows, where the initial two factors are the most challenging to a forecasting model [18]. Seasonality - They are patterns of short-term repeated change in a time series during the same time every period. For example, the power consumption in summer days is generally high compared to those in winter days through a year. Trend - It is the smooth long-term direction of a time series of increasing or decreasing patterns during a very long period, like electricity price. Cyclic - It is a pattern of the rise and fall of a time series over periods longer than one year, and it depends on the type of problem. Irregularity - It is the residual of the time series after removing all of the seasonal, trend, and cyclic components.
The energy TSF problem is a challenging problem due to significant volatility and uncertainty of natural factors that included. It is widely demonstrated that energy time series datasets are complex and has an abnormal distribution [49]. In addition, they have high nonlinearity and nonstationarity characteristics. These limitations make energy time series data are difficult to be analysed using either statistical or conventional computing methods. Furthermore, due to the economic and population growth, the consumption of energy resources had increased dramatically during recent years.
Accordingly, the development in energy TSF has significant research values and should attract extra research efforts in order to mitigate any expected energy crisis in the future [40, 49]. In this paper, three different energy TSF problems are examined. In the following, a brief description for each problem is given.
Electrical energy consumption problem
The electrical energy consumption and its forecasting techniques are very crucial to stakeholders for estimating the electric energy usage as well as making right decisions for future development to expand power systems. Energy stakeholders use energy consumption forecasting models to monitor the change of energy consumption attitude and compared the outcomes with the predicted values over a certain period. For this reason, many forecasting algorithms focusing on electrical energy consumption are presented for the sake of improving the electrical energy efficiency as indicated in [2, 94]. In fact, there are several factors, which make electrical energy consumption forecasting a challenging problem, which can be summarized as follows. Big data challenge: The problem of voluminous and dimensionality of the time series data arises and, consequently, time series analysis based on learning-based techniques are required. Change in consumers consumption behavior: The energy consumption has a dynamic behavior and may change every minute based on changes in consumers activities and events. Weather conditions: Energy consumption can be vary depending on the state of the weather. Demography: Rate of population growth is not fixed every year and it is in increasing pattern and, hence, the energy consumption.
Natural gas consumption problem
In contrast to the electrical energy field, the prediction activities in the natural gas field is still immature [45], however the significant importance of natural gas for human and industry. As confirmed by the International Energy Outlook [38], the natural gas will remain an essential resource until, at least, 2040. The production and delivery processes of natural gas pass by three main phases; namely, production, transmission and storage, and distribution [56]. The consumption prediction is also an important process in this industry. Similar to electrical energy consumption, the variability of natural gas consumption over time depends on similar external factors [45].
Household load forecasting problem
Individual household load forecasting is another challenging problem in the energy domain in terms of system volatility due to dynamic attitude composed of many individual components [35]. This problem is influenced by a number of external factors, including customers attitude, devices characteristics, time and day of the week, holidays, geographic patterns, weather conditions, and other economical factors. In the recent years, the problem of household load forecasting received much attention after the emerging of smart grids and the advent of advanced metering infrastructure, energy storage systems, and home area networks.
Parametric vs. nonparametric techniques
In sensory-based applications, one of the key requirements is to maintain integrity of the underlying sensory data; so that it can be monitored, analyzed, and managed in a trusted manner. Sensory data integrity has been analyzed by either parametric or nonparametric techniques. Parametric method is a learning model that can represent the data in a fixed size set of parameters regardless of number of training samples [77]. Though these techniques are simple and fast to learn the features included in the data using a functional form, they are highly constrained to this specified form. In addition, regression models based on parametric techniques are highly sensitive to the models’ parameters as well as they are unlikely to match the underlying mapping function. Examples of these techniques are Bayesian, Naive Bayes, Perceptron, linear discriminant analysis, and logistic regression.
Unlike parametric techniques, the nonparametric techniques do not assume a specific formula for the learning function. In other words, there are no restrictions to learn any function using the training data. Nonparametric techniques are very suitable when there is no prior knowledge about the learning function as well as there are enough training data to construct the mapping function [77]. Accordingly, these techniques are maintaining some ability to generalize to testing data and much flexibility to fit a large number of functional forms. Despite these advantages, these techniques require plentiful of training data to be able to estimate the functional form, which will cause the model’s slowness as well as overfitting. Examples of these techniques are kNN, ANNs, decision trees, and SVM. The following section shows different models, which fall down either parametric or nonparametric.
Methodologies
This section shows a brief description of a number of conventional, machine learning, and deep learning techniques used in our empirical assessment. Figure 1 shows the name and category of selected techniques that covered in this paper. These techniques have selected specifically in terms of their astonishing performance in solving TSF problems. For each of these techniques, we can find myriads of variations developed in the literature. Beside the description given in this section, Table 11 at the end of this paper concludes this description focusing on advantages and disadvantages of each technique.

Stream of selected techniques.
Conventional methods use varieties of mathematical statistics, probability theory, and stochastic processes to establish a mapping between historical time series data and the generated outputs. Generally speaking, conventional methods have a simple modeling process, when compared with other methods. In addition, they require fewer input data, compared to learning-based methods, and show good performance particularly in the yearly energy consumption forecasting [59]. Though several conventional methods exist, ARIMA, VAR, and the exponential smoothing (SE) methods are the most commonly used conventional methods can to achieve energy forecasting.
Autoregressive integrated moving average
Previously, most of the conventional methods were based on the autoregression (AR) and the moving average (MA) methods, which are accomplished by the efforts of Yule, Slutsky, Walker, and Yaglom during 1920s [66]. Shortly after, Wold combined both AR and MA methods to be as one method known as ARMA, which can be used to model all stationary time series data [30]. Precisely, as a parametric model, both the mean of the series and the covariance among observations do not change with time [79].
Next, Box and Jenkins popularized the use of the ARMA method and propose an integrated model known as ARIMA, which stands for autoregressive integrated moving average [24]. Since then, the ARIMA model is widely employed in the forecasting activities as it can be used to solve nonstationary time series problems. In addition, the ARIMA model is highly efficient and widely used in short-term forecasting problems as short-term factors are expected to change slowly [28]. The ARIMA modeling for a time series can be represented as a linear relation as follows [13],
However, several ARIMAs’ variants were discussed as special cases of the original ARIMA model, such as white noise, autoregression, and random walk w/o drift [66]. So, it is much easier to work with the back-shift notation (B) to describe the process of differencing and combining the components (i.e. AR, I, and MA) to form the complicated and special cases of the ARIMA model. Thus, Eq. 7 can be rewritten in the back-shift notation as [66],
However, ARIMA has a number of limitations such as the disability to represent dynamic behaviors and the non-linearity of energy time series data [13, 75]. In addition, ARIMA does not support the seasonality in the energy time series data. To solve the seasonality problem, an extension of ARIMA is developed called seasonal ARIMA (SARIMA) [88]. Nevertheless, this extension adds three new hyper-parameters to ARIMA specify the seasonal component of the series, as well as an additional parameter represents the number of periods in each season. It can be represented as ARIMA (p, d, q) (P, D, Q) m , where (p, d, q) for non seasonal part and (P, D, Q) for seasonal part with (m) seasonal period [66, 88]. Simply, we can say that the additional seasonal terms are multiplied by the non-seasonal terms. However efficient prediction using univariate time series data, ARIMA is not suitable to do the same with a multivariate time series (MTS) data. This is due to the disability to represent the dynamic behavior of a multi variable dataset.
One of the commonly used generalization methods of ARIMA is the VAR method, which provides a flexible means to model and forecast MTS problems [29]. VAR is a parametric model and each variable is a linear function of the past values of itself and the past values of all the other variables [53]. In addition, VAR enable the user to measure and visualise the estimated effect of the explanatory change on the dependent variable over time [29].
The VAR model that containing n time series variables and a lag length of k is written as,
where α0 is an (n x 1) vector of intercepts, α i : i ≤ 0, are (n x n) coefficient matrices, y t is an (n x 1) vector of endogenous variables, and u t is an (n x 1) vector of white noise residuals. The vector α0 contains n intercept terms and each matrix α i contains n2 coefficients. Therefore, number of overall coefficients that must be estimated is (n + kn2), which grows up exponentially with the number of variables in the system. Accordingly, a major problem in the computation of the VAR model occurs when k is large, which yields over parameterization. As a result, too many coefficients must be estimated in proportional to the sample size [29].
A variant VAR model, called Bayesian VAR (BVAR), could avoid the expensive computation of the original VAR by allowing the model to include many coefficients while simultaneously controlling their influence by the data. This improves the forecast performance by reducing the number of false correlations that captured by the original VAR model [53]. It is very recently when VAR used to make an energy forecasting using univariate and multivariate datasets, where VAR outperforms other autoregressive methods [39]. For example, Liu et al. [91] built a VAR forecast model based on three weather variables, for 61 cities in the US, to model the electricity supply and demand.
The ES is a traditional statistical parametric method, introduced in the 1950s, commonly used in solving TSF problems. It is considered as a collection of ad-hoc techniques for extrapolating various types of UTS data. Compared to ARIMA, the main concept of ES depends upon the assumption of exponential decay of weights for past data over time while ARIMA is employed by converting the time series into a stationary series [20].
Though, ES has a solid theoretical basis and works well even with fewer input data, it suffers when the data has a trend or a seasonality particularly with long-term data. In addition and similar to ARIMA, it is not suitable to model the nonlinearity of MTS data. There is a simple version of the ES method called simple exponential smoothing (SES). The SES method is used only when there is no clear trend or seasonal patterns in the data. Eq. (10) shows the degrade of the weights exponentially, where α (0 ≤ α ≤ 1) determines at which rate they are decreasing.
Holt extended the SES method to allow the forecasting of the data that include trend [12]. Also, Holt and Winters extended the Holt’s method to capture seasonality, which includes a seasonality smoothing parameter γ, by modeling three types of exponential smoothing: a value, a trend over time, and a seasonality pattern [66]. Independently of the univariate ES, the multivariate form of this method has developed in the form of a state space model. One of the early contributions that outlined a multivariate ES (MES) seasonal specification was presented by Pfeffermann et al. using two univariate tourist data sets [19]. In their paper, the authors compared MES method with the VAR method and the original ES method. The experimental results showed that MES produced more accurate results than other contenders.
Overall, conventional forecasting methodologies have drawn much attention due to their relative simplicity in representation as well as relying on solid theoretical and mathematical bases. However, most of the conventional methods are linear and parametric and, therefore, show a poor forecasting performance, particularly when treat long-term data and MTS problems [27]. In addition, the variables in most of the energy applications exhibit a highly nonlinear attitude [28]. Therefore, the use of conventional methodologies to model these nonlinear variables is not appropriate because of the poor representation of the complex relations and interactions among energy variables [14].
Machine learning methods
Machine learning methods have flexible structures and nonparametric procedures sufficient for capturing and identifying any complex interactions and nonlinearity relationship among the variables of the TSF problem [8]. Simply, machine learning defined as the automated frame of the human being learning. It is demonstrated that human beings learn through experience using a trial and error style in order to discover which actions should be triggered given certain circumstances. This enables the human to make abstractions and build knowledge. The machine learning is similar; it can be regarded as the algorithms that have the objective to improve a performance measure by automatically extracting its own rules and creating its own models based on the given information [75].
During the last two decades, there has been a growing interest in employing various machine learning models in the forecasting domain, particularly the energy TSF problem [15]. The most important aspect of machine learning methodologies versus conventional methodologies is their capability to accomplish the learning process in order to improve the model performance over time using a trial and error fashion [75]. Artificial neural networks (ANNs) are the most machine learning algorithms widely used in the energy forecasting applications.
Artificial neural network (ANN)
ANN is a modeling technique that mimics the functions of the biological neural network in the human brain. As shown in Figure 2, ANN consists of many computing network units linked by directed connections capable of performing a lot of complex computations. The network outputs are given through all outgoing connections, where w i represents the connection’s strength with the input x i . The ANNs algorithms are decent enough for dealing with the intrinsic properties commonly exist in the energy activities [49]. For more details about ANNs and forecasting see [6, 27].

The standard artificial neural network architecture.
Many advantages are reported for ANNs such as their ability to efficiently deal with the extreme noisy data and prone to error time series. In addition, ANNs have the ability to manipulate the complexity and nonlinearity of variables and processes in the energy domain; the problem that can be alleviated by the amenable mathematical ANNs structure [14]. Moreover, ANNs are considered as a parallel distributed processor which is capable of storing information for further use.
Nevertheless, the most important aspect in the treatment of the TSF problems using ANNs is that the statistical distribution of the raw time series data need not be known in advance. This is due to that the nonstationarities aspects, which exist in the time series data, such as trends and seasonality are implicitly estimated by the internal structure of the ANN [14]. Hence, machine learning and ANNs algorithms are deemed powerful alternatives to conventional methods. In the following subsections, we will show a brief description of the most known machine learning and ANNs algorithms that can be employed to solve various energy TSF problems.
The kNN algorithm is a nonparametric method used for both classification and regression problems [21]. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether kNN is used for classification or regression: In kNN classification, the output is the class’s membership. An object is classified by a majority vote of its neighbours, with the object being assigned to the most common class among its k nearest neighbours (k is a positive small integer). If k = 1, then the object is simply assigned to the class with a single nearest neighbour. In kNN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbours.
In case of regression, given a data point, we compute the Euclidean distance between this point and all points in the training set. Then, picking the closest k training data points and set the prediction as the average of the target output values for these k points. Considering J(x) is the set of k nearest neighbours of point x, then the prediction y is given by,
In the field of energy, Lora et al. [7] proposed a method based on the kNN to solve the TSF problem applied to short-term electric load forecasting. The empirical results lead up to the kNN method is more accurate than dynamic regression models to solve the selected problem. Sun et al. [54] suggested a kNN based technique for decreasing the cost of energy that home energy management system is using. kNN helped the authors to analyze the classification and regression datasets they have and to make a simulation of the energy needed for each apparatus. Johannesen et al. [58] developed a regression approach using kNN to explore the use of regression of regional electric load forecasting by correlating lower distinctive categorical level (season, day of the week) and weather parameters.
Nevertheless, employing kNN in forecasting applications is limited due to some limitations [72]. Specifically, kNN may show a poor run-time performance when the training dataset is large. In such a case, computation cost will be quite high since we need to compute the distance of each query instance to all training samples. In addition, it is very sensitive to irrelevant or redundant features because all features contribute to the similarity and thus to the classification or regression process, however, with a careful feature selection (or weighting) this limitation can be avoided. Moreover, there is a limitation is related to distance metric, where it is not clear which type of distance metric and attributes to use to yield the best results [62].
The support vector machine (SVM) is a machine learning algorithm widely used in classification and recognition applications [83]. When SVM is applied to achieve regression analysis of TSF problems, it is called as support vector regression (SVR) [57]. SVM is based on two concepts of statistical learning theory; namely, the decision plane and the decision boundary. The decision plane can be defined as a plane that separates a set of different objects. Basically, the SVM uses a linear function to implement nonlinear class boundaries through a nonlinear mapping of the input vectors x into a high-dimensional feature space [83].
Similar to SVM approach, there is motivation to seek and optimize the generalization bounds given for regression via SVR. Actually, SVR depends on defining the loss function with ignoring the possible errors that are located within a certain distance from the actual values. This loss function is denoted as epsilon-insensitive loss function, as shown in Figure 3, which embeds a one-dimensional linear regression function with epsilon-insensitive band. The goal of SVR is to find a function f(x) = w T x + b that deviates no more than ζ from the targets y i for all training data.
For linearly separable data, the corresponding quadratic optimization problem is given as,

The epsilon-insensitive band via one-dimensional linear regression function.

The epsilon-insensitive band via one-dimensional nonlinear regression function.
Suykens et al. developed an extension for SVR and called it as least squares SVR (LSSVR). It represents a higher stability alternative to SVR as well as it is trained faster than SVR [43]. The LSSVR algorithm improves the SVR method by transforming the quadratic programming problem of SVM into a linear equation by establishing a quadratic loss function instead of the epsilon-insensitive loss function. Accordingly, this improves the accuracy along with reduces the computational burden of SVR. Therefore, we can obtain the LSSVR regression model by solving the following optimization problem,
where γ is a constant similar to C in the standard SVR, ∅(x i ) is the mapping to the high dimensional feature space as in SVR, and e i ∈ R are the error variables. Even though, the performance of LSSVR degrades if the time series data have chaotic characteristics [17].
The MLP is a conventional neural network model [1] consists of one input layer, one (or more) hidden layers, and one output layer, as shown in Figure 5. Each hidden layer includes a number of units, called neurons, that can be considered as a single output perceptron network. The output unit is equivalent to a single output unit perceptron, and it regarded as a soft thresholded linear combination of the units of the preceding hidden layers. The hidden and output units are based on a sigmoid that calculates a linear combination of its input x, and then applies the following sigmoid function on the net result,

The multilayer perceptron architecture.
The learning of MLP employs the backpropagation (BP) algorithm in order to minimize the sum of squared errors 1 by changing the connection weights among neurons. Ideally, the MLP aims to construct a model that is capable of mapping the inputs to the known outputs using previous historical data. The produced model can be used to predict unknown outputs.
In TSF terminology, the number of neurons in the input layer of the MLP network is equivalent to the number of independent variables such as the number of days, day of the month, number of hours. While the number of neurons in the output layer is equivalent to the number of dependent variables such as the amount of gas or power consumption. In addition, the MLP model will learn the function from lag observations of time series data [42]. Due to this simple representation, the MLP is commonly used in power, load and gas forecasting problems [16, 85]. In the energy time series data, MLP can learn from multiple observations that taken through prior time steps, or lag observations, and use them as input features. In addition, MLP has the ability to use these input features to achieve multiple-step ahead forecasts [42].
However, MLP has some drawbacks like slow convergence, linearity, a tendency to get trapped into local minima, possible oscillations during searching and sensitivity to learning rate [5]. In addition, determining the best MLP architecture requires making a large number of experiments, which make them impractical in terms of the data voluminous. Moreover, increasing the number of layers and neurons yield overfitting and training difficulties [81].
The RNN is a widespread neural network approach adopted to solve the energy TSF problems. Compared to other traditional neural networks, the RNN has a looping mechanism that allowing information to flow from one step to the next step as shown in Figure 6. This information is the hidden state, which is a representation of previous inputs [92]. The hidden state h t is a nonlinear mapping depends on the current input x t and the previous hidden state ht-1. It is written as,
As it depicted in Figure 6, the structure of standard RNN contains an internal memory cell. It computes recursively a new output by applying an activation function into the previous historical data and new inputs. This allows the RNNs to process information sequentially and exhibits temporal behaviour for a time sequence while retaining information from the past [92]. As a result of this distinct structure, many attempts have been performed to use RNNs in power, electric load and natural gas forecasting domains [10, 71].

RNN Architecture, h refers to the hidden state, X refers to the observation, W refers to the cell’s weight, and Z refers to the output.
Nevertheless, in the context of TSF, the main drawback of RNNs is that they require essentially intensive connections among cells, as well as much memory in simulation than the other BP-based neural networks. In addition, RNNs are not able to keep track of long-term dependencies because of the vanishing and exploding gradient problems, which prevent the information from propagating to early layers in the architecture [47]. Several remedies have been developed to address the drawbacks of RNN, the long short-term memory and the echo state network.
When learning a sequential data, the standard RNN aims to learn representations of patterns repeatedly occurred via the past observation by sharing parameters across all time steps. But as the time goes on, the memory of past learned patterns is fade. As a special type of RNN but with a different structure, LSTM allows the model memory cell to memorize the data sequence for a longer period of time by establishing propagation tracks keep the flow of gradients for earlier states [86].
The standard LSTM model [73] is composed of one hidden LSTM layer followed by a feed-forward output layer. LSTM differs from other traditional ANNs models in containing memory blocks that replace the summation unit in each cell. The internal structure of an LSTM cell is simply demonstrated in Figure 7.

The internal structure of an LSTM cell.
Each LSTM is a set of arranged cells where the information data stored. In each of these cells, there are some gates built upon a sigmoidal neural network layer allowing each cell to optionally permit the data passing through. Each sigmoidal layer representing the amount of data in each cell within the range zero and one. When the zero value is estimated, this implies that no-information is passing through, whereas if the one value is estimated, this implies that full information is passing through. Each cell has three types of gates: Forget gate: determines whether the data is removed or retained. Memory gate: determines which data needs to be stored in the cell. Output gate: determines which data is useful and can be used for current forecasting.
Recently, the LSTM is used to handle regression via different kinds of energy TSF problems [76, 94]. However, there are a few limitations of using LSTM. The first limitation is that the number of memory cells is linked to the size of the recurrent weight matrices. More precisely, an LSTM with N
h
memory cells requires a recurrent weight matrix with
The ESN is a type of simplified RNN model uses the idea of reservoir as a medium for information processing to avoid the limitations of RNN, such as expensive computation and slowness. ESN is composed of an input layer, a middle layer, and an output layer. It recovers the RNN limitations by simplifying the learning approach of the network by training only a number of the connected neurons. Once the remaining neuron connections are generated, their weights will not update anymore, where the outputs weights are the only neurons that subject to updating [26].
In a comparison with the original RNN, the ESN has a simplified structure and training approach, the advantages that ensure simple and fast training procedure as well as low-cost computation. Compared to other ANN models, the ESN contains a large number of sparsely distributed neuron within the reservoir, as depicted in Figure 8. Though ESN has been applied in many research fields in literature, according to our knowledge, its application to model various energy TSF problems is still limited. Specifically, it is used only to model the TSF problem of wind energy [33].

The ESN architecture.
Deep learning is the artificial intelligence branch that causes a significant progress nowadays. Deep learning methods that based on deep architectures of ANNs have repeatedly outperformed the shallow neural network counterparts [48]. Hence, real applications based on deep learning have been grown rapidly because of the high-performance computing capability of deep learning including the capability for dealing with large size datasets [63]. Therefore, deep learning is more suitable for energy TSF problems as it is easily applicable with large datasets, complex variables, multivariate inputs, along with forecasting multiple time steps [32, 80].
The deep NNs have the same structure of the shallow NNs except including more hidden layers, such that each layer processes a portion of the underlying task [78], as shown in Figure 9. The role of additional hidden layers is twofold, the first is recombining the learned features from the preceding layers. The second is creating new representations at high levels of abstraction [78]. Deep learning algorithms are used efficiently in many energy TSF applications, such as renewable energy [15], solar energy [89], electricity demand [41], forecasting of load and consumption of electricity [2, 32].

The deep neural networks architecture.
The most common way to build a deep network is by stacking more layers one above another with various ways of stacking and learning mechanisms [68]. In this paper, we build two deep models, namely, DRNN and DLSTM, in the same way described in our previous works. Therefore, more information about these deep models structure, hyperparameters selection, optimization technique, loss function, and other experimental conditions are given in our previous works [3, 4]. These two contributions showed an experimental evidence of the significant benefits of building RNN and LSTM with deep architectures.
We have reached to the end of the first part of this paper, which concerns with the analytical phase. Overall, we reviewed the structures and properties of various conventional, machine learning, and deep learning methods. Specifically, we described briefly the most common methods, in total 15, of each methodology that could be tailored to solve various energy TSF problems, with indications to the existing works. For a qualitative comparison among these methods including the strengthens and limitations of each methods is listed in Table 11, at the end of this paper. In the second part of this paper, we will conduct empirical assessment and comparison among all methods that described in the first part, using the same experimental conditions and datasets. This assessment represents a guide to the researchers and practitioners in the energy domain in how should they efficiently select the relevant forecasting model.
Forecasting evaluation metrics
Evaluating the forecasting model performance is very essential step before selecting the suitable forecast. Of course, the key factor in selecting the appropriate forecasting model is accuracy. The forecast’s accuracy is determined by considering how well the chosen forecast model performs on unseen data samples. In a forecasting experiment, the dataset is divided into two disjoint subsections; training and testing. The section of training data is used in fitting the model parameters, whereas the section of testing data is used to calculate the model’s prediction. Since the testing data is not utilized during the training model, it should provide a reliable indication about how well the selected model forecasts the unseen data.
For this purpose, many performance measures have introduced to evaluate the model accuracy and calculate the forecast error. The forecast error defines the difference in values between the desired forecast and the actual forecast through the underlying interval of time series. This can be mathematically written as [69],
There are two broad kinds of prediction errors [74].
The forecasting that based on percentage errors has the disadvantage of being undefined if the data contain zero values; i.e. y t = 0, in the selected time period. Compared to scale-dependent error, it is widely reported that the percentage error measures are unit-free and more accurate and efficient in tracking the forecasting precision and performance evaluation of the forecasts. The reason behind this is that they have the advantage of being a scale-independent. Accordingly, the measures of percentage error are frequently used in practice to conduct an assessment for different forecasts, particularly when different scaled datasets are used [65]. For this reason, in the experiment section of this study, we will rely on the values of the two measures MAPE and RMSPE.
In this section, we conduct empirical comparison sessions among the the selected models, that described in section 3, in total 15 models. These models are SES, MES, ARIMA, SARIMA, and VAR as representatives for conventional models, kNN, SVR, LSVSR, MLP, ESN, RNNs and LSTM as representatives for machine learning models, and DRNN, and DLSTM as representatives for deep learning models.
The assessments are conducted using three case studies employing three different energy time series datasets, each for a different energy application. Two of these datasets are including univariate time series observations, whereas the third dataset includes a multivariate time series observations. In each experiment, we will describe the employed dataset and then show its related forecasting results.
It is not logic; and maybe unfair, to assess and compare all models in the three methodologies on the same base in spite of differences in nature and architecture of each model. Therefore to make a convincing assessment, in each case study first we will compare the performance of conventional methods against the standard machine learning models. Then, we will compare the performance of standard machine learning (or shallow neural networks) models against deep learning (or deep neural networks) models.
Hardware and software platforms
All experiments in this paper are implemented on an HP workstation-PC equipped with Ubuntu 16.04 operating system. The computational time of each algorithm was estimated using a system with Intel Core™ i7-6700 CPU @ 3.40GHz, 8.00 GB RAM, x64 based processor under python 2.7 software environment. For ANNs methods, the Keras library was used with an open-source TensorFlow [50] library as back-end.
Case Study-I: Univariate time series
This case study concerns an electrical energy consumption problem.
Description of dataset
The data samples of this dataset were collected from the City of Bloomington Utilities (CBU)-intake tower. This facility pumps water from lake Monore to Monore water treatment plants, southern U.S state, Indiana. The daily energy consumption used at this facility is measured in (MWh) from January 2011 to June 2018. The data are available for public use on the website [36]. The dataset has 2,738 indexes splitting into a training dataset (67%) and testing dataset (33%). The weekly power consumption of the data is shown in Figure 10 and Figure 11 for daily and weekly consumption, respectively.

CBU Monore intake tower power-daily consumption.

CBU Monore intake tower power-weekly consumption.
Table 1 shows the performance of the conventional methods and the machine learning methods, along with the best parameter values for each method. It is clear that the machine learning models show a growing improvement over the conventional methods on the scope of the three performance metrics. If we select one metric; for example MAPE, we will notice that SES showed 8.75%, ARIMA showed 8.53%, LSSVR showed 5.44%, MLP showed 6.17%, ESN showed 7.9%, RNN showed 5.32%, and LSTM showed 5.19%. The same performance can be observed in the metrics RMSE and RMSPE. Not only outperformed the conventional counterparts, but the LSTM also improved the forecasting error compared to machine learning counterparts; ESN, RNN and MLP, even if the difference with RNN is slightly scanty. The same attitude can be observed in the results of the RMSPE measure. On the level of RMSE measure, the same attitude is shown where both ARIMA and SES showed about 1.7%, MLP showed 1.31%, ESN showed 1.54, % whereas both RNN and LSTM showed the same error rate about 1.06%.
Comparison among conventional and machine learning models for case study-I. NoL: No of layers; NoN: No of neurons
Comparison among conventional and machine learning models for case study-I. NoL: No of layers; NoN: No of neurons
Since the RNN and LSTM demonstrated the smallest forecasting errors over other ANN counterparts, we examined their performance when deep architectures are adopted. Table 2 and Table 3 display the comparison between shallow architectures and deep architectures for both RNN and LSTM, respectively. It is clear that the deep models improved the performance of standard or shallow models. If we select the RMSPE metric, the RNN showed around 7% whereas DRNN showed around 3.5%, and the LSTM showed around 6.9% whereas DSLTM showed around 3.3%.
Comparison between RNN and DRNN for case study-I
Comparison between LSTM and DLSTM for case study-I
The improvement in forecasting errors brought by deep models is clearly high compared to forecasting errors brought by shallow models. Indeed, this assessment confirms the superiority of deep learning models over conventional and machine learning models. For visual comparisons, Figure 12 illustrates the consumption prediction of the last four months of the first year using the methods SARIMA, LSTM, and DLSTM. We selected these three methods in this specific period in order to have a clear visual illustration.

The consumption prediction during the last four months of the first year for the methods ARIMA, LSTM, and DLSTM.
This case concerns a natural gas consumption problem.
Description of dataset
The data samples of this case study were collected for 5-minute intervals for three months from January to March 2014. It was the natural gas consumption of a building number 74 located in the Lawrence Berkeley National Lab campus (BNC). Lawrence Berkeley National Laboratory (Berkeley Lab) is a department of Energy (DOE) office of Science lab managed by the University of California. Since natural gas meters measure volume and net energy content, a therm factor is used by natural gas companies to convert the volume of gas used to its heat equivalent, and thus calculate the actual energy use. So, BNC uses natural gas with consumption measured in (Therms/hr). The data have 25,908 indexes splitting into a training dataset (67 %) and testing dataset (33%). The daily natural gas consumption at BNC is shown in Figure 13.

The daily natural gas consumption at BNC.
We will follow the same scenario adopted in the previous case study. Table 4 displays the performance of the conventional methods and the machine learning methods, with the best parameter values for each model. Newly, the machine learning models show a growing improvement over the conventional methods in the scope of the three performance metrics. Let us select this time the RMSPE measure, we will notice that ARIMA showed 18.6% whereas SES showed about 18.1%. However MLP and ESN are machine learning models, they showed errors with 17.61% and 16.98%, respectively, which are very close to the performance of conventional models. In contrast, RNN showed 2.9% and LSTM showed about 3.1% in a significant improvement compared to other contenders. The same attitude can be observed in the results of the MAPE measure. On the level of RMSE measure, the same attitude is shown where ARIMA, SES, ESN, and MLP showed about 0.7%, whereas a big improvement bought by both RNN and LSTM which showed the same error rate about 0.1%.
Comparison among conventional and machine learning models on dataset-II. NoL: No of layers; NoN: No of neurons
Comparison among conventional and machine learning models on dataset-II. NoL: No of layers; NoN: No of neurons
For the deep learning models, Table 5 and Table 6 display the comparison between shallow architectures and deep architectures for each model separately. Again, it is clear that the deep architectures improve the performance of shallow architectures. If we select the RMSPE measure, the RNN showed 2.9% whereas DRNN showed about 2.1%. Also, the LSTM showed about 3.1% whereas DSLTM showed about 1.9%. Certainly, the improvement in forecasting brought by the deep models is high compared to shallow ones, which confirms their superiority over conventional and standard machine learning models.
Comparison between RNN and DRNN for case study-II
Comparison between LSTM and DLSTM for case study-II
This case study concerns the electric power consumption problem of an individual household. The sector of individual house consumption is one of the largest consumers of electric energy. The rational consumption of electricity at home becomes of a great importance [11]. Accordingly, this case study is considered as a large scale TSF problem.
Description of dataset
The household power consumption dataset is a multivariate time series dataset that describes the electricity consumption for a single household over four years, exactly 47 months. The dataset instances were collected every minute, which yields a total number of instances equal to 2075259 instance [82]. Beside the time and date, the dataset identifies the following seven variables of interest: global-active-power (the total active power consumed by the household), global-reactive-power (the total reactive power consumed by the household), voltage (Average voltage), global-intensity (average current intensity), sub-metering-1 (active energy for kitchen), sub-metering-2 (active energy for laundry), and sub-metering-3 (active energy for climate control systems). The active energy variable is the real power consumed by the household, whereas the reactive energy variable is the unused power in the electric lines.
The best way to understand these two core variables is graphical visualization. By this way, we are able to detect if the data contains a significant trend or some consistent or irregular attributes as well as understanding seasonality. For this purpose, we create a separate plot for each of the seven variables, as shown in Figure 15. If we look closely to the first panel of global-active-power, it is clear that it is strongly annual seasonal dependent. In addition, some trend can be noticed there as well as some random aspects influences the data distribution. Specifically, we can easily notice a downward trend over the summer months (middle of the year) and more consumption in the winter months (at the edges of the plots). Same phenomena we may notice to some extent in the second panel of global-reactive-power.

The consumption prediction of the last day in the first month for the methods SARIMA, LSTM, and DLSTM.

Visualization of the decomposed variables of the household power consumption (multivariate) dataset.
All the aforementioned properties of the dataset attribute the nonlinearity and complexity of the current case study. Due to voluminous size and nonlinearity of the employed dataset, this dataset is very ideal for evaluating the deep learning models against conventional methods. As we explained in section 3.1. since the current problem is a multivariate regression problem, we will use VAR instead of ARIMA and SARIMA, and MES instead of ES and SES. Also, due to technical limitations of kNN and SVR to model multivariate regression, we excluded them in this experiment.
Table 7 shows the performance of the conventional methods and the machine learning models. After the first look to the table, we will observe that the error values are big, particularly the values of RMSPE and MAPE metrics, as a reflection to the bigger values of the data samples [82]. Regardless of the values voluminous, it is clear that the machine learning models outperformed the conventional methods on the context of the three performance metrics. If we select the MAPE metric, we will notice that MES showed 103.66, whereas VAR showed 89.64. In contrast, MLP showed 67.62, RNN showed 45.91, and LSTM showed 46.50. The same performance can be observed in the other metrics RMSE and RMSPE. Overall, we can easily notice that the RNN improved the forecasting error compared to other machine learning models; LSTM and MLP, and compared to conventional methods.
Comparison among conventional and machine learning models on case study-III. NoL: No of layers; NoN: No of neurons
Comparison among conventional and machine learning models on case study-III. NoL: No of layers; NoN: No of neurons
To examine the impact of deep architectures, we applied the DRNN and DLSTM in this case study as well. Table 8 and Table 9 display the comparison between shallow and deep architectures for each RNN and LSTM, respectively. It is clear that the deep models improve the performance of shallow models. On the scale of the RMSPE metric, the RNN showed an error rate as 64.22 whereas DRNN showed 62.57. Similarly, the LSTM showed an error rate 66.08 whereas DSLTM showed 64.71. The same attitude can be noticed on the scale of the MAPE metric, however, all models showed the error rate 0.56 on the scale of RMSE metric. Figure 16 shows a visualization of the performance of the shallow RNN against the performance of DRNN. Doubtless, including many inputs, or variables, considerably increases the computation complexity of a model. This case study confirms the superiority of deep learning models compared to other contenders in terms of forecasting accuracy.
Comparison between RNN and DRNN for case study-III
Comparison between LSTM and DLSTM for case study-III

Global active power prediction for RNN and DRNN.
Based on the empirical assessments described in the previous section, it is clear that the deep learning models, represented in DRNN and DLSTM, outperformed both the machine learning models and the conventional models. In this section, we analyze the performance of all models from different aspects.
Analysis in terms of accuracy and performance
No doubt that, the key factor in selecting the appropriate forecasting model is the accuracy and overall performance. Beside the presented tables, Figures 17-19 demonstrate visual representation for the percentage errors, MAPE and RMSPE, of all models. It is clear that the errors shown by DLSTM and DRNN are lower than those shown by all other contenders. This performance is steady through the three case studies and did not change according to the employed dataset.

Percentage errors RMSPE and MAPE for Case Study-I.

Percentage errors RMSPE and MAPE for Case Study-II.

Percentage errors RMSPE and MAPE for Case Study-III.
The figures point out that the performance of conventional methods (ARIMA, SES, SARIMA, VAR, MES), across the three case studies, is poor. This performance asserts that the conventional methods could not effectively extract the features from a time series dataset. The same poor performance, but with a little improvement, is shown by MLP, kNN, SVR, and LSVSR. In contrast, consistent improvement in performance is shown by the shallow neural network models, while a significant improvement is brought by their deep counterparts. The reason for this improvement brought by deep models over the shallow models is that, the deep NN learns the nonlinear combinations, or correlations, of the features in higher layers of the network. However, the hidden neurons in shallow NN models learn the nonlinear combinations of the inputs as the features. Therefore, the deep architecture will learn the hierarchical features, for example in the load demand problem, at different layers smoother than shallow architecture [32].
Specifically, the significant improvement brought by DRNN and DLSTM is attributed to the properties of these models. They exploit the benefits of memory loops that integrated in a deep architecture, enabling each network to learn from a long horizon dataset [48, 68]. In addition, both models are relying on the dynamic (or active) learning mode, in which the model integrates the historical observations with the recent ones to efficiently predict the future [41]. This property is very useful particularly when process an energy demand problem with long horizon. On the contrary side, the conventional methods and other machine learning methods are static learning based approaches. Actually, they do not rely on the explicit relationship among the historical data and future data. Rather they just learn from the available historical observations for prediction [41]. Such a static learning mode faces troubles when a long-term forecasting is needed. Certainly, the longer the forecasting horizon, the greater is the possibility of a forecasting error.
The presented results indicate the relation between the forecasting error and the size of the dataset. Precisely, given that the size of the first dataset is small (2738 samples), the conventional methods showed the level of 11.5% on the RMSPE measure and 8.5% on the MAPE measure. Once the size of the dataset is getting larger, such as the second dataset that has 25908 instances, the error rates jumped to approach the level of 18% on RMSPE measure and 10% on the MAPE measure. When the size of the dataset became very voluminous with multiple variables included; such as the third dataset that has 2075259 instances, the difference in performance became clear with a superiority for deep learning models. This performance gives a credit for deep models in treating a broad spectrum of prediction applications, particularly if the observations correlate many variables [63].
Looking inside each dataset we will notice that the observations of all datasets were recorded in short intervals, every one and five minutes for dataset I and III, respectively, and every one hour for dataset II. Keeping in mind their poor performance, the conventional methods are not fit for such short interval observations. Actually, this conclusion is consistent with the result reported in [59] who stated that conventional methods are preferred for the yearly energy consumption. In line with the latter conclusion and keeping in mind their well performance, the deep learning models are more robust and stable in all data intervals and forecasting horizons. They providing the optimum forecasting for either short-term or long-term energy consumption. Surprisingly, this contradicts with what Wei. et al have reported that deep learning models need further validation compared to the conventional methods [59].
In terms of data structure, the machine learning, including deep learning, models have a strong capability to handle nonlinear and heterogeneous time series data, either short-term or long-term forecasting problems. In the contrary, the conventional methods suffer when handling nonlinearity and nonstationerity in time series data. Nevertheless, it is worth to highlight that the conventional methods, in general, do not need many of historical data to function, contrary to the learning-based models, which require plentiful data to learn.
Analysis in terms of computational complexity
Toward a rational assessment, we calculated the computational complexity (CC) of each method in order to delimit the time taken to fit a model and use it for prediction. In the same way described in [75], the formula of CC for a specific method to accomplish a specific task can be represented as shown in Eq. (23). As the denominator in Eq. (23) is unified for all computations, we can rely only on the numerator value and, eventually, we shall have a value indicates the proportional time taken by a model to achieve the prediction.
To avoid redundancy, we will settle for calculating the CC required by each model using only the samples of case study-I, as shown in Table 10. The fitting time is the time required to train the model, whereas the prediction time is the time required to test or validate the model. It is clear that the computation requirements for conventional methods are very cheap compared to other contenders, particularly deep models. For example, the prediction time of ARIMA and SES are about 0.001 and 0.002 seconds, respectively, whereas DRNN and DLSTM need 0.33 and 0.6 seconds to predict, respectively. No doubt, this is a big privilege of conventional methods contrary to other models.
Model computational time in seconds
On the same time, the fastest machine learning model is kNN that requires 0.001 seconds to predict. The fastest among ANN models is ESN, which requires about 0.03 seconds to converge. Indeed, the ESN time requirement is reasonable since it is already developed as a faster version of RNN. In the meantime, it is totally credible that the deep learning models require longer time to function in terms of increasing number of hidden layers and neurons. Another negative aspect of increasing the number of layers in deep models may reflect the occurrence of overfitting, especially in lack of data diversity [32]. Nevertheless, using convolution optimization techniques may solve the overfitting problem by optimally select the proper parameters of the deep model according to the problem at hand [46]. In addition, with the emerging progress in hardware and learning algorithms, there are many remedies for the expensive computation cost of deep learning models [9, 25].
Overall, the main finding of this paper is the suitability of deep learning models to be applied in various energy problems. The given three case studies confirm the stability, well-performance, and robustness of deep learning models; DRNN and DLSTM. Accordingly, there are no reasons for some recent articles to assume that deep learning is not stable as it does not test yet [59]. We are very certain that the adoption of deep learning methods, in the energy domain, is right now at the same maturity level as conventional methods, which are the dominant methodologies for decades.
Challenges for future research
Though the comprehensive analytical and empirical review presented in this paper, there are a number of challenges represent open-ended questions. The probabilistic prediction of various energy problems has not attracted enough attention, however, it is a vital research domain that should be addressed. In energy management systems, the probabilistic predictions can provide a range of energy changes that may help to quantify the uncertainty involved [11]. Most of the research in the energy TSF domain consider only the historical data and neglect other influencing or exogenous factors. For example, in energy consumption problems, there are many factors that influence the rates of consumption in a building or a city, including weather conditions, human occupancy, and indoor conditions. These influencing factors and many others need to be taken into account in order to have efficient forecasting results. In the scope of their robust mathematical and statistical foundations, conventional methods able to show a good performance in terms of representing the relationship between historical data and the influencing factors. In contrast and in the scope that ANNs are black-box approaches, they might not have the same ability. Handling a collaborative training on multiple energy prediction tasks is very essential in the future of energy research.
All these challenges represent future research directions should be further investigated by researchers. However, there are a number of extensions to the work presented in this paper could be addressed as well. For example, the forecasting problem of multiple variables needs deep investigation, particularly in case of missing and distorted sequential observations. In addition, various optimization techniques can be adopted in order to improve the prediction accuracy of deep learning models. Moreover, developing an intelligent prediction platform, with a comprehensive auto-analysis and visualizations of the useful insights, represents a great enhancement toward intelligent energy management systems [11].
Conclusions and recommendations
Time series forecasting problem has a remarkable importance in various practical and industrial applications nowadays. Various methodologies have widely employed to solve different time series forecasting problems. This paper presents a qualitative analytical review along with an quantitative empirical assessment for the conventional, machine learning, and deep learning methodologies that applicable in the energy TSF domain. In the analytical review part, we could not pretend to have reviewed all existing models in this wide domain, however, we selected the most remarkable models, in total 14, along with the advantages and disadvantages of each model, as shown in Table 11, at the end of this paper. In the empirical assessment part, we conducted several experiments using the selected models. Three real world datasets in the energy domain have used, two datasets include univariate observations and one dataset includes multivariate observations. Except the high computation requirements, the analytical and empirical assessments indicate the superiority of the deep learning models in terms of accuracy and forecasting horizons examined, compared to the other contenders. Ideally, the outcome of this study should motivate the AI and machine learning community for further development in this trend of research.
Analytical comparison among the selected models
Based on the aforementioned qualitative and quantitative assessments, we can suggest the following recommendations: The deep learning models DRNN and DLSTM are robust and stable enough to be applied in all forecasting horizons of energy applications. In addition, they support the dynamic learning mode, in which the model integrates the historical observations with the recent ones to predict the future. Unlike other machine learning methods and conventional methods that relay directly on the historical observations only, which is known as static learning. It is possible that hybrid methods, which combine one or more conventional and machine learning methods, outperform a single method even deep learning one [93]. This combination is not addressed in our paper, however, the experiments of this paper reveal that the single models are the best choice for a user who would like to promptly estimate an energy task. In contrast, hybrid models are the best choice for users who know the basics of machine learning techniques and can integrate more than a model with a suitable optimization algorithms. Nevertheless, keep in mind that, the growing number of parameters that yield after combining one or more models is the most negative aspects of such hybridization.
Footnotes
Both the ANNs and the conventional models keen to improve the forecasting accuracy via minimizing the sum of squared errors, but they are different in how this minimization process is carried out.
