Abstract
In recent years, deep learning models have been developed to address probabilistic forecasting tasks, assuming an implicit stochastic process that relates past observed values to uncertain future values. These models are capable of capturing the inherent uncertainty of the underlying process, but they ignore the model uncertainty that comes from the fact of not having infinite data. This work proposes addressing the model uncertainty problem using Monte Carlo dropout, a variational approach that assigns distributions to the weights of a neural network instead of simply using fixed values. This allows to easily adapt common deep learning models currently in use to produce better probabilistic forecasting estimates, in terms of their consideration of uncertainty. The proposal is validated for prediction intervals estimation on seven energy time series, using a popular probabilistic model called Mean Variance Estimation (MVE), as the deep model adapted using the technique.
Keywords
Introduction
Several human activities depend on the forecasting of future values. A big ammount of research in machine learning has focused on how to perform these forecasting tasks using data of previous historical measurements. In general, these approaches are supervised, meaning that training data is used to build a model that transforms observations of values related to the forecasted phenomenon into predictions of the future values. These predictions, though, are uncertain, because of two reasons: on the one hand, it is impossible to know all the information related to the process generating the data. For example, when predicting load demand of a household, it depends on the activities of those who live in the house, and those activities depend on many factors that are not completly known, such as weather, their health, neighborhood activities, and more. This kind of uncertainty is called inherent or aleatoric uncertainty of the process, and it is present in most real life forecasting situations. On the other hand, many models can be used to produce predictions using the same previous information. Machine learning methods usually select one model from all possible ones by minimizing a desired loss function, chosen in advance, on training data. When doing so, it leaves out many other models that may fit the same data, and that, due to incomplete data, may be more close to the real process. This type of uncertainty, related to the model to use itself, is called epistemic uncertainty. In most probabilistic forecasting tasks, instead of a probabilistic distribution for future values, the problem is tackled producing prediction intervals, which give decision makers a good insight into how confident they can be of model predictions. This has been done for tasks that present a big inherent uncertainty, such as wind speed, wind power and electric consumer load forecasting, as mentioned in [1, 2]. Therefore, failing to consider all sources of uncertainty can lead to narrow prediction intervals, making systems that depend on these forecasts to be less reliable. An example of this can be found in wind power generation forecasting, where prediction intervals are used to decide which among the other power generators of the system have to work to satisfy the overall demand in case wind power is not enough. Consequently, narrow prediction intervals result in higher energy costs and a less reliable electrical grid [3].
During recent years, deep learning models composed of neural networks have attracted attention due to their capacity to model complex non linear relationships between past values and future predictions. To perform probabilistic forecasts, a popular technique is to apply the Mean Variance Estimation (MVE) method, which makes use of a neural network architecture with two outputs and is trained to predict the mean and variance of each future value [4]. This method allows the model to learn to predict the inherent uncertainty of the process and tackle prediction in heteroscedastic time series, that is, that have non-constant variance, and seems to work well in practice, as shown in [1, 2]. Unfortunately, the MVE method alone does not tackle the model uncertainty source, which is important for proabilistic forecasting. In order to tackle this uncertainty, other approaches have been explored recently that are based on variational inference, and have been applied on tasks that are different from probabilistic forecasting. These approaches replace network weights values by a posterior distribution learnt during training. One such approach is Monte Carlo (MC) dropout, a technique that randomly disables connections within the network, allowing it to learn to extract relevant features with redundancy during training, and then, during evaluation, can be seen as a simulation of an ensemble of networks that share parameters. To observe the effects that modelling both inherent and model uncertainty has compared to only modelling the former one, the results of a toy example are presented in Fig. 1. Here, we draw comparisons between a single MVE network, and an MVE network using MC dropout trained on synthetic one dimensional training data. Both models learn to predict a mean and a variance similar to the one used to generate the data in regions with training data, but they differ on how they make predictions in other regions. The MVE network cannot provide a reliable uncertainty estimation in other regions, because it does not consider the epistemic uncertainty, and this is specially noted in the uncertainty of the mean estimation. It can be seen that MC dropout estimates of the mean are more uncertain in regions without training data. For probabilistic forecasting tasks, this translates into wider and more accurate prediction intervals.
Mean, variance and final prediction in black, for two models trained using the same synthetic training data in orange (grey when printed in black and white). First row: An MVE network captures changing variance in data, but does not consider model uncertainty in regions lacking training data. Second row: An MVE network using MC dropout captures both variance in data, and uncertainty of other regions, specially for the mean prediction.
The hypothesis of this work is that applying MC dropout to neural network models used for probabilistic forecasting tasks improves their performance due to incorporing the model uncertainty. In particular, this is experimentally supported for the estimation of prediction intervals using a Long Short-Term Memory (LSTM) network, comparing between a traditional MVE network and one with MC dropout for four wind speed, one wind power, one consumer electrical load and one utility power load forecasting tasks. This work is an extension of the research presented in [5]. The main differences are:
An additional wind power dataset and an additional consumer electrical load dataset have been added to the experimental evaluation. A thorough review of the related work was done and included in an individual section. There is a deeper quantitatively and qualitatively comparison between models. There is a comparison between the proposed model and an ensemble of models, for one dataset.
The paper is structured as follows: Section 2 makes a review of related work, Section 3 introduces how prediction intervals are estimated, Section 4 details the proposal, Section 5describes the experiments to validate the proposal, and Section 6 makes final remarks and describe future research directions.
Initial probabilistic forecasting approaches based on neural networks, considered only homoscedastic time series [6], which is different to the approach of this work that can handle heteroscedastic time series. The works of [7, 8, 9] can handle heteroscedasticity using different optimization algorithms and objectives functions to build intervals. This work, instead, uses stochastic gradient optimization, which is simpler than those ad-hoc algorithms. Another work [10], also use a mix of complex models, which are harder to interpret and to optimize. Simpler approaches using similar optimization techniques than this work are [11, 12], which are based on Extreme Learning Machines (ELM), but do not use a recurrent model, as in this work. In particular, [12] considers the model uncertainty problem, as in this work, but using bootstrap, which is an ensemble technique, heavier than the proposed variational approach. At the same time, a complete solution using also a bootstrap ensemble of ELM networks to handle the model uncertainty problem is provided in [13], while this work handles it using MC dropout. The work of [14] improves these models using a recurrent network, as in this work, instead of a feed forward one or ELM, but it uses a complex optimization algorithm, and not stochastic gradient optimization as in this work. An interesting approach using deep architectures is [15], which considers an LSTM stacked to a CNN in order to produce point forecasts, and then the probabilistic part is modelled using clusters of similar data, although they do not consider the model uncertainty problem. Some works rely on input scenarios and output simulations to produce probabilistic forecasts, such as [16, 17, 18], using neural network models that are not essentially probabilistic, as in this work, and consider the model uncertainty using ensembles. In [19], a recurrent network is used as well, and its output is optimized to produce quantiles based on an adjustable distribution shaped by splines, which is flexible. In constrast, this work uses a simpler normal distribution, as the focus is to validate the model uncertainty issue. In the same line, [20] uses a single layer feed forward network and [21] uses a more complex neural network to produce quantiles, minimizing the pinball loss, while the output distribution is estimated using Kernel Density Estimation. Both works have networks with many outputs to produce quantiles, differently from this work where the network has only two ouputs: mean and variance. Another approach is [22], that uses a Deep Belief Network model, much harder to optimize and use than the simpler LSTM network used in this work. Closer to this work, [23] uses an MVE network, although not exploring the model uncertainty problem. Also, [24] uses a similar MVE network and addresses the model uncertainty problem, just as in this paper, but with two differences: it is trained using an ad-hoc metric for prediction intervals, which cannot be extended easily to other probabilistic forecasting tasks, and the model uncertainty problem is tackled with bootstrap, building an ensemble of models trained by resampling training data. Similarly, [25] addresses both sources of uncertainty using an MVE-like model, but the network architecture is more complex, having a wavelet transform and then applying a 2D CNN, which is different from the recurrent network used in this work. Finally, [26] is the closest approach to this work, using an MVE model for the inherent uncertainty and MC dropout to handle the model uncertainty, differing from this work because they use a GRU recurrent network instead of an LSTM, and they only apply it to a wind speed forecasting dataset, while in this work it is applied to seven datasets, including wind speed, wind power and load demand forecasting tasks.
A few works have tackled the model uncertainty problem without considering the inherent process uncertainty. For example, the ensemble model from [27] produces many point forecasts, changing the input for them, but each output stays deterministic, meaning the inherent process uncertainty is not considered, as in this work. Others assume homoscedasticity, that is, a fixed distribution for all times. In this scenario the use of ensemble is explored [28, 29, 30].
Aside from the forecasting community, there is a branch of research that explore neural network parameters uncertainty. In general, they make assumptions regarding the distribution of those parameters, and use variational inference to approximate their posterior distribution, as explained by [31]. One example is [32], which assumes Gaussian parameters distributions. Also, as explained by [33], many regularization techniques usual in the deep learning community can be interpreted as assumptions on the parameters distributions. In particular, MC dropout has been adapted to this framework and applied to many kinds of networks. This work is based on he application of MC dropout for recurrent neural networks, shown possible by [34]. Only few works have tackled the problem of model uncertainty using this variational approach for prediction intervals estimation. Among them, [35] deserves mention, which is applied for Uber data, focusing mostly on the coverage of produced intervals, and [5], which this work is an extended version of.
Prediction interval estimation
In this work, the probabilistic forecasting task being addressed is the estimation of prediction intervals. This task can be posed as the following problem: Given an univariate time-series
where
from where it can be seen that the distribution of
Given training data
The first term is the variance of the mean estimation, as
where
Usually, neural network models based on Mean Variance Estimation (MVE) are trained to get a single value for its parameters,
The proposal of this work is to use a neural network model that combines the MVE method and MC dropout in order to build prediction intervals that include both inherent and model uncertainty, and that can handle heteroscedastic time series, allowing to produce better prediction intervals when presented with previously unseen data, compared to those obtained using the MVE method alone or including dropout only during training, as it is typically done.
Formally, the posterior distribution
The inequality comes from Jensen’s inequality, and the fact that
Based on this likelihood, the variational bound has to be maximized, which can be treated as minimizing its negative version:
This minimization is done using a stochastic gradient optimization algorithm, such as ADAM [38]. For each minibatch of size
To state clearly the steps performed by the training procedure, Algorithm 4 shows a step by step pseudocode version of it.
MVE Neural Network Model Training[1] Training time series
Note that, when no MC dropout is used, there is a similar estimation to be minimized by the optimization algorithm, although instead of using a sampled
This strategy was implemented for a Long-Short Term Memory (LSTM), as explained in Fig. 2, which is a recurrent neural network that is commonly used for sequence processing, such as time series forecasting. To make a forecast, the network receives as input the
Summary of the seven datasets used in this research to validate the proposal
The input 
A model that considers model parameters uncertainty should have better estimates of uncertainty, specially in the long term, because errors usually accumulate through time steps. A model trained to perform one step ahead forecasts can be used to produce prediction intervals for horizons larger than a single time step in the following way: the output of a one step ahead prediction is injected as input for the model to produce the next one until the horizon is completed. This is done many times to simulate multiple future trajectories which can then be transformed into prediction intervals, as described in Fig. 3. This procedure is sometimes called rolling forward forecasting. Then, prediction intervals are built for each time step using the same sorting strategy explained for one step intervals.
To build a trajectory of 
A step by step pseudocode procedure to produce prediction intervals for the following
Multi-step Probabilistic Forecasting[1] Network with parameters
Data
To assess the proposal, seven datasets of four representative probabilistic forecasting tasks were considered, summarized in Table 1, showing different global distribution of values, as shown in Fig. 4. A brief description of each of them is given below:
B08, D05a, D08 and E01: These are four wind speed measurements datasets taken by prospecting towers located in different places of the Atacama Desert, in Northern Chile.1
Data can be downloaded from
Histogram of time series values for all datasets.
UCI individual household electric power consumption: This dataset is composed of electrical load consumption measured every minute in a home for almost 4 years in France [39]. It presents many missing values, which were filled before the averaging process by copying the value at the same hour and minute from the previous day. For example, if the 10 am measurement of a day is missing, the value at 10 am of the previous day is used. The high resolution of this dataset makes this time series very hard to forecast, as the model would require information from many time steps in the past days for its input, in order to make a good minutely forecast. In this work, instead, the resolution was reduced to mean hourly values to be able to build models using the same architecture as the ones applied to the other datasets, taking the time series comprised by the hourly sum of each minute value as the input for the model. The distribution of values is highly two modal, as it can be seen in Fig. 4.
GEFCom2014-L: It corresponds to the power load of an electric utility in the US measured every hour for 7 years, and it was made available for a Global Energy Forecasting Competition [40]. Its distribution is smoother than the other series, with one dominant mode.
Canela 1: This dataset consists of hourly values of power injected to the electrical grid by a wind power plant located in Chile.2
Data can be downloaded from
In order to evaluate the proposal over different regimes of a time series, each dataset was divided in five blocks, considering a small overlap between them, as shown in Fig. 5. For each block, the last 1000 measurements were left as test set and, from the remaining block data, 10% was considered as a validation set, used to select the best hyperparameters. For hyperparameters selection, a random search was done, as recommended by [41], and the selected parameters were those having the minimum average loss in the validation set, considering at least 5 repetitions with different random seeds. Hyper parameter values ranged according to values shown in Table 2.
Range of hyper parameter values explored
The data was normalized for each block, considering only the training set to fix the minimum and maximum normalization parameters.
Each time series is divided into five blocks of train, validation and test sets.
As the model proposed in this work was evaluated for the probabilistic forecasting task of prediction intervals estimation, the quality of the produced intervals has to be assessed considering the following three criteria:
Coverage: Each prediction interval is always associated to a Prediction Interval Nominal Confidence (PINC) that informs the associated probability of the next observation falling within the interval, as given by the model. Obviously, it is desired that future observations always fall inside the predicted intervals, but in reality this may not be the case. PINC can be seen as the expected percentage of future observations that may be inside of the predicted interval. In this work, all prediction intervals were evaluated for a nominal confidence (PINC) of 90%.
Sharpness: It would be easy to satisfy the coverage requirement by just making the intervals as wide as possible, but then they would be useless. Because of this, it is desired that the intervals be as narrow as possible, while still covering the required PINC.
Resolution: In this paper we work with heteroscedastic time series, which means that the width of the interval should change in time, adapting to the real uncertainty of each time period.
Many metrics exist in the literature that consider these features, differing on the importance given to each of them, and the penalization in case the coverage is not met by the intervals. In this work, the Winkler Loss, also called Interval Score (IS), was used as a general proxy of the quality of the interval. For each observation, it gives a base weight given by the width of the interval, and penalizes linearly the observations that fall out of the interval, according to:
where
Prediction Interval Coverage Probability (PICP): It is a global measure of the percentage of observations falling inside the produced intervals: Prediction Interval Normalized Average Width (PINAW): It measures the average width of produced intervals, as a percentage of the data range: Coverage Width Criterion (CWC): It measures the quality of the interval globally, giving an exponential penalty when the coverage is not met. This metric is only low when the interval is working in terms of coverage, and then it favours narrower intervals:
Average Winkler, PICP and CWC metrics, considering different prediction horizons, for wind speed datasets. Standard deviation of each metric is displayed as shaded areas.
Average Winkler, PICP and CWC metrics, considering different prediction horizons, for electrical load and wind power datasets. Standard deviation of each metric is displayed as shaded areas.
All MVE models, including the proposed model, may also be used to forecast in the traditional sense, that is, not probabilistically, producing the median or the most expected future value. For this reason, the models were also evaluated in terms of the following point prediction metrics:
Mean Absolute Error (MAE): It allows to compare the models in terms of their median point forecasting performance, Root Mean Squared Error (RMSE): It allows to compare the models in terms of their mean point forecasting performance,
Average metrics for each dataset and dropout mode (N: No dropout, Y: Regularization dropout, MC: Monte Carlo dropout), for 1 step ahead forecasting
Average metrics for each dataset and dropout mode (N: No dropout, Y: Regularization dropout, MC: Monte Carlo dropout), for 12 step ahead forecasting
In order to study the behavior of the proposal when using Monte Carlo dropout, it was compared against similar neural network models, first using regularization dropout only during training, and then without using dropout at all. While the models are trained to produce one step prediction intervals, they are run to produce multi step prediction intervals for 24 steps in the future, as explained previously, meaning that they produce intervals for each hour of the next day. Resulting average metrics for all time steps are displayed in Fig. 6, for wind speed datasets, and Fig. 7, for the other datasets. The best set of hyper parameters values was chosen for each dataset and time step shown.
It can be seen that, in general, MC dropout performs similarly to regularization dropout and no dropout for few time steps, but its handling of model uncertainty allows it to perform better in the long term, displaying reduced Winkler loss for many time steps. Looking at the PICP metric, it is clear that the coverage is higher using Monte Carlo dropout, translating into more observations falling inside the produced intervals, except for the D08 dataset. Interestingly, for this dataset, the Winkler loss is still low having a reduced PICP, meaning that, while having many observations out of the interval, its width was smaller than the linear penalization used by the metric. In general, as the coverage is still met for further time steps in the future, the CWC metric is also low in comparison with regularization or no dropout, even more than Winkler loss, because it penalizes exponentially when the coverage is not met. In the case of dataset B08, the reduced coverage also makes CWC higher and, for dataset D05a, CWC growths further for MC dropout, because unfortunately one of the repetitions had low coverage, which turned in an exponential penalty that, when averaged with the other repetitions, made it grow large. This effect is also seen on the big standard deviations related to CWC in general. For the same dataset D05a, and for the UCI dataset, the average Winkler loss was very similar for regularization and Monte Carlo dropout models, meaning that, while the coverage was increased, it was due to a bigger interval width. Interestingly, for datset UCI, Winkler and CWC losses were very similar for regularization and MC dropout. This could be related to the fact that the series has two modes instead of only one, making it difficul for models based on normal distributions to improve the produced intervals.
Average metrics for each dataset and dropout mode (N: No dropout, Y: Regularization dropout, MC: Monte Carlo dropout), for 24 step ahead forecasting
Average metrics for each dataset and dropout mode (N: No dropout, Y: Regularization dropout, MC: Monte Carlo dropout), for 24 step ahead forecasting
Average Winkler loss for GEFCom2014-L dataset, comparing dropout modes
In order to compare the models in detail, Tables 3–5 show average metrics for 1, 12 and 24 steps ahead forecasting, including point forecasting metrics MAE and RMSE. While for 1 time step only dataset B08 presents a notable improvement, this is not the case for the other datasets, as seen in Table 3, meaning that considering the model uncertainty for one step ahead forecasting is not really necessary. Nevertheless, 12 step ahead results (Table 4) present mixed improvements, while 24 steps ahead results (Table 5) present a clear improvement using Monte Carlo dropout. This means that, although not necessary for short term prediction, considering model uncertainty using Monte Carlo dropout makes the model more robust for medium or long term forecasting tasks. This effect was specially notable for dataset GEFCom2014-L, an electrical load consumption dataset, which has the smoothest distribution, closer to the normal distribution used by the model. Also, as shown by point prediction metrics, adding Monte Carlo dropout does not reduce the performance of the model for point forecasting tasks, as shown by the MAE and RMSE metrics, meaning that models using Monte Carlo dropout can safely be used for multi step point forecasting tasks as well.
Average metrics for Canela 1 dataset, considering different prediction horizons, and considering an ensemble of models. Standard deviation of each metric is displayed as shaded areas.
Detailed metrics for selected time steps for Canela 1 dataset
In Table 6, average Winkler loss for all time steps is displayed for dataset GEFCom2014-L. It is clear that using Monte Carlo dropout handles better the uncertainty for long term probabilistic forecasting tasks. Also, standard deviation, in brackets, grows more slowly when using Monte Carlo dropout than when not using it. This is in line with our consideration of model parameters uncertainty, because it means that the model is more robust to changes in the random seed that controls the randomized parts of the training, like initialization of parameters and mini-batch selection.
As mentioned before, applying MC dropout to a neural network may be interpreted as simulating an ensemble of subnetworks that share parameters. In order to compare the proposed model against an ensemble, similar experiments as described for each model were conducted using an ensemble of 20 similar neural networks using regularization dropout or no dropout at all, only for Canela 1 dataset, due to the very long time an ensemble takes to train. The results for all time horizons are displayed in Fig. 8, where it can be seen that the ensemble behaves better than the other models, specially for long term forecasting, as expected. The coverage (PICP) is increased for all time steps, slightly making the intervals wider (PINAW), while keeping metrics Winkler and CWC low for intervals, and low RMSE for point prediction. The standard deviation of all metrics was gathered considering 5 repetitions, and the ensemble model always had the lowest, as expected, because the ensemble considers all sources of uncertainty, while the MC dropout proposal is the one having the lowest standard deviation after the ensemble. This is inline with the assumption that using MC dropout considers model uncertainty, thus reducing the uncertainty of the training procedure and finally making the technique more robust. In Table 7 the same metrics are detailed for some selected time steps, and also the number of LSTM hidden layers and LSTM nodes of each layer. As MC dropout disables some connections, the selection of hyper parameters chose networks with more hidden nodes, thus making the network more similar to those of an ensemble of models after applying dropout.
This work has explored how the behavior of a probabilistic forecasting recurrent deep learning model changes when considering the model uncertainty using a variational technique, in particular Monte Carlo dropout, for the task of producing prediction intervals. From the results, applying this technique for a recurrent model makes it to effectively handle the model uncertainty, and it is clear that this technique is specially important for multi-step prediction intervals generation, due to the increasing uncertainty of the future over the medium or long term, that is not completely handled by the base MVE neural network model.
From the results, it is concluded that, while the use of dropout is focused on handling the model uncertainty for probabilistic forecasting, its addition does not affect the point forecasting capabilities of the models, meaning that they can safely be improved for probabilistic forecasting using MC dropout, without interferring point forecasting metrics.
Additionally, the proposal seems to behave more similar to an ensemble of models for one dataset in terms of measured metrics. While this was expected, further comparison would be needed if their similarities and differences want to be clearly understood.
Although the proposal has been shown for univariate time series, the analysis for multivariate series is similar, and extending the use of MVE networks with MC dropout for them is straightforward, adding outputs of mean and variance estimates for each output dimension. Also, although in this work the deep architecture used is a recurrent one, the use of MC dropout can be done for other kind of networks, such as feedforward, convolutionals, extreme learning machines, etc.
As a future work, it is recommended to explore other variational techniques than dropout, in order to handle model parameters uncertainty, allowing more interpretability of the network parameters. Also, the assumption used in this work regarding the normal distribution of the outputs may be improved to handle time series with other distributions, specially those that are asymmetric or multimodal.
Footnotes
Acknowledgments
This work was supported in part by Conicyt doctoral scholarship 21170109, Fondecyt grant 1170123, Basal project FB0821, and UTFSM-PIIC grant 2019.
