Abstract
The paper presents a novel approach for hourly short term forecast of load active power using discrete state Hidden Markov Models. The load data used belongs to the New York Independent System Operator (NYISO) and has been recorded in the years 2014-2017. In the first step, features the best explaining load power changes from the set of weather data, market data (price for load, losses or congestion) and calendar data (day type, day of week, season) are defined. Due to strong seasonality in the data, also a filtering step is included. Finally, the forecast itself is executed with 24 discrete state Hidden Markov Models with a high number of states. Besides the direct comparison with the forecast results obtained by NYISO, the approach is evaluated using an additional benchmark method.
Keywords
Introduction
Due to the increasing implementation of renewable generation in the transmission and distribution networks, the task of power system operation, planning and maintenance is getting more and more challenging. One of the ways to support the system operator is to provide load forecasts which may be used to conduct power flow analysis for future time points to predict problems in the network or to assure the acquisition of sufficient generated power in balance to the expected load. Short term load forecasting provides a prediction of load demand typically reaching from few hours until one week ahead. Depending on the length of the prediction horizon, one distinguishes between online and offline short term load forecasts [1]. Online load forecasts are made typically on the fly to predict only few hours. This kind of load forecasting is required by the online grid operation and used for example for economic load dispatch. Offline load forecasts are applied to prediction horizons up to few days and are used for example for unit commitment or energy trading [2].
Related work
In the area of short-term load forecasting, two main method groups, i.e., Statistical Techniques and Intelligent approaches, are distinguished [3]. A classical statistical approach is multiple regression [4, 5]. The method is capable of considering trend, seasonality, weather and calendar data and additionally the auto-correlation of load data. To prevent over-fitting, a sufficient amount of training data (usually the year before the start of the forecast period) is required. Other very popular forecast methods from the category of Statistical Techniques are General Exponential Smoothing [6] and stochastic time series like for example autoregressive integrated moving average (ARIMA) [7]. Both methods can consider seasonality and trend, they come along with short training data history, are simple and therefore very well suited for online short-term load forecasting. However, they are not capable of including external factors such as weather into their models. Another method from the category of Statistical Techniques is Kalman Filter [8, 9]. It can consider exogenous variables and adapt to data with evolving time. It can be easily implemented due to its recursive structure and it can cope with small training data sets.
Statistical Techniques require an exact mathematical specification of the relationship between the load as a dependent and the influencing factors as an independent variable. Intelligent techniques like Artificial Neural Networks (ANNs) [10–12] or Fuzzy Logic approaches [13–15] get along with a coarse model structure. ANNs are capable of learning the underlying relationship from the data. Fuzzy approaches require a coarse description of the relationship in the fuzzy domain in the form of human-readable "IF"-"THEN" statements. Both methods are very well suited for problems where a mathematical formulation is too complex or impossible. However, they require on average a higher amount of data samples and therefore longer training times [3]. As one of the fewer techniques, ANNs provide the possibility of a multi-time instant forecasting in one step using a multi-output model [11]. Using such models may reduce the propagation of the forecast error from earlier to later time instances. However, multi-output models may get very complex and tend toover-fitting [16].
The approach presented in this paper makes use of the Hidden Markov Model, a Markov model with unobservable (hidden) states emitting unhidden observations. Because the HMM structure describes only generally the dependency between load and exogenous variables, the HMMs can be classified as an AI approach, even if the underlying process is a statistical one. Until now HMMs have been applied amongst others with great success to speech recognition problems outperforming even the ANNs [17]. However, in the field of short-term load forecasting, this technique is not as well established as other already mentioned methods. Two lately published conference papers [18, 19] use HMMs to execute day ahead forecasts for three Island groups of the Philippines and data center load and achieve MAPE values of 5% -7% and 2.9% -3.5% respectively. Both approaches work with a small number of discrete states (3-4) which represent the hidden variables of the Hidden Markov Model. The actual load data is modeled as observations. The model parameters are trained on historical data using the Baum-Welch algorithm. The most probable state sequence is obtained before the forecasting step using the Viterbi algorithm on the base of the recent demand data. The subsequent load demand prediction utilizes the last observations, the state sequence and the model parameters for a day ahead forecast. Such modeling is the classical one. However, the introduction of artificial hidden states leads to some drawbacks like the separation of the calculation of the optimal state sequence and the actual forecasting task. Furthermore, the optimal state sequence has to be calculated for recent demand data which may be strongly different from the data to predict. Already [20] remarks that a moving window approach may be insufficient for historical data not matching the day type (workday or holiday) or temperature profile of the data to be forecasted. [18, 19] also do not clearify what amount of data is sufficient for the Baum-Welch training which has to be executed repeatedly.
Because the load input is not observable only in the forecasting horizon, the here proposed method models the load data as a hidden variable and weather information as observations. Therefore, no Baum-Welch training is required because the states are directly available in the historical data. However, depending on the numerical range of the load data and the applied bin size, the number of states may exceed the value of 1000. This problem is solved using a reduced state Viterbi algorithm. Different than [18] and [19], no separate forecast step is needed because the optimal state sequence corresponds to the searched sequence of the forecasted load and is obtained as a result of the reduced state Viterbi algorithm. This makes the application of a multi-time instant forecast in one step possible and is the main advantage of the approach also in comparison to Statistical Techniques. At the same time, the applied model is much less complex than in [11]. The problem of model suitability and the seasonality in the load data regarding the season of the year is addressed by a simple filtering step. It collects only such data for the HMM creation which is expected to match the observations and therefore to capture the underlying load patterns.
To sum up, the main scientific contribution of the presented approach is a novel, multi-time instant load forecast in one step using 24 discrete state Hidden Markov Models with a large number of states. The approach combines proven techniques like Hidden Markov Model, filtering and probabilistic clustering to a method that provides a high prediction accuracy and flexibility regarding the inclusion of further predictors. Additionally, it extends the Viterbi algorithm so that it can be applied with sufficient performance to a large number of discrete states. Because of the application of the HMMs, only a coarse model structure specification is required. At the same time, the model parameters related to the Gaussian distributions can be directly obtained from the data, and the Baum-Welch training can be spared.
In the next section, the data set is presented and the choice of the features is explained. After this, in section IV the main definitions of Markov Chain and Hidden Markov Model are summarized. On the base of these definitions, the particular steps of the method are explained in section V. The simulation results are compared with the results achieved by NYISO and with an additional benchmark method. Both are discussed in section VI. The conclusions are made in the last section VII.
Data set and data analysis
The data set used to develop the approach is available online and belongs to the New York Independent System Operator (NYISO) [21]. It includes integrated hourly load forecasts and related measured load power in Mega WATS. Additionally, NYISO provides market data related to price for power delivery, losses and congestion in US Dollar and weather forecasts related to ambient and wet bulb temperature in Fahrenheit. The choice of the data set is motivated by the fact, that it comes from a real utility, which also publishes its forecast results, so that a direct comparison is possible. The data is complete starting from 2008. For this study, the data for all eleven zones has been chosen. The training data encloses the period 2013-2015. The data from year 2016 has been used as test data. The final evaluation has been executed for the year 2017. Tested features included all data provided by NYISO. The auto-correlation of load power, ambient temperature, wet bulb and calendar data like day type (working day or not), hour of the day and day length have proven as the most significantfeatures.
A feature with the strongest correlation is the load power from subsequent hours h - 1 and h. With an increasing number of subsequent hours, this dependency gets weaker. Fig. 1 shows the relation between the load from hour h - 1 and hour h based on the load data for the year 2015. Because of the strong auto-correlation, the usage of the Markov process to model the load behavior seemed to beappropriate. The next feature is related to the auto-correlation of load power. It is called in following hourly delta load power and it refers to the difference of load power at two subsequent hours h and h - 1 as specified in (1), with P
h
and Ph-1 referring to load power at hours h and h - 1.

High correlation of load power of two subsequent hours.
Related to the hour of the day, this feature shows a clear pattern as well. For example in winter months, for hours between 0-3 am and after 8 pm the delta load power is negative indicating hourly power decreases. For hours between 5-10 am the delta load is positive meaning hourly power increases. For remaining hours, there is a mix between increase and decrease with a clear tendency to powerincrease.
The temperature forecasts available from NYISO include daily minimum and maximum temperature and wet bulb for the forecast day and for the next day. Prior to the processing in the presented algorithm, hourly temperature values are interpolated using the R package chillR [22] which uses a mathematical model provided by [23]. This model is based on a sine curve representation for daytime temperatures and a logarithmic decay function for night temperatures. Additionally, the package chillR considers differences in day length between locations as well.
The correlation between the load power and the temperature differs depending on the season. It is negative in the months from January until April. It switches into positive values for the months Mai until October and gets again negative in November and December. The reason for this changing correlation is the difference in load patterns as a result of seasonal weather changes. For example in winter months, the decreasing temperature leads to increased power demand due to the usage of heating appliances. In summer months, falling temperatures have reduced air conditioning demand as a consequence. Fig. 2 shows the relation between temperature and load for the New York City zone. The correlation for temperatures below 60 F is negative. As already noticed in [24], the change in load for a specified temperature delta Δ temp is much smaller than for temperatures >60 F. That means, temperatures below 60 F have lower impact on load.

Correlation of load power and temperature for New York City data in 2015.
From the varying correlation between load power and temperature the necessity of a filtering step prior to the forecast execution has been concluded in the manner of the similar day approach [20, 25]. The goal is to filter time series with similar temperature and load patterns to the time series under prediction. The details of the filtering step are discussed in subsection 5.1.
To capture the seasonal cycles, an attribute related to the day length rather than the usually taken month of the year [5] is preferred. Despite the fact that seasonality is strongly connected with the length of the day, it can be easier dealt with because it is numerical and not categorical data.
Before starting the training phase, the data has been checked for completeness and correctness. Incomplete time series or time series containing outliers have been removed from the training data set.
1st order Markov chain
A Markov chain is an extension of a weighted finite automaton with a set of states and transitions between the states [26]. As such Markov chain can be applied to sequential data. The states of a Markov chain represent random variables and the weights express transition probabilities between the states. Fig. 3. shows a simple Markov chain.

Structure of a first order Markov Model.
Therefore a Markov chain can be described by a set of states S (2), the initial state ∏ (3), the set of transition probabilities A (4) and the rule that the sum of all transition probabilities from a defined state i to all other successor states must sum up to 1 (5). Since elements of the set A are probabilities, they must be in the range of [0, 1].
A 1st order Markov property states that the transition to the next state is determined only by a previous state in the chain and not by any other previously passed state in the chain (6).
A Hidden Markov Model [26, 27], extends the Markov Models by a set of observations O (7) and the set of emission probabilities B (8) as shown schematically in Fig. 4. Each element in B describes the probability b of producing an observation o while being in state s (9).

Schema of a Hidden Markov Model.
A Hidden Markov Model is defined by the model parameter set H consisting of sets of state transition and emission probabilities A and B (10).

Trellis diagram for calculation of the Viterbi algorithm.
The Viterbi algorithm gets executed in three steps:
Initialization: The state values for time step 0 are set to the probabilities of the initial state (11).
Induction: Each state value j at time step t is obtained as a maximum value of the products for each state value i at the previous time step, the transition probability between states i and j and the emission probability for state j and observationt (12).
Termination: First, the most probable state value at time step T is found (13). From this state value, the path is expanded backward through the maximal product of transition probability and state value (14).
Also for unsupervised learning of the model parameters A and B, HMMs offer very efficient algorithms, for example, the Baum-Welch algorithm. However, since all available training examples contain also the hidden states, the usage of unsupervised learning algorithms was not required. For supervised learning of the model parameters, the MLE (Maximum Likelihood Estimation) is usually used which is in discrete case equivalent to counting the ratio of state transitions from state S
i
to state S
j
and number of occurrences of S
j
as described by equation (15).
Filtering of training data for HMM creation
As mentioned in section 3, prior to the creation of the Hidden Markov Model, the training data is filtered based on the equation (16). In this way, historical time series with features having a strong similarity to the features of the time series under prediction can be used for the HMM creation. The similarity of time series ts
i
and ts
j
is measured by means of the weighted Euclidean distance d (ts
i
, ts
j
). The smaller the distance the bigger the similarity.
daily minimum, maximum and the next day minimum temperature daily minimum, maximum and next day minimum wet bulb day type (holiday or workday) day length
The weights for coordinates related to temperature and wet bulb are obtained from a linear mapping of the daily average temperature of the time series under prediction ts
i
to a target value range [w1 = 0, w2 = 20]. The limits of daily average temperature are set to [t1 = 0, t2 = 100] (17)-(18).
The weights for the day type and day length are set to a hight value of 100. For temperatures above 60 F however, the attribute day length is irrelevant due to high correlation between temperature and load.
For working days with daily average temperatures below 60 F only the last year data is used during the filtering step. For remaining days, the data from last three years was required. The filtering starts with the initially low distance value to capture the most similar historical time series. If less than numTs = 20 time series are found, the distance is iteratively relaxed and thus time series with less similarity to the time series under prediction are considered.
The choice of an initially small distance leads on average to smaller forecast errors. Larger distance values result in increased forecast error and in a linear only increase of computation time for the HMM creation.
Using the filtered time series, 24 separate Hidden Markov Models are created as presented in Fig. 6. The indices mark the according hour. The symbols S, T and Δ are related to the load power, ambient temperature, and delta load power at the according hour. It is assumed that the observable feature temperature and the conducted feature delta load are conditionally independent.

Structure of the applied Hidden Markov Model for load forecast.
Each single HMM stores the information about the state transitions and the emissions related to the hour h. According to the model structure applied required are three joint distribution: load power of hours h - 1 and h
load power and temperature at hour h
load power and delta load at hour h
The calculation and the usage of joint distributions is described in detail in following sections.
A discrete HMM consists of conditional probability tables that store the conditional probabilities calculated by (15). However, the load active power and the temperature are continuous variables. The discretization into sufficiently large bins leads to information losses and higher prediction errors. The usage of too small bins leads either to over-fitting or makes the prediction impossible, since many value combinations from forecast data did not occur in the training data. Therefore, the here applied HMM is based on discrete states, but it uses the continuous observation densities [27]. However, not as proposed in [27] the densities are modeled as mixtures of Gaussian joint density functions p (θ
h
) (19)-(22) for each prediction hour h and not for each state separately. The mixture components are assumed to follow the normal bivariate Gaussian distribution. The mixture component φh,i, joint mean vector
An example of three joint Gaussians for temperature and load on a winter day at 0 am is shown in Fig. 7. The mean and covariance matrix are calculated by the Expectation Maximization algorithm [28] for three clusters.

Three mixtures of joint normal distribution of temperature and load.
The mean and co-variance matrix for the calculation of the conditional is obtained from (23)-(25) [28].
With:
The Viterbi algorithm has a quadratic complexity O (N2k) with N denoting the number of states and k referring to the length of the predicted time series. At hour h there are up to N states (“from”-states) which have to be considered for each one of the N states (“to”-states) at hour h + 1. So, at each transition from hour h to hour h + 1, N * N states have to be taken into account. The hidden states may take on a wide range of values resulting in a huge number of states. To speed up the calculation, the number of evaluated states has to be limited. Three ways to do it are here applied.
Firstly, the choice of the state binning size. Coarse bins may lead to a decrease in computation time and unfortunately in prediction quality. To find a sufficient compromise between the prediction error and computation time, the bin size is dynamically calculated from the minimum and maximum load data values stored in the zone training data. The load range is divided by a sampling constant ω = 1500 as specified in (26). In this way, zones with a smaller spread in load power will have smaller bin sizes than zones with a high difference between the minimal and maximal load power.
Secondly, the number of states which must be evaluated while the transition step from hour h to hour h + 1 is narrowed down. To do so, the implementation of the forward algorithm is changed as shown in Fig. 8. For each state i at hour h all reachable states at hour h + 1 are targeted in opposition to the procedure shown initially in Fig. 5. In this way, the already available conditional density distribution function for load power at subsequent hours (23) conditioned on the state i at hour h: (Si,h) can be applied for state reduction. According to (27)-(29) only states at hour h + 1 which are located between the limits and have therefore sufficient high probability to be reached from Si,h shall be evaluated by Viterbi. The reduction step is clearly based on the strong correlation of load power of subsequent hours.

Modified forward step for state reduction in Viterbi algorithm.
Thirdly, according to the Viterbi algorithm after the calculation of all probabilities at hour h, the states with non-zero probabilities become the “from”-states at the next hour h + 1. The set of the “from-states” is again reduced to such ones with the lowest subpath costs. The obtained costs at the subpath at hour h are clustered using the k means clustering method with a number of clusters proportional to the number of paths. The states with paths belonging to the cluster with the smallest sub costs are considered as the “from”-states at the next hour.
Fig. 9 shows an overview of algorithm applied to conduct the short term load forecast. As input data, the algorithm expects the information about the time series to predict:
daily minimum, maximum and the next day minimum temperature daily minimum, maximum and next day minimum wet bulb hourly temperature day type (holiday or workday) day length
If the iteration counter Counter does not exceed MaxIterCnt then time series are filtered from the database with attributes matching the time series under prediction (16). From the filtered time series, the parameters of mixture joint densities are obtained using the Expectation-Maximization algorithm for each prediction hour separately. In the next step, the reduced states Viterbi algorithm is calculated. If a Viterbi path with a probability >0 exists, the algorithm terminates. Otherwise, the distance value is increased and thus the constraints for the selection of the time series for the creation of next HMMs are relaxed. If the number of maximal allowed iterations is exceeded, the algorithm terminates as well.

Overview of the applied algorithm.
The final evaluation has been done for the whole year 2017 and all eleven zones of NYISO. Table 1 contains an overview of the zone names and abbreviations used in the diagrams below. The zones differ strongly regarding the size of the population, the power consumption, and the geographic area spread. New York City is, for example, one of the smallest zones with the largest population and thus the highest consumed load power. Mohawk Valley has a large area but a relatively small population and load consumption [21].
NYISO load zones
NYISO load zones
The results obtained from the HMM approach are compared with the results of two other approaches. On the one hand, the forecasts delivered by NYISO [21] are considered. The method used by NYISO incorporates an Artificial Neuronal Network (ANN) [29] and provides six days ahead forecasts. Here, always only the first 24 hours are evaluated. The second benchmark method taken into account is based on General Linear Model based Load Forecaster - Benchmark [5] and is called Tao’s Vanilla Benchmark. It has been used as a benchmark during a forecast competition GEFCom2012 and it belonged to the 25% of the best results [30]. The Vanilla Benchmark is a multiple linear regression approach as defined by (30) [5, 30] and therefore very clear in formulation and easy in implementation.
Trend in (30) is a natural number assigned to each hour of the data ascending with the timestamp. M, W, H and T in (30) are related to the day of the week (1-7), the hour of the day (1-24) and temperature respectively. L represents the load power value from the preceding hour. If the forecast sequence is longer than 1 hour, the latest forecasted value is taken instead [5, 30]. For the comparison with the Hidden Markov Model approach, TVB has been implemented in R using the function lm [31].
As a measure of the prediction quality, the Mean Average Percentage Error (MAPE) is calculated according to (31), with A as actual value and F as forecasted value.
Fig. 10 shows the results obtained for the year 2017 and all load zones together with the two benchmarks. The HMM approach outperforms both methods and delivers very stable results across all load zones. Only for two zones, LONGIL and MHK VL the error is higher than 3%. The best result is achieved in New York City, the worst in the load zone MHK VL. The forecast error of NYISO’s approach is in all zones higher than HMM’s MAPE. For CENTRL, HUD VL the difference between HMM and NYISO MAPE is higher than 2%. In MHK VL it exceeds even around 10%. TVB returns similar to HMM very stable results for all zones with however higher MAPE values.

Day ahead forecast of all NYISO zones in 2017.
Fig. 11 shows the daily MAPE for the New York City load zone returned by HMM and NYISO. The HMM error is generally lower for all days of the year than the MAPE produced by the NYISO approach. The highest HMM error is around 12%, the highest NYISO error is around 17%. The errors above 10% occur sporadically and are mainly concentrated between the months of May and September, i.e. between 120th and 270th day of the year. In this period, the daily temperature and load variations are high which seems to be problematic for the HMM approach.

Comparison of a daily MAPE of day ahead forecast for NYC in 2017.
Fig. 12 compares the daily MAPE results for the load zone MHK VL in 2017. Considered are HMM and TVB results. The highest error above 10% in HMM approach occurs in June, i.e. around the 158th day of the year. For all other months, there are no such error peaks. It is assumed, that the difference in prediction quality between MHK VL and New York City is caused by differing load consumption and area sizes. MHK VL has a relatively small population and thus small consumption. On the other hand, it is spread about 15,230 square kilometers and temperature forecasts based on the data from several weather stations are required. Weather conditions at different locations in the MHK VL and consequently the load may strongly vary which must lead to forecast inaccuracies on the aggregation level.

Comparison of daily MAPE of a day ahead forecast for MHK VL in 2017.
The HMM method has been tested on a PC with Intel Core i5-6300U CPU@2.40GHz with 32 GB RAM. The implementation is done with C++ language. Table 2 summarizes the average execution time in seconds for an hourly day ahead forecast for each zone separately. Additionally, the load power bin size used during the calculation is presented. The load zones with the highest spread in load power, i.e.: NYC, LONGIL and CENTRL get the largest bin sizes. The computation time for all zones differs only slightly because the binning approach produces a similar number of states. The average computation time accross all zones is about 1.82 secons for one time series of 24 time points.
Execution time and bin size for NYISO load zones
The presented approach uses 24 discrete state Hidden Markov Models with a large number of states to conduct a 24 hours ahead load forecast in a single step. The feature set used for the prediction includes the ambient temperature, wet bulb, the day type and the hour of the day. Additionally, a calculated variable, i.e., the hourly delta load is considered as an observation input. Before the forecast execution, the historical data is filtered for inputs similar to the features on the forecast day. The resulting time series are used to create the Hidden Markov Models. The actual forecasting is executed employing a reduced state Viterbi algorithm. The reduction step utilizes the strong correlation of the load at subsequenthours.
Due to achieved simulation results, the method combines a very good prediction quality with simplicity given by the usage of 24 Hidden Markov Models. The algorithm achieves stable and very good results for all eleven NYISO zones compared with two other approaches. The applied reduced state Viterbi algorithm significantly decreases the execution time of the prediction algorithm.
One of the limitations of the algorithm is the assumption of a strong auto-correlation in the load data. For data sets with no or weak auto-correlation of the data related to the states, the approach will not work. The filtering step is based on the observation that the load is more sensitive to temperatures above 60 F. For data with a different relationship between temperature and load, this part of the algorithm has to be adjusted. Also, since the state transitions and emissions are assumed to be Gaussians, this method will be not directly applicable to data with a different underlying distribution type. Finally, the algorithm requires at least 20 historical time series for the creation of HMMs. However, nowadays utilities do collect and store such amounts of data.
In future work, the focus will be put on Gaussian Processes and ANNs. Also, forecast points at lower voltage levels will be addressed.
