Interval forecast for model averaging methods

Abstract

This paper is aimed at the analysis and verification of the formula for computing the number of degrees of freedom for the combined model when averaging across a set of regression models, which was proposed by Moiseev (2017) but was not thoroughly analyzed. The key feature of this formula is that it is applicable to absolutely any averaging method what dramatically widens its scope of application. We notice that the exact number of degrees of freedom for the combined model can not be computed due to uncertainty of variance-covariance matrix of submodels’ errors. However, it is shown by conducted simulation study that even using unbiased estimator of this matrix yields reliable confidence intervals. Therefore, considered formula appears to be crucial when computing interval forecast by model averaging methods.

Keywords

Interval forecast confidence intervals degrees of freedom t-distribution forecast combination model averaging mean-squared forecast error model averaging JEL codes: C52 C53

1. Introduction

When analyzing and modeling economic data researchers very often resort to construction of regression models, which nearly always requires model specification and model selection procedures. Model selection procedure can be performed using different types of model efficiency criteria, e.g. mean-squared forecast error, F-statistics, mean-squared bootstrapped error and various information criteria. One of them is Bayesian information criterion (BIC), which was introduced by Schwarz (1978). Since then there were a lot of papers written on its application in econometrics, which include Raftery et al. (1997), Hoeting et al. (1999), Brock and Durlauf (2001), Brock et al. (2003), Fernandez and Steel (2001), Garratt et al. (2003) and Sala-i-Martin et al. (2004). Another selection method is Mallows criterion, which was introduced by Mallows (1973) and resembles Akaike (1973) and Shibata (1980) information criteria whose asymptotic optimality was researched by Shibata (1980, 1981, 1983), Lee and Karagrigoriou (2001), Ing (2003, 2004, 2007) and Ing and Wei (2003, 2005). Focused information criterion (FIC) was first proposed by Claeskens and Hjort (2003) and is also used when choosing among a set of models.

However recent literature is focused more on averaging across the set of models rather than picking just one of them. Combination of forecasts was pioneered by Bates and Granger (1969) and further developed in Granger and Ramanathan (1984). Since then its synergetic effect of decreasing the forecast error was affirmed by number of econometricians and no doubts arise about its efficiency, see for example Granger (1989), Clemen (1989), Hendry and Clements (2002), Timmermann (2006) and Stock and Watson (2006). However, there is still no consensus about selecting forecast weights. The most recent papers are focused on five major methods of weights selection: simple averaging, Bayesian and Akaike model averaging (BMA), Mallows model averaging (MMA), focused model averaging (FMA) and mean-squared forecast error (MSFE) model averaging.

Simple average works quite well, when submodels are properly specified. However, in case a poor model is included in the set of models for averaging, simple averaging will pay the penalty. In this respect simple averaging should incorporate submodels, which are very close in their characteristics, to be considered as a complete method, what imposes quite tough restrictions on submodel selection. Bayesian information criterion was proposed to be used in model averaging by Min and Zellner (1993) and since then its efficiency was explicitly demonstrated by Stock and Watson (1999, 2004, 2005) and Wright (2003a, b). Using the exponential Akaike information criteria (AIC) for model weights computation was introduced by Akaike (1979) and further developed in Burnham and Anderson (2002). Mallows model averaging and closely associated averaging techniques were profoundly researched by Hansen (2007, 2008, 2014), Hansen and Racine (2012) and Cheng and Hansen (2015) where authors provided a strong empirical evidence of their efficiency compared to most common averaging methods. Focused model averaging works on the basis of focused information criterion (FIC) and was elaborated by Hjort and Claeskens (2006) and further developed in Claeskens and Hjort (2008). MSFE model averaging was proposed in Zubakin et al. (2015) and further elaborated by Moiseev (2016, 2017) and focuses on minimizing an unbiased estimator of mean-squared forecast error of combined model.

As was mentioned by numerous econometricians there is no existing method that would outperform all the others in all types of initial settings. Therefore averaging methods can coexist yielding satisfying results when applied in a proper situation. However when forecasting economic processes it is also (if not more) important to obtain a reliable confidence interval for constructed forecast as this goal is crucial for risk-management purposes, investment, strategic planning and etc. Unfortunately, most of the papers, devoted to model averaging techniques, inexplicably omit this issue except for Zubakin et al. (2015) and Moiseev (2016, 2017), where authors show, that one can obtain interval forecast by applying t-distribution quantiles with the number of degrees of freedom, computed by the formula also provided in their research. In this paper we will focus on analyzing properties of degrees of freedom formula and provide simulation and empirical testing to prove its applicability in forecasting economic processes by combination of models. We also show that obtaining reliable confidence intervals does not depend on the chosen model averaging method, what significantly widens the field of its application.

The paper has the following structure. Section 2 reviews the principles of model averaging, in particular MSFE model averaging, and discusses the analytical expression of the number of degrees of freedom for combined model. Section 3 analyzes the properties of number of degrees of freedom and presents the results of out-of-sample simulation testing. In Section 4 we sum up the key points of the paper and emphasize the main characteristics of proposed method.

2. Review of MSFE model averaging principles

Let $\{y_{t},X_{t}\!\!:t=1,\ldots,n\}$ be a considered real-valued sample where $y_{t}$ is a target variable and $X_{t}=(1,x_{1t},$ $x_{2t},\ldots)$ is a countable dataset of possible explanatory variables. Suppose we can compute $M$ regression models where $i$ -th model incorporates $k_{i}$ parameters and looks as follows:

$\displaystyle y_{t}=X_{it}B_{i}+e_{it}\text{ or}$ (1) $\displaystyle\hat{{y}}_{it}=X_{it}B_{i},$ (2)

where $X_{it}$ denotes selected explanatory variables for $i$ -th model, $e_{it}$ is a model’s observed residual at time $t$ , $B_{i}$ is a parameter column-vector which is computed straightforward as below:

$B_{i}=\left({X_{i}^{T}X_{i}}\right)^{-1}X_{i}^{T}Y,$ (3)

where

$X_{i}=\begin{pmatrix}{X_{in_{i}}}\\ {X_{i\left({n_{i}-1}\right)}}\\ \vdots\\ {X_{i1}}\\ \end{pmatrix},Y=\begin{pmatrix}{y_{n_{i}}}\\ {y_{n_{i}-1}}\\ \vdots\\ {y_{1}}\\ \end{pmatrix}.$

Here we notice that $X_{i}$ may have different number of observations than $X_{j}$ . Our goal is to construct a combined model

$\displaystyle\bar{{\hat{{y}}}}_{t}=\sum\limits_{i=1}^{M}{\hat{{y}}_{it}w_{i}}% \text{ or}$ (4) $\displaystyle y_{t}=\bar{{\hat{{y}}}}_{t}+\bar{{e}}_{t},$ (5)

where

$E({{\bar{{\varepsilon}}_{t}}|X_{1t},X_{2t},\ldots,X_{lt},w_{1},w_{2},\ldots,w_% {l}})=0,E({{\bar{{\varepsilon}}_{t}^{2}}|X_{1t},X_{2t},\ldots,X_{lt}})=\bar{{% \sigma}}^{2},$

and $w_{i}$ is an element of weight vector whose elements are non-negative and sum to one.

$\left|\begin{array}[]{*{20}c}{\sum\limits_{i=1}^{M}{w_{i}}=1,}\\ {0\leqslant w_{i}\leqslant 1.}\\ \end{array}\right.$ (6)

We also suppose that each of $M$ considered submodels has an error term $\varepsilon_{it}$ and satisfies the OLS assumptions:

Assumption 1: Strict exogeneity, i.e. $E(\varepsilon_{it}|X_{i})=0$ ;

Assumption 2: Homoscedasticity, i.e. $E(\varepsilon_{it}^{2}|X_{i})=\sigma_{i}^{2}$ ;

Assumption 3: Normality, i.e. $\varepsilon_{it}\sim N(0;\sigma_{i})$ ;

Assumption 4: No perfect multicollinearity, i.e. $X_{i}^{T}X_{i}$ is a positive-definite matrix;

Assumption 5: No autocorrelation, i.e. $\mathrm{cov}(\varepsilon_{iq};\varepsilon_{iu})=0,\forall q\neq u$ .

Additionally to traditional OLS prerequisites we assume the following:

Assumption 6: No autocorrelation of error terms among submodels, i.e. $\mathrm{cov}(\varepsilon_{iq};\varepsilon_{ju})=0,\forall q\neq u$ .

Here it is worth noticing, that Assumption 6 holds almost automatically if Assumption 5 holds.

For MSFE model averaging the goal is to adjust weights in such a way, that they would yield the lowest expected mean-squared forecast error and at the same time satisfy the constraints shown in Eq. (6). It is a general knowledge that MSFE of a linear model consists of an error term variance plus the regression line variance, see for example Mood et al. (1974). Thus, this model averaging method is aimed at obtaining the minimal value of the expression below:

$\displaystyle\textit{MSFE}_{n+1}=\textit{Var}\left({\bar{{\hat{{y}}}}_{n+1}-y_% {n+1}}\right)=\textit{Var}\left({\bar{{\varepsilon}}}\right)+\textit{Var}\left% \{{\bar{{\hat{{y}}}}_{n+1}-E\left({\bar{{\hat{{y}}}}_{n+1}}\right)}\right\}$ $\displaystyle=\textit{Var}\left({\sum\limits_{i=1}^{M}{\varepsilon_{i}w_{i}}}% \right)+\textit{Var}\left[{\sum\limits_{i=1}^{M}{\left\{{\hat{{y}}_{i\left({n+% 1}\right)}-E\left({\hat{{y}}_{i\left({n+1}\right)}}\right)}\right\}w_{i}}}% \right].$ (7)

As it was shown in Moiseev (2016), the explicit formula for MSFE looks as follows:

$\textit{MSFE}_{n+1}=\sum\limits_{i=1}^{M}{\sum\limits_{j=1}^{M}{\frac{\breve{{% e}}_{i}^{T}\breve{{e}}_{j}}{tr\left({A_{i}A_{j}}\right)}w_{i}w_{j}}}\left\{{1+% X_{i\left({n+1}\right)}\left({X_{i}^{T}X_{i}}\right)^{-1}\breve{{X}}_{i}^{T}% \breve{{X}}_{j}\left({X_{j}^{T}X_{j}}\right)^{-1}X_{j\left({n+1}\right)}^{T}}% \right\},$ (8)

where $\breve{{X}}_{i}=|\begin{array}[]{*{20}c}0\\ {X_{i}}\\ \end{array}|$ and 0 denotes a matrix of zeroes which augments $X_{i}$ to have the number of rows equal to the number of rows of the longest data frame under consideration. In case $X_{i}$ is the longest data frame, then matrix of zeroes is not applied, $\breve{{e}}_{i}=|{{\begin{array}[]{*{20}c}0\\ {e_{i}}\\ \end{array}}}|$ and 0 denotes a column-vector of zeroes which augments $e_{i}$ to have the number of rows equal to the number of rows of the longest data frame under consideration.

$A_{i}=\breve{{I}}_{ni}-\breve{{X}}_{i}(X_{i}^{T}X_{i})^{-1}\breve{{X}}_{i}^{T},$ (9)

where $\breve{{I}}_{ni}=\Big{|}\begin{array}[]{*{20}c}0&0\\ 0&{I_{ni}}\\ \end{array}\Big{|}$ , which is augmented with zeroes identity matrix of size of $i$ -th data frame that it has the size of the longest data frame under consideration.

Weight selection can be implemented by solving the following optimization problem:

$\begin{array}[]{l}\textit{MSFE}_{n+1}\to\min\\ \left|\begin{array}[]{*{20}c}{\sum\limits_{i=1}^{M}{w_{i}}=1,}\\ {0\leqslant w_{i}\leqslant 1.}\\ \end{array}\right.\\ \end{array}$ (10)

Due to imposed constraints analytical solution for Eq. (10) is not available for $M>$ 3. Therefore it must be found numerically using quadratic programming. With recent development of computer technologies even for quite large $M$ it is possible to find a numerical solution relatively fast using any standard algorithm.

Moving forward to interval forecast, we know that in order to compute a confidence interval for a linear regression we use t-distribution with $n-k$ degrees of freedom, where $k$ is a number of model’s parameters.

$\hat{{y}}-T_{\alpha,n-k}\cdot s<y<\hat{{y}}+T_{\alpha,n-k}\cdot s,$

where $s$ is a square root of an unbiased estimator of error term variance, $\alpha$ – significance level.

In case of combined model to obtain confidence intervals for the point forecast at $n+1$ time period, it is also reasonable to use the same formula as above.

$\bar{{\hat{{y}}}}-T_{\alpha,r}\cdot\sqrt{\textit{MSFE}_{n+1}}<y<\bar{{\hat{{y}% }}}+T_{\alpha,r}\cdot\sqrt{\textit{MSFE}_{n+1}},$ (11)

where $r$ is the number of degrees of freedom.

Hence the problem of confidence interval definition converges to the computation of the number of degrees of freedom $r$ . As we know

$\frac{\frac{\hat{{y}}-y}{\sigma}}{\sqrt{\frac{({n-k})s^{2}}{({n-k})\sigma^{2}}% }}\sim T_{n-k},$

then we can infer, that

$\frac{(n-k)s^{2}}{\sigma^{2}}\sim\chi_{n-k}^{2}\Rightarrow\frac{r\cdot\textit{% MSFE}_{n+1}}{\bar{{\sigma}}^{2}}\sim\chi_{r}^{2}.$ (12)

The explicit formula for the number of degrees of freedom was derived in Moiseev (2017) and looks as follows:

$r=\frac{2\left(\sum\limits_{i=1}^{M}{\sum\limits_{j=1}^{M}{\frac{\breve{{e}}_{% i}^{T}\breve{{e}}_{j}}{tr({A_{i}A_{j}})}w_{i}w_{j}}}\theta_{ij}\right)^{2}}{% \sum\limits_{a=1}^{M}{\sum\limits_{b=1}^{M}{\sum\limits_{c=1}^{M}{\sum\limits_% {d=1}^{M}{w_{a}w_{b}w_{c}w_{d}\theta_{ab}\theta_{cd}\frac{\Theta_{abcd}tr({[{A% _{a}A_{b}}]\circ[{A_{c}A_{d}}]})+\Psi_{abcd}}{tr({A_{a}A_{b}})tr({A_{c}A_{d}})% }}}}}},$ (13)

where $\circ$ – Hadamard product,

$\displaystyle\theta_{ij}=1+X_{i({n+1})}({X_{i}^{T}X_{i}})^{-1}\breve{{X}}_{i}^% {T}\breve{{X}}_{j}({X_{j}^{T}X_{j}})^{-1}X_{j({n+1})}^{T},$ (14) $\displaystyle\Psi_{abcd}=\textit{UCOV}_{ac}\textit{UCOV}_{bd}\left[{tr({[{A_{a% }A_{b}}]\times[{A_{c}A_{d}}]^{T}})-tr({[{A_{a}A_{b}}]\circ[{A_{c}A_{d}}]})}\right]$ $\displaystyle\mspace{68.0mu }+\textit{UCOV}_{ad}\textit{UCOV}_{bc}\left[tr({[{% A_{a}A_{b}}]\times[{A_{c}A_{d}}]})-tr({[{A_{a}A_{b}}]\circ[{A_{c}A_{d}}]})% \right],$ (15) $\displaystyle\textit{UCOV}=\left|\begin{array}[]{*{20}c}{\frac{\breve{{e}}_{1}% ^{T}\breve{{e}}_{1}}{tr({A_{1}})}}&{\frac{\breve{{e}}_{1}^{T}\breve{{e}}_{2}}{% tr({A_{1}A_{2}})}}&\ldots&{\frac{\breve{{e}}_{1}^{T}\breve{{e}}_{M}}{tr({A_{1}% A_{M}})}}\\ {\frac{\breve{{e}}_{2}^{T}\breve{{e}}_{1}}{tr({A_{2}A_{1}})}}&{\frac{\breve{{e% }}_{2}^{T}\breve{{e}}_{2}}{tr({A_{2}})}}&\ldots&{\frac{\breve{{e}}_{2}^{T}% \breve{{e}}_{M}}{tr({A_{2}A_{M}})}}\\ \vdots&\vdots&\vdots&\vdots\\ {\frac{\breve{{e}}_{M}^{T}\breve{{e}}_{1}}{tr({A_{M}A_{1}})}}&{\frac{\breve{{e% }}_{M}^{T}\breve{{e}}_{2}}{tr({A_{M}A_{2}})}}&\ldots&{\frac{\breve{{e}}_{M}^{T% }\breve{{e}}_{M}}{tr({A_{M}})}}\\ \end{array}\right|,$ (16) $\displaystyle\Theta_{abcd}=\{{({I+P_{\textit{vec}(\textit{UCOV})}})\textit{% UCOV}\otimes\textit{UCOV}}\}_{\langle{abcd}\rangle},$ (17)

where $\langle abcd\rangle$ refers to a matrix element which is located by the following procedure. We locate the position of $\textit{UCOV}_{ab}\textit{UCOV}_{cd}$ in $(\textit{vec}(\textit{UCOV}){\textit{vec}\mathrm{(\textit{UCOV})}}^{T})$ and then locate the element with the same position in $(I+P_{\textit{vec}(\textit{UCOV})})\textit{UCOV}\otimes\textit{UCOV}$ , $\otimes$ denotes the Kronecker product, $P_{\textit{vec}(\textit{UCOV})}$ – transposition-permutation matrix, which is derived from the following equality:

$\textit{vec}({X^{T}})=P_{\textit{vec}(X)}\textit{vec}(X),$ (18)

where $\textit{vec}(\cdot)$ denotes a vectorization of a matrix.

Here we notice, that for computation of exact number of degrees of freedom one should use the variance-covariance matrix of true errors of considered submodels, but since it is not available matrix COV, displayed below, with matrix UCOV.

$\textit{COV}=\left|{{\begin{array}[]{*{20}c}{\sigma_{1}^{2}}&{\textit{cov}({% \varepsilon_{{\rm 1}t};\varepsilon_{{\rm 2}t}})}&\ldots&{\textit{cov}({% \varepsilon_{{\rm 1}t};\varepsilon_{Mt}})}\\ {\textit{cov}({\varepsilon_{{\rm 2}t};\varepsilon_{{\rm 1}t}})}&{\sigma_{2}^{2% }}&\ldots&{\textit{cov}({\varepsilon_{{\rm 2}t};\varepsilon_{Mt}})}\\ \vdots&\vdots&\vdots&\vdots\\ {\textit{cov}({\varepsilon_{Mt};\varepsilon_{{\rm 2}t}})}&{\textit{cov}({% \varepsilon_{Mt};\varepsilon_{{\rm 2}t}})}&\ldots&{\sigma_{M}^{2}}\\ \end{array}}}\right|,$ (19)

To summarize the MSFE model averaging procedure, we provide a stepwise algorithm for combined model computation.

Step 1: Step 1:

Specify a set of submodels satisfying Assumptions 1–6 computed either on one or on different data frames.

Step 2:

Compute a weight vector by solving the optimization problem Eq. (10).

Step 3:

Implement combination of submodels by Eq. (4) using the weight vector from Step 2.

Step 4:

Calculate a point forecast using the combined model from Step 3.

Step 5:

Compute an interval forecast by Eq. (11) with the number of degrees of freedom computed by Eq. (13).

3. Properties the number of degrees of freedom formula

As it was previously noted, the number of degrees of freedom for the combined model, calculated from Eq. (13) is not exact, and to some extent depends on unbiased estimations of variances and covariances of true errors of the submodels under consideration. Let us show this dependence graphically. To do this, we will conduct a simulation experiment. Suppose we average across two regression models, which model $y_{t}$ depending on factors $x_{1t}$ and $x_{2t}$ respectively. Random generation of $y_{t}$ is conducted according to the normal distribution with zero mean and unit variance. Factors $x_{1t}$ and $x_{2t}$ are modelled as $x_{it}=y_{t}+N(0,1)$ . Submodel number one is constructed on a data frame of the length of $n_{1}$ observations, submodel number two – on a data frame of $n_{2}$ observations. In this experiment, we consider the empirically obtained probability density of the number of degrees of freedom of the combined model for varying variances and covariances of true errors of submodels given constant other parameters. To calculate each probability density graph, we use 30,000 random generations of unbiased estimations of the variance-covariance matrix of true errors of submodels. Figure 1a–d present empirical probability densities of the number of degrees of freedom for the combined model given $n_{1}=20$ , $n_{2}=7$ and different weight, assigned to each of considered submodels.

As it can be seen from Fig. 1a–d, the distribution of the number of degrees of freedom for the combined model has a positive skewness for $w_{1}=0.2$ , $w_{1}=0.5$ , and even for $w_{1}=0.7$ . Only with sufficient superiority of the weight coefficient of the first model, calculated on a longer data frame, over the second weight coefficient distribution of the number of degrees of freedom for the combined model has a negative skewness. Also, when $w_{1}$ approaches one, the distribution has a much larger variance than for $w_{1}$ values that are close to zero. In the latter case, the number of degrees of freedom can be calculated quite accurately even under conditions of unknown variances and covariances of true errors of submodels. However, in the case of equal weights or superiority of $w_{1}$ over $w_{2}$ , uncertainty about the true number of degrees of freedom increases.

Figure 1.

a. Probability density of the number of degrees of freedom given $n_{1}=20$ , $n_{2}=7$ , $w_{1}=0.2$ , $w_{2}=0.8$ , b. Probability density of the number of degrees of freedom given $n_{1}=20$ , $n_{2}=7$ , $w_{1}=0.5$ , $w_{2}=0.5$ , c. Probability density of the number of degrees of freedom given $n_{1}=20$ , $n_{2}=7$ , $w_{1}=0.7$ , $w_{2}=0.3$ , d. Probability density of the number of degrees of freedom given $n_{1}=20$ , $n_{2}=7$ , $w_{1}=0.9$ , $w_{2}=0.1$ .

Figure 2.

It should also be noted that even if the variance-covariance matrix of true errors of weighted submodels is known, the number of degrees of freedom for the combined model would still not be a constant, but in some way it would depend on the elements $A_{i}$ and $\theta_{ij}$ . In order to illustrate this dependence, we take some variance-covariance matrix of the model errors as the true one and generate randomly data sets $A_{i}$ and $\theta_{ij}$ by the above mentioned system. As it was done in the previous case, 30,000 random generations were used to calculate each probability density chart.

Figure 2a and b show the probability densities of the number of degrees of freedom for constant covariances and variances of errors and changing $A_{i}$ and $\theta_{ij}$ .

As can be seen from Fig. 2a and b with the known variance-covariance matrix, the number of degrees of freedom for the combined model is not subject to such volatility as in the case of unknown variances and covariances of true errors. The distribution density in this case has almost a symmetrical form, regardless of the values of the weight coefficients.

Thus, under uncertainty conditions concerning the variance-covariance matrix of true errors, an accurate interval forecast for the combined model can be calculated using Bayesian statistics. The idea is to obtain a marginal distribution of the probability density of the true errors for the combined model, taking into account the probability distributions of the variances and the covariances of the true errors of the submodels. This marginal distribution can be represented as shown below:

$\displaystyle f({{y_{n+1}}|r})=\int\ldots\int\limits_{T}{({{y_{n+1}}|\sigma_{1% }^{2},\sigma_{2}^{2},\ldots,\sigma_{M}^{2},\textit{cov}_{12},\ldots,\textit{% cov}_{({M-1})M}})}\cdot f({{\textit{cov}_{12}}|\sigma_{1}^{2},\sigma_{2}^{2}})\ldots$ $\displaystyle\mspace{100.0mu }\cdot f({{\textit{cov}_{({M-1})M}}|\sigma_{1}^{2% },\sigma_{2}^{2},\ldots,\sigma_{M}^{2},\textit{cov}_{12},\ldots,\textit{cov}_{% ({M-2})M}})$ $\displaystyle\mspace{100.0mu }\cdot f({\sigma_{1}^{2}})\ldots f({\sigma_{M}^{2% }})d\sigma_{1}^{2}\ldots d\sigma_{M}^{2}d\textit{cov}_{12}\ldots d\textit{cov}% _{({M-1})M},$ (20)

where simplex $T\!\!:=\{(\sigma_{1}^{2},\sigma_{2}^{2},\ldots,\sigma_{M}^{2}):\sigma_{i}^{2}% \geqslant 0,|{\textit{cov}}_{ij}|\leqslant\sigma_{i}\sigma_{j}\}$ .

As it was said before, the variance-covariance matrix of true errors of submodels can be modeled using the Wishart distribution. However, in this case the problem lies in the fact that when combining models each element of this matrix has its own number of degrees of freedom, what makes the known functional form of distribution inapplicable. Technically it is possible to derive an analytic expression for the probability distribution of the variance-covariance matrix for a different number of degrees of freedom for each element, but in this paper this derivation was omitted due to the lack of significant contribution to the accuracy and speed of the calculations performed. To calculate the confidence interval for the forecasts of the combined model, it is recommended to use the numerical method. Below we present a step-by-step algorithm for calculating the interval forecast using this method.

1. 1.

Perform the Cholesky decomposition of an unbiased estimate of the variance-covariance matrix of submodels’ errors $\textit{UCOV}=SS^{T}$ .

FOR $i=$ 1 TO $T$ DO, where parameter $T$ determines the accuracy of the calculated confidence interval.

Set $n=$ 1.

WHILE $n\leqslant N$ DO, where $N$ is the largest number of degrees of freedom in the set of considered submodels.

a). a).

Generate a realization of matrix COV with one degree of freedom according to the following formula:

$\textit{COV}=SZZ^{T}S^{T},$ (21)

where $Z\sim N(0,1)$ and $E(ZZ^{T})=I$ .

b).

Repeat Step a) until $n<n_{i}$ , where $n_{i}$ is the number of degrees of freedom for the COV elements, arranged in ascending order.

c).

In case $n_{i}$ is a fractional number, multiply the last implementation of the $\textit{COV}_{kl}$ element corresponding to $n$ , obtained in a) by the fractional part $n_{i}$ .

d).

Go to the next $n_{i}$ , but do not accumulate elements of COV that have already been considered.

ENDDO.

Sum the obtained realizations of $\textit{COV}_{kl}$ and calculate the number of degrees of freedom $r_{i}$ with the obtained variance-covariance matrix of submodels’ errors.

ENDDO.

The desired probability distribution for the predicted value will be a simple average of scaled t-distributions with $r_{i}$ degrees of freedom:

$f({y_{n+1}})=\frac{1}{T}\sum\limits_{i=1}^{T}{t\_\textit{scaled}({\textit{MSFE% },r_{i}})}.$ (22)

Figure 3.

a. Empirical probability of $y_{n+1}$ falling into 95% confidence interval, b. Empirical probability of $y_{n+1}$ falling into 90% confidence interval, c. Empirical probability of $y_{n+1}$ falling into 80% confidence interval, d. Empirical probability of $y_{n+1}$ falling into 70% confidence interval.

Figure 4.

Dependence of the number of degrees of freedom for combined model on the weight coefficients for two submodels.

Figure 5.

Dependence of the number of degrees of freedom for combined model on the weight coefficients for three submodels.

However, it should be noted that even when using a simple calculation of the number of degrees of freedom for the combined model by Eq. (13), the interval forecast turns out to be sufficiently reliable. To verify this statement, we will perform a simulation experiment. As in the previous case, suppose that two regression models that model $y_{t}$ are weighted depending on the factors $x_{1t}$ and $x_{2t}$ respectively. The machine generation $y_{t}$ is performed according to the normal distribution with zero mean and unit variance. The factors $x_{1t}$ and $x_{2t}$ are modeled as $x_{it}=y_{t}+N(0,1)$ . The first submodel is built on a data frame of length $n_{1}=20$ observations, the second – on a data frame of length $n_{2}=7$ observations. In this experiment, we consider the empirically obtained probability that predicted value $y_{n+1}$ appears within the confidence interval calculated from the Student’s t-distribution with several numbers of degrees of freedom. In this case, we will consider the number of degrees of freedom associated with the shortest data window ( $df=$ 5), yielding a distribution close to the normal ( $df=$ 1000) and calculated according to the Eq. (13), mentioned above ( $df=r$ ).

Figure 3a–d show the results of the simulation experiment, in particular the empirical probabilities of predicted value $y_{n+1}$ falling into the 95%, 90%, 80% and 70% confidence intervals for the considered above systems of calculating the number of degrees of freedom and different values of the weight coefficients $w_{1}$ and $w_{2}$ . To obtain each point of the graph we used 10,000 simulations.

As can be seen from Fig. 3a–d, the use of $df=$ 5 gives sufficiently reliable confidence intervals for small values of $w_{1}$ , but with the increase in the weight coefficient of the first submodel, the reliability of this method of finding the confidence interval for predicted values decreases. The confidence interval calculated for $df=$ 5 gives an overestimation of the true one, what means that it captures a larger percentage of realized predicted values than what was stated initially. The use of a near-normal distribution to calculate the interval forecast for combined model can be reliable either at $w_{1}$ values close to one or if considered data frames are sufficiently long. Also, from Fig. 3a–d, we can conclude that proposed formula for calculating the number of degrees of freedom for combined model allows obtaining reliable confidence intervals, regardless of the values of the weight coefficients and the lengths of the data frames. Thus, from the conducted empirical testing it is possible to state the reasonable assumption that it is not so necessary to use the algorithm of numerical calculation of interval forecast given above, since using the point estimate of the number of degrees of freedom (see Eq. (13)) yields confidence intervals with satisfactory accuracy.

Next, we study the form of the dependence of the number of degrees of freedom for combined model on the values of the weight coefficients. First, we consider a two-dimensional case, when averaging occurs only across two submodels. Then the weight coefficient $w_{1}$ automatically determines the value of the remaining weight coefficient $w_{2}={1-w}_{1}$ . Thus, it is logical to depict this dependence on a two-dimensional graph with $w_{1}$ on the X-axis the number of degrees of freedom on the Y-axis.

In Fig. 4 we present a graph of dependence of the number of degrees of freedom for combined model on $w_{1}$ given ${df}_{1}=n_{1}-2=18$ and ${df}_{2}=n_{2}-2=5$ .

In Fig. 4 one can trace a distinct nonlinear asymmetric logistic function. Hence we note that for $w_{1}\to 0$ the number of degrees of freedom tends to ${df}_{2}$ and vice versa: for $w_{1}\to 1$ the number of degrees of freedom tends to ${df}_{1}$ . Analyzing this functional form, one can also say that the smallest of the two considered degrees of freedom (in this case ${df}_{2}$ ) has a greater “priority”, what means that a significant supremacy of $w_{1}$ over $w_{2}$ is required for the number of degrees of freedom to be equal to at least the average of ${df}_{1}$ and ${df}_{2}$ . Therefore we can also conclude that the simple average of considered number of degrees of freedom will also not contribute to obtaining reliable confidence intervals.

Next, consider the three-dimensional case, when three submodels are averaged. The weight coefficients $w_{1}$ and $w_{2}$ unambiguously determine the value of the weight coefficient $w_{3}=1-w_{1}-w_{2}$ , therefore in this case we will use a three-dimensional space for depicting considered relation. Figure 5 shows the dependence of the number of degrees of freedom of the combined model on $w_{1}$ and $w_{2}$ given ${df}_{1}=38$ , ${df}_{2}=18$ and ${df}_{3}=5$ .

The graph in Fig. 5 confirms the earlier statement about the greater “priority” of a smaller number of degrees of freedom over a larger one, since it has a distinct asymmetric shape. For the number of degrees of freedom to approach ${df}_{1}=38$ it would require a significant advantage of $w_{1}$ over the remaining two weight coefficients.

4. Conclusion

This paper has revealed the properties of the formula of the number of degrees of freedom for the combined model, constructed according to an arbitrary averaging method. We notice that the number of degrees of freedom depends not only on the variance-covariance matrix of true errors of submodels but also on values of the factors involved, what we presented in several graphs. However, despite the presence of the uncertainty concerning errors variances and covariances analyzed formula yields reliable and consistent results, what was confirmed by the simulation study. We also investigate the dependence of the number of degrees of freedom on the value of a weight coefficient, which appears to give more priority to the number of degrees of freedom of the submodel, constructed on the shortest data frame. Therefore, the analyzed formula is needed in a lot of real life situations, since even if a researcher decides to average across data frames of lengths twenty and one hundred, just using the normal distribution for obtaining a confidence interval will return a significantly underestimated interval forecast. The same underestimation will happen also for a simple average of degrees of freedom of averaged submodels, although this method can be considered as a significant improvement compared to using the normal distribution. Thus, we recommend that one use analyzed formula in cases when the shortest considered data frame has fewer than thirty observations and want to encourage econometricians to resort to correct calculation of the number of degrees of freedom when dealing with model averaging methods.

Footnotes

Acknowledgments

This research was funded by Plekhanov Russian University of Economics.

Conflict of interest

Authors declare that they have no conflict of interests.

References

Akaike

(1973). Information theory and an extension of the maximum likelihood principle. Petroc, B., & Csake, F. (eds.), 2nd International Symposium on Information Theory, Akademiai Kiado, Budapest, 1973, 267-281.

Akaike

(1979). A Bayesian extension of the minimum AIC procedure of autoregressive model fitting. Biometrika, 66, 237-242.

Bates

J. M.

, & Granger

C. W. J.

(1969). The combination of forecasts. Operations Research Quarterly, 20, 451-468.

Brock

W. A.

, & Durlauf

S. N.

(2001). Growth empirics and reality. World Bank Economic Review, 15, 229-272.

Brock

W. A.

Durlauf

S. N.

, & West

K. D.

(2003). Policy analysis in ucertain economic environments. Brookings Papers on Economic Activity, 1, 235-322.

Buckland

S. T.

Burnham

K. P.

, & Augustin

N. H.

(1997). Model selection: An integral part of inference. Biometrics, 53, 603-618.

Burnham

K. P.

, & Anderson

D. R.

(2002). Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, Second ed. Springer, New York.

Cheng

, & Hansen

B. E.

(2015). Forecasting with factor-augmented regression: A frequentist model averaging approach. Journal of Econometrics, 186, 280-293.

Claeskens

, & Hjort

N. L.

(2003). The focused information criterion. Journal of the American Statistical Association, 98(1), 900-916.

10.

Claeskens

, & Hjort

N. L.

(2008). Model selection and model averaging. Cambridge University Press, Cambridge.

11.

Clemen

R. T.

(1989). Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, 5, 559-581.

12.

Fernandez

L. C. E.

, & Steel

M. F. J.

(2001). Benchmark priors for Bayesian model averaging. Journal of Econometrics, 100, 381-427.

13.

Garratt

Lee

Pesaran

M. H.

, & Shin

(2003). Forecasting uncertainties in in macroeconomic modelling: An application to the UK economy. Journal of the American Statistical Association, 98, 823-838.

14.

Granger

C. W. J.

(1989). Combining forecasts – twenty years later. Journal of Forecasting, 8, 167-173.

15.

Granger

C. W. J.

, & Ramanathan

(1984). Improved methods of combining forecast accuracy. Journal of Forecasting, 19, 197-204.

16.

Hansen

B. E.

(2007). Least squares model averaging. Econometrica, 75, 1175-1189.

17.

Hansen

B. E.

(2008). Least-squares forecast averaging. Journal of Econometrics, 146, 342-350.

18.

Hansen

B. E.

(2014). Model averaging, asymptotic risk and regressor groups. Quantitative Economics, 5, 495-530.

19.

Hansen

B. E.

, & Racine

J. S.

(2012). Jackknife model averaging. Journal of Econometrics, 167, 38-46.

20.

Hendry

D. F.

, & Clements

M. P.

(2002). Pooling of forecasts. Econometrics Journal, 5, 1-26.

21.

Hjort

N. L.

, & Claeskens

(2006). Focused information criteria and model averaging for the Cox hazard regression model. Journal of the American Statistical Association, 101, 1449-1464.

22.

Hoeting

J. A.

Madigan

Raftery

A. E.

, & Volinsky

C. T.

(1999). Bayesian model averaging: A tutorial. Statistical Science, 14(4), 382-417.

23.

Ing

C.-K.

(2003). Multistep prediction in autoregressive processes. Econometric Theory, 19, 254-279.

24.

Ing

C.-K.

(2004). Selecting optimal multistep predictors for autoregressive processes of unknown order. Annals of Statistics, 32, 693-722.

25.

Ing

C.-K.

(2007). Accumulated prediction errors, information criteria and optimal forecasting for autoregressive time series. Annals of Statistics, 35, 1238-1277.

26.

Ing

C.-K.

Wei

C.-Z.

(2003). On same-realization prediction in an infinite-order autoregressive process. Journal of Multivariate Analysis, 85, 130-155.

27.

Ing

C.-K.

Wei

C.-Z.

(2005). Order selection for same-realization predictions in autoregressive processes. Annals of Statistics, 33, 2423-2474.

28.

Lee

, & Karagrigoriou

(2001). An asymptotically optimal selection of the order of a linear process. Sankhya Series, A63, 93-106.

29.

Liu

C.-A.

(2015). Distribution theory of the least square averaging estimator. Journal of Econometrics, 186, 142-159.

30.

Mallows

C. L.

(1973). Some comments on Cp. Technometrics, 15, 661-675.

31.

Min

C.-K.

, & Zellner

(1993). Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates. Journal of Econometrics, 56, 89-118.

32.

Moiseev

N. A.

(2016). Linear model averaging by minimizing mean-squared forecast averaging unbiased estimator. Model Assisted Statistics and Applications, 11(4), 325-338.

33.

Moiseev

N. A.

(2017). Forecasting time series of economic processes by model averaging across data frames of various lengths. Journal of Statistical Computation and Simulation. Forthcoming.

34.

Mood

Graybill

, & Boes

(1974). Introduction to the Theory of Statistics (3rd ed). McGraw-Hill. New-York City. p. 557.

35.

Nydick

S. W.

(2012). The Wishart and Inverse Wishart Distributions, [Online]. Available: http://www.tc.umn.edu/∼nydic001/docs/unpubs/Wishart_Distribution.pdf.

36.

Raftery

A. E.

Madigan

, & Hoeting

J. A.

(1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 92(437), 179-191.

37.

Rissanen

(1986). Order estimation by accumulated prediction errors. Journal of Applied Probability, 23A, 55-61.

38.

Doppelhofer

, & Miller

R. I.

(2004). Determinants of long-term growth: A Bayesian averaging of classical estimates (BACE) approach. American Economic Review, 94, 813-835.

39.

Schwartz

(1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.

40.

Shibata

(1980). Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. Annals of Statistics, 8, 147-164.

41.

Shibata

(1981). An optimal selection of regression variables. Biometrika, 68, 45-54.

42.

Shibata

(1983). Asymptotic mean efficiency of a selection of regression variables. Annals of the Institute of Statistical Mathematics, 35, 415-423.

43.

Stock

J. H.

, & Watson

M. W.

(1999). A comparison of linear and nonlinear univariate models for forecasting macroeconomic time series. Engle, R., & White, H. (Eds.), Cointegration, Causality and Forecasting: A Festschrift for Clive W. J. Granger. Oxford University Press, Oxford, 1-44.

44.

Stock

J. H.

, & Watson

M. W.

(2004). Combination forecasts of output growth in a seven-country data set. Journal of Forecasting, 23, 405-430.

45.

Stock

J. H.

, & Watson

M. W.

(2005). An empirical comparison of methods for forecasting using many predictors. Working Paper, NBER.

46.

Stock

J. H.

, & Watson

M. W.

(2006). Forecasting with many predictors. Elliott, G., Granger, C. W. J., & Timmermann, A. (Eds.), Handbook of Economic Forecasting, vol. 1. Elsevier, Amsterdam, pp. 515-554.

47.

Timmermann

(2006). Forecast combinations. Elliott, G., Granger, C. W. J., & Timmermann, A. (Eds.), Handbook of Economic Forecasting, vol. 1. Elsevier, Amsterdam, pp. 135-196.

48.

Wan

Zhang

, & Zou

(2010). Least squares model averaging by Mallows criterion. Journal of Econometrics, 156, 277-283.

49.

Wan

A. T. K.

Zhang

, & Wang

(2013). Frequentist model averaging for multinomial and ordered logit models. International Journal of Forecasting, 30, 118-128.

50.

Wishart

(1928). The generalized product moment distribution in samples from a normal multivariate population. Biometrica 20A, 32-52.

51.

Wright

J. H.

(2003a). Bayesian model averaging and exchange rate forecasting. Federal Reserve Board International Finance Discussion Papers, 779.

52.

Wright

J. H.

(2003b). Forecasting US Inflation by Bayesian Model Averaging. Federal Reserve Board International Finance Discussion Papers, 780.

53.

Zhang

, & Liang

(2011). Focused information criterion and model averaging for generalized additive partial linear models. Annals of Statistics, 39(1), 174-200.

54.

Zubakin

V. A.

Kosorukov

O. A.

, & Moiseev

N. A.

(2015). Improvement of regression forecasting models. Modern Applied Science, 9(6), 344-353.