Abstract
There are two main approaches to carrying out prediction in the context of penalized regression: with low-rank basis and penalties or through the smooth mixed models. In this article, we give further insight in the case of P-splines showing the influence of the penalty on the prediction. In the context of mixed models, we can connect the new predicted values to the observed values through a joint normal distribution, which allows us to compute prediction intervals. In this work, we propose an alternative approach, called the extended mixed model approach, that allows us to fit and predict data simultaneously. The methodology is illustrated with two real datasets, one of them on aboveground biomass and the other on monthly sulphur dioxide (
Introduction
There are many situations in which prediction of new observations in the context of regression is needed; in particular, when ‘out-of-range’ prediction is required, that is, beyond the range of observed covariates. This problem extends in the framework of smoothing, that is, for models where the regression function is a smooth but otherwise unspecified function. Examples are numerous; for instance, hourly temperatures at a weather station or the yearly number of deaths, where the latter has a major impact in areas such as demography (mortality tables). A scatterplot of data often exhibits patterns, such as an upward or downward trend, or a pattern that repeats, seasonal variation, both of which might be used to predict new values.
The generalities of the problem and the particular characteristics of datasets are one of the main reasons that encourages us to work in the prediction field and base our work on the forecasting method proposed in Currie et al. (2004). They have shown how the method of penalized splines (P-splines), introduced by Eilers and Marx (1996) and extensively discussed in Ruppert et al. (2003), can be extended to smooth and predict two-dimensional mortality tables. In particular, the authors show how to construct the appropriate regression bases and penalty matrices for forecasting.
Most of the existing literature in the area is related to the prediction of new observations in a temporal context, that is, forecast of new observations. Although we have a more general approach, we start by providing a brief review of the main literature related to forecasting in smoothing models by commenting the main approaches of Ba et al. (2012) and Sacks et al. (1989). We also refer to Hyndman et al. (2008) who gave an overview of exponential smoothing methods.
Exponential smoothing or weighted smoothing, respectively, refers to a class of forecasting methods, each of them having the property that forecasts are weighted combinations of past observations, with recent observations given relatively more weight than older observations. Hyndman et al. (2008) provide extensive information about exponential smoothing methods. A summary of the exponential smoothing history shows the great importance of this method. It certainly is the most popular forecasting method used in business and industry since the 1950s.
In this article, we focus on computationally driven methods. A recent strategy is to use penalized splines to fit and forecast time series data (Ba et al., 2012). In this case, one minimizes the penalized least squares criteria:
where
A similar prediction problem is tackled in the framework of global optimization where the interest is to evaluate an unknown function at point
Beyond the aforementioned articles, we have not found methods to predict in penalized regression. In this article, we propose a general framework for prediction in penalized regression and also delve deeper into our knowledge of the forecasting method proposed in Currie et al. (2004). Although we will use B-spline basis and penalties based on differences, the methodology proposed here can be extended to any basis and quadratic penalty. We have organized the remaining of the article as follows. Section 2 is dedicated to introducing the fundamentals of the method proposed by Currie et al. (2004) and to show some properties that relate the order of the penalty with the shape of the predictions. In Section 3, we describe how forecasting can be carried out in the context of smooth mixed models. It can be done through two-stages procedures, following Gilmour et al. (2004) or based on the conditional distribution of the new values given the observed data. We show how the last procedure allows us to compute prediction intervals (see Section 3.1). In addition, we propose an alternative method for prediction with smooth mixed models that can be done through a one-stage procedure, by extending the proposal of Currie et al. (2004) to the mixed model framework (see Section 3.2). The equivalence of two-stage and one-stage procedures (in terms of predicted values and choice of smoothing parameter) is shown in the particular case of penalties based on the differences between adjacent coefficients (Eilers and Marx, 1996). The proposed methodology is illustrated in Section 4 with the analysis of two real datasets. Finally, concluding remarks are made in Section 5.
Prediction with smooth models and quadratic penalties
We start our presentation by giving a brief revision of penalized regression. Consider the case of univariate data with response variable
where
where
where
Currie et al. (2004) proposed a method to fit and predict simultaneously in penalized regression models. We call their proposal the missing value approach, subsequently, and give a brief summary of their methodology.
In the framework of model (2.1), given a vector
which contains the observed response
Example of an extended basis to the right of the data (forward)
Associated to the new basis
In the case that
that is,
The model can be fitted and predicted simultaneously by minimizing the following penalized least squares criterion for
where the unknown
and
Writing the fit and the forecast as a function of the extended penalty matrix (2.4) and applying Theorem 9.6.1 given in Harville (2000), we get the fit and the prediction
where the superscript (
When the aim of prediction is to forecast future values, the independent error assumption in (2.2) might not always be appropriate. If this is the case, the estimation of the smoothing parameter will be affected by the correlation that is not being accounted for. The methods presented here are easily extended to the case of non i.i.d. errors where smoothing and correlation parameters can be estimated simultaneously (Durbán and Currie, 2003). See also Krivobokoa and Kauermann (2007) for related results in mixed models.
When penalties are based on differences between adjacent coefficients, for interpolating, there is no need for new coefficients and the B-spline coefficients form a polynomial sequence of degree The fit remains the same regardless of the forecast horizon (the coefficients that yield the fit do not change). The shape of the forecast is determined by the order of the penalty.
These properties are an immediate consequence of the following theorems.
The first The coefficients for the
Proof. Substituting the blocks of
that is the first
If the knots are not equally spaced, the expression above would be modified, since the penalty would have to account for the difference between the knots (Eilers and Marx, 2010).
Corollary 1. Given penalties of order
As the most popular penalties are of second or third order, the proof of the previous corollary for such cases and for penalties of order 1 is given in the complementary material.
In many situations, the fit is not greatly affected by the order of the penalty. However, there is an immediate connection between the penalty (or prior distribution) on the coefficients on the shape of the out-of-sample prediction shown in the above corollary. This is known in the framework of Bayesian P-splines, where the difference penalty corresponds to assuming a random walk prior on the coefficients (see Lang and Brezger, 2004); however, it is not common knowledge in a non-Bayesian context. We believe this is an important result that needs to be addressed when using this methodology.
The connection between penalized smoothing and mixed models was established more than 30 years ago in (Green, 1987; see also Currie and Durbán, 2002; Wand, 2003). The key point of this equivalence is the fact that the smoothing parameter becomes a ratio of variances,
In order to reparameterize a penalized smooth model, it is necessary to find a new basis that allows the representation of model (2.1) as a mixed model. Like before, we replace the smooth function by a basis representation which is now written as
where
There are different alternatives for the reparameterization of the original smooth model described before which was based on quadratic penalties as a mixed model in (3.1). The idea is to find a transformation
where
Prediction in the context of mixed models has always been done as a two-stage procedure: first fit and then predict. We will first show how to apply the existing results to the context of smooth mixed models, and then, we will propose an alternative one-stage approach. As we will see, the variance–covariance matrix for the random effects in the two-stage approach is a direct extension of the variance–covariance matrix of the random effects in the fit. This implies that the extended mixed model transformation used has to provide a direct extension of the variance-covariance matrix of the random effects in the fit. One of the advantages of the one-stage approach is that we can use any transformation.
Standard methodology for prediction
Once we have established the connection between mixed models and P-splines, we can use the results given in Gilmour et al. (2004) to predict new observations. In this case, the prediction is a linear function of the best linear unbiased predictor (BLUP) of random effects and the best linear unbiased estimator (BLUE) of the fixed effects in the model. The results are based on the following augmented mixed model,
that is,
with
where
Now we need to formulate the extended P-spline model into the extended mixed model (3.2). For that, we define the extended transformation matrix
Then, the new predicted values are
with
where
It follows that the predicted random effects vector for
Therefore, the
with
The previous method can be seen from the point of view of conditional distributions, which allows us to compute prediction intervals. We rewrite (3.2) as
with
where
which equals (3.3). Note that the first term in the above equation is the result of plugging
where the latter equality holds, since in the Normal model the conditional variance does not depend on the value of the variable we condition on. With these results, we can construct
where observed values are considered as fixed. Note that, as we have mentioned before, this approach allows us to work out the posterior predictive distribution
The previous method was a two-stage procedure. First, the model is fitted to the available data and, second, based on the fitted model we predict the new observations. As mentioned previously, this approach imposes constraints on the reparameterization used to obtain the smooth mixed model, since the variance–covariance matrix of the extended model (the model that includes out-of-sample prediction) needs to be an extension of the variance–covariance matrix of the original model. Now we propose an alternative approach which can be used with any reparameterization and yields the same results as the two-step approach for appropriate transformations. This approach relates the above results to the method presented in Section 2. We will call it ‘extended mixed model approach’, since we include
where
where
For the particular case of penalties based on differences, we choose
We use the double hat symbol (
Like the above, we have
As we mentioned earlier, the above extension of the missing value approach to the mixed model framework fits and predicts simultaneously, while the approach of Gilmour et al. (2004) is a two-stage method. In order to know the relationship between the two methods, we need to know the relationship between the covariance matrix of the random effects that gives the fit and the extended covariance matrix. This is shown in the following theorem.
Under the previous hypothesis, the variance components (
Equivalent results would be obtained using ML.
The proof can be found in the complementary material. Since the fitted and predicted values in both approaches do not depend on the transformations used, the previous theorem states a stronger result: The approaches always give the same solution, regardless of the used transformations. We have stated the theorem for a particular transformation to establish a relationship between both the methods. The last statement of the previous theorem means that the variance parameters used to predict are the same as the ones used for estimating the original fit. In other words, prediction in and out of sample cannot only be done simultaneously but also the optimal smoothing parameter is the same.
In this section, we apply the proposed methods to two real datasets. The first one, allows us to show an example of predicting within the framework of additive models and in the case when out-of-sample prediction is needed to the left and right of the interval where the covariate is observed. The second dataset illustrates a classic example where forecasting is needed, when data are collected over time. Although, as we have shown, all the methods give us the same result, in order to obtain the prediction intervals we use the two-stage approach.
Prediction of aboveground biomass
All the results presented in the previous sections are obtained in the case of smooth models with a single covariate. However, it is immediate to extend these results to the case of semiparametric or additive models. In this section, we apply the proposed methodology to such dataset. The dataset corresponds to an agricultural trial carried out in Spain (Rivas-Martínez et al., 2002) with the aim of evaluating the economic viability of Populus trees prior to harvesting. In this context, it is essential to estimate aboveground biomass and obtain accurate predictions using only a minimum set of easily obtainable information, that is diameter and height. Sánchez-González et al. (2016) proposed the use of smooth additive mixed models for predicting aboveground biomass. Here, we analyse data for a single clone of the nine included in the trials. The aim is to estimate the productions (measured as aboveground dry biomass) as a function of diameter and height, and give out-of-sample predictions. The observed data consists of
Plot of weight versus diameter (left panel) and plot of weight versus height (right panel)
Plot of weight versus diameter (left panel) and plot of weight versus height (right panel)
From the plot, it is immediate to see that the weight is a non-linear function of diameter and height. Therefore, we fit the following model:
where
We predict weight for
Once we have extended the basis and the penalty, it is straightforward to obtain the fit and the prediction applying Equation (2.7). But in order to obtain confidence and prediction intervals and to avoid identifiability problems, since the column of
Fit, forecast,
confidence interval (grey lines) and
prediction interval (dashed lines) of the additive smooth term for diameter (left panel) and for height (right panel), result of applying the methodology of a dataset on the stem biomass
The estimation of the covariance parameters was carried out by the REML through the SOP algorithm, proposed in Rodríguez-Álvarez et al. (2018).
Now, we analyse data where out-of-sample prediction is needed and missing observations are present in the data. In this situation, the one-stage approach offers an elegant solution to solve simultaneously both problems. We consider measurements on sulphur dioxide (
Time series plot of
data for station AT02
Time series plot of
data for station AT02
The data were collected through the ‘European Monitoring and Evaluation Programme’ (EMEP) under the Co-operative Programme for Monitoring and Evaluation of the Long-range Transmission of Air Pollutants in Europe (see further information available at
First of all, let us introduce the smooth modulation model based on P-splines suggested by Eilers et al. (2008):
where
where
If we want to obtain the coefficients
where
To obtain the fit and the forecast simultaneously, we just have to extend the B-spline basis for the trend and the modulation components
Once
Keeping in mind that in a penalized spline modulation model, we have the trend and the seasonality components, it is straightforward to represent it as a mixed model.
As we have seen the order of the penalty,
The matrix of the random component,
The one-stage approach can deal with missing observations within and out of sample just by setting to zero the corresponding diagonal entries of
Forecast for AT02 station. Top and bottom left figures show the data (points), the fitted and forecasted trend (black line), 95% confidence interval (grey lines) and 95% prediction interval (dashed lines) for second and third order penalties, respectively. Top and bottom right figures show the data (points), the fit and the forecast in the modulation model (black line), 95% confidence interval (grey lines) and 95% prediction interval (dashed lines) for second and third order penalties, respectively
Smoothing techniques have become a very popular tool for describing patterns in noisy data. However, prediction is still an open area of research. In this article we have proposed a general framework for prediction of new observations in penalized regression. The methodology proposed can be accommodated to the different frameworks in which smoothing is carried out, regardless of the basis and quadratic penalty used:
1. Extend the basis used for regression and the penalty to control the smoothness in the framework of penalized regression based on quadratic penalties. 2. Extend the fixed and random components in the context of mixed models and carry out the prediction as a two-stage procedure or using a new approach that allows us to fit and predict simultaneously.
The approach given in Section 3.1.2 is a particular case of prediction in the context of Gaussian process regression, under the assumption that the regression function is a realization of a stochastic process. Then, prediction of new values are based on Gaussian process prior and a P-spline covariance matrix. Hence, we can use the flexibility of the Gaussian process and the choice of suitable covariance matrix to model any non-linear model nonparametrically. In this context, prediction is straightforward due to the properties of Gaussian processes.
In the context of penalties based on differences between adjacent coefficients, we have proved that all the approaches compared in this work are equivalent, but the one-stage approach can be used regardless of the mixed model reparameterization used. Furthermore, when predicting out-of-sample observations, the horizon of prediction has no effect on the fit or on the selection of the smoothing parameter.
These results are based on the basis constructed from equally spaced knot. However, the results apply also to the non-equal spaced knots case, defining the appropriate scaled penalty matrices.
To illustrate the methodology and the proved results, we have analysed two real datasets and showed the performance of the extended mixed model approach with B-spline basis and penalties based on differences. These examples illustrate how to obtain out-of-sample prediction in the case of additive models. In the first example, we show how the method can do backward and forward predictions, and in the second example, we show how the one-stage approach can deal with out-of-sample and missing value predictions simultaneously.
The results obtained here have opened new areas of research; in particular, we have already obtained similar results in the case of multidimensional smoothing and preliminary work is being done to extend this work to the case of non-Gaussian data.
Supplementary material
Supplementary material for this article are available from
Footnotes
Declaration of conflicting interests
Funding
The first and second authors acknowledge financial support from the Spanish Ministry of Economy and Competitiveness MTM2014-52184-P. The fourth author acknowledges the financial support by the Basque Government through the BERC 2018-2021 programme and by Spanish Ministry of Economy and Competitiveness MINECO through BCAM Severo Ochoa excellence accreditation SEV-2013-0323 and through project MTM2017-82379-R funded by (AEI/FEDER, UE) and acronym AFTERAM.
