Abstract
The specification of a particular type of effect (e.g., linear or non-linear) of a covariate in a regression model can be either based on graphical assessment, subject matter knowledge or also on data-driven model choice procedures. For the latter variant, we present a boosting approach that is available for a huge number of different model classes. Boosting is an indirect regularization technique that leads to variable selection and can easily incorporate also non-linear or smooth effects. Furthermore, the algorithm can be adapted in a way to automatically select whether to model a continuous variable with a smooth or a linear effect. We enhance this model choice procedure by trying to compensate the inherent bias towards the more complex effect by incorporating a pragmatic and simple deselection technique that was originally implemented for enhanced variable selection. We illustrate our approach in the analysis of T3 thyroid hormone levels from a larger Galician cohort and investigate its performance in a simulation study.
Introduction
With the introduction of newer and more complex model classes (e.g., distributional regression, Rigby and Stasinopoulos, 2005; Kneib, 2013) and also more flexible types of effects, in the last decades the toolbox of data analysts has become richer in potential statistical modelling options. However, this increased flexibility also leads to the burden of having to decide on selecting the best possible option from a vast number of suitable solutions. This is also reflected by the introduction of P-splines (Eilers and Marx, 1996), based also on the work by Brian Marx—who we remember with this special issue. P-splines allow for a very flexible form of effects for continuous variables, but at the same time their flexibility is regulated by penalization in order to ensure smooth effects. This trade-off between flexibility and simplicity is the core of many discussions in statistical modelling between the involved researchers trying to find the best solution for the research question at hand. Generally, the problem of selecting a model as complex as necessary but also as simple as possible can be either tackled from a subject-matter perspective or in a data-driven automated manner. With the emergence of larger data sets, new and modern sources (e.g., proteomics, genomics), there is also an increased need for automated procedures to answer the questions of dimensionality, complexity reduction and model choice.
A natural solution for complexity reduction in the context of statistical modelling are penalized regression techniques. The most popular regularized regression techniques as the lasso (Tibshirani, 1996), ridge regression (Hoerl and Kennard, 1970) or elastic net (Zou and Hastie, 2005) typically aim for linear effects. Boosting, on the other hand, is an indirect approach to regularized regression that can easily incorporate non-linear and smooth effects (as introduced by Schmid and Hothorn, 2008). For the smooth component, the boosting framework relies on the iterative application of P-splines (Eilers and Marx, 1996) as base-learners, that were also extended to monotone or cyclical effects (Hofner et al., 2016). Boosting algorithms originally emerged from machine learning, but were later adapted to statistical modelling (Bühlmann and Hothorn, 2007) and are nowadays a versatile option to fit various model classes, including very complex ones (for an overview see Mayr et al., 2014a, b). For the remainder of this article, we will focus on statistical boosting algorithms in the sense of component-wise gradient boosting with regression models as base-learners (Bühlmann and Hothorn, 2007). It can be shown, that with linear base-learners, boosting with early stopping leads to very similar solutions as the lasso (Hepp et al., 2016).
In the context of dimensionality or complexity reduction, statistical boosting algorithms allow simultaneously: (a) to select the most influential variables from a potentially high-dimensional set of candidate variables; (b) to perform model choice by selecting the most appropriate type of effect via a decomposition into linear and non-linear parts; and (c) to estimate the final prediction model.
The notion of model choice in the context of statistical boosting was introduced by Kneib et al. (2009). The basic idea is to decompose the effect of a continuous variable in a linear component, represented by a linear base-learner, and a smooth component, represented by a non-linear base-learner. To avoid bias towards more complex base-learners, either when choosing between linear and smooth effects or in the presence of categorical variables, Hofner et al. (2011) developed a framework for unbiased model choice by incorporating penalized least squares base-learners.
Over the last years, however, practical experience suggests that there is still a tendency of statistical boosting to select too complex models—particularly in settings with a relatively large number of observations n in comparison to the number of potential predictor variables p (Staerk and Mayr, 2021, Strömer et al., 2022). As overfitting is less problematic in rather low-dimensional settings, the algorithm stops relatively late and has the tendency to include many base-learners. The effect of some of these base-learners in the final model might be only very small, as they were likely only updated once or twice, however this behaviour limits the interpretability of the prediction model without having a relevant impact on prediction performance. This carries over to the question of model choice.
Further regularization might be indicated if variable selection or model choice is key. An approach which can be applied to many regularization techniques is stability selection (Meinshausen and Bühlmann, 2010, Shah and Samworth, 2013), which was later successfully transferred to boosting (Hofner et al., 2015; Mayr et al., 2016; Thomas et al., 2018). This approach is computationally expensive and it remains unclear how to fit the final model.
Here, we propose to tackle this issue with a similar approach as described by Strömer et al. (2022) in the context of variable selection: We directly enforce model choice by deselecting more complex base-learners that do not contribute enough to the final model. In other words, if the smooth component of a continuous predictor was selected, but does not have a larger impact on the predictive risk than the linear one, we can also deselect it and re-fit the model only with the simpler linear component. We investigate the performance of this pragmatic and simple approach in a small simulation study and illustrate its merits by modelling thyroid hormone levels from a large Galician cohort based on different continuous and categorical predictor variables.
Methods
Our approach to model choice is based on the application of boosting which emerged from machine learning (Freund and Schapire, 1996) performing classification based on simple decision trees, but was later adapted to estimate regression models via gradient descent in function space (Friedman et al., 2000; Friedman, 2001). For given observations
by estimating a statistical model
and afterwards every base-learner is fitted to the negative gradient vector
Note that in most cases, the number of base-learners
P-splines are typically used as base-learners to model a potential non-linear effect of a dependent variable on the outcome (Schmid and Hothorn, 2008). P-splines base-learners can be expressed via a simple penalized least squares regression model, regardless of the distributional assumption of the overall model (Eilers and Marx, 2021; Fahrmeir et al., 2022, Ch. 8). The effect estimate for a P-spline base-learner
where
In Figure 1 this aspect is illustrated, displaying in the upper row the spline fit after different numbers of boosting iterations and in the bottom row a smoothing spline through the current residuals at this iteration (see also Mayr and Hofner, 2018). One can nicely observe how the P-spline fitted to the residuals (corresponding here to
Illustration of the iterative fitting process when boosting a single P-spline. The figure visualizes the influence of the stopping iteration
on the complexity of the spline from oversmoothing (small
, left) to overfitting (larger
, right). The dashed line in the upper plot displays the true underlying function. The red solid line (top) is the boosting fit
for iteration
. For the same iteration
the blue line (bottom) is a classical cubic smoothing spline displaying the remaining structure of the residuals (which correspond in this
case to the gradient of the loss).
Illustration of the iterative fitting process when boosting a single P-spline. The figure visualizes the influence of the stopping iteration
on the complexity of the spline from oversmoothing (small
, left) to overfitting (larger
, right). The dashed line in the upper plot displays the true underlying function. The red solid line (top) is the boosting fit
for iteration
. For the same iteration
the blue line (bottom) is a classical cubic smoothing spline displaying the remaining structure of the residuals (which correspond in this
case to the gradient of the loss).
As discussed above, boosting with early stopping allows for variable selection as only one base-learner is updated per boosting iteration. Hence, not all variables will be incorporated in the final model. If we define for each variable additionally separate base-learners per effect type (linear, smooth, interaction), one can see that the selection of base-learners naturally leads to the selection of different effect types, that is, model choice. It is obvious that a linear effect
with intercept
As before, the crucial parameter for the complexity of the model is the stopping iteration. If the boosting algorithm chooses only the linear effect of
As discussed in the Introduction, boosting models eventually tend to select too many different base-learners. If variable selection or model choice is of key interest, additional sparsity is warranted. Our approach builds on the classical decomposition of linear and smooth components of continuous variables as described in Section 2.2 but additionally incorporates a recent deselection procedure (Strömer et al., 2022) that was developed for enhanced variable selection. The core idea is to identify selected base-learners with minor importance for an initial boosting model and remove them from the set of potential predictors before boosting the model again on the subset of base-learners that were initially selected and have not been deselected. In the second boosting run, the same number of iterations
Where
Strömer et al. (2022) propose to use a pragmatic threshold of
To overcome this, we adapted the original deselection procedure and deselect all base-learners referring to the variable
to ensure that the decomposition does not lead to a tendency to deselect these variables. To enforce simpler and more interpretable models, we additionally propose to further adapt the procedure by Strömer et al. (2022) to enhance model choice. The base-learner
In other words, we remove the spline base-learner representing the deviation from the linear trend, when its contribution to the risk reduction is smaller than the one of the linear base-learner. The justification for this procedure is pragmatic: In case of doubt, go for the simpler solution. This does not mean that the deviation from the linear trend is considered non-existing; however, our approach aims at ensuring that only variables with an effect clearly deviating from linearity are actually modelled with a spline.
Empirical results
Simulation
Simulations conducted by Hofner et al. (2011) showed a biased selection of categorical base-learners with more factor levels and smooth base-learners compared to linear base-learners. To overcome this bias, the authors proposed to use a different, more appropriate definition of the degrees of freedom and to assign equal degrees of freedom (i.e., usually df = 1, as described in Hofner et al. (2011) in more detail) to all competing base-learners. Simulations indicate that despite these adjustments boosting still has a tendency to select the more complex base-learners even if not needed; for example, the smooth base-learner gets selected even though the true underlying effect was strictly linear. To illustrate this, we use first a setup with six informative variables of which five have a linear effect and one has a non-linear one as well as
As an additional scenario, we consider a model of four informative linear predictors, three informative non-linear predictors and only six non-informative predictors with reduced noise term
Results show that in the first scenario 9.8% of variables with true linear effect were falsely incorporated also or only with a smooth base-learner (averaged over all true linear variables and all simulation steps). This proportion increased to 47% in the second scenario. Applying the proposed deselection procedure for enhanced variable selection and model choice from Section 2.3 the proportion of variables incorrectly identified as representing a smooth effect was successfully decreased to
Modelling thyroid hormone levels from the AEGIS study
The A Estrada Glycation and Inflammation Study (AEGIS) is a cross-sectional observational trial performed in Galicia, Northwestern Spain, including
We consider overall p=33 potential predictor variables, of which 29 are continuous and four categorical. The final sample size for a complete case analysis reduced to n=1 352. A first graphical assessment of the data showed potentially non-linear associations between some of the continuous predictor variables and thyroid hormones, particularly for T3 which we will focus on in the following. The results of the same analyses with T4 and TSH can be found in the Supplementary Material.
For the classical boosting fit with the
The variables in the final model in descending order w.r.t. overall risk reduction were hemoglobin, age, monocytes percentage, red blood cells count, height, erythrocyte sedimentation rate, transferrin, ferritin, transferrin saturation, platelet mean volume, red cell distribution width, smoking, platelet distribution width and physical activity. The three variables with non-linear effects were transferrin saturation, ferritin and red cell distribution width. The two remaining categorical variables were smoking and physical activity (both with three categories). The deselection process adjusted the originally smoothly variables monocytes percentage, transferrin saturation, age and height to be included in the final model as linear predictors.
In order to further evaluate the stability of our proposed procedure we repeated the fitting, deselection and re-fitting on 1 000 bootstrap samples of the original data set. The estimated effects from standard boosting and from boosting with the previously described deselection procedure with
Results from the AEGIS study: the dashed lines refer to the estimated effects from standard boosting with model choice (left) and the enhanced approach with deselection (right). We show effects of two variables (red cell distribution width and ferritin) that remained with their smooth component in the model and two variables (transferrin saturation and age) that were switched to a linear effect after deselection. The shaded areas refer to empirical
-bootstrap confidence intervals.
Results from the AEGIS study: the dashed lines refer to the estimated effects from standard boosting with model choice (left) and the enhanced approach with deselection (right). We show effects of two variables (red cell distribution width and ferritin) that remained with their smooth component in the model and two variables (transferrin saturation and age) that were switched to a linear effect after deselection. The shaded areas refer to empirical
-bootstrap confidence intervals.
These results are in accordance with other studies reporting that environmental factors (smoking, diet, physical exercise, body mass index) can affect thyroid function. Iron metabolism (among others hemoglobin, ferritin and transferrin) is also very intricately connected to thyroid hormone metabolism. Thyroid hormone insufficiency may lead to deficiency of iron and vice versa. Other factors like oxidative stress may also play a role (erythrocyte sedimentation rate). Finally, we should keep in mind that the thyroid gland is the organ most commonly affected by autoimmune disease.
To evaluate model fit and predictive performance we computed the RMSE both on the training as well as on the test data. For the standard boosting algorithm this resulted in
Root Mean Square Errors (RMSE) on training and test data (generated via 1 000 bootstrap samples) for standard boosting with decomposition and the deselection approach for enhanced model choice and variable selection for T3 thyroid hormone levels from the AEGIS study.
To summarize the results, in the example of modelling T3 thyroid levels from a larger Galician cohort with
We have proposed a simple and pragmatic approach to enhance model choice regarding the decision whether to include a continuous variable with a linear or smooth effect in a boosted statistical model. We have illustrated both in a simulation study as well as on thyroid hormone data that our approach can help researchers to decide on the type of effect and at least in the considered settings led to very promising results: The resulting models were sparser and incorporated only non-linear effects, which were really necessary, but led basically to the same fit and prediction accuracy.
There are, however, several points and limitations to consider: First, our approach actively changes the optimal model selected by the boosting procedure. As the boosting algorithm is typically tuned for predictive risk, it hence should be expected that deselecting components that were initially selected will yield some kind of loss in prediction accuracy. The initial deselection step is controlled by the
With all that in mind, we still have reason to believe that this simple procedure might be a valuable option for practical data modelling. Automated model choice is an often cited and highlighted feature of statistical boosting, but the practical relevance over the last years seemed to be limited. Combined with the proposed pragmatic deselection procedure, the decomposition of effects in linear and non-linear components could become a true asset—not only for the simple Gaussian models presented here. The development of more complex and flexible model classes (Kneib et al., 2021) calls also for methods to reduce the complexity again in order to keep the model manageable for data analysts and interpretable for subject matter researchers.
Further research is warranted on how this procedure performs in multi-dimensional optimization problems where the same continuous variable could enter in different model components (like for location and scale, Mayr et al., 2012). Another field of future research might be to extend the procedure not only for selecting the type of effect, but also to allocate variables to different model components like in joint models for longitudinal and time-to-event data (Waldmann et al., 2017; Rappl et al., 2022).
Supplemental Material
Supplemental material is available for this article online.
Supplemental Material for Linear or smooth? Enhanced model choice in boosting via deselection of base-learners by Andreas Mayr, Tobias Wistuba, Jan Speller, Francisco Gude and Benjamin Hofner, in Statistical Modelling
Footnotes
Acknowledgements
This article would not have been possible without two fellow researchers and friends that are no longer with us and are deeply missed. The work of Professor Brian D. Marx (
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors received no financial support for the research, authorship and/or publication of this article.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
