Abstract
Simple models are preferred over complex models, but over-simplistic models could lead to erroneous interpretations. The classical approach is to start with a simple model, whose shortcomings are assessed in residual-based model diagnostics. Eventually, one increases the complexity of this initial overly simple model and obtains a better-fitting model. I illustrate how transformation analysis can be used as an alternative approach to model choice. Instead of adding complexity to simple models, step-wise complexity reduction is used to help identify simpler and better interpretable models. As an example, body mass index (BMI) distributions in Switzerland are modelled by means of transformation models to understand the impact of sex, age, smoking and other lifestyle factors on a person's BMI. In this process, I searched for a compromise between model fit and model interpretability. Special emphasis is given to the understanding of the connections between transformation models of increasing complexity. The models used in this analysis ranged from evergreens, such as the normal linear regression model with constant variance, to novel models with extremely flexible conditional distribution functions, such as transformation trees and transformation forests.
Keywords
Introduction
Let's face it. The work of statisticians is considered boring in the public eye. Nobody publishes page turners on the thrilling aspects of data analysis, yet the quest for a good model can be as exciting as detective work. One of my favourite paperback characters is LAPD detective Harry Bosch in the crime novels of Michael Connelly. Like Harry, who follows the traces left by the murderer on the crime scene to form a theory about the culprit, the experienced data analyst follows the traces left by the data-generating process in the residuals of an over-simplistic model. Unlike Harry, who of course always succeeds in arresting the murderer, the statistician can never be sure whether the correct or even an approximately useful model was found. In the quests for a suspect or for a good model, parsimonious explanations are preferred by Occam's razor. Therefore, in residual-based model diagnostics, the data analyst starts with a very simple model, whose complexity is increased by step-wise refinement until all signs of lack of fit disappear from the residuals. I refer to such a procedure as ‘bottom-up model choice’ because one moves from simple to more complex models. In this tutorial, I consider moving in the opposite direction, that is, from complex to simple models, for distributional regression. This ‘top-down approach’ to model choice begins with the most complex model that one can come up with that explains both signal and noise without overfitting the data. In a regression setup, such a model would describe as accurately as possible the conditional distribution of the response given the explanatory variables. Once such a model is established as a benchmark for comparison with simpler models, one can start to reduce model complexity step-wise. In the crime novel scenario, the top-down data analyst takes the role of an eyewitness at the scene. What one ‘sees’ in this process is, of course, still a portrayal and not the real thing. There is no way to ‘see’ the correct model. In top-down model choice, however, the trajectories through model space will be guided by assessments of vital models. In bottom-up model choice, by contrast, the horizon is limited by the amount of information that one can find in traces in deceased models.
In this tutorial, I focus on top-down model choice in continuous regression problems.
Conceptually, a regression model is a family of conditional distributions for some
response Y, given a specific configuration of explanatory variables
X = x. The model describes both signal and
noise, that is, the variability explained by the explanatory
variables and the unexplained variability. Unfortunately, this point of view only
applies to relatively simple models that assume a certain parametric distribution,
whose parameters partially depend on the explanatory variables. The so-called
‘non-parametric regression models’ (Fahrmeir et al., 2013) often restrict their
attention to the signal
The implementation of top-down model choice is much simpler when the most complex and
the most simple model are members of the same family. Conditional transformation
models from the transformation family of distributions (Hothorn et al., 2014, 2017) include many important established
off-the-shelf regression models. In addition, tailored models can be created,
in vivo with our brains and in silico using
open-source software, which allow smooth transitions between models of different
complexity. In a nutshell, the class of conditional transformation models
Model complexity in the class of conditional transformation models is linked to
smooths
I will proceed by introducing the Swiss Health Survey (SHS) and the variables dealt with in Section 2. In a very simple setup, I first illustrate a bottom-up route, starting with a normal linear model and ending with a more complex non-normal transformation model, for describing the BMI distribution of females and males at various levels of smoking (Section 3). I then try to reduce the complexity again until an interpretable model that fits the data roughly as well as the most complex model can be found. In addition to a consideration of sex and smoking, I consider age and some lifestyle variables in a more realistic setup of top-down transformation choice in Section 4.
Body mass index in the Swiss Health Survey
The SHS is a population-based cross-sectional survey. It has been conducted every five years since 1992 by the Swiss Federal Statistical Office (Bundesamt für Statistik, 2013). For this tutorial, I restricted the sample to 16 427 individuals aged between 18 and 74 years from the 2012 survey. Study samples were obtained by stratified random sampling using a database with all private household landline telephone numbers. Data were collected by telephone interviews and self-administered questionnaires. Height and weight were self-reported in telephone interviews. Observations with extreme values of height and weight were excluded (highest and lowest percentile by sex). Smoking status was categorized into never, former, light (1 − 9 cigarettes per day), moderate (10 − 19) and heavy smokers (>19). Never smokers stated that they did not currently smoke and never regularly smoked longer than six months; former smokers had quit smoking, but have smoked for more than six months during their life course. One cigarillo or pipe counted as two cigarettes, and one cigar counted as four cigarettes. The following lifestyle variables were included and assessed by telephone interview and self-administered questionnaire: fruit and vegetable consumption, physical activity, alcohol intake, level of education, nationality and place of residence. Fruit and vegetable consumption was combined in one binary variable that comprised the information on whether both fruits and vegetables were consumed daily or not. The variable describing physical activity was defined as the number of days per week a subject started to sweat during leisure time physical activity and was categorized as >2 days, 1–2 days and none. Alcohol intake was included using the continuous variable gram per day. Education was included as highest degree obtained and was categorized into mandatory (International Standard Classification of Education, ISCED 1–2), secondary (ISCED 3–4) and tertiary (ISCED 5–8) (UNESCO Institute for Statistics, 2012). Nationality had the two categories Swiss and foreign. Language reflected cultural and regional differences within Switzerland, and the three categories German/Romansh, French and Italian were taken into account. Sampling weights of this representative survey were considered for the estimation of all models reported in this tutorial. More detailed information about this study and an analysis using simple transformation models is given in Lohse et al. (2017).
Sex- and smoking-specific BMI distributions
I start with the very simple situation where the conditional distribution of BMI depends on sex and smoking only. Smoking was assessed on five different levels (never smoked, former smokers, light smokers, medium smokers and heavy smokers). Therefore, I am interested in the conditional distribution of BMI in these 10 groups of participants. Figure 1 presents the empirical CDFs, that is, the non-parametric maximum likelihood estimators for the underlying continuous distributions, for each of the 10 combinations of sex and smoking. At the same time, the plot also represents the uncompressed raw data. With a high enough resolution, one could recover the original BMI values and the corresponding sampling weights from such an image. Consequently, goodness of fit can be assessed by overlaying the empirical CDFs with their model-based counterparts in this simple setup. I will try to find a suitable parametric model this way. In addition to this rather informal approach, I will study the increase of the log-likelihoods as model complexity is increased. In the classical bottom-up approach, one would start with a very simple model assuming conditional normal distributions. The next section discusses possible choices in this model class.
The empirical cumulative distribution function (CDF) of BMI given sex and
smoking. For each combination of sex and smoking, the weighted empirical CDF
taking sampling weights into account is presented
The empirical cumulative distribution function (CDF) of BMI given sex and smoking. For each combination of sex and smoking, the weighted empirical CDF taking sampling weights into account is presented
The normal cell-means model with constant variance
Normal cell-means model (3.1) with constant variance. Estimated means of BMI for each combination of sex and smoking, with 95% confidence intervals are shown
Normal cell-means model (3.1) with constant variance. The empirical (blue) and model-based (yellow) CDF of BMI given sex and smoking are shown
How well does this model fit the data? I want to answer this question by
graphically comparing the conditional distribution functions obtained from this
model to the corresponding empirical conditional distributions, and thus the raw
data. The model-based conditional CDFs
The log-likelihood in this 20-parameter model increased to −44 801.19, and the corresponding conditional distribution functions in Figure 3 were closer to the empirical CDFs. For males, the model-based normal distributions were very close to the empirical conditional BMI distributions. For females, however, there still was a considerable discrepancy between model and data, especially in the lower tails. The BMI distributions of females deviated from normality much more than the BMI distributions of males (note that I am not saying that males are normal and females are not!). It is clear that one has to move to a non-normal error model, at least for females, and the transformation models discussed in the following are a convenient way to do so.
Normal cell-means model (3.2) with heterogeneuous variance. The empirical (blue) and model-based (yellow) CDF of BMI given sex and smoking are shown
The normal models are a special case of transformation models and thus the latter
class is a very natural extension of the former. To see the connection, consider
the conditional distribution function
The core concept of a transformation model is a potentially non-linear
monotonically increasing transformation function
Transformation model (3.3) stratified by sex and smoking. The empirical (blue) and model-based (yellow) CDF of BMI given sex and smoking are shown
Transformation model (3.3) stratified by sex and smoking. Deviations from normality indicated by the non-linear transformation functions (blue) compared to the linear transformation functions (yellow) obtained from the normal cell-means model (3.2) with heterogeneous variances are shown
One nice feature of model (3.3) is the possibility to easily derive characterizations of the distribution other than the distribution function. Density, quantile, hazard, cumulative hazard or other characterizing functions can be derived from (3.3), and Figure 6 depicts the densities for males and females at the various levels of smoking. The right skewness of the distribution, and thus deviation from normality, was more pronounced for females. The BMI distributions for females put more weight on smaller BMI values for females than for males. Except for heavy smokers, the effects of smoking seemed to be rather small.
Transformation model (3.3) stratified by sex and smoking. The model-based conditional densities of BMI given sex and smoking are shown
The model fit of this stratified transformation model is now satisfactory, as it essentially smoothly interpolates the empirical distribution functions and thus the data in Figure 4. This most complex model describes the data well, but, unfortunately, it is difficult to learn anything from this model. That is, one wants to understand the differences between the conditional distributions in terms of simple parameters and not complex non-linear functions. A simpler model is needed. A top-down approach to transformation choice might help to identify a model with simpler and interpretable transformation functions, but any necessary compromises to the model fit should not be too demanding.
Because the BMI distributions differed most between males and females, I first
simplify the model by conditioning on smoking and stratifying by sex, that is, I
introduce sex-specific transformations
The log-likelihood for this model with 20 parameters was found to be −43 602.03, a moderate reduction compared to the log-likelihood of the most complex transformation model (−43 564.30). Figure 7 shows only minor differences between the empirical and model-based conditional distribution functions. Thus, it seems that a more parsimonious model was found without paying too high a price in terms of log-likelihood reduction.
Linear transformation model (3.4) with sex and smoking-specific shift, stratified by sex. The empirical (blue) and model-based (yellow) CDF of BMI given sex and smoking are shown
The conceptual problem with this model, however, is lack of interpretability of
the shift term
The parameterization
Unfortunately, there was some further reduction in the log-likelihood (−43 639.74), and interpretability does not come for free. However, the model-based and empirical conditional BMI distribution functions look very much the same as presented in Figure 7 (additional plot not shown). The sex-specific BMI-independent odds ratios of smoking, compared to never smoking, are given in Table 2. Former smokers had, on an average, a larger BMI compared to never smokers, and the effect was stronger for males. A similar effect was observed for male heavy smokers. Female light smokers showed a BMI distribution shifted to the left, compared with female never smokers.
Linear transformation model (3.5) with sex and smoking-specific shift, stratified by sex. Odds ratios to the baseline category never smoking along with 95% confidence intervals for males and females are shown. Odds ratios larger than one indicate a shift of the BMI distribution to the right
Maintaining interpretability, one could go further and assume equal smoking
effects for males and females in the model
My aim is to estimate the conditional BMI distribution given sex, smoking, age and
the lifestyle variables alcohol intake, education, physical activity, fruit and
vegetables consumption, residence and nationality as explanatory variables
Transformation trees and forests
A transformation tree (Hothorn
and Zeileis, 2017) starts with an unconditional transformation model
Transformation tree. The conditional BMI distributions (depicted in terms of their densities) are given in each subgroup corresponding to the terminal nodes of the tree. Variables: education (edu) at levels mandatory (I), secondary (II) and tertiary (III); alcohol intake (agramtag)
Transformation forest. Likelihood-based permutation variable
importance for all variables in the forest. The
-axis shows the mean
decrease in the log-likelihood caused by permuting one variable before
computing the in-sample tree log-likelihood
A transformation forest (Hothorn and Zeileis, 2017) allows less rough conditional parameter
functions
The generic random forest algorithm essentially relies on multiple transformation
trees fitted to subsamples of the data, with a random selection of variables to
be considered for splitting in each node. Unlike the original random forest
(Breiman, 2001), a
transformation model can be understood as a procedure assigning a parametric
model to each observation. For subject
On the downside, this black-box model makes it very difficult to understand the impact of the explanatory variables on the conditional BMI distribution. The likelihood-based permutation variable importance (Figure 9) indicated that only sex, age, education, physical activity and smoking have an impact on BMI, where again sex seems to be the most important variable. Age was a more important factor than education or physical activity, and thus the only numeric variable one needs to consider. The association between sex, smoking, age and BMI as described by the transformation forest is given in terms of a partial dependency plot of conditional deciles in Figure 10. In general, the median BMI increases with age, as does the BMI variance. For males, there seemed to be a level-effect whose onset depends on smoking category. Females tended to higher BMI values, and the variance was larger compared to males. There seemed to be a bump in BMI values for females, roughly around 30 years. This corresponds to mothers giving birth to their first child around this age. It is important to note that the right skewness of the conditional BMI distributions in Figure 10 renders conditional normal distributions inappropriate, even under variance heterogeneity.
This complex model would be sufficient if one was only interested in the estimation of conditional BMI distributions for persons with specific configurations of the sex, smoking, age and the remaining explanatory variables. The variable importances can be used to rank variables according to their impact on the conditional BMI distributions but cannot replace effect measures, let alone an assessment of their variability. Communication with subject-matter scientists and publication of results in subject-matter journals require simplification of these models. Top-down transformation choice can help to find models of appropriate complexity, as will be seen in the next section.
Transformation forest. Partial dependency of sex, smoking and age. Conditional decile curves for BMI depending on age, separately for all combinations of sex and smoking, are shown
The analysis using transformation trees and especially transformation forests
revealed strong effects of sex and age; the latter variable was not considered
in our analysis presented in Section 3. A more structured model roughly as
powerful as the transformation forest must therefore allow the conditional
distribution of BMI to change with both sex and age in very general ways. The
remaining variables were less important, and one can hopefully cut some corners
here by assuming simple linear main effects for these variables. I start the
top-down search for a simpler model with a conditional transformation model of
the form
With 89 parameters, the log-likelihood −42 778.14 of model (4.2) was only slightly smaller than the log-likelihood of the transformation forest (−42 520.18). In a certain sense, this conditional transformation model can be seen as an approximation of the black-box transformation forest. The effects of sex, smoking and age, with all remaining variables being constant, are again best visualized using the conditional decile functions (Figure 11). The decile functions are now smooth in age due to the parameterization of the age effect in terms of Bernstein polynomials. For males, the BMI increased with age; the BMI reduction in males older than 65 years was not visible in the decile curves of the transformation forests (Figure 10). The slope was largest for young men up to 25 years, followed by a linear increase until the age of 65. The male BMI distribution was right skewed, with only a small increase in the variance towards older people. For females, a bump in the BMI distribution was again identified around the age of 30, corresponding to pregnancies and breast-feeding times. The effect seemed more pronounced in higher deciles. Right skewness and a variance increase towards older women can be inferred from this figure.
Conditional transformation model (4.2). Conditional decile curves for BMI depending on age, separately for all combinations of sex and smoking, are shown
The main advantage of this complexity reduction is the interpretability of the
regression coefficients
The term ‘distribution regression’ (Chernozhukov et al., 2013) is commonly
used to describe response-varying coefficients. In survival analysis, the term
‘time-varying coefficients’ is more typical. Here, a BMI-varying coefficient of
age is a means of simplifying the conditional transformation model (4.2).
In the simpler model, I assume a smoothly varying but sex-specific coefficient
of age
Distribution regression model (4.3). Conditional decile curves for BMI depending on age, separately for all combinations of sex and smoking, are shown
I extend the stratified linear transformation model (3.5)
with a sex-specific age effect and a linear predictor
The three columns presented in Table 3 refer to the same parameters, estimated by three models differing only with respect to the complexity of the age effect. The effects of smoking, alcohol intake, education, physical activity, fruit and vegetables consumption, residence and nationality were remarkably constant. Alcohol intake had no impact on the BMI in this study, and right shifts in BMI distributions were associated with low fruit and vegetable consumption, moderate and low physical activity, short education, being a foreigner or living in the German-speaking part of Switzerland. These conclusions can be drawn from all three models in the same way. The effects of smoking were less pronounced than the effects obtained in our initial analysis that ignored age and the lifestyle variables (Table 2). Light smokers had lower BMIs than never smokers; the remaining effects are questionable.
The core of top-down transformation choice is a family of decreasingly complex, yet
fully comparable, conditional transformation models. Model parameterization and
interpretation in the family of transformation models are always based on the
conditional distribution function
A unique feature of conditional transformation models is the ability to formulate, estimate, compare, evaluate, interpret and understand models seemingly as far apart as a normal linear model with constant variance and a transformation forest in the same theoretical framework. Straightforward answers to some questions that have plagued data analysis for decades, for example ‘Is it appropriate to assume normal errors?’ or ‘How should the response be transformed prior to analysis?’, are easily obtained from conditional transformation models.
One practical and interesting question relates to the impact of the order
This tutorial did not address any issue regarding model estimation or model
inference. Details about maximum likelihood estimation in conditional transformation
models can be found in Hothorn
et al. (2017). Locally adaptive maximum likelihood estimation for
transformation trees and transformation forests has been introduced in Hothorn and Zeileis (2017).
More elaborate discussions of model parameterization in conditional transformation
models and of connections to other models can be found in Hothorn et al. (2014) and Hothorn et al. (2017).
Applications of conditional transformation models can be found in Hothorn et al. (2013),
Möst et al. (2014)
and Möst and Hothorn
(2015). An introduction to the
Reproducibility
Data from the Swiss Health Survey 2012 can be obtained from the Swiss Federal
Statistics Office (E-mail:
The code used for producing the results presented in this paper can be evaluated on a
smaller artificial data set sampled from the transformation forest by running
Acknowledgments
I thank the students participating in the course ‘STA660 Advanced R Programming’ that I taught in the spring semester of 2017 for producing the code underlying Figure 8 as part of their homework assignments. Parts of this article were written during a research sabbatical at Universität Innsbruck, financially supported by the Swiss National Science Foundation (grant number IZSEZ0_177091).
