Abstract
Abstract:
Quantile regression quantifies the association of explanatory variables with a conditional quantile of a dependent variable without assuming any specific conditional distribution. It hence models the quantiles, instead of the mean as done in standard regression. In cases where either the requirements for mean regression, such as homoscedasticity, are violated or interest lies in the outer regions of the conditional distribution, quantile regression can explain dependencies more accurately than classical methods. However, many quantile regression papers are rather theoretical so the method has still not become a standard tool in applications. In this article, we explain quantile regression from an applied perspective. In particular, we illustrate the concept, advantages and disadvantages of quantile regression using two datasets as examples.
Introduction
Mathematically speaking, the probable [...] and the improbable [...] are not
different in kind, but only in frequency, whereby the more frequent appears
a priori more probable. But the occasional occurrence of the improbable does
not imply the intervention of a higher power, something in the nature of a
miracle, as the layman is so ready to assume. The term probability includes
improbability at the extreme limits of probability, and when the improbable
does occur this is no cause for surprise, bewilderment or mystification.
This is a quote from Max Frisch's most famous book Homo Faber in which the main character meets his daughter, whose existence he did not know of, and falls in love with her. This is only one of the many coincidences in this book, or—as he puts it—one of the improbable events. Another one is a plane crash, which leads him to the earlier quoted reasoning. This example is not very close to our everyday life experience. As an introduction to the topic of this tutorial article, we hence use a dataset on the body mass index (BMI) of Dutch males aged 0 to 21 (from now on referred to as the ‘Dutch boys’ dataset). The data will be described in detail in Section 2. The average BMI of the 7 294 observations is 18.027. Thus, we can say ‘the expected BMI for a dutch male between 0 and 21 years is’ 18.027. However, reducing an experiment to its expectation is exactly what the main character of our novel criticizes the ‘layman’ for, who is surprised by the occurrence of events at the ‘the extreme limits of probability’. Furthermore, those extreme limits are highly relevant for the BMI data since obesity and (in case of the Dutch population maybe less relevant) underweight are more relevant than the simple question for the average weight. But what are those ‘limits of probability’ and how can we capture them? A histogram of the data, together with a fitted Gaussian distribution is displayed in the left panel of Figure 1. Distributions can be described quite accurately by their moments, that is, expectation, variance, skewness and so on, as well as by a list of quantiles.
Histogram of the BMI from the Dutch boys dataset with the Gaussian
distribution with mean and standard deviation taken from the dataset
Histogram of the BMI from the Dutch boys dataset with the Gaussian distribution with mean and standard deviation taken from the dataset
In the case of the Gaussian distribution, just as for most distributions from a parametric family, those statistics are known and can be calculated for arbitrary parameters. As we can see in the left panel of Figure 1, however, the shape of the fitted Gaussian distribution is quite different from that of the observed distribution of the data, as the histogram is quite asymmetric in comparison to the distribution curve. When one is interested in the extreme values of the dataset (e.g., the 95% quantile), we would hence refrain from using the quantiles calculated via the distribution and simply use the quantile of the data itself: while the 95% quantile of the fitted normal distribution is 22.809, the 95% quantile of the data is 23.612. This is a relevant difference; the approximation with the Gaussian distribution would substantially misinform us. Researchers, however, hardly ever only deal with the simple description of the univariate dataset, but are more interested in the influence of covariates on a dependent variable or the prediction based on independent variables. In our case, we could imagine wanting to predict the BMI depending on the age of an individual. The plot in the right panel of Figure 1 displays the scatterplot of age and BMI. The asymmetry already visible in the univariate representation also shows in the scatterplot. When doing classical regression on this relation however, we would assume exactly the same thing as when plainly approximating our histogram with a Gaussian distribution: that the BMI conditional on the age is normally distributed and this assumption hence underlies all statistics we derive from this (such as the conditional 95% quantile). The solution to this problem is to use quantile regression, that is, calculating the impact of covariates on quantiles directly, rather than assuming an underlying conditional distribution. This tutorial article explains what quantile regression is and how it can be calculated.
The remainder of the article is structured in the following way: the second section deals with univariate quantile regression model illustrated by the aforementioned dataset on the BMI of Dutch boys. The dataset will be analysed with both standard regression method as well as with quantile regression and the differences will be discussed. The third section will present a second example dataset, which is then used to show an additive quantile regression model, containing different types of covariates. In the fourth section different estimation methods and related models will be introduced. The last section consists of a short summary and a guideline of when to use quantile regression.
The supplementary material to this article includes the commented code for both examples. They are estimated with component-wise gradient boosting (Hofner et al., 2017); for more detailed explanation for this method please see the tutorial on boosting (Mayr and Hofner, 2018).
Data example I: The BMI in the Netherlands
As a first illustrative example, we present the Fourth Dutch Growth Study which
was already mentioned in the introduction. The dataset originates from a
cross-sectional study that measures growth and development of the Dutch
population between 0 and 21 years. We use a subset of the data available in the
R package
The variable of interest in this example will be the BMI of the individuals with
a special focus on those who are overweight. Recall Figure 1, where the fitted Gaussian
distribution yields
When only analysing the BMI, we ignore the fact that its conditional distribution
varies with age, as can be seen in Figure 1 on the right side. The questions
asked should not be restricted to the simple quantile (2.1) but
be a quantile conditioned on age:
In the following, both mean regression and quantile regression models will be applied to the presented dataset.
Mean regression
The conventional non-linear regression model estimating the association of
age with the expected value of BMI was calculated with R-package
Scatterplot of age and BMI of the dataset Dutch boys with the
regression line from a non-linear Gaussian regression model
Scatterplot of age and BMI of the dataset Dutch boys with the regression line from a non-linear Gaussian regression model
The assumptions for conventional mean regression models which are violated
quite obviously in this dataset are homoscedasticity and the symmetry of the
Gaussian distribution. In addition to the mean curve, the quantiles which
are implied by the assumption of a Gaussian distribution are displayed for
When modelling the quantiles independently of distributional assumptions yet
conditional on the data (i.e.,
Scatterplot of age and BMI of the dataset Dutch boys. Left hand
side: the bold line is the regression line from a non-linear
Gaussian regression model, the other lines depict the
quantiles of the mean regression model. Right hand side: the lines
depict the
quantile regression models
Scatterplot of age and BMI of the dataset Dutch boys. Left hand
side: the bold line is the regression line from a non-linear
Gaussian regression model, the other lines depict the
quantiles of the mean regression model. Right hand side: the lines
depict the
quantile regression models
A third feature of quantile regression is the robustness against outliers. As can be seen in Figure 4, the median regression simply divides the data into two 50% parts. This figure also shows that the values scattered further away from the dense central cloud (e.g., those close to 35) only have an influence on the 99% quantile regression curve and not the more central curves. This, however, is not a problem since those are the data points that should be captured and described when looking at the 1% of the population with the highest BMI. The fact that the mean regression is above the 50% quantile regression curve is due to the long tail of the conditional distribution of the BMI.
Scatterplot of age and BMI of the dataset Dutch boys. The solid line displays the median regression and the dashed line displays the mean regression
Data example II: Stunting in India
For the second data example, a dataset called
Modelling the Z-score of Indian children
The model presented in the previous section (i.e., a model with one continuous
explanatory variable) is a comparatively simple explanatory model. Quantile
regression methods were extended, however, in many different ways towards more
complex model classes, just like mean regression, such as for models with
measurement errors by Wei
and Carroll (2009), time series by Kley et al. (2016), or additive, partly
non-linear models by Lee et
al. (2010). For an overview over the progress that has been made, see
the paper of Koenker
(2017); for the range of types of effects that can be included in
structured additive quantile regression, see Kneib (2013). In the following, we
present an additive model for the dataset on stunting in India described
earlier. The model includes three covariates, being modelled in two different
ways. While the influence of the age of the child—just as for the Dutch boys
dataset—and the BMI of the mother will be modelled by P-splines (Eilers and Marx, 1996),
the geographical differences will be modelled by a Markov random field. This
method allows all districts to have their own effect on the dependent variable
(or in our case the quantile of interest of the dependent variable), but it does
not allow neighbouring districts to vary too much from each other. This leads to
a similar type of penalization as in P-splines and thus also to a smooth surface
(Fahrmeir et al.,
2004). The model formula hence looks like this:
Scatterplot of age and z-score of the dataset on stunting in India.
The lines depict the
quantile regression models adjusted for the rest of the data
Scatterplot of age and z-score of the dataset on stunting in India.
The lines depict the
quantile regression models adjusted for the rest of the data
Scatterplot of the BMI of the mother and z-score of the dataset on
stunting in India. The lines depict the
quantile regression models adjusted for the rest of the data
Effect of the district on the
,
the
and the
quantiles. The framed part contains the district of Bihar, which is
further displayed in more detail
The call
A set of models with different quantile levels
To this end, we use a centred version of the effect and then add the mean of the
conditional quantile (i.e., the average over the prediction for all individuals)
of interest. The different shape of the curves for different quantiles can be
seen, just like in Section 2. Especially, the steeper decline for the lowest
quantile (
The problem can be illustrated by a simple example: there are 7 women that have a
BMI higher than 35. Thus, between 35 and 40, all 11 functions are estimated
based on those 7 data points, which is very little information. We hence try to
construct functions, which split 7 data points into 12 groups, without even
informing one function about the position of the rest. When estimating the
models independently for the different quantiles (as we did here), this can even
lead to the quantiles to cross. For references to methods avoiding crossing see
Section 4. Figure 7
displays the spatial effects for the quantiles
In this article a gradient boosting approach was used to produce the results for the illustrative example. This could however have been done with many different approaches, which will be explained in the following.
Quantile regression has many advantages, but a major disadvantage is that parameters
are harder to estimate than in Gaussian or generalized regression. Inference on them
can get complicated because the estimators for coefficients are not available in
closed form. There are many ways to estimate quantile regression parameters; this
will be sketched out in the following three ways. The first and original way is a
linear optimization algorithm and was brought forward by Koenker and Bassett (1978) together with
the original proposal. This approach is implemented in the R-package
Another form of regression that exceeds the form of simply modelling the mean but does not have the issue of crossing quantiles is the generalized additive model for location, shape and scale (GAMLSS Rigby and Stasinopoulos, 2005). For a tutorial paper on this topic see Stasinopoulos et al. (2018). In this type of regression, all parameters of the assumed distribution can be modelled by covariates. The model is estimated as a whole, hence problems similar to the quantile crossing issue cannot occur in this case. This method is in some cases more stable than quantile regression (Kneib, 2013) and delivers similar results, as long as the assumed distribution is close enough to the real data distribution. This can be checked either graphically or by cross-validation. The Bayesian approach called distributional regression works analogously to this approach (Umlauf and Kneib , 2018). A related idea which has been explored less is the so-called Bayesian density regression (Dunson et al., 2007). In this case, the distribution of the dependent variable does not have to be specified beforehand but drawn in a Dirichlet process mixture scheme. A further model form closely related to this kind of model are the so-called conditional transformation models (Hothorn, 2018). Those models have the advantage of flexibly estimating the conditional distribution of the variable of interest.
There is one further model type which is very similar to quantile regression. Expectile regression is a generalization of the mean regression in the same way that quantile regression is a generalization of the median regression. The resulting regression functions minimize asymmetrically weighted squared residuals, where the weights are the same as the weights in quantile regression. The interpretation of those models is slightly harder than for quantile regression yet it is very useful in the analysis of financial risks, for example, the expected shortfall (Taylor, 2008).
Summary and Recommendations
Our goal in this article is to explain quantile regression in a simple way, in order to provide some guidance in the decision for which research questions it should be used. When only being interested in the mean and its characteristics and predictions, standard tools can already provide enough information. Quantile regression on the other hand is suitable for the following situations:
Acknowledgments
We want to thank the support of the Interdisciplinary Centre for Clinical Research (IZKF) of the Friedrich-Alexander-University Erlangen-Nürnberg (Project J61). Special thanks goes to Graeme Hickey for proofreading the article.
