Abstract
While extensive research has been devoted to univariate quantile regression, this is considerably less the case for the multivariate (longitudinal) version, even though there are many potential applications, such as the joint examination of growth curves for two or more growth characteristics, such as body weight and length in infants. Quantile functions are easier to interpret for a population of curves than mean functions. While the connection between multivariate quantiles and the multivariate asymmetric Laplace distribution is known, it is less well known that its use for maximum likelihood estimation poses mathematical as well as computational challenges. Therefore, we study a broader family of multivariate generalized hyperbolic distributions, of which the multivariate asymmetric Laplace distribution is a limiting case. We offer an asymptotic treatment. Simulations and a data example supplement the modelling and theoretical considerations.
Introduction
Since the pioneering work of Koenker and Bassett (1978), quantile regression has taken a prominent role in both theoretical and applied statistics, as alternative to classical mean regression. While mean regression models solely grasp the central behaviour of the data, quantile regression allows to examine the effect of a set of covariates on different quantiles of a response variable. Quantile regression has also been shown useful when the distribution of the response is skewed rather than symmetric, when the data contain outliers, or when flexibility to the error distribution is important. For non-crossing curves, the mean curve is typically different in shape from all subject curves (Molenberghs and Verbeke, 2005), often causing confusion, whereas this is not the case for median and other quantile curves. Furthermore, growth curves or other profiles are often examined together, such as body length and body height in growing infants, or blood pressure and other vital parameters in patients. We refer to Koenker (2005) for a book length treatment of the methodology of quantile regression.
Whereas classical means are obtained by minimizing the quadratic loss function, the quantile
where
We focus on multivariate and/or longitudinal data, in this article. The methodology is illustrated on diabetes data in Section 6. The dataset consists of information of 2 495 patients with type 2 diabetes mellitus that were enrolled in the Leuven diabetes project. It contains several variables that were measured at the beginning of the programme and one year later. Among these, LDL-cholesterol (low-density lipoprotein cholesterol, mg/dl) and HbA1c (glycosylated haemoglobin, %) are some of the most important measures to evaluate how well diabetes is controlled. A reduction of these endpoints indicates an improvement in the physical condition of the patients due to the treatment. Therefore, we are interested in studying the joint evolution of these outcomes in time, and to explore the effect of some potentially disease-related characteristics of the patients. The joint modelling allows to address the association between both responses in the estimation process. The data are, therefore, multivariate and longitudinal in nature.
In this article, we consider a quantile regression model with multivariate and/or longitudinal data, and we propose a novel way to take the longitudinal structure of the data into account in the estimation of the model. To do this, first recall the well-known equivalence in the univariate case between the minimization of the check function and the maximization of the likelihood based on an asymmetric Laplace distribution (see Kotz et al., 2001). Indeed, in the univariate case, using the parametrization of Poiraud-Casanova and Thomas-Agnan (2000), if the density of
then it is easily seen that
The article is organized as follows. In the next section, we give an overview of the existing literature on quantile regression with longitudinal data. In Section 3, we define our multivariate quantile regression model, and propose an estimator of the model parameters via the maximization of the MGH likelihood. We also show the asymptotic normality of the proposed estimator. Section 4 describes some of the computational issues related to the proposed estimator, whereas a detailed simulation study and the analysis of diabetes data are studied in Sections 5 and 6, respectively. Some conclusions and ideas for future research are given in Section 7. Finally, additional simulation results and technical details are exhibited in the Supplementary Materials available online.
The existing literature on quantile regression for multivariate and/or longitudinal data can be roughly classified into two groups of articles, namely those articles that only posit a model for the quantile of interest, and those that impose a model on the entire distribution. Only assuming a model for the quantile of interest is of course a less restrictive assumption, but the existing approaches suffer from one or several possible drawbacks. For example, in some articles the independence working model is assumed, in which case conventional quantile regression techniques can be used. They lead to consistent, yet inefficient estimators, due to the incorrect specification of the variance structure. So, while an independence working assumption does not in itself imply that zero correlation is assumed to be correct, the intrinsic corrections that take place in such a semi-parametric approach and that ensure consistency, come at the price of considerable efficiency loss. We refer the reader to Lipsitz et al. (1997) and Chen et al., (2004), among others. These authors follow a semi-parametric approach whereby Lipsitz et al. (1997) also allows for incomplete data. Parente and Santos Silva (2016) examine the performance of the univariate model even when data are clustered. Another option is to add subject-specific fixed effects to the usual quantile regression model. This approach has the drawback that it potentially leads to a large number of parameters in the model. This is especially the case when the number of subjects is large compared to the number of repeated measures, in which case the estimation of these fixed effects is inconsistent. To bypass these problems, Koenker (2004) and Lamarche (2010) proposed penalized estimators, where the penalty is necessary to control the large number of fixed effects. An overview of this and other approaches in the econometrics literature, mainly based on fixed effects, can be found in Galvao and Kato (2017).
The second category of articles on quantile regression for longitudinal data contains articles that impose a model not only for the quantile of interest, but for the entire distribution. The proposed approaches are mostly likelihood based. One possibility is to replace the subject-specific fixed effects mentioned above by random effects. In that case, one essentially binds together univariate Laplace distributions by using normal or other types of random effects; see, for example, Geraci and Bottai (2007), Geraci and Bottai (2014), Liu and Bottai (2009), Yuan and Yin (2010), Lee and Neocleous (2010), Farcomeni (2012) and references therein. Note, however, that random effects are formulated at the subject level. In case one is interested in marginal functions as in our quantile regression case, often cumbersome integration over such random effects is needed. If these random-effects enter the model in a non-linear fashion, integration over them is not straightforward and, importantly, often does not provide closed forms. Moreover, the so obtained functions may be cumbersome to interpret in the light of the population under study, as it may not correspond to an observable quantile function.
Instead of specifying two distributions (the univariate distribution and the random effects distribution), it is therefore a valid and very sensible alternative to assume a marginal parametric model for the multivariate distribution of the response vector for a given subject, in which case the specification of the dependence between the observations is effectuated via the multivariate distribution. This facilitates a more direct and easier to interpret specification of the model when compared with the random-effects model. Waldmann and Kneib (2015) and Petrella and Raponi (2019) contributed to this research line. In the former article, the authors considered a Bayesian bivariate quantile regression model in which the asymmetric Laplace distribution is an auxiliary error distribution, whereas in the latter article the asymmetric Laplace distribution was replaced by a multivariate extension of it, proposed and studied by Kozubowski and Podgórski (2000) and Kotz et al. (2001). Petrella and Raponi (2019) showed via simulations that, in spite of the peaks and non-differentiability problems that are inherent to the latter distribution, it is possible to estimate the model correctly. They use an intuitive approach to this problem, by maximizing the likelihood as if it is a classical, well behaved likelihood. However, as Kotz et al. (2001) point out, there are some issues with the multivariate asymmetric Laplace (MAL) distribution, that require special treatment and attention. In particular, a peculiarity of this distribution is that it has an asymptote at the origin, the implications of which have so far not been carried forward to multivariate data analysis. One of the special features of the likelihood surface is that in certain cases it contains a ‘minefield’ of spikes, leading to problems in the maximization of the likelihood and even more in the calculation of the gradient and Hessian matrix. We solve these issues using the link between the MAL distribution and the multivariate generalized hyperbolic (MGH) distribution. Furthermore, we show that the estimator is asymptotically normal and that standard errors can be computed via this minor modification in the likelihood function.
Model and methodology
Multivariate generalized hyperbolic distribution
Suppose that
where
This formulation is generic and encompasses several special cases. To see this, write
We are interested in conditional quantiles of order
An important property of the MAL distribution is that its marginals are univariate asymmetric Laplace distributions (see Kotz et al. 2001), and hence this distribution is a natural extension of the univariate setting, while at the same time taking the dependence structure into account. Despite this important property, there are also serious concerns and drawbacks related to this model, which we explain in Section G of the Supplementary Material. They are all related to the fact that for
with density
for some
Note that this density is bounded for
If
where
We are interested in doing inference for
where
be the maximum likelihood estimator (MLE), where
In this subsection, we will develop the asymptotic theory for the MLE
where
and where
Note that if the model is correctly specified, the estimator
Computational aspects
The model and algorithm described in the previous section can be summarized as follows. First, in order to overcome the problems created by the MAL likelihood we propose to work with the MGH density defined in (3.2) above. Unlike the MAL density, the MGH density has no spikes and its log-likelihood given in (3.4) is therefore smooth in all its parameters for positive values of
Since the maximization of the log-likelihood cannot be achieved analytically, the maximum likelihood estimator is obtained iteratively. To do so, the estimator is implemented in the
To guarantee that the solution of
for
The initial values are obtained using a method-of-moments estimator (MME). First, we estimate the diagonal elements of
where
and
respectively, where
The covariance matrix of
and
respectively.
Note that the maximization is done using the MGH density after adding
The choice of
The algorithm does not always get trapped in a peak. Nevertheless, we recommend in any case to conduct a sensitivity analysis checking the estimates and standard errors for several values of
When
Simulations
The main objective of the simulations is to evaluate the performance of our proposal not only for estimating the parameters of the multivariate asymmetric Laplace distribution, but most importantly, for estimating the coefficients of the quantile regression with correlated data.
Therefore, we consider two simulation settings. First, we consider a bivariate asymmetric Laplace distribution as data-generating model for the outcome. Thereby, we illustrate that our estimator provides efficient estimates for the parameters of the MAL distribution. The main results of this simulation study are shown in Section B of the Supplementary Materials.
Second, we contemplate settings in the quantile regression context. Here, we compare our proposal with the univariate quantile regression estimator (UQR). The data-generating models and results of these simulations are presented in Sections 5.1 and 5.2, respectively.
Settings
The settings resemble longitudinal data with two measurements per subject. The bivariate data-generating model is:
with
where the covariance matrix
We are aiming to estimate the
where
For these scenarios, we set
Parameters of interest
The parameter of interest is
where
Estimators
We estimate the parameter of interest using the MLE with
Although the inclusion of a small value for
As a reference, we compare the MLE with the univariate quantile regression estimator (UQR). The comparison is done using the relative bias (RB) and the relative efficiency (RE) of
If
Results
In this section, we present the main findings of the estimators of
The relative bias and efficiency of the MLE for each element of
Bivariate normal and Student-
cases. Relative bias (%) and efficiency of the MLE with
for different values of
,
, and
Bivariate normal and Student-
cases. Relative bias (%) and efficiency of the MLE with
for different values of
,
, and
The coverage of the 95% confidence intervals of each element of
Bivariate normal and Student-
cases. Coverage of the 95% confidence intervals based on the MLE with
for different values of
,
, and
The Leuven diabetes project (LDP) is a randomized trial with before/after measurements and two intervention arms, conducted from January
Figure 1 displays the relationship between the LDL-cholesterol and HbA1c at the beginning of the programme and one year later. There is a moderate positive correlation (around 0.5) between each variable at both time points. However, the correlations between variables are considerably low. Furthermore, both variables show a positive asymmetric behaviour in their densities with a larger asymmetry for the HbA1c.
LDP data. Scatter plot matrix of the LDL-cholesterol and HbA1c at baseline (LDL.0 and Hba1c.0), and after one year on the programme (LDL.1 and Hba1c.1). The entries above the main diagonal displays the correlations, the entries below the main diagonal the scatter plots, and in the main diagonal the densities
LDP data. Scatter plot matrix of the LDL-cholesterol and HbA1c at baseline (LDL.0 and Hba1c.0), and after one year on the programme (LDL.1 and Hba1c.1). The entries above the main diagonal displays the correlations, the entries below the main diagonal the scatter plots, and in the main diagonal the densities
In the analysis, we consider the following covariates: gender, age, use of insulin, diabetes duration (in decades), and body mass index (BMI) of the patient. Unfortunately, there are missing values in both outcomes, roughly 15% in LDL-cholesterol and 6% in HbA1c, and in all covariates, ranging between 2% and 11%. For the analysis, the complete cases are considered, that is, patients without missing values neither in the outcomes and covariates in both measurement points. Therefore, the dataset is reduced to 1 562 patients.
Let
where
The selection of
Table 3 displays the parameter estimates for quantile levels
LDP data. Parameter estimates, standard errors, and
-values for the maximum likelihood estimator (MLE) with
, and the univariate quantile regression (UQR) for different quantile levels
. Important differences between the
-values for the two estimators are in bold
The results in Table 3 show that both estimators provide similar estimates and standard errors. Nevertheless, some differences can be found. For both endpoints, there is a significant negative effect of time in all quantile levels indicating that the treatment successfully improves patients’ conditions. However, this effect is consistently smaller for the MLE with a lower standard error. Similarly, the effect of the use of insulin is significantly negative for LDL-cholesterol, but positive for HbA1c. The former is stronger for the MLE, with a larger standard error, for quantiles
The new methodology opens doors to many extensions, improvements and adaptations, that will be studied in the future. For instance, we can extend the current methodology to a pairwise pseudo-likelihood approach (see, e.g., Hermans et al., 2018). This will reduce the multivariate log-likelihood to a bivariate (pseudo) log-likelihood, allowing us to use the methodology developed in this article for the special case of
A further extension exists in allowing for more flexibility in the model, by modelling and estimating the location parameter of the MGH distribution non-parametrically via for example (penalized) B-splines (see, e.g., Bollaerts et al., 2006) or local polynomial modelling (see, e.g., Gijbels et al., 2019). These nonparametric specifications of the location parameter can then also be used for model diagnostics, by comparing them with the linear fit by means of a suitable test statistic that measures the distance between the two fits.
Footnotes
Acknowledgments
The authors would like to thank Michael G. Kenward and Geert Verbeke for helpful suggestions.
Supplementary materials
The R-code for executing the simulations and the data analysis is available at
Additional results and technical details are exhibited in the Supplementary Materials available online (
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: The authors gratefully acknowledge Special Research Fund (Bijzonder Onderzoeksfonds) of Hasselt University [BOF14NI06], and the European Research Council (2016-2021, Horizon 2020 / ERC grant agreement No. 694409).
