Abstract
Clustered partly interval-censored survival data naturally arise from many medical and epidemiological studies. We propose a Bayesian semiparametric approach for fitting a mixed effects proportional hazards (PH) model to clustered partly interval-censored data. The proposed method allows for not only a random intercept as most frailty models do for clustered survival data, but also random effects of covariates. We assume a normal prior for each random intercept/random effect, seeing the instability of a gamma prior for a frailty in this situation. Simulation studies with data generated from both mixed effects PH model and mixed effects accelerated failure times model are conducted, to evaluate the performance of the proposed method and compare it with the three methods currently available in the literature. The application of the proposed approach is illustrated through analyzing the progression-free survival data derived from a phase III metastatic colorectal cancer clinical trial.
Keywords
Introduction
Partly interval-censored data often occur in health-related studies that involve periodic examinations for symptoms of interest such as oncology clinical trials, epidemiological studies, dental research, and studies of infectious diseases. With partly interval-censored data, the exact time-to-event is known for some subjects, while is only known to be within certain time intervals for the others. It is a combination of exact failure times and general interval-censored (Huang and Wellner, 1997) failure times; or equivalently, a combination of exact, left-censored, interval-censored, and right-censored failure times. For instance, in cancer clinical trials, progression-free survival, defined as time from randomization to disease progression or death due to any cause, is actually partly interval-censored as disease progression needs to be detected by periodic imaging assessments. Also disease-free survival, defined as the length of time a patient stays free of a disease or cancer after a particular treatment, is also partly interval-censored as the recurrence of cancer needs to be detected by imaging assessments too. Some of the main methodological papers for partly interval-censored data are Huang, (1999), Kim, (2003), Komáek and Lesaffre (2007), Zhao et al. (2008), Gao et al. (2017), Zhou and Hanson (2018a), Pan et al. (2020), and Pan and Cai (2020).
When we have patients treated at different medical centres or recurrent events within different patients, these patients or recurrent events may be dependent, that is clustered by centre or by patient. This requires the inclusion of a random intercept/random effects in our survival models to account for the dependency within clusters. Traditionally survival models with a random intercept/random effects are for clustered right-censored data. Klein (1992) proposed a semiparametric estimation method of the proportional hazards (PH) model with a random intercept based on the expectation-maximization (EM) algorithm. Aslanidou et al. (1998) proposed a Bayesian PH model with a shared gamma frailty. Sargent (1998) proposed a Bayesian PH model with a normal random intercept. Ripatti and Palmgren (2000) proposed a mixed effects PH model with the random effects following a multivariate normal distribution. Only during recent years, has there been growing attention to clustered interval-censored data. Goethals et al. (2009) developed a shared gamma frailty model where they integrated out the frailty analytically. Pan et al. (2015) proposed a Bayesian PH model for clustered interval-censored data with multiple gamma frailties. Yavuz and Lambert (2016) developed a Bayesian semiparametric shared-frailty model for clustered interval-censored data, assuming a normal and a flexible P-splines form for the frailty distribution. Chen et al. (2020) developed a multiple imputation approach to convert clustered interval-censored data to right-censored data and analyze it under the additive hazards model. As we can see, most of the methods for both clustered right-censored data and clustered interval-censored data contain only a shared-frailty, or equivalently, a random intercept.
There are only three publications in the literature that we are aware of for modelling clustered partly interval-censored data. Both Komáek and Lesaffre (2007) and Komáek et al. (2007) developed a Bayesian mixed effects accelerated failure times (AFT) model for clustered partly interval-censored data, with the random effects following a multivariate normal prior. They differ in how the error term (equivalently, baseline distribution) is modelled. The accompanying R functions that fit these two models are
In this article, we introduce an efficient and easy-to-implement Bayesian semiparametric approach for fitting a mixed effects PH model to clustered partly interval-censored data. The main differences between the proposed method and the three Bayesian methods mentioned above are: (1) Komáek and Lesaffre (2007) fit a mixed effects AFT model with the error term specified as a normal mixture where the number of components follows a Poisson distribution. Komáek et al. (2007) fit a mixed effects AFT model with the error term specified as a mixture of G-splines (normal densities with equidistant means and constant variances) where the number of components is user-specified. While we fit a mixed effects PH model with the cumulative baseline hazard function specified as a linear combination of basis I-splines. (2) Both Komáek and Lesaffre (2007) and Komáek et al. (2007) assume a multivariate normal prior for the vector of random effects, which forces all random effects to jointly follow a multivariate normal distribution as posterior. This distributional constraint may limit the exploration of sampling space during the Markov chain Monte Carlo (MCMC) updates of the random effects and the associated fixed effects. We have tried a multivariate normal prior for the random effects under the proposed method, and the estimated fixed effect with a corresponding random effect has a relatively large bias. (3) Zhou and Hanson (2018a)’s method only allows for a random intercept which assumes a normal prior. A mixed effects model with a general Q-variate random effects covariate vector is in demand to also account for random variations of treatment effect and risk factors. Such a mixed effects PH model has not yet been considered in the partly interval-censored data literature. (4) We have no way to evaluate how Zhou and Hanson (2018a)’s method would perform if random effects of covariates were included. In their MCMC algorithm, the vector of regression coefficients is drawn from a multivariate normal with mean equal to the current value and variance equal to sample variance with a small adjustment. If we imagine this sampler were also to be used for random effects, then it may face the same distributional constraint as in the Komáek and Lesaffre (2007)’s method and the Komáek et al. (2007)’s method. (5) Adding random effects of covariates to a random intercept/shared-frailty only model poses non-trivial challenges to both method development and MCMC algorithm computation. This will be demonstrated in Section 3, where the Komáek and Lesaffre (2007)’s method and the Komáek et al. (2007)’s method perform well when only a random intercept is included. However, the MCMC chain of a regression coefficient will not converge if a corresponding random effect is added.
The current work is related to our previous work (Pan et al., 2015, 2020; Pan and Cai, 2020), in the sense that they all use a monotone spline to flexibly approximate the cumulative baseline hazard function and use a two-step data augmentation based on a nonhomogeneous Poisson process to facilitate posterior computation. Pan et al. (2015) proposed a multiple-frailty model for clustered interval-censored data with frailty selection. Pan et al. (2020) proposed a Bayesian semiparametric PH model for partly interval-censored data. Pan and Cai (2020) proposed a PH model with spatial frailty for spatially dependent partly interval-censored data. Here our main advancement is to modify the model from Pan et al. (2020) by inclusion of random intercept and random effects, such that heterogeneity of baseline survival and heterogeneity of covariate effects are accounted for.
The article is organized as follows. Section 2 presents the proposed Bayesian semiparametric approach under the mixed effects PH model. Section 3 presents two simulation studies that compare the performance of the proposed method versus current models under two data-generating settings. In Section 4, we illustrate the application of the proposed method using the progression-free survival data derived from a phase III metastatic colorectal cancer clinical trial. Finally, Section 5 provides conclusions and discussions.
The model
Data structure
Suppose there are
Model and likelihood
We propose a Bayesian semiparametric approach for analyzing clustered partly interval-censored data under the mixed effects PH model. Specifically, it is assumed that the hazard function for the
where
For an exact observation
where
For a general interval-censored observation
where
Assuming that subjects are independent conditional on the random effects
Following our previous work (Pan et al., 2015, 2020; Pan and Cai, 2020), we approximate the unspecified cumulative baseline hazard function
where
Based on the fact that the baseline hazard function
Per definition in Ramsay (1988),
It would be difficult to draw MCMC samples from the posteriors derived directly from the conditional likelihood function in Equation (2.4). To facilitate posterior computation, we construct the following data augmentations in order to obtain more posterior distributions of standard forms.
Plugging the I-splines approximation of
It is impossible to extract a specific
from which we can extract
For the general interval-censored observations part, we introduce a two-step data augmentation. Let
where
Furthermore, based on the additive property of Poisson distribution and the additive form of
where
The overall augmented data likelihood then becomes:
which will enable efficient sampling in the posterior computation to be presented in the Appendix. The latent variables, spline coefficients, and hyperparameters can all be sampled from standard distributions. Only the posteriors of regression coefficients and random effects are of non-standard form, which will be sampled using the Metropolis-Hastings algorithm Hastings (1970).
The detailed MCMC algorithm and formulas of the posterior distributions are presented in the Appendix. The latent variables
Model comparison criteria
Spiegelhalter et al. (2002) proposed a Bayesian criterion for comparing complex hierarchical models: deviance information criterion (DIC). Assume a model for observed data
Geisser and Eddy (1979) proposed a Bayesian predictive approach to model selection. Assume there are several possible models
Simulation studies
Simulation I
We evaluate the performance of the proposed method through simulation studies. We fit the mixed effects PH model, the Zhou and Hanson (2018a)’s PH model with a random intercept with the
Failure times are generated from a PH model with a random intercept (Data 1) and a PH model with a random intercept and a random treatment effect (Data 2):
where
In fitting the proposed method, we set the degree
Table 1 presents the simulation results of fitting the proposed method,
Simulation Ia: Estimation of regression coefficients, ESS, NLLK, and DIC based on the proposed method, survregbayes , bayessurvreg1 , bayessurvreg2 and coxph under Data 1.
Simulation Ia: Estimation of regression coefficients, ESS, NLLK, and DIC based on the proposed method, survregbayes , bayessurvreg1 , bayessurvreg2 and coxph under Data 1.
As seen in Table 1, for Data 1, the proposed method and
Table 2 summarizes the simulation results of fitting the proposed method,
Simulation Ib: Estimation of regression coefficients, ESS, NLLK, and DIC based on the proposed method, bayessurvreg1 , bayessurvreg2 and coxph under Data 2.
Traceplot of MCMC chain of
– Left panel: Proposed method; middle panel: bayessurvreg1 ; right panel: bayessurvreg2 .
Figure 2 plots the estimated baseline survival curves versus the true under Data 1 and Data 2 based on the proposed method,
Plot of
– Left panel: Data 1 with a normal random intercept; right panel: Data 2 with a normal random intercept and a random treatment effect.
To give an advantage to the mixed effects AFT model by Komáek and Lesaffre (2007) and Komáek et al. (2007) (implemented by the R functions
where we set
We compare the proposed method, Zhou and Hanson (2018a)’s method (implemented by
As we can see in Table 3,
Simulation IIa: Estimation of regression coefficients, ESS, NLLK, and DIC based on the proposed method, survregbayes , bayessurvreg1 , bayessurvreg2 and coxph under Data 3.
Simulation IIa: Estimation of regression coefficients, ESS, NLLK, and DIC based on the proposed method, survregbayes , bayessurvreg1 , bayessurvreg2 and coxph under Data 3.
The results under Data 4 are presented in Table 4. Similar to what we have seen under Data 2, by ignoring the random treatment effect, the estimated standard error of treatment effect (
Simulation IIb: Estimation of regression coefficients, ESS, NLLK, and DIC based on the proposed method, survregbayes , bayessurvreg1 , bayessurvreg2 and coxph under Data 4.
Traceplot of MCMC chain of
– Left panel: Proposed method; middle panel: bayessurvreg1 ; right panel: bayessurvreg2 .
Figure 4 plots the estimated baseline survival curves averaged over the 100 datasets generated under Data 3 and Data 4 respectively, based on the proposed method,
Plot of
– Left panel: Data 3 with a normal random intercept; right panel: Data 4 with a normal random intercept and a random treatment effect.
The proposed method can be implemented with the main function
We analyze the data with a random intercept using the proposed method,
Metastatic colorectal cancer trial (
): Estimation result based on the proposed method, survregbayes , bayessurvreg1 , bayessurvreg2 and coxph .
Metastatic colorectal cancer trial (
): Estimation result based on the proposed method, survregbayes , bayessurvreg1 , bayessurvreg2 and coxph .
Figure 5 presents the estimated survival curves for the four subgroups formed by treatment arm and mutation status, based on the proposed method. The survival probability is obviously the highest for wild-type patients receiving panitumumab
Metastatic colorectal cancer trial (
): Estimated survival curves for four subgroups based on the proposed method.
There are currently only three methods in the literature for analyzing clustered partly interval-censored data: Komáek and Lesaffre (2007), Komáek et al. (2007), and Zhou and Hanson (2018a). We developed a Bayesian approach under the PH model which is flexible by its semiparametric nature. The proposed method outperforms the Zhou and Hanson (2018a)’s method by allowing for not only a random intercept but also random effects of covariates. As we have shown in Simulation I and Simulation II, if the random effect of a covariate exists, fitting the Zhou and Hanson (2018a)’s method would lead to a reduced estimated standard error for the covariate’s fixed effect. That could give a false positive inferential conclusion. The Komáek and Lesaffre (2007)’s method and the Komáek et al. (2007)’s method include both a random intercept and random effects of covariates. However, their methods may not quite work when the random effect of a covariate is included in the model, as we have shown in both Simulation I and Simulation II. For the regression coefficient
As we have seen in the literature review, more than half of the current publications assume a gamma frailty, for the benefit of a conjugate gamma posterior distribution. We also used gamma frailties for clustered interval-censored data (Pan et al., 2015). However, due to increased complexity of data structure, we have found that gamma frailties do not work well for our specific data structure and model here. The algorithm could get stuck due to some very large values generated from gamma posterior distributions for frailties. It makes sense if we think of the skewed nature of a gamma distribution. While with a normal distribution which is symmetric and hence less likely to give extreme values, we did not have this problem. Curious about a Bayesian nonparametric prior, we have also tried a Dirichlet process normal mixture prior (Görür and Rasmussen, 2010) for each random effect. Specifically, we assume each random effect to follow a normal distribution with mean zero and precision parameter following a Dirichlet process. The results are very similar to our current method and the algorithm is significantly more complicated due to the involvement of the blocked Gibbs sampler (Ishwaran and Zarepour, 2000; Ishwaran and James, 2001). After a certain number of MCMC iterations, all clusters converge to sharing the same precision parameter. So the Dirichlet process normal mixture prior reduces to a normal prior. We have also explored a multivariate normal prior for the vector of random effects. It turns out the estimation for fixed effects are not as good as the current method, which assumes a normal prior for each random intercept/random effect separately. We think the reason is that more constraints are actually being imposed if we force the joint distribution of the random effects to be of a specific form. Even though it has been a common practice for linear models, it may not be as good a choice for clustered partly interval-censored data. Komáek and Lesaffre (2007) and Komáek et al. (2007) assumed a multivariate normal prior for the random effects, and we have seen that the MCMC chain for the fixed effect with an associated random effect does not converge. Of course, it is unknown whether assuming a normal prior for each random intercept/random treatment effect separately under their model framework would solve this non-convergence problem. Yavuz and Lambert (2016) employed a flexible P-splines specification to approximate the distribution of a frailty, this nonparametric form could be something that is worthy of exploring in the future.
Appendix
After initializing values for the parameters, the proposed MCMC algorithm proceeds in the following steps.
Let
If
Sample
For
For
where
Sample
Sample
For a specific
For a specific
Supplementary Materials
Supplementary files are available online.
Supplemental Material for Bayesian semiparametric mixed effects proportional hazards model for clustered partly interval-censored data by Chun Pan and Bo Cai, in Statistical Modelling
Footnotes
Acknowledgements
The authors thank the reviewers and the associate editor for their insightful comments which greatly improved this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
CP was funded by the National Institute of General Medical Sciences of the National Institutes of Health under Award Number SC2GM135078.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
