Abstract
This research is motivated from the data from a large Selenium and Vitamin E Cancer Prevention Trial (SELECT). The prostate specific antigens (PSAs) were collected longitudinally, and the survival endpoint was the time to low-grade cancer or the time to high-grade cancer (competing risks). In this article, the goal is to model the longitudinal PSA data and the time-to-prostate cancer (PC) due to low- or high-grade. We consider the low-grade and high-grade as two competing causes of developing PC. A joint model for simultaneously analysing longitudinal and time-to-event data in the presence of multiple causes of failure (or competing risk) is proposed within the Bayesian framework. The proposed model allows for handling the missing causes of failure in the SELECT data and implementing an efficient Markov chain Monte Carlo sampling algorithm to sample from the posterior distribution via a novel reparameterization technique. Bayesian criteria,
Introduction
In many clinical trials and medical studies, often both longitudinal (repeated measurements of a response) and time-to-event (time until the occurrence of an event of interest) outcomes are collected along with some other baseline covariates. Studies dealing with such outcomes mainly concern to investigate how the change in longitudinal outcome is associated with different covariates and also to determine the relationship among the longitudinal outcome, the survival outcome and other covariates. The traditional random effects model (Laird and Ware, 1982) and Cox proportional hazards model (Cox, 1972) may be considered for analysing the longitudinal and survival outcomes separately. However, a separate analysis of these two outcomes often leads to biased estimates of the model parameters (Hu et al., 2009). It is due to the fact that the longitudinal measurements are often incomplete (missing) and are subject to measurement error (Tsiatis et al., 1995), which may induce informative censoring mechanism for the survival outcome (Wulfsohn and Tsiatis, 1997). Thus, treating longitudinal outcome as a time-dependent covariate in the survival model fails to capture the endogenity of the longitudinal process and may lead us to a wrong interpretation of the model parameters (Prentice, 1982). In such cases, joint modelling of longitudinal and survival outcomes (Faucett and Thomas, 1996; Wulfsohn and Tsiatis, 1997; Henderson et al., 2000; Rizopoulos, 2012) has been widely used in clinical studies.
The standard joint model assumes that the event of interest takes place due to only one cause. However, the event of interest may occur due to one of the many plausible causes under consideration. This study is motivated from the Selenium and Vitamin E Cancer Prevention Trial (SELECT), where the main variable of interest is the time-to-diagnosis of prostate cancer (PC) due to low-grade or high-grade. We consider the two grades of PC as two distinct causes of PC. In the SELECT data, prostate specific antigens (PSAs) of the subjects are collected longitudinally, which is considered to be an important biomarker for the PC. In this study, our goal is to quantify the improvement in the fit of the time-to-PC due to low or high grade after the inclusion of longitudinal PSA in the survival sub-model. We also intend to quantify the cause-specific association between the time-to-PC and longitudinal progression of PSA.
Joint modelling of longitudinal and competing risks survival data has been proposed in the literature. In the joint modelling setting, Elashoff et al. (2007); Elashoff et al. (2008) extended the cause-specific hazards model and the mixture model by allowing latent variables to adjust for the association between the longitudinal and survival outcomes. However, the development in this area has not been in line under the Bayesian paradigm. A Bayesian extension of the cause-specific hazards models to analyse longitudinal and survival outcomes has been proposed by Huang et al. (2011) and Hu et al. (2009). However, the earlier works do not account for the missing causes of failure, which is common in clinical studies.
In the SELECT data, there are a substantial number of subjects who developed PC without knowing whether PC should be attributed to low-grade or high-grade. This missing cause of PC could happen when subjects reported cancer diagnostics outside the studies home institution. Considering the subjects with unknown causes as missing or ignoring those subjects from the analysis may lead to biased parameter estimates and thus the inference could be misleading (Lu and Tsiatis, 2001). Particularly, when the missing percentage is substantial (e.g., about 26% for SELECT data), the analysis could be erroneous. Within the competing risks setting, multiple imputation technique has been proposed by Lu and Tsiatis (2001), while Gao and Tsiatis (2005) developed an inverse probability weighted complete case estimator. Within the joint modelling framework, the literature to account for the missing causes under the competing risks setting is still sparse. In this study, we propose a shared parameter joint model within the Bayesian framework. Our proposed joint model imputes the missing causes of PC within the Markov chain Monte Carlo (MCMC) sampling algorithm.
In the SELECT data, the substantially large sample size of 32 261 might also impact the performance of the joint model resulting in weak convergence of the variance covariance matrix of the random effects (Zhang et al., 2019). This can also influence the convergence of other parameters that depend on the random effects in MCMC sampling. To deal with this issue, we use the Cholesky factorization of the variance covariance matrix of the random effects similar to Zhang et al. (2019) and propose a novel reparameterization of the regression coefficients associated with the random effects under the survival sub-model, which leads to a convenient but efficient implementation of the MCMC sampling algorithm under the joint model.
One of the main goals of this study is to quantify the improvement in the fit of the competing risks survival data due to the inclusion of the longitudinal PSA. Several measures, for example, Akaike information criterion (AIC), Bayesian information criterion (BIC), widely applicable or Watanabe–Akaike information criterion (WAIC; Watanabe, 2010), deviance information criterion (DIC; Spiegelhalter et al., 2002), etc., have been routinely used to quantify the overall fit of the model under consideration. Under the joint modelling setting, in order to quantify the overall goodness of fit, it is often desirable to assess the separate contribution of each component of the joint model towards the overall fit. Zhang et al. (2014) developed a decomposition of AIC and BIC under the joint model of longitudinal and survival outcomes. A useful SAS macro has been developed by Zhang et al. (2016) that allows to use various flexible models under the joint modelling framework and computes the decomposition of AIC and BIC. Within the Bayesian framework, a novel decomposition of DIC and logarithm of the pseudo-marginal likelihood (LPML; Ibrahim et al., 2001) has been proposed by Zhang et al. (2017). To investigate the performance of our proposed model, we define a BIC,
The remainder of the article is organized as follows. The description of the SELECT data and some preliminary statistics are presented in Section 2. The development of the proposed methodologies is discussed in Section 3. The Bayesian inference and computation are presented in Section 4. The simulation study design and results are presented in Section 5. A detailed analysis of the SELECT data is carried out in Section 6. We conclude the article with a brief discussion in Section 7. The technical details of MCMC sampling are given in the Supplementary Materials (
Percentage distribution of PSA repeated measurements for different censoring status
Percentage distribution of PSA repeated measurements for different censoring status
Distribution of cancer status by cancer grade
Distribution of cancer status by cancer grade
The main variable of interest is the time-to-PC due to low-grade or high-grade (survival outcome), where the low- and high-grades are considered as two different causes of PC. The main objective of this study is to investigate the association between the longitudinal PSA (longitudinal outcome) and the competing risks survival outcome adjusting for other covariates. We include the cases in the analysis, in which the number of follow-up visits is at least two, leading to a total of 22 792 patients. The follow-up visiting times and event times are recorded in years for each patient. For the longitudinal PSAs, the median follow-up times are 4.02 years and 1.95 years, respectively, for the censored and observed patients with an overall median follow-up time of 3.96 years. The observed minimum, median and maximum survival times are 0.003, 8.071 and 14.249 years, respectively. The median number of follow-up visits for the patients is 6 with the minimum and maximum of 2 and 10, respectively. Figure 1 shows the percentage distribution of the number of repeated measurements for different censoring status. We observe that for low-grade and high-grade, the percentage for the patients with less than five repeated measurements is higher compared to those for censored and missing-grade patients.
In Table 1, the distribution of PC status and the causes are presented. It is observed that about 7% patients developed cancer among whom for more than 26% patients, the cause of PC is missing. In Figure 2, the longitudinal trajectories and corresponding survival times for eight selected patients are presented. The figure shows that with the increasing PSA trajectory, the survival time tends to be lower compared to those for non-increasing PSA trajectory cases. This motivates us to develop the model to investigate the association between longitudinal PSA measures and the time to PC due to two different causes. Also, the missing cause cases should also be taken care of by imputation in MCMC sampling for carrying out an appropriate inference of the SELECT data.
The joint model is comprised of a longitudinal sub-model and a survival sub-model. In the following subsections, the notation and the sub-models are defined under the general setting.
PSA trajectories and survival times for eight random patients
PSA trajectories and survival times for eight random patients
Suppose there are
where
This study is motivated from the PC data-SELECT (discussed in Section 2), where PSA measurements are collected longitudinally. The earlier studies (Ferrer et al., 2016) suggest that the logarithm of PSAs can be explained by a linear mixed effects model with a latent linear trajectory function of observation time. In this study, we consider a linear trajectory for the longitudinal sub-model defined in (3.1). Within the Bayesian framework, the posterior estimation of
where
where
Let
where
With the reparameterization of
where
With this reparameterization, the hazard model (3.4) for the
and the hazard function (3.5) for the unknown cause of failure is redefined as
We further assume that the baseline hazard function due to the
Let {
Let
where
In the following Section 4, we present the Bayesian estimation of the parameters under the joint model.
The Bayesian estimation of the proposed model involves sampling from the full conditional distributions, adaptive rejection sampling (Gilks and Wild, 1992) and Metropolis–Hastings sampling(Metropolis et al., 1953; Hastings, 1970). In the SELECT data, there are a substantial percentage of missing causes of failure. In MCMC sampling from the posterior distribution, we add one additional step to impute the missing causes of failure. Let
The probability for high-grade cancer can be calculated similarly. Let
The other components of the augmented likelihood remain the same as defined in (3.10). In the augmented data likelihood, the longitudinal and survival data are independent to each other conditional on the random effects. Thus, the posterior distributions of the longitudinal and survival sub-model parameters depend on the random effects, which are unobserved. We sample the random effects
where
In this article, the random effects
where
and the deviance function
The DIC for the survival sub-model with the above consideration
where
The following assessment criterion is defined
which quantifies the gain in the fit in the competing risks survival sub-model due to the inclusion of longitudinal data penalizing the additional parameters in the competing survival sub-model.
The effective number of parameters is evaluated as
Finally, the WAIC
where
The interpretation of the
Simulation design
We carry out a simulation study to examine the empirical performance of
where
For the survival sub-model, the same set of covariates is used as for the longitudinal sub-model and set
Thus the survival sub-model for the
We generate the true event time due to the
where
The design values of the survival sub-model parameters are given as True-SPM: In this scenario, we generate the data from the proposed share parameter joint model. In particular, we generate the longitudinal data from the model (5.1) and the survival data from the model (5.2) with True-Surv: In this scenario, the data are generated from the survival data only model by specifying
Posterior estimates under the joint model for True-SPM in Section 5.1
Posterior estimates under the joint model for True-SPM in Section 5.1
We repeat a similar analysis for the simulated data under the True-Surv scenario. To preserve the space, we present the simulation results under the proposed joint model and the survival data only model in Tables S.1 and S.2, respectively, in the Supplementary Materials (
The boxplots of
Boxplots of
DIC
under True-SPM and True-Surv model
Posterior estimates under the survival data only model for True-SPM in Section 5.1
Summary of
Boxplots of
WAIC
under True-SPM and True-Surv model
Summary statistics of continuous covariates at baseline by PC
Summary statistics of continuous covariates at baseline by PC
Summary statistics of categorical covariates by PC
The main goal of this study is to assess the association between the longitudinal PSA and the time-to-PC due to low-grade or high-grade after adjusting for the covariates under consideration. For the longitudinal sub-model, we consider the natural logarithm of PSA as the response variable. For both the longitudinal and survival data, the same seven covariates are used. We consider the shared parameter joint model, in which the random effects are assumed for the longitudinal and survival data. We consider the piecewise constant forms for the baseline hazards functions due to two different causes. To construct the appropriate partitions
The posterior estimates of
The survival sub-model parameter estimates under the joint model and the survival data only model are presented in Table 9. The results show that age is positively associated with the risk of high-grade cancer and not significantly associated with low-grade cancer, which is found to be consistent under both the joint model and the survival data only model. Both joint model and he survival data only model show that BMI is not associated with the risk of low-grade cancer. Although BMI is significantly positively associated with high-grade cancer under the survival data only model, BMI is not significantly associated with the risk of high-grade cancer under the joint model. Hispanic patients were found to have a lower risk of developing low-grade cancer compared to non-Hispanic patients and Hispanic status is found to be not associated with the risk of developing high-grade cancer. Although race of the patients is a significant factor for the cancer development under the survival data only model, race is not a significant factor associated with the risk of developing low-grade and high-grade cancer under the joint model. We also present the estimates and the 95% HPD intervals of
DIC
Posterior estimates and 95% HPD intervals of
In this article, a joint model for the longitudinal and competing risks survival data is proposed with an applications to SELECT data. Our proposed model is flexible to account for the uncensored subjects with missing causes of failure. We develop a novel reparameterization to facilitate an efficient and convenient implementation of the MCMC sampling algorithm to sample from the posterior distribution. The proposed model is applied to SELECT data and to assess the model fit, we introduce four model assessment criteria,
95% HPD interval of the
Posterior estimates, 95% HPD intervals, DIC
Footnotes
Supplementary materials
Supplementary materials for this article are available at
Acknowledgement
The authors wish to thank the editors and two referees for the insightful comments and suggestions, which helped to improve the article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was partially supported by US NIH grant #GM 70335.
