Abstract
In this paper we develop self-consistent and smoothed dependent estimators for the cause-specific failure time density in a competing risks context, employed in the presence of both left-censored and right-censored data, while allowing for masking of the failure cause. Dependence will be incorporated between the failure times and both the censoring times and the masked causes with the use of both Kernel Regression and Multivariate Multiple Regression at each iteration of the algorithm. Our approach to modeling the cause-specific failure times is intended to be the most automated and data-driven approach possible.
Introduction
The theory of competing risks is employed by statisticians, actuaries (multiple decrement theory), engineers (reliability theory), demographers, biologists, and others. Scientists from these various disciplines who are involved in the modeling of time-to-event data with competing modes of failure will very likely encounter censored and/or masked data, as well as statistical dependence issues, in their work. This present work addresses a significant void in the literature by providing a nonparametric framework for modeling competing risks data while simultaneously allowing for the possibility of censoring, masking, and the statistical dependence of these latter phenomena with the failure times themselves. The doubly-censored and nonparametric SC-CR Algorithm of Adamic (2010) will be utilized as the engine to generate the cumulative incidence probabilities. However, the SC-CR Algorithm will be enhanced at each iteration by employing either Kernel Regression or Multivariate Multiple Regression (MMR) to account for the dependence between the censoring times and the masked failure causes with the failure time distribution, but in such a manner so as to maintain the statistical attribute of self-consistency for the resulting estimators. For illustrative purposes, the proposed models will be applied to a bivariate Trypanosomiasis data set.
This research fits in well within the overall trajectory of scholarship that has taken shape in the niche area of nonparametric competing risks research. Nonparametric maximum likelihood estimators (NPMLEs) of the cumulative incidence for competing risks data were pioneered by Aalen (1976) and Kalbfleisch and Prentice (1980). Subsequently, Dinse (1982) proposed an NPMLE for right-censored and masked competing risks data to be computed with the explicit use of a Dempster et al. (1977) Expectation-Maximization (EM) algorithm. Hudgens et al. (2001) first presented an NPMLE estimated using an EM algorithm for competing risks data, subject to both interval-censoring and truncation. Jewell et al. (2003) and Groeneboom et al. (2008) follow this trajectory with similar studies of nonparametric maximum likelihood estimators for current status data. More recently, Adamic (2010), Adamic et al. (2010), and Adamic and Guse (2016) developed generalizations of Turnbull’s (1974, 1976) classical univariate algorithms for modeling competing risks. These models, based on Turnbull’s self-consistent algorithm, can be shown to be species of EM algorithms. Overall, distribution-free models that can be employed in a multiple decrement context have received relatively little attention to date, in part due to the complexity that censoring and masking carry in their respective trains. Our research is geared towards developing models that are as automated as possible, by keeping the number of assumptions to an absolute minimum – and this aim is actualized in the present work.
The SC-CR Algorithm for doubly-censored data
The first portion of Section 2, which summarizes the SC-CR Algorithm for doubly-censored data, is predominantly from Adamic (2010) and/or Adamic et al. (2010) and can be implemented as follows. The steps, statements, and logic of the algorithm will be directly generalized from those of the single variable algorithm of the standard textbook by Klein & Moeschberger (1997).
The SC-CR algorithm for doubly-censored data
Provide initial estimates of the overall survival probabilities at each Using the current estimates of
Using the results of the previous step, estimate the number of cause-specific failures at time
Compute
jointly for all of the time points
Partial masking can be introduced into the algorithm. For details, consult Adamic and Guse (2016). In terms of self-consistency, Adamic (2010) outlines a proof that the SC-CR Algorithms (for both the partially masked and completely masked cases) produce self-consistent estimators of the CIF’s for each failure mode. Although we strongly suspect that the CIF’s derived from the SC-CR Algorithms are also NPMLE’s, we will be content to rely on the statistical merits of self-consistency for the present work.
Despite the novelty of modeling masked competing risks data using an EM-type algorithm, there is a significant drawback associated with the various approaches that have been developed to date. As opined in Hudgens et al. (2001), estimators of this type will have the unwelcome property that the resulting estimators of the survival distribution will be undefined over a potentially large set of regions. Indeed, the problem is even more acute in the multiple-decrement environment: the SC-CR Algorithms of Adamic (2010) and Adamic and Guse (2016) can be seen to converge only over a class of intervals that were dubbed cause-specific innermost intervals. To remedy this problem, we have chosen to generalize a univariate kernel density estimator found in Braun et al. (2005) that was used to fill in the gaps between the univariate innermost intervals that were created by invoking the self-consistent EM algorithm of Turnbull (1976). The converged estimator of the failure rate distribution is often difficult to smooth, due to the large gaps often exhibited between innermost intervals, as well as the attendant multi-modal dispersion of the probability distribution that will typically arise when there are many gaps between the probability masses. As mentioned in Duchesne and Stafford (2001), adopting a kernel smoothed estimator at each iteration avoids the bias created by arbitrarily assigning probability mass at the right end points of the innermost intervals, as is also recommended by Pan (2000), and is furthermore better at borrowing more information from neighboring data points than would otherwise be the case. Duchesne and Stafford (2001) go on to state that since the innermost intervals are effectively no longer present, the kernel modification moves the algorithm away from problem causing areas – areas where Turnbull’s algorithm can sometimes get stuck at local solutions (also see Li et al., 1997, for further details on this point).
For motivation, let us first assume the survival data are interval-censored. Stafford (2005), drawing on the work of Goutis (1997), argues that a natural extension of the standard kernel smoothing weight is to define
the kernel density estimate of
for a fixed kernel function
The function npreg, part of the nonparametric np Package in R, is used to execute the Kernel Regression, an approach based on Li and Racine (2003), Li and Racine (2004), and Racine and Li (2004). The function utilizes data-driven (sometimes referred to as automated) bandwidth selection methods. As noted by Li and Racine (2003), traditional nonparametric kernel methods presume that the underlying data is continuous in nature, which is frequently not the case (and is not the case in the way we are regressing on censored and masked data, which are indicator based, and hence categorical), and so they develop their methodology using what they call generalized product kernels. Further details regarding the npreg function can easily be found online in the R documentation for the np Package.
Using a kernel smoothing mechanism at each iteration of the SC-CR Algorithm, the density estimate at the
conditioning on all observations,
where it is understood that the estimator is derived from the particulars of the chosen Kernel Regression routine. The span,
The following theorem shows that as the bandwidth tends to zero, the smoothed estimator of the CIF approaches the self-consistent CIF. The theorem statement, steps, and logic of the proof are direct generalizations of an analogous univariate proof from Braun et al. (2005).
.
Let
since
The kernel weight function is also a valid PDF that approaches the empirical distribution, and so,
since
The multivariate linear regression model is
where
with
or, summarily,
Using least squares estimates
The orthogonality present among the residuals, predicted values, and columns of
The foregoing was felicitous, as we want to explicitly maintain the statistical attribute of self-consistency when adding the dependent smoothing at each iteration of the SC-CR Algorithm. A proof to this effect is as follows.
Generalizing and adapting the preceding notation, the MMR estimator at a failure time
By keeping track of the censoring and masking types at each time point, we can construct indicator (or categorical, as necessary) regressors just like we did for Kernel Regression. We can run the modified SC-CR Algorithm just as before, only this time, using MMR regression at each iteration.
Application to a data set
The Trypanosoma brucei is a parasite that causes the rare disease African trypanosomiasis, colloquially referred to as African sleeping sickness. There are two forms that the disease can assume: the neurological form (N) and the lymphatic-sanguine (LS) form. These will comprise the two competing modes of failure, where the failure time,
Summary of the trypanosomiasis data
Summary of the trypanosomiasis data
The SC-CR Algorithm was applied to the data. The raw results are given in the first and second columns of Table 2 under the subheading, “Unsmoothed”. As can be gleaned from the abundance of zeros, there is the conspicuous presence of many gaps between the cause-specific innermost intervals. Therefore, the use of Kernel Regression smoothing is especially opportune. The first Kernel Regression used only the masking information for creating the regressors, not the censoring. Masking regressors (or covariates) can easily be created by assigned an indicator of 1 or 0, if a specific cause (in this case, cause N or LS) was possible at each failure time point in the data set. The npreg Kernel Regression routine was then executed at each iteration of the SC-CR Algorithm.
Dependent kernel regression results
Converged CIF’s for the kernel regression models.
The first set of results are shown under the “Dep Masking” subheading in Table 2, with the result rounded to the nearest one-thousandth. Note that the innermost intervals are no longer present (the few zeros that remain, to the nearest one-thousandth, are due to their numbers being very small; they are not nil). In particular, for cause LS, so many more meaningful failure probabilities are now available, that have the additional benefit of drawing on the dependence between the type of masking and the failure time. Indeed, the lymphatic-sanguine (LS) risk was more associated with higher ages of onset of infection than the neurological form, N. Also, note that the magnitudes of the failure probabilities are themselves revealing. For example, consider age 50. Before the kernel modification, it was not known whether failure due to cause N or cause LS was more likely (since they were both zero). Now, we can estimate that infection at age 50 is over twice as likely to be due to form N than form LS.
The next Kernel Regression was fit to only the censoring-type expressed in terms of regressors, the results being summarized under the subheading, “Dep Censoring” in Table 2. Creating the regressors was yet again not difficult, as there were only left-censored and exact observations for this specific data set (that is, indicator variables sufficed). As can be deduced from a cursory inspection of the data, left-censoring was far more common than exact observations at the higher ages, on average. As such, accounting for this by utilizing a dependent smoothing mechanism is advantageous, leading to more accurate estimates of the failure probabilities than by simply using an independent kernel approach. Interestingly, the results for cause N under dependent censoring exhibited almost the same results as under dependent masking, especially at the lower ages.
MMR dependence model results
Converged CIF’s for the MMR dependence models.
The final Kernel Regression was fit to all of the created regressors, whether masking-based or censored-based. The final two columns of Table 2 furnish the results. The probability distribution seems very consistent with the results from the previous two model fits for cause N. However, for cause LS, there was virtually no smoothing that ensued. One possible explanation for this might be over-parameterization of the model in this case (i.e. too many regressors for a relatively small number of data points); but this theory is inconclusive. Figure 1 depicts all of the final CIF curves for all three Kernel Regression fits.
The entire process can be performed again, this time using MMR instead of Kernel Regression. Table 3 illustrates the results. The results from the MMR fits were similar in many ways to the kernel approach, but different in other respects. In terms of similarities, the results for cause N mimic quite closely the pattern exhibited from the kernel approach: the probability distributions were roughly equivalent and the dependent masking and censoring fit again produced similar results to when just masking or just censoring information is utilized. However, the results were very different for cause LS: less smoothing transpired for the MMR models fits when only censoring or only masking information were used, whereas more smoothing emerges for the MMR fits when both masking and censoring were employed in aggregate. Figure 2 plots all of the resulting MMR fits.
The main advantages of the MMR approach over a kernel-based method are (a) ease of understanding; (b) ease in adopting further enhancements such as adding, say, interaction terms between the created masking and censoring covariates, if desired; and (c) computational efficiency. Indeed, this was corroborated by experience, as the algorithms converged much more quickly using the MMR approach. The main advantage of the Kernel Regression over the MMR approach is that it is entirely nonparametric, with the user not required to make any distributional/parametric assumptions whatsoever.
The purpose of this paper was to relax the restrictive independence assumption commonly invoked between failure times and the censoring and masking variables found in nonparametric competing risks modeling. This was achieved by incorporating dependent smoothing of the cause-specific failure time density into the SC-CR paradigm, while maintaining the statistical attribute of self-consistency. Dependence was incorporated between the failure times and both the censoring times and the masked causes by employing both Kernel Regression and Multivariate Multiple Regression at each iteration of the SC-CR Algorithm. Our approach to modeling the CIF’s in a multiple decrement setting is intended to be the most automated and data-driven approach possible, and in this respect is unrivaled in the literature to date.
Future work
In terms of future work, we note first that the dependent smoothing enhancements can also be incorporated into the interval-censored version of the SC-CR Algorithm of Adamic et al. (2010). This would represent a valuable contribution to the literature, as interval-censoring is extremely commonplace in practice. A second avenue to explore would be whether or not a multivariate copula approach could be used to account for the dependence between failure times with censoring and/or masking, as opposed to the regression approaches we have adopted in the present work. We have already made reference in the introduction to active research in copula-based competing risks scholarship, and a thorough investigation as to whether these methods can be used in the SC-CR framework is certainly warranted. Specifically, the use of a fully nonparametric copula would be in keeping with our goals to provide models that require the least number of assumptions for the analyst. Some research in this regard in the survival analysis sphere has already begun to take shape; a good example is Gribkova and Lopez (2015), who espouse a nonparametric copula approach under bivariate censoring.
Footnotes
Acknowledgments
This work was supported by the Government of Canada, Natural Sciences and Engineering Research Council of Canada under a 2017 Discovery Grant, RGPIN-2017-05595: Actuarial Modeling of Competing Risks Under Various Dependence Structures.
Appendix
Proof that E[E[X
Proof that E[E[X
Let
Thus, it follows that
