Time-to-event models are a popular tool to analyse data where the outcome
variable is the time to the occurrence of a specific event of interest. Here, we
focus on the analysis of time-to-event outcomes that are either intrinsically
discrete or grouped versions of continuous event times. In the literature, there
exists a variety of regression methods for such data. This tutorial provides an
introduction to how these models can be applied using open source statistical
software. In particular, we consider semiparametric extensions comprising the
use of smooth nonlinear functions and tree-based methods. All methods are
illustrated by data on the duration of unemployment of US citizens.
The objective of many statistical analyses is to model a duration time until a
specific event occurs. This is usually referred to as time-to-event
or survival analysis. In biostatistics, for example, one often
examines the time to death or the progression of a disease. In economics and the
social sciences, popular examples include the modelling of the duration of
unemployment or the time to retirement. Generally, in regression models for
time-to-event data, the event time itself is the response variable, and one wants to
investigate the association of the response with several explanatory variables. Most
often it is assumed in these analyses that the survival time is given by a random
variable measured on a continuous scale. This case has been studied
extensively in the literature; see, for example, Kalbfleisch and Prentice (2002) and Klein and Moeschberger
(2003). However, in practice, measurements of time are often discrete.
Durations, for example, are often measured in days, years or months. Moreover, there
are situations where the exact event time may not be known, but only an interval
during which the event of interest took place.
Here, we consider the application of regression models for discrete
time-to-event data, which are characterized by an ordinal response variable taking
the numbers ,
with equal to the number of event
times. These numbers either refer to a situation where event times are intrinsically
discrete (such as the time to pregnancy, which in clinical applications is usually
measured by the number of menstrual cycles) or when continuous event times have been
grouped. In the latter case, the numbers
refer to mutually exclusive time intervals , , ,
, with fixed
boundaries .
Generally, a great advantage of discrete time-to-event models is that they can be
viewed as regression models with binary response, giving rise, for example, to the
application of logistic regression or probit regression (Willett and Singer, 1993).
A comprehensive treatment of the statistical methodology for discrete time-to-event
data has recently been given by Tutz and Schmid (2016). Similar to Gaussian regression, a large part of
this methodology has been designed to estimate predictor-response relationships
using a linear combination of the explanatory variables. In
addition, Tutz and Schmid
(2016) discuss several (less well known) approaches for
semiparametric discrete time-to-event modelling. The aim of
this tutorial is to provide an in-depth explanation of how these semiparametric
models can be fitted and implemented using the R software for statistical computing
(R Core Team, 2017).
In particular, we will explain how smooth nonlinear functions and tree-based methods
can be incorporated into discrete time-to-event models.
A frequently observed phenomenon in time-to-event analysis is censoring. Generally, a
duration time is termed ‘censored’, if its total length has not been fully observed.
In this article, we consider the most common type of censoring, the so-called
type-I or right censoring, which means that
the beginnings of the duration times are observed for all individuals in a study,
whereas the respective ends are only observed for part of the individuals. Hence,
for some of the individuals, it is only known that the event occurred later than the
observed time.
All models discussed in this article will be illustrated by means of a publicly
available dataset on the duration of unemployment. The data comprise observations
obtained from
US citizens and were collected between 1986 and 1992 as part of the January Current
Population Surveys Displaced Workers Supplements (DWS). The original dataset is
available as part of the R add-on package Ecdat (Croissant, 2016). The
response variable that will be considered here is the time to re-employment in any
kind of job, which includes full-time, part-time or other kind of jobs. Due to the
study design, the observed unemployment durations are discrete, as they were
measured in two-week intervals. In this article, we will analyse the data over a
period of 40 weeks comprising 21 possible event times ,
where
refers to event times
weeks. Explanatory variables that will be included in our analyses are the age of
the US citizen in years (age), an indicator on whether an
unemployment insurance claim was submitted (ui), the eligible
replacement rate (reprate, defined by the weekly benefit
amount divided by the amount of weekly earnings in the lost job), the eligible
disregard rate (disrate, defined as the amount up to which
recipients of unemployment insurance who accept part-time work can earn without any
reduction in unemployment benefits divided by the weekly earnings in the lost job),
the log weekly earnings in the lost job in US$ (logwage) and
the tenure in the lost job in years (tenure). The summary
statistics of the six explanatory variables are presented in Table 1. Due to missing values in the
variables, some observations were excluded from the data, arriving at a sample
containing the data of citizens.
Summary statistics of the six explanatory variables used in the modelling
of the US unemployment data ()
Summary Statistics
Variable
age
20
27
34
35.45
43
61
reprate
0.06
0.39
0.50
0.45
0.52
2.05
disrate
0.01
0.05
0.10
0.11
0.15
1.02
logwage
2.70
5.29
5.68
5.69
6.05
7.60
tenure
0
0
2
4.11
5
40
ui
no: 1 437 ()
yes: 1 773 ()
Observed time to re-employment (measured in two-week intervals) in the US
unemployment data. The median observation time in the data is 4,
corresponding to a time period of eight weeks
The observed times to re-employment are visualized in Figure 1. If an individual is still jobless
at the end of the survey (i.e., after 40 weeks) or dropped out of the study before
finding a job, it is subject to right censoring. In this case, its observation time
corresponds to the censoring time, otherwise to the true time of
re-employment.
The analysis in this tutorial shows how the relationship between the chance of
re-employment and the explanatory variables can be estimated in a flexible way,
using tailored semiparametric models.
The rest of this tutorial is organized as follows: Section 2 provides the basic
theoretical framework and an introduction to parametric as well as semiparametric
discrete time-to-event modelling. Details on model fitting and data preparation are
given in Section 3. Section 4 presents measures that are useful for assessing the
goodness of fit of discrete time-to-event models. In Section 5, we illustrate the
methods by presenting a detailed analysis of the US unemployment data, showing how
the various regression models can be applied by use of the R language for
statistical computing. Furthermore, we provide guidance on model choice and compare
the models in terms of their prediction accuracy. Section 6 discusses additional
aspects related to discrete time-to-event modelling and puts the methods considered
in this article into perspective.
The R code to reproduce all the numerical results is provided as electronic
supplement to this tutorial.
Notation and basic concepts
Given observations, let, in the
following,
denote the event time and
the censoring time of individual i, .
and
are assumed to be independent random variables taking discrete values in
.
In addition, one observes a vector of
time-constant explanatory variables . For
right-censored data, the observation time is defined by , that is,
corresponds to the true event time, if ,
and to the censoring time otherwise. If originally continuous data have been
grouped, the discrete event times
refer to time intervals , where
means that the event occurred in time interval . For example, in
our application on unemployment durations, where time was measured in two-week
intervals,
implies that re-employment of individual
took place between four and six weeks after the start of the study.
The main tool to model discrete time-to-event data is the hazard
function, which captures the dynamics of the survival process at each
time point. For a given vector of time-constant explanatory variables
xi, the hazard function is defined
by
describing
the conditional probability of an event at time ,
given that the individual survived until .
The corresponding survival function is given by
denoting
the probability that an event occurs later than at time ,
or, alternatively, the probability of surviving interval .
An important consequence of the definition of the hazard function in (2.1) is that
for a fixed time , the hazard drives a binary variable that
distinguishes between the event taking place at time
or not, conditional on .
Therefore, a model for the discrete hazard function can be derived from regression
modelling strategies for a binary response data. This allows to use established
tools and reliable software packages that have been developed for this class of
models.
Parametric discrete hazard models
A general class of binary response models applied to the discrete hazard function
is defined by
where is a
strictly monotone increasing distribution function. A common assumption is that
the model contains a time-varying intercept and a set of covariate effects that
are fixed over time. Hence, the linear predictor of the model, ,
comprises the intercepts ,
,
and a vector of regression coefficients
independent of . Note that there is no
intercept parameter for ,
as the hazard function in (2.1) is fully determined by
and the
coefficients .
The hazard of the last interval is not explicitly modelled, but by definition is
given by .
The most popular version of model (2.3) is the logistic
discrete hazard model or proportional continuation ratio
model, which is specified by the equation
By
definition, the proportional continuation ratio model uses the logistic
distribution function for . It can be shown that an
alternative representation of the model is
The ratio compares the
probability of an event at time to the probability of an
event later than . It is also known as
continuation ratio; see, for example, Agresti (2013). One has to note, that
the odds in (2.5) are equivalent to the odds under the condition .
This representation of the model allows for an easy interpretation of the
effects see the application in Section 5.2.
Generally, the number of parameters in model (2.4) depends on the number of
time points, as there is a separate intercept for each . The set of intercepts
defines the hazard that is always present for any given set of covariates. This
hazard is usually referred to as baseline hazard, and the
intercepts
correspond to the log continuation ratio when all covariates are zero.
Semiparametric extensions
The parametric model introduced in the previous section is linear in, implying
that each covariate has a linear effect on the transformed hazard. In practice,
this linearity assumption may be too restrictive, as predictor-response
relationships are often characterized by nonlinear functional forms.
Furthermore, so far it has been assumed that the baseline hazard is represented
by a separate intercept coefficient for each . This can lead to numerical
problems, especially when the number of time points (and, hence, the number of
intercept parameters) is large relative to the sample size, implying that the
event counts at some of these time points may become small. In the following, we
will consider popular semiparametric alternatives for the definition of
that address
these issues. We first consider additive hazard models and
subsequently tree-based methods, which can also be embedded
into the framework of binary response models.
To avoid numerical problems in the estimation of the baseline hazard, it is often
convenient to consider an additive model with predictor
where is a smooth
(possibly nonlinear) function of time. By relating the values of the baseline
hazard at neighbouring time points via , the number
of parameters involved in model fitting effectively reduces, and low event
counts at some time points become less problematic. A common way to specify the
smooth function in is to use splines,
which are represented by a weighted sum of basis functions. One
possible representation of is by
-spline basis functions.
These are polynomials of fixed degree differing from zero in
adjacing intervals. For a comprehensive introduction to -splines, see De Boor (1978). Very
flexible spline functions can be obtained by choosing a relatively large number
of basis functions and at the same time
using a penalty term to prevent estimates becoming too rough (‘wiggly’). This
approach, on which we will focus in this article, is called
P-splines and was first proposed by Eilers and Marx (1996).
An extension of the semiparametric model (2.6) that weakens the linearity
assumption on the effects of the covariates is given by the additive model
where the are unknown
smooth functions. That is, the effects of the covariates (or subsets of the
covariates) are determined by smooth, possibly nonlinear, functions. A common
approach is again to use -splines and to
expand each function separately by a weighted sum of -spline basis functions
depending on the covariates.
Possible further extensions of model (2.7) are, for example, the use
of smooth time-varying effects of the form
or . This kind
of models is extensively discussed in Bender et al. (2018). The tutorial by
Bender
et al.(2018) illustrates the use of generalized additive mixed models
for semiparametric continuous time-to-event modelling and is also part of this
special issue.
In the semiparametric models with predictors (2.6) and (2.7), it
is assumed that the predictor is given by an additive function of time and a
linear (or additive) function of the covariates. Although these models are very
flexible, they may not capture the structure of the data very well, if
interactions between covariates are present. For example, it is quite
conceivable that the effect of a covariate on the hazard depends on the values
of a second covariate, implying the presence of an interaction between the two
covariates. The problem when incorporating interactions in parametric or
additive models is that the relevant interactions have to be known and specified
before model fitting. Furthermore, parametric and additive models are hard to
handle if the interaction terms involve more than two covariates. An alternative
regression approach that addresses these problems is recursive
partitioning, which is also known as tree
modelling. The most popular tree method is classification
and regression trees (CART), as proposed and described in detail by
Breiman et al.
(1984). The basic CART method is conceptually very simple: The
covariate space is partitioned recursively into a set of rectangles, and in each
rectangle a simple model (for example, a covariate-free model) is fitted. A
user-friendly introduction to the basic concepts of tree modelling is found in
Hastie et al.
(2009). Recently, Schmid et al.(2016) proposed a recursive partitioning method that is
specifically designed to model discrete time-to-event data. The main principle
is to fit a discrete hazard model of the form
where is
represented by a classification tree with binary outcome. Each split of this
tree is determined by either (treated as an ordinal
variable) or one of the covariates. As a result, each terminal node of the tree
refers to an estimate of the hazard function for a specific covariate
combination and a specific time interval . For details
on the calculation of the corresponding estimates, see Section 5.3.
Estimation and data preparation for additive hazard models
To derive the log-likelihood function for discrete hazard models, it is useful to
introduce a binary variable indicating whether the target event was observed or not:
Thus,
becomes 1, if the exact true time is observed; otherwise, .
In the case where continuous time-to-event data are grouped,
and
implies an event in interval and
.
Similarly,
and
implies
and survival beyond ,
that is, .
Note that when continuous time-to-event data are grouped or rounded, additional
assumptions are implicitly imposed on the censoring mechanism. To see this, consider
the case where both the continuous event time
and the continuous censoring time
are within the same interval, say, . Then, by
definition, ,
,
and ,
leading to the usual interpretation that an event was observed in interval
. At the same
time, however, this interpretation implicitly assumes ,
that is, the continuous event time is not allowed
to be larger than the continuous censoring time . Without this
assumption, the scenario where both and
would result in
and
but no observed event in . This assumption
on the nature of the censoring mechanism is often referred to as ‘censoring at the
end of the interval'.
With data ,
the contribution of the -th observation to the
likelihood function is given by
A crucial assumption that is usually made to simplify the likelihood function (3.2) is
that the censoring process does not depend on the parameters determining the event
times .
A consequence of this assumption is that the terms involving the censoring times can
be ignored in the maximization of the likelihood function for the time-to-event
process. Omitting the terms involving
in (3.2)
and inserting the definitions of the hazard function (2.1) and the survival function
(2.2),
one obtains (expect for some constants)
Note that, by definition of
in (3.1),
one always obtains
and .
if
is equal to the last time point . For maximum likelihood estimation,
it is therefore convenient to re-code observations with
as follows:
making use of the fact that the value of the likelihood contribution in (3.3) will
not be altered by this transformation.
With some algebra, it can be shown that the likelihood function (3.3) is
equal to the likelihood of a binary response model with outcome variables
For individuals where the exact event time is observed, one defines the observation
vector of length
.
For censored individuals, the observation vector contains only zeros. According to
this definition, one has
binary observations for each individual , resulting in a total of
observations. Using these definitions, the log-likelihood of the proportional
continuation ratio model becomes
The main advantage of this representation of the log-likelihood is that it allows to
use software for fitting binary response models. For example, it follows from (3.6) that
fitting a continuation ratio model is equivalent to fitting a logistic regression
model with predictor (2.6) or (2.7). In
this model, the values of the binary responses can be interpreted as binary
decisions for the transition from interval to
. For instance,
in the application on unemployment duration, one observes
for each two-week interval as long as the individual
is not re-employed yet.
The models with smooth components (2.6) and (2.7) can be
fitted by maximizing a penalized likelihood of the form
where
is a penalty parameter and
is the penalty term already mentioned in Section 2.2, putting restrictions on the
weights of the -spline basis functions and
preventing estimates from becoming too rough. When using -splines,
is a difference penalty on adjacent -spline coefficients. A common
procedure is to use cubic B-splines ()
with second order differences. Then, for example, the penalty term for the
estimation of the smooth baseline hazard in model (2.6)
contains only the parameters ,
corresponding to and has the form
The degree of smoothness is determined by the tuning parameter . The larger the value of
, the smoother is the resulting
function, and vice versa. When several smooth functions are included in the model,
one uses a difference penalty for each spline effect, based on the differences of
adjacent -spline coefficients for the
corresponding covariate. The smoothness of the individual spline estimates can be
determined by the same or separate penalty parameters .
Before fitting proportional continuation ratio models with software for binary
outcome data, one has to generate the required binary observations presented in
(3.5). This is done by the generation of an augmented data
matrix. For the setup of the matrix, one has to distinguish between
censored and non-censored individuals. For an individual whose event was observed
()
at time ,
the augmented data matrix is given by
For an individual that is censored ()
at time ,
the augmented data matrix is given by
The first column in the augmented data matrices corresponds to the binary responses
.
The second column is the time interval running from
to .
When fitting a model with fixed intercept parameters ,
this column has necessarily to be coded as a nominal factor, for example, via dummy
variables. The remaining part of the data contains the covariates. When the
covariates are constant over time, the values in each row of columns
to are the same,
that is, covariate vector
is repeated row-wise. This is also the case in the US unemployment data. Otherwise,
when time-varying covariates are considered, the observed time series are entered in
the respective columns of the augmented data matrices. For each individual, the
augmented data matrix has
rows, and the whole data matrix, which is obtained by ‘glueing’ the individual
augmented matrices together, has
rows.
In R, the augmented data matrix can be generated by applying the function
dataLong() in the R package discSurvWelchowski and Schmid
(2017). The general interface of the function is
The function requires the original data of class data.frame in
‘non-augmented’ short format (argument dataSet), the column
name of the observed discrete event times (argument
timeColumn) and the column name of the binary event
indicator as defined in Equation (3.1) (argument
censColumn). The variable required by
timeColumn can either be numeric or coded as an ordinal
or nominal factor. If timeAsFactor = TRUE, the time column in
the augmented data matrix will be returned as a nominal factor. The variable
required by censColumn can either be a numerically coded 0/1
vector or a labelled factor variable. Note that dataLong()
assumes that the covariates are constant over time. If this is not the case, the
function dataLongTimeDep() should be used instead to generate
the augmented data matrix. For details on the required format of the raw data
matrix, we refer to the documentation in discSurv.
The augmented data matrix returned by dataLong() contains the
binary responses as defined in Equation (3.5) in the form of a numerically
coded 0/1 vector named y. Further details on the output are
given by the application in Section 5.1.
Goodness-of-fit measures
In this tutorial, we consider two diagnostic tools that are useful to investigate
discrete hazard models in terms of their goodness of fit. By appropriate
visualizations, both tools can be used to check whether a model is well
calibrated, that is, to check how well the fitted probabilities
agree with their corresponding observed proportions. Note that these graphical
checks do not constitute ‘formal’ calibration tests, in the sense that they neither
rely on asymptotics nor on distributional results.
First, one can generate a calibration plot. The idea is to compare
the estimated hazards ,
of the model to the relative frequencies of observed events ()
in predefined subsets of the augmented set of observations. More specifically, one
splits the data into subsets ,
defined by the percentiles of the estimated hazards. Common choices for
are
or .
Then the relative frequency of observed events (‘empirical hazard’) is calculated in
each subset by
where
corresponds to
the number of observations in subset .
If the fit of the model is satisfactory, the empirical hazard measure in (4.1) should
be close to the average of the estimated hazards in
for all , that is, close to the mean of
.
Examples of calibration plots are shown in Figure 3 in the application.
Second, we consider martingale residuals, which allow for assessing
the importance of single covariates .
The idea of the martingale residuals is to compare for each individual the observed
number of events with the expected number of events up to .
Using the binary response variables
the residuals are defined as
For a well-fitting model that includes all relevant predictors, the difference
between and
should be
‘random’ and therefore uncorrelated with the covariate values. To assess the
importance of a covariate graphically, one can plot the residuals against the
covariate values. Martingale residuals can be computed by the function
martingaleResid() contained in the discSurv
package. Examples are shown in Figure 3 in the application.
Application: Duration of unemployment
In the following, discrete hazard regression modelling is illustrated by means of a
step-by-step analysis of the US unemployment data. Throughout this section, we use
the logistic link function, that is, we consider the fitting of a proportional
continuation ratio model.
Preprocessing of the data
To fit a logistic discrete hazard model of the form (2.4),
the original data matrix first has to be transformed to an augmented data
matrix, as described earlier. The dataset UnempDur, which
(after application of the preprocessing steps outlined in the Introduction) is a
slightly modified version of the data frame available in the R package
Ecdat, has the following form:
>head(UnempDur)
spell age ui reprate disrate logwage tenure status
1 5 41 no 0.179 0.045 6.89568 3 1
2 13 30 yes 0.520 0.130 5.28827 6 1
4 3 26 yes 0.448 0.112 5.97889 3 1
5 9 22 yes 0.320 0.080 6.31536 0 1
6 11 43 yes 0.187 0.047 6.85435 9 0
8 3 32 no 0.373 0.093 6.16121 0 1
The first column named spell is the observed time to
re-employment of individual and contains the values of
.
As mentioned earlier, these values correspond to the lengths of the spells
(measured in two week intervals), whose distribution is displayed in Figure 1. The last column
named status indicates whether the exact event time of
individual has been observed
(status)
or if the individual is subject to right censoring
(status);
it corresponds to the random variable
defined in equation (3.1). Summarizing the
status column yields a censoring rate of 0.391:
> table(UnempDur$status)/nrow(UnempDur)
0 1
0.3909657 0.6090343
Columns two to seven of the data frame UnempDur contain
the explanatory variables described in Table 1. All covariates are constant
over the time of the survey.
When using dataLong() to obtain the augmented data matrix,
one has to pass the column names spell and
status to the arguments
timeColumn and censColumn,
respectively:
The new columns are obj, which is an identifier of the
individuals, timeInt, which contains the discrete time
values (i.e., the second column of the augmented matrices in (3.9)
and (3.10), stored as a nominal factor) and y,
which contains the binary response variables .
The head of the augmented data matrix is given by
>UnempDurLong[UnempDurLongobj==1,]
obj timeInt y spell age ui reprate disrate logwage tenure
status
1 1 1 0 5 41 no 0.179 0.045 6.89568 3 1
1.1 1 2 0 5 41 no 0.179 0.045 6.89568 3 1
1.2 1 3 0 5 41 no 0.179 0.045 6.89568 3 1
1.3 1 4 0 5 41 no 0.179 0.045 6.89568 3 1
1.4 1 5 1 5 41 no 0.179 0.045 6.89568 3 1,
showing that the first individual (obj)
had an event after 10 weeks (spell
and status).
Accordingly, the augmented data matrix for the first individual has five rows,
where each row corresponds to one time interval (timeInt).
The corresponding vector of responses is . The values
of the covariates remain constant over time and are therefore the same in each
row.
As a second example, consider the augmented data matrix of the 12th individual
(obj).
This individual is censored after six weeks (spell
and status).
Hence, the corresponding data matrix has three rows with response
:
>UnempDurLong[UnempDurLong obj==12,]
obj timeInt y spell age ui reprate disrate logwage tenure
status
14 12 1 0 3 40 yes 0.52 0.13 4.95583 0 0
14.1 12 2 0 3 40 yes 0.52 0.13 4.95583 0 0
14.2 12 3 0 3 40 yes 0.52 0.13 4.95583 0 0.
Regression modelling
A parametric proportional continuation ratio model with a linear predictor is
estimated in R by passing the augmented data matrix
UnempDurLong to glm() with the
usual specifications:
+ data = UnempDurLong, family = binomial(link =
”logit”)).
The left-hand side of the formula argument contains the
binary response vector y. In addition to the names of the
six covariates, the right-hand side contains the discrete time variable as a
nominal factor without intercept (timeInt ’ 1). The
family argument binomial(link =
”logit”) is the same as for ‘usual’ logistic regression models
with binary outcome.
The more complex model with smooth baseline hazard (Equation (2.6)) is
estimated by use of the R package mgcvWood (2017). A
detailed introduction to the estimation procedures is found in Wood (2011). The
corresponding function gam() essentialy has the same
interface as glm():
+ data = UnempDurLong, family = binomial(link =
”logit”)).
Before passing the nominal factor timeInt to
gam(), it has to be transformed to a continuous
variable (timeIntNum). Then a smooth baseline hazard is
specified by use of the function s() on the right-hand
side of the model formula. Required arguments are the type of spline smoother
bs, the dimension of the basis
k (which corresponds to the number of basis functions
)
and the order of the penalty m. Here, we use
-splines with cubic basis
functions (bs = ”ps”) and a second order difference
penalty (m).
The chosen dimension (k)
results in four cubic basis functions. Based on these specifications, the
optimal smoothing parameter , see Equation (3.7),
is computed by generalized cross-validation see Wood (2006). Its estimate is stored in
the argument sp. In case of the US unemployment data,
sp is estimated as:
>model2sp
s(timeIntNum)
0.08562284.
Generally, the mgcv package implements a large variety of
alternative spline estimators and methods for smoothing parameter optimization.
In principle, all of these methods may be used for discrete hazard modelling, in
the same way as they would be used in logistic regression (or, more generally,
in additive models with a binary response).
The estimated baseline hazards of model1 and
model2 are shown in Figure 2. The discrete baseline hazard
obtained for model1 is visualized by black dots. From
these estimates, a reasonable interpretation is hard to derive. On the other
hand, the smooth baseline hazard obtained for model2,
visualized by grey squares, is more meaningful. It is seen that the conditional
probability of re-employment decreases until week 20 and subsequently increases
up to week 32 before it diminishes again. The reason for this might be that in
many US states, workers are eligible for up to 26 weeks of benefits from the
state-funded unemployment compensation programme.
Analysis of the US unemployment data. The figure shows the estimated
discrete baseline hazard of model1 (black dots)
and the smooth baseline hazard of model2 (grey
squares)
Analysis of the US unemployment data. The table contains coefficient
estimates (coef), estimated standard errors (se) and
p-values of the covariate effects obtained for
model1 and model2 (bh
= baseline hazard)
model1 (discrete bh)
model2 (smooth bh)
coef
se
-value
coef
se
-value
age
0.012
0.003
0.000
0.012
0.003
0.000
reprate
0.285
0.342
0.406
0.301
0.338
0.373
disrate
0.764
0.383
0.046
0.755
0.379
0.047
logwage
0.231
0.072
0.001
0.236
0.071
0.001
tenure
0.005
0.005
0.280
0.006
0.005
0.266
ui
1.151
0.052
0.000
1.175
0.051
0.000
Table 2 shows the
estimates of the coefficients, the corresponding estimated standard errors, and
the -values of the covariate
effects obtained for model1 and
model2. Apart from the eligible replacement rate
(reprate) and the tenure in the lost job,
(tenure), the covariates are significantly associated
with the time to re-employment. According to the signs of the estimates of both
models, the chance of getting re-employed decreases with increasing values of
age, increasing disregard rate and with the filing of an unemployment claim. On
the other hand, the higher the earnings in the lost job, the better the chance
of re-employment. Table
2 also shows that the differences in coefficient estimates between
the two models are small. By use of Equation (2.5), the effects can be
interpreted in an easy way. For example, let us compare citizens who submitted
an unemployment insurance claim (ui)
to those who did not (ui).
Based on the estimate of model1 (),
one obtains that the probability of re-employment at (any) time , compared to the probability
of re-employment later than , decreases for citizens who
submitted an unemployment insurance claim by the factor .
The chance of re-employment is therefore much smaller in this group. One might
speculate that due to benefits from the state-funded unemployment, the
motivation to search for a new job is lower.
Analysis of the US unemployment data. The two panels show the
calibration plot (left) and the martingale residuals against the values
of age (right) obtained for model1 (a) and
model2 (b) without age, respectively. The
trend lines were obtained by local polynomial regression
The goodness-of-fit measures for model1 and
model2 are presented in Figure 3. The left figures show the
calibration plots (average fitted hazards against the relative frequencies of
events). It is seen that the values of model1 are closer
to the 45 degree line than those of model2, indicating a
better model fit. The right figures show the martingale residuals, defined in
(4.2), against the values of the covariate age
for the fit of model1 and model2
without age. The black lines correspond to the estimated trend obtained by a
local polynomial regression using the R function loess().
The functional form of the trend lines (compared to the zero line) shows a
nonlinear effect on the martingale residuals for both models. This indicates
that the covariate age is an influential variable (cf.
Table 2) with a
possibly nonlinear effect on the response.
Therefore, as a possible extension, we consider a model where the baseline hazard
as well as the covariate age are both modelled as smooth -spline functions:
>model3 < ’ gam(formula = y s(timeIntNum, bs = ”ps”, k = 5, m
= 2) +
+ s(age, bs = ”ps”, k = 25, m = 2) +
+ reprate + disrate + logwage + tenure + ui,
+ data = UnempDurLong, family = binomial(link =
”logit”)).
As seen from the R code, the estimation of the smooth function of covariate age
in model3 is based on 24 cubic basis functions (dimension
k).
The estimated penalty parameter
for age, stored in sp, is:
>model3sp[”s(age)”]
s(age)
56 860.2.
From the resulting function shown in Figure 4, it is seen that the association
between the time to re-employment and age is definitely not linear. Its form is
very similar to the loess trend shown in Figure 3. The value on the
-axis of the figure
corresponds to the contribution of age to the predictor of the model. The chance
of re-employment has a peak between 20 years and 30 years, and subsequently
decreases.
Analysis of the US unemployment data. The plot shows the estimated
P-spline function for the covariate age in
model3
Tree-based modelling
Finally, we fit a recursive partitioning model of the form (2.8).
Again, we consider a procedure that is based on the augmented data matrix with
binary outcomes .
When growing trees, one has to take two main decisions: First, one has to choose
an appropriate criterion for performing the splits. Criteria that have already
been used in the early days of tree construction are impurity measures. For
discrete survival trees, a natural measure of node impurity is the Brier
score, which evaluates the average squared difference between the
binary outcome values and the
respective hazard estimate in each node
(see Schmid et al. ,
2016). It can be shown that using the Brier score is equivalent to
the traditional Gini impurity measure. For a single node , the Gini impurity is given
by
where
is the proportion of ones in node (see Breiman 1996). This equivalence implies
that the traditional CART algorithm based on the Gini criterion can be used for
the construction of the tree. The latter is done by using the function
rpart() of the eponymous R package rpart
(Therneau et al.,
2015).
Second, one has to determine the optimal size of the tree. For the discrete
survival tree, an appropriate tuning parameter controlling tree size is the
minimal number of observations in each terminal node (‘minimal node size’).
Optimizing this number avoids overfitting, as the number of terminal nodes is
prevented from becoming too large and, at the same time, the node sizes are
prevented from becoming too small. Growing the largest possible tree with
exactly one observation in each terminal node, for example, is not desirable, as
the estimated hazards would all be either or in this case. Accordingly,
splitting is stopped when further splitting in any of the current nodes would
result in an additional node containing less observations than the minimal node
size. Given a sequence of tree estimates depending on the minimal node size, the
optimal tree (i.e., the tree with ‘optimal’ minimal node size) is determined by
either minimization of an information criterion (such as AIC or BIC, see in the
following) or maximization of the predictive log-likelihood. The latter strategy
means to repeatedly draw subsamples from the original non-augmented data (for
example, by cross-validation, bootstrapping or subsampling without replacement)
and to calculate the log-likelihood for the omitted observations. One determines
the optimal tree as the one for which the predictive log-likelihood (averaged
across the subsamples) becomes maximal. The R function
survivalTree() automatically generates the augmented
data matrix by dataLong(), estimates the discrete
survival tree by rpart() and returns the optimal one
according to the specified performance criterion. The function is part of the
electronic supplement of this article.
Once the optimal minimal node size has been determined, the estimate of
is given by
the relative frequency of events (proportion of ones) in each node, possibly
after applying some sort of correction procedure like the Laplace correction
(see in the following).
To fit a tree model to the US unemployment data, we call the
survivalTree function using the following
arguments:
> source(”survivalTree.R”)
> model4 < ’ survivalTree(formula = y timeInt + age
+
+ reprate + disrate + logwage + tenure + ui,
+ data = UnempDur, tuning = ”BIC”,
+ timeColumn = ”spell”, censColumn = ”status”,
+ minimal_ns = seq(100, 1500, by = 10),
+ trace = TRUE).
The formula required for the tree model is analogous to the one specified for a
model with linear predictor. Note that internally the time variable
timeInt is coded as a numeric vector. This is in
analogy to model2 and model3 with
smooth baseline hazards. The original data frame UnempDur
(in non-augmented format) is passed to the data argument.
In addition, one has to specify the timeColumn and
CensColumn arguments used in
dataLong(). The performance criterion is specified by
the argument tuning. For tuning, we use the Bayesian
information criterion (BIC) defined by
where
is the log-likelihood
(3.6),
is the number of rows of the augmented data matrix and
denotes the number of splits as a measure of the complexity of the tree. Other
possible arguments for tuning are ”AIC” (Akaike's
information criterion) and ”ll” (predictive
log-likelihood method). When using ”ll”,
survivalTree() performs a five-fold cross-validation
based on subsamples without replacement. To ensure that each subsample contains
a sufficient number of observations per observed event time, the subsamples are
stratified by spell. The
survivalTree function searches for the best model
among the sequence of models with minimal node sizes
minimal_ns. If minimal_ns is
not specified, the sequence of minimal node sizes is set to .
Analysis of the US unemployment data. The plot shows the BIC values
for the sequence of survival trees that was obtained by fitting
model4 with minimal node sizes ranging from
100 to 1 500. The minimal BIC value (obtained for node size 840) is
marked by the vertical dashed line
The BICs obtained for model4 with minimal node sizes
(in steps of 10) are
shown in Figure 5. If an
increase of the minimal node size does not change the number of splits and
therefore does not influence the resulting tree, the BIC remains the same. This
is the case, for example, between minimal node sizes and . According to the BIC,
the optimal tree model has minimal node size 840, marked by the dashed line in
Figure 5. This
results in a tree with 11 splits or 12 terminal nodes. The estimated tree is
shown in Figure 6.
Analysis of the US unemployment data. The graph visualizes
the survival tree obtained from fitting
model4 with BIC-optimal minimal node size
840. The numbers at the terminal nodes refer to the estimated
hazards. All estimated hazards were additionally post-processed by
application of the Laplace correction, which was
suggested by Ferri et al. (2003) to correct for estimates near the
boundaries 0 and 1 in nodes with very few observations. The Laplace
correction is automatically performed by
survivalTree()
The most important covariate, which was chosen in the first split of the tree, is
ui. As already derived from the parametric models,
the submission of an unemployment insurance claim (ui =
‘yes’) has a negative effect on the ‘chance’ of re-employment. Within the group
of citizens who submitted an insurance claim, the chance is lowest for citizens
aged years, or older and
with a tenure in the lost job of at least 6 years (leftmost node in Figure 6). For citizens
younger than years, all further
splits are performed with regard to the discrete time variable
(timeInt). This confirms the results from Figure 2 in that the
chance of re-employment is highly time-dependent. With the high hazard estimate
of after 26 weeks
(timeInt)
of unemployment in this group, the tree estimate also indicates similar
tendencies that were already seen in Figure 2 after 20 weeks. The best
opportunities of re-employment are observed for citizens without an unemployment
insurance claim, within the first six weeks (timeInt)
of unemployment, and with log weekly earnings of at least
(rightmost node in Figure
6). For this subgroup, the estimated hazard rate is . The two covariates
reprate and disrate were not
selected in any of the splits and are therefore exluded from the model. This is
in contrast to model1 and model2
(see Table 2), where
disrate showed a significant effect on the
hazard.
Comparison and model choice
In the previous sections, we presented the results of the analysis of the US
unemployment data, having used three different parametric models and a
tree-based model. In addition to the goodness-of-fit measures defined in Section
4, the interpretability of the model coefficients and the performance with
regard to predicting events of future observations can be used as criteria for
model choice.
Regarding the interpretation of the covariate effects, there is an important
difference between the parametric models and the tree-based model. After the
first split in the tree, all nodes represent interactions between either the
covariates or between the covariates and time. For example, the second split in
the left node (ui = ‘yes’) in the covariate
age implies an interaction between the covariates
ui and age (see Figure 6). In contrast to
the interaction effects, it is usually difficult to detect and quantify main
effects using tree modelling (e.g., by age). On the other
hand, in the parametric models, the main covariate effects can easily be
interpreted. For example, according to our analysis, covariate
age has a negative linear effect (see Table 2) or smooth
effect (see Figure 4) on
the chance of re-employment. Higher order interactions (e.g., between
ui, age and
tenure) are usually hard to model by the parametric
models, as they—unlike trees—do not include a data-driven selection of
interaction terms during model fitting. This implies that interactions either
need to be known or that a large number of parametric models
(including/excluding the various interaction terms) need to be fitted and
compared.
A common measure for the prediction accuracy of the models is the deviance D =
-2ℓ, evaluated on ,
of an independent test dataset comprising
observations. Because D is small if the log-likelihood ℓ is
large, one should prefer the model with minimal D. To further
compare the models, we drew 100 subsamples without replacement of size
n = 2 568 (i.e., 80% of the original sample), estimated the
four models on each of the 100 subsamples and computed the predictive deviances
from the remaining 100 test sets of
observations each. Subsampling was stratified by spell to
ensure a sufficient number of observations per observed event. From the results
in Figure 7, it is seen
that model1 with discrete baseline hazard performed
better than model2 and model3 with
smooth baseline hazard. This was already indicated by the calibration plots in
Figure 3. The
tree-based model (rightmost boxplot) had a smaller predictive deviance on
average than the parametric models and may hence be considered the
best-performing model for the US unemployment data.
Analysis of the US unemployment data. The boxplots show the
predictive deviance values of the four models based on 100
subsamples without replacement of size n = 2 568
each The models were evaluated on the remaining 100 test sets of
= 642 observations each. For a better comparison, the median of the
tree-based model is marked by a dashed line
Concluding remarks
In this tutorial, we have described a basic set of tools to fit semiparametric
regression models with a discrete time-to-event outcome. All presented models are
very general, in that they are applicable to any type of censored discrete response,
regardless of whether the data-generating process is defined by an intrinsically
discrete process or by the rounding/grouping of continuous event times. Furthermore,
the presented methods are applicable in basically any field of research, as for
example, in the social sciences, biostatistics, epidemiology and many more. The US
unemployment data considered in this article is therefore only one of many possible
examples. Further applications are presented in Tutz and Schmid (2016).
It is important to realize that all models considered in this tutorial can be fitted
easily by use of standard software for binary regression modelling. This has the
great advantage that established tools for estimation and inference can be used,
that are already available. The most important functions in R are
glm(), gam() (of the
mgcv package) and rpart() (of the eponymous
package). In addition to the P-spline and CART methodologies
considered here, many other options for semiparametric discrete time-to-event
modelling exist in R. For example, mgcv provides a variety of
alternative spline modelling tools such as cardinal splines and smoothing splines,
which can be used for discrete hazard modelling by specifying the
bs argument in gam() accordingly.
Further extensions also include time-varying coefficient models as considered in
Bender et al. (2018)
for continuous data. Similarly, there is an alternative tree modelling approach
developed by Bou-Hamad et al.
(2009) that operates directly on the non-augmented time-to-event data.
This procedure is implemented in the R package DStree (Mayer et al., 2014).
The basic functionalities required for applying the aforementioned software packages
are all implemented in the discSurv package. Next to the functions used
in this tutorial, discSurv provides additional functions to calculate,
for example, measures for model evaluation like the concordance index (Schmid et al., 2017), and
alternative tools for residual analysis.
We finally note that there exists a number of additional modelling options that are
beyond the scope of this tutorial. These include, among many others, (a) regularized
estimation via boosting or penalized optimization of the log-likelihood, which is
useful for variable selection in higher-dimensional settings, (b) random effects and
finite mixture modelling, which account for unobserved heterogeneity in the data and
(c) competing-risks models, which extend the models considered in this tutorial by
allowing for more than one target event. For details on further methodology,
including semiparametric extensions, see Tutz and Schmid (2016). In particular, a
basic introduction into boosting for regression modelling is given by Mayr and Hofner, (2018) as
part of this special issue.
Footnotes
Acknowledgments
The work of MS was supported by the German Research Foundation (DFG), grant SCHM
2966/1-2.
References
1.
AgrestiA (2013)
Categorical Data Analysis, 3rd
edition. New York, NY: John Wiley & Sons.
2.
BenderAGrollAScheiplF (2018)
A tutorial on the estimation of time-varying coefficient
models in event data analysis. Statistical
Modelling,
18(299–321).
3.
Bou-HamadILarocqueDBen-AmeurHMâsseLCVitaroFTremblayRE (2009)
Discrete-time survival trees.
Canadian Journal of Statistics,
37, 17–32.
4.
BreimanL (1996)
Technical note: Some properties of splitting
criteria. Machine
Learning, 24,
41–47.
5.
BreimanLFriedmanJHOlshenRAStoneJC (1984)
Classification and Regression Trees.
Monterey, CA:
Wadsworth.
6.
CroissantY (2016)
Ecdat: Data sets for econome- trics. R package
version 0.3-1. URL
http://CRAN.R-project.org/package=Ecdat
7.
De BoorC (1978)
A Practical Guide to SplinesNew York, NY:
Springer.
8.
EilersPHMarxBD (1996)
Flexible smoothing with B-splines and
penalties. Statistical
Science, 11,
89–102.
9.
FerriCFlachPAHernández-OralloJ (2003)
Improving the AUC of probabilistic es- timation
trees. In NadaLavračDraganGambergerHendrikBlockeelLjupčoTodorovskiEuropean Conference on Machine Learning,
121–132. Berlin
Heidelberg:
Springer.
10.
HastieTTibshiraniRFriedmanJH (2009)
The Elements of Statistical Learning, 2nd
edition. New York, NY:
Springer.
11.
KalbfleischJPrenticeR (2002)
The Survival Analysis of Failure Time Data, 2nd
edition.New Jersey: Wiley
Inter-Science.
12.
KleinJMoeschbergerM (2003)
Survival Analysis: Statistical Methods for Censored and
Truncated Data.New York, NY:
Springer.
13.
MayerPLarocqueDSchmidM (2014)
DStree: Recursive partitioning for discrete- time
survival trees. R package version 1.0. URL
https://CRAN.R-project.org/package=DStree
14.
MayrAHofnerB (2018)
Boosting for statistical modelling: A non-technical
introduction. Statistical
Modelling18365–384.
15.
R Core Team
(2017) R: A Language and Environment for
Statistical Computing.Vienna: R Foundation for Statistical Comp- uting. URL
https://www.R-project.org/
16.
SchmidMKüchenhoffHHoeraufATutzG (2016)
A survival tree method for the analysis of discrete event
times in clinical and epidemiological studies.
Statistics in Medicine,
35, 734–751.
17.
SchmidMTutzGWelchowskiT (2017)
Discrimination measures for discrete time- to-event
predictions. Econometrics and
Statistics.
18.
TherneauTAtkinsonBRipleyB (2015)
rpart: Recursive Partitioning and Regression
Trees. R package version 4.1-10.URL
https://CRAN.R-project.org/package=rpart
19.
TutzGSchmidM (2016)
Modeling Discrete Time-to-event Data.New York, NY:
Springer.
20.
WelchowskiTSchmidM (2017)
discSurv: Discrete Time Survival
Analysis.R package version 1.1.7 URL
http://CRAN.R-project.org/package=discSurv
21.
WillettJBSingerJD (1993)
Investigating onset, cessation, relapse, and recovery: Why
you should, and how you can, use discrete-time survival analysis to examine
event occurrence. Journal of Consulting and
Clinical Psychology, 61,
952–965.
22.
WoodS (2006)
Generalized Additive Models: An Introduction with
R. Florida:
CRC press.
23.
WoodS (2017)
mgcv: Mixed GAM Computation Vehicle with GCV/AIC/REML
Smoothness Estimation.R package version 1.8-15. URL https://CRAN.R-project.org/package=
mgcv
24.
WoodS (2011)
Fast stable restricted maximum likelihood and marginal
likelihood esti- mation of semiparametric generalized linear
models. Journal of the Royal Statistical
Society: Series B (Statistical Methodology),
73, 3–36.