Abstract
The case-time-control design is a tool to control for measured, time-varying covariates that increase montonically in time within each subject while also controlling for all unmeasured covariates that are constant within each subject across time. Until recently, the design was restricted to data with only two timepoints and a single binary covariate, or data with a binary exposure. Sjölander (2017) made an important extension that allows for an arbitrary number of timepoints and covariates and a nonbinary exposure. However, his estimation method requires fairly strong model assumptions, and it may create bias if these assumptions are violated. We propose a novel estimation method for the case-time-control design, which to a large extent relaxes the model assumptions in Sjölander. We show in simulations that this estimation method performs well under a range of scenarios and gives consistent estimates when Sjölander’s estimation does not.
1. Introduction
Sociological and epidemiological research often aims to estimate the causal effect of an exposure on a binary outcome. One popular way to reduce confounding bias in observational studies is to use each subject as his or her own control, that is, to compare the risk of the outcome at exposed and unexposed timepoints within the same subject. Technically, this is often accomplished by using a conditional logistic regression model. This model conditions on the subject, implicitly controlling for all covariates that are constant within the subject across time, such as genetic makeup (Allison 2009; Allison and Christakis 2006).
Often, scholars will also want to control for measured time-varying covariates in the model, such as age, income, and socioeconomic status. This may be challenging, particularly when (a) some of these covariates are monotonically increasing in time within each subject (e.g., age) and (b) the outcome of interest is “absorbing,” which means the subject leaves the cohort when the outcome occurs (death is the most obvious example). Under these conditions, it it not possible to fit a conditional logistic regression model. This is because a monotonically increasing covariate perfectly predicts an absorbing outcome, so it is impossible to obtain an estimate of the parameter for that covariate (Allison 2009; Allison and Christakis 2006).
The case-time-control design has been proposed as a solution to this problem. Briefly, the design uses a reformulation of the regression model that allows the target parameter to be estimated without convergence problems. There are two common reformulations: one proposed by Suissa (1995) and one proposed by Allison and Christakis (2006). Both of these have important limitations. Suissa’s (1995) reformulation assumes each subject is observed at only two timepoints but allows the exposure variable to be either binary or continuous. Allison and Christakis’s (2006) reformulation allows for more than two timepoints but requires the exposure variable to be binary.
In recent work, Sjölander (2017) proposed an extension of the case-time-control design that allows for an arbitrary number of timepoints, an arbitrary number of (possibly nonbinary) time-varying covariates, and a nonbinary exposure. This extension uses a conditional generalized linear model (GLM) for the exposure. Sjölander (2017) showed that the target parameter (i.e., the exposure effect) can be written as the ratio between the outcome coefficient and the dispersion parameter in this GLM. Sjölander (2017) proposed estimating the outcome coefficient and the dispersion parameter separately using conditional generalized estimating equations (CGEEs) and the method of moments, respectively, and estimating the target parameter as the ratio of these. A major limitation of this two-step estimation method is that it requires the link function in the GLM to be specified. In practice, the analyst often does not know which link function is most appropriate, and an incorrectly specified link function may give biased estimates. Furthermore, the estimation method proposed by Sjölander (2017) only allows for the identity link and the log link; it cannot handle other common link functions in GLMs, such as the inverse or inverse squared.
In this article, we propose a conditional maximum likelihood (CML) estimation method for the case-time-control design that does not suffer from these limitations. This CML method allows for an arbitrary canonical link function in the GLM for the exposure. Furthermore, it does not require the analyst to specify this link function. Thus, this CML method is both more general and more robust than the two-step method proposed by Sjölander (2017).
The article is organized as follows. In Section 2, we introduce notation and assumptions and define the target parameter. In Section 3, we briefly review the two-step method proposed by Sjölander (2017), and in Section 4, we present our CML method. In Section 5, we evaluate the performance of the CML method with a simulation study and compare it to the two-step method proposed by Sjölander (2017). As an illustration of the two-step method, Sjölander (2017) analyzed a real data set containing information on 1,151 teenage girls who were interviewed annually for five years. In Section 6, we reanalyze these data with the CML method and compare the results to those obtained by Sjölander (2017).
2. Notation, Assumptions, and Target Parameter
We assume data have been collected on
Most estimation methods for the case-time-control design rely on three independence assumptions (Jensen et al. 2014; Sjölander 2017):
and
Assumption 1 states that the observations from different subjects are independent. This is uncontroversial because it typically holds by study design. Assumptions 2 and 3 are more questionable. Although these are probabilistic assumptions, they have important causal implications. Assumption 2 implies that the exposure at any given timepoint has no direct causal effect on the exposure at future timepoints, and Assumption 3 implies that the exposure at a given timepoint has no direct causal effect on the outcome and covariates at future timepoints, and vice versa. The causal diagram (Pearl 1995, 2009) in Figure 1 illustrates the scenario for two timepoints. In this diagram, an arrow from

A causal diagram for which Assumptions 2 and 3 hold.
Note that Sjölander’s (2017) two-step method does not require Assumption 2 to hold. To our knowledge, this is the only proposed estimation method for the case-time-control design that does not require this assumption.
The case-time-control design aims to estimate the odds ratio (OR)
which measures the association of the outcome with
where
A standard way to estimate
where
3. The Two-Step Method Proposed by Sjölander (2017)
Sjölander (2017) proposed using a generalized linear model (GLM; McCullagh and Nelder 1989) for the exposure. A GLM assumes that the variable being modeled, in this case the exposure, has a distribution within the exponential dispersion family. This family of distributions is fairly flexible and can accommodate various shapes. It includes, for instance, the Gaussian distribution, the Poisson distribution, the gamma distribution, and the inverse Gaussian distribution. Technically, Sjölander’s (2017) GLM assumes that the conditional distribution for exposure
In this model, the parameters
The conditional mean of
Using Bayes rule, Sjölander (2017) showed that the target parameter
with
Notably, CGEEs are restricted to the identity link and log link, and they may give biased estimates if the link function is misspecified. In the next section, we derive an estimation method that allows for an arbitrary canonical link function in the GLM for the exposure and does not require that the analyst specify this link function.
4. The CML Method
Let
where
We emphasize that the conditional likelihood
Kalbfleisch (1978) derived a similar conditional likelihood for nonclustered data in the context of permutation tests. However, in that context, the target parameter is the unscaled parameter
5. Simulation
We carried out a simulation study to assess the finite sample properties of the proposed CML method and compare it with Sjölander’s (2017) two-step method. The code for the simulation is provided in Online Appendix B.
We generated samples of
When simulating the exposure from the gamma and inverse Gaussian distributions, we subtracted the constant 2.2 from the confounder
Table 1 displays the results. When
Simulation Results When
Note. Target parameter:
In all these simulations, the analysis models were correctly specified: The outcome was generated from an exogenous model, and the exposure was generated from the endogenous Model 6. To assess the performance of the estimators under model misspecification, we modified the simulation scheme so
with
Table 2 displays the results. No scenario or estimator has a substantially larger bias than in Table 1. In some cases, the bias is even smaller (e.g., the CML estimator in scenario 3). These results indicate some robustness of the estimators to model misspecification.
Simulation Results When
Note. Target parameter:
6. Real Data Analysis
As an illustration, Sjölander (2017) used the data set teenpov, which was borrowed from Allison (2009) and contains information on 1,151 teenage girls who were interviewed annually for five years, beginning in 1979. The data set contains the variables ID (a unique subject-identifier), nonpov. (1 if the girl is currently not in poverty according to U.S. federal standards, 0 else), hours (the number of hours currently worked per week), in school (1 if the girl is currently enrolled in school, 0 else), spouse (1 if the girl is currently living with a spouse, 0 else), age (the girl’s current age), and mother (1 if the girl currently has at least one child, 0 else). To be consistent with previous notation, we rename ID as
Sjölander (2017) aimed to investigate how much each additional working hour increases the probability of shifting from poverty to nonpoverty. He thus restricted attention to girls who were in poverty at the first interview and followed them until the first interview when they were no longer in poverty or the fifth interview, whichever came first. After this restriction, the data set contains 1,342 interviews from 401 girls, so that
In a first analysis, Sjölander (2017) fitted the ordinary logistic regression model
In Model 10, the parameter
By conditioning on the subject-identifier
However, Model 11 cannot be fitted to the data because the covariates age and mother increase monotonically in time within each subject and the outcome (nonpoverty) is absorbing. To solve this problem, Sjölander (2017) used the two-step method with an assumed identity link, obtaining an estimate of
To facilitate use of the proposed methods, we wrote an R package, cglm, which is freely available on Cran. We use this package to reconstruct the analysis by Sjölander (2017) and illustrate our novel proposal. We first load the package by typing
The package has one single function,
The
We summarize the results by typing
These results are identical to those presented by Sjölander (2017). We caution the reader that the estimated coefficients for in school, spouse, age, and mother in this output, as well as in the following outputs, have no obvious relevance for the research question as these measure the association between the measured covariates and the exposure (scaled by the dispersion parameter).
To use the two-step method with an assumed log link, we type
The estimate of
To use the CML method, we type
The CML estimate of
In the previous reanalysis, we assumed that the exposure-outcome odds ratio is constant across levels of covariates (e.g.,
The interaction term on the last row is nonsignificant, and the main effect on the first row is almost identical to the main effect in the aforementioned simpler analysis.
7. Discussion
The case-time-control design makes it possible to control for measured, time-varying covariates that increase monotonically in time within each subject while also controlling for all unmeasured covariates that are constant within each subject across time. Until recently, the design was restricted to data with only two timepoints and a single binary covariate, or data with a binary exposure. Sjölander (2017) made an important extension that allows for an arbitrary number of timepoints and covariates and a nonbinary exposure. However, his two-step estimation method requires specification of the link function in the GLM for the exposure and is restricted to the identity link and the log link.
In this article, we proposed a novel CML estimation method for the case-time-control design. This method allows for an arbitrary canonical link function and does not require the analyst to specify the link function. Our simulations show that the CML method works well and delivers a consistent estimate of the target parameter for a wide range of link functions. In contrast, we have shown that Sjölander’s (2017) two-step estimation method may be highly sensitive to the choice of link function and may have a large bias if this is misspecified.
Despite the strong dependency on the link function, we note three potential advantages of the two-step method. First, our simulations show that the two-step estimator is more efficient than the CML estimator when the link function is correct. This is not surprising given that the CML method makes weaker model assumptions than does the two-step method and the CML method conditions on the order statistic
A third, more subtle advantage of the two-step method is that it does not require the independence Assumption 2 to hold. Whereas the CML estimator requires that the exposure at a given timepoint has no direct causal effect on the exposure at future timepoints, the two-step estimator does not. This is potentially an important advantage as exposure-to-exposure causal effects are likely present in many scenarios. Unfortunately, it is difficult to study violations of this assumption in isolation: Both the two-step method and the CML method require that the exposure follows the GLM in Equation 6 at each timepoint, and it is difficult to see how one could make this model hold while violating the independence Assumption 2. We recognize this as an important topic for future research.
Supplemental Material
SM862259_Supplemental_Appendix_B – Supplemental material for A General and Robust Estimation Method for the Case-Time-Control Design
Supplemental material, SM862259_Supplemental_Appendix_B for A General and Robust Estimation Method for the Case-Time-Control Design by Arvid Sjölander and Yang Ning in Sociological Methodology
Footnotes
Appendix A
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Funding was provided by the Swedish Research Council (grant no. 2016-01267).
Author Biographies
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
