Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Estimators With Variation in Treatment Timing
Available accessResearch articleFirst published online August, 2022
Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Estimators With Variation in Treatment Timing
This article develops new closed-form variance expressions for power analyses for commonly used difference-in-differences (DID) and comparative interrupted time series (CITS) panel data estimators. The main contribution is to incorporate variation in treatment timing into the analysis. The power formulas also account for other key design features that arise in practice: autocorrelated errors, unequal measurement intervals, and clustering due to the unit of treatment assignment. We consider power formulas for both cross-sectional and longitudinal models and allow for covariates. An illustrative power analysis provides guidance on appropriate sample sizes. The key finding is that accounting for treatment timing increases required sample sizes. Further, DID estimators have considerably more power than standard CITS and ITS estimators. An available Shiny R dashboard performs the sample size calculations for the considered estimators.
When randomized controlled trials (RCTs) are not feasible and time series data are available, panel data methods can be used to estimate treatment effects on outcomes, by exploiting variation in policies and conditions over time and across locations. A feature of panel data is that treatment timing often varies across the sample, for example, due to differences across locations in treatment implementation (such as school turnaround efforts; Redding & Nguyen, 2020), adoption of laws, or natural phenomena (such as the onset of Covid-19). Only recently has this timing feature been addressed in the literature for estimating treatment effects using difference-in-differences (DID) estimators but not for other panel data estimators (Athey & Imbens, 2018; Borusyak & Jaravel, 2017; Callaway & Sant’Anna, 2021; de Chaisemartin & D’Haultfœuille, 2020; Goodman-Bacon, 2018; Sun & Abraham, 2020). Another feature of panel data is that outcomes are often autocorrelated over time, which if not addressed can lead to estimated standard errors that are seriously biased downward (Bertrand et al., 2004).
This article discusses new power formulas to assess required sample sizes that account for both of these real-world features for two classes of commonly used panel data estimators: (1) DID estimators that contrast changes in outcomes during the pre- and posttreatment periods across the treated and comparison groups and (2) comparative interrupted time series (CITS) and associated ITS estimators that rely on fitted pre- and postperiod trend lines. We adopt common specifications found in the literature for these estimators and build on the much smaller associated literature on statistical power that has focused only on settings without variation in treatment timing. We consider binary treatments—where individuals or groups who become treated remain so during the following periods—and focus on both continuous outcomes (such as student test scores) and binary outcomes aggregated as rates to the cluster level.
Another contribution of this work is that we incorporate recent approaches for adjusting for the clustering of individuals within groups—such as geographic or educational units—an issue where there has been considerable confusion in the causal inference literature for quasi-experimental designs (QEDs; Abadie et al., 2017). We consider panel designs with separate cross sections of individuals as well as longitudinal designs where the same individuals are followed over time. Further, unlike previous power analysis studies, we allow for time periods to be unevenly spaced and for the inclusion of model covariates to adjust for potential confounding bias and to improve precision. Using an event history approach, we develop power formulas for both point-in-time effects (to examine treatment dynamics) and pooled effects averaged over the full postperiod. We use a potential outcomes framework (Holland, 1986; Rubin, 1974, 2005) to facilitate comparisons across the estimators. Our approach accommodates studies using either individual- or cluster-level (aggregate) data. A free Shiny R dashboard, Power_Panel available for download in the Supplementary Materials with a .pdf documentation file, performs the sample size calculations for the considered estimators.
The rest of this article is in eight sections. The second section discusses the related power analysis literature, and the third section discusses our statistical power framework. In the fourth section, we discuss clustering, and the fifth section presents notation and target estimands. The sixth and seventh sections present sample size formulas for the DID and CITS/ITS estimators, where our main results are presented in a series of theorems. In the eighth section, we present an illustrative power analysis for the considered estimators and then conclude. For reference, Table 1 lists the equation numbers for the variance formulas presented in the text for each estimator. Online Appendix A provides more details on the power formulas.
Equation Numbers of Variance Formulas in the Text, by Estimator
Panel Data Estimator (Relative to the Average Preperiod and Averaged Across All Timing Groups)
Equation 19, omitting preperiod slope terms and including autocorrelations for individual-level errors
a The discrete estimator is similar to the fully interacted estimator, except that the postperiod in Equation 19 is modeled using discrete postperiod indicators rather than linear trendlines. The pooled variance estimators are the same for the discrete and fully interacted estimators. Variances for the point-in-time discrete estimators can be obtained using Equation 24 and Note 6 in the main text.
Related Literature
We extend the previous power analysis literature for DID estimators to simultaneously allow for variation in treatment timing (not previously addressed), autocorrelated errors, clustering, both average and point-in-time effects, and uneven time intervals and covariates (both not previously addressed). Frison and Pocock (1992) provide sample size formulas for nonclustered, longitudinal DID designs with constant correlations across time but do not consider clustered designs. McKenzie (2012) considers similar nonclustered designs with more general error structures but does not provide explicit variance formulas. While Somers et al. (2013) consider clustered designs, they examine two time periods only and do not address staggered timing or autocorrelation. Burlig et al. (2019) provide sample size formulas for cross-sectional models with both clustering and autocorrelated errors for a more restrictive model specification than ours, but do not allow for staggered timing, uneven time intervals, covariates, or point-in-time effects.
To our knowledge, Bloom (1999) is the only article that has developed statistical power formulas for CITS and ITS estimators. However, this study only considers models with discrete postperiod indicators, but not more common specifications that model both pre- and postperiod trendlines that are considered here. Further, Bloom (1999) considers a single follow-up period only and does not address staggered timing, autocorrelation, or covariates. Zhang et al. (2011), Kontopantelis (2018), Hawley et al. (2019), and Liu et al. (2019) use simulations to calculate power for the ITS design but do not develop power formulas or address staggered timing. Several articles conduct within-study comparisons using RCT data to empirically compare RCT impact estimates with those obtained using CITS or ITS methods (Baicker & Svoronos, 2019; Clair et al., 2016) but do not develop variance formulas. In the theory sections, we show how our general formulas reduce to those from the literature.
Statistical Power Framework
Our power analysis relies on formulas for minimum detectable impacts, which represent the smallest true impacts that can be detected with a high probability. We scale the minimal detectable impacts into effect size (standard deviation) units—hereafter referred to as MDEs—which is common in policy research, especially for outcomes that are difficult to interpret in nominal units (although this scaling is not necessary). This approach parallels the statistical power literature for RCTs (see, e.g., Bloom, 1995; Donner & Klar, 2000; Murray, 1998; Schochet, 2008) and regression discontinuity designs (Schochet, 2009).
All the considered impact estimators are multiple regression estimators. Accordingly, relying on asymptotic normality and a classical hypothesis testing approach, we assume the use of t-statistics to test null hypotheses of the form, , where is the impact parameter of interest. This approach yields the following MDE formula for a two-tailed hypothesis test:
where = , is the inverse of the student’s t distribution function, is the significance level, is the statistical power level, is the degrees of freedom, is the impact estimator, is the variance of based on the specific design features, and is the standard deviation of the outcome variable at a particular time point.
Alternatively, one can solve Equation 1 to calculate required sample sizes or the number of time periods to achieve a given MDE value, assuming specific values of , , and the design parameters in . Study sample sizes enter Equation 1 primarily through but also through because of . Hill et al. (2008) and Lipsey et al. (2012) discuss a framework for selecting MDE targets for education evaluations, for example, by examining the natural growth in outcomes over time, policy-relevant gaps across subgroups, and observed effect sizes from previous similar evaluations.
The power formula in Equation 1 hinges critically on . Thus, this article focuses on calculating closed-formed expressions for for the considered designs. However, we begin by discussing the general issue of clustering that can have a large effect on precision.
Clustering
Clustering occurs when the outcomes of individuals (or broader units) in the data set are correlated. Consider a panel analysis using data on the same individuals (or units) over time. In this case, outcomes are likely to be correlated for the same individual (or unit) across time periods, and our variance formulas incorporate this form of clustering. However, there has been much confusion in the QED causal inference literature about how to account for clustering across different individuals (or units) in the sample (Abadie et al., 2017).
In this article, we adopt an approach developed for RCTs and inspired by design-based theory to account for clustering across sample members or units (Abadie et al., 2017; Freedman, 2008; Imbens & Rubin, 2015; Schochet, 2010, 2013, 2016, 2020). Under this approach, one must be explicit about what is random under repeated sampling. For instance, if the study units are assumed to be fixed (which occurs in finite population settings), the main source of randomness is the treatment assignment itself. Accordingly, in this setting, the presence of clustering hinges on the considered unit of treatment assignment.
To determine this unit, it is useful to consider what would be the parallel unit under a hypothetical RCT. To help fix concepts, consider a panel data study examining whether Covid-19 had a larger effect on students in states where the pandemic hit earlier than in states where it hit later. In this case, the hypothetical unit of treatment assignment under an RCT would be the state, and thus, similarly in the panel data context. However, suppose the study instead examined the effects of pandemic-related educational policies (such as distance learning) that differed by school district. In this case, we would treat districts as the unit of treatment assignment and the source of clustering.
Note that in some designs, study units are formally sampled for the study from broader populations (e.g., studies using national data sets with multistage sample designs or multisite random block designs) or are deemed to have been so. In these cases, study results are assumed to generalize beyond the study sample, and clustering from the sampling of units becomes pertinent (Schochet, 2008). We do not formally consider such forms of clustering because their variance estimators differ from those developed here due to the emergence of correlations between the outcomes of the treatment and comparison groups within the same higher level clusters. However, our variance formulas are likely to be conservative in these cases (because they ignore these correlations). Further, our approach applies to settings where the higher level clusters are treated as fixed effects rather than as random blocks.
Clustering increases the MDEs in Equation 1 for two interrelated reasons (Donner & Klar, 2000; Hedges, 2007; Murray, 1998; Schochet, 2008). First, design effects due to correlated outcomes reduce effective sample sizes, and hence precision, as can be quantified by intraclass correlation coefficients (ICCs). Second, s are based on the number of clusters, not the number of individuals, which increases values in Equation 1.
While many designs in our setting are likely to be clustered, there may be longitudinal designs where the unit of treatment assignment is the individual. For instance, consider a study that aims to examine the effects of tutoring on the achievement of low-performing students who have access to tutors at different time points. In this case, the unit of treatment assignment could be the student. While we consider more general clustered designs, our variance formulas also apply (reduce) to nonclustered, longitudinal designs by setting the pertinent ICC parameters to 0 and treating clusters as individuals (as discussed further in Section “DID Estimator”).
Definitions, Assumptions, and Target Estimands
We consider a clustered panel data setting with M total clusters defined by the unit of treatment assignment. We assume MT clusters in the treatment group and untreated clusters in the comparison group. We assume the same clusters remain in the sample for each of P time periods (e.g., where time is measured in months, semesters, or years). We denote time periods by with both pre- and posttreatment periods. As discussed later, the time periods do not necessarily have to be evenly spaced, but we avoid this notational inconvenience for now.
We assume K treatment timing groups in total, with treatment start times of for timing group ordered from earliest to most recent, with treatment clusters in group k. We assume comparison clusters are matched to timing group k ( and can differ). We do not consider an analysis where each timing group is compared to the full comparison group (as is sometimes done in the literature), because under our event history framework, this approach considerably complicates the variance formulas due to covariance terms that arise across the timing group estimators due to the common comparison group. It can be shown, however, that the power formulas developed here are good approximations for panel designs with an overlapping comparison group sample (e.g., if this sample is proportionally split across the timing groups). Our stratified approach may also be desirable in practice, especially if there are differences in cluster-level characteristics across the timing groups (see Daw & Hatfield, 2018, for a discussion of matching issues in panel studies).
We assume multiple time periods and the availability of both pre- and postperiod data ( for the DID design and for the CITS and ITS designs for all k). We assume treatment effects are observed and measured starting in period Sk, which pertains to the first posttreatment period. Let Tj be the treatment indicator that equals 1 for ever-treated clusters and 0 for comparison (never-treated) clusters. Further, let denote the group of treatment clusters in timing group k (with onset time Sk), where are the group of matched comparison clusters. Finally, let denote the number of postintervention (“after”) periods, where are the number of preintervention (“before”) periods. For reference, Table 2 provides key parameter definitions used throughout this article.
Key Notation and Definitions
Input Variable
Definition
Variable Range
,
Number of total, treatment, and comparison clusters, where is the share of treatment clusters
for DID/CITS designs and for ITS design
P, t
Total number of time periods, where time is indexed by t
for DID design; for CITS and ITS designs
N
Number of individuals per cluster per time period
Tj
Treatment indicator that equals 1 for ever-treated clusters and 0 for comparison (never-treated) clusters indexed by j
;
Time of outcome measurement, in elapsed calendar time since a common reference point (e.g., months or quarters). These time points could be evenly or unevenly spaced
K
Number of treatment timing groups in the treatment sample
Sk;
Treatment start period for each timing group, measured using the time label, t, not the elapsed calendar time variable, . Treatment effects can be measured starting in Sk
for DID design; for CITS/ITS designs
;
Number of post- and preintervention periods
Ak, for DID design; for CITS/ITS designs
1/0 indicators of preperiod () or postperiod
0 or 1
, , ,
Number of treatment and associated comparison clusters in timing group k, where and
;
, , Gj
for treatment clusters in timing group k, for matched comparison clusters, and for both groups
,
Potential outcomes in the treated and untreated conditions for individual i in cluster j at time t
Continuous or binary
Mean potential outcomes averaged over all clusters in a timing group at time q ( or over a specific period
Means or proportions
Observed outcomes and cluster-level means at a time point or period, averaged over a timing group and treatment status
Continuous or binary
Weight assigned to timing group k in postperiod q ( or to timing group k when calculating pooled effects (wk)
Correlation of cluster-level outcomes over time, modeled using an AR(1) structure (or assumed constant over time)
Mean correlations over specific time periods
Intraclass correlation coefficient measuring the percentage of total variance in the outcome that is due to variation between cluster-time cells
;
Correlation of outcomes for the same individual over time for longitudinal designs, modeled using an AR(1) structure
;
Regression R2 value from the inclusion of a vector of covariates, , conditional on the other model parameters
; can be continuous or binary
Average R2 value from regressing the treatment status interaction terms on the model covariates (treatment-covariate collinearity)
Note. DID = difference-in-differences; CITS = comparative interrupted time series.
We assume data are available for individuals in each study cluster at each time period, which allows us to fully examine sources of variation in the model error terms. However, the power formulas apply also to analyses using data averaged to the cluster level or to lower level units within the study clusters. Data aggregation (averaging) aligns with our framework because we consider linear panel data estimators, so the regression model structure is not affected by aggregation (Greene, 2012; Schochet, 2020). The analysis, however, cannot be conducted if data are available only for units at higher levels than the unit of treatment assignment.
We index individuals by for individual i in cluster j in time t. For each cluster, either all individuals are treated or not. For simplicity and as is customary in the power literature, we assume a balanced design with individuals per cluster in each time period; for unbalanced designs, the average sample size across all time periods and clusters, , can be used in the power formulas as an approximation (Donner & Klar, 2000; Kish, 1995; Murray, 1998; Schochet, 2008). This assumption does not materially affect the analysis as power for clustered designs is determined primarily by the number of clusters, not the number of individuals per cluster (unless this number is small). We allow for both separate cross sections of individuals (e.g., separate cohorts of fourth graders in each time period) and longitudinal data where the same persons are followed over time.
To make the large amount of notation concrete, consider the panel study conducted by Zimmer et al. (2017) that estimated the effects of school turnaround strategies in low-performing Tennessee schools on student test scores. In this study, there were time periods with schools (clusters) in the treatment group and comparison clusters. There was variation in treatment timing, with 19 treatment schools treated at time point , 20 schools treated at , and 12 schools treated at . Thus, the study had three timing groups (, 2, and 3) with associated treatment start times of , , and and treatment cluster sample sizes of in timing group 1 (), in timing group 2 (), and in timing group 3 (). The lengths of the pre- and postperiods were and for , and for , and and for . Finally, there was an average of students per school per time period. We use this running example throughout this article.
We invoke several assumptions that apply to both the DID and CITS/ITS estimators to obtain unbiased estimators of well-defined treatment effect parameters (estimands), where additional assumptions specific to each design are presented in Sections “DID Estimator” and “CITS and ITS Estimators”:
Assumption 1. The stable unit treatment value assumption (SUTVA; Rubin,
1986
). Let denote the potential outcome for an individual given the random vector of all cluster treatment assignments, . Then, if for cluster j, we have that .
SUTVA allows us to express as , so that the potential outcomes of an individual in cluster j depend only on the cluster’s treatment assignment and not on the treatment assignments of other clusters in the sample. More specifically, SUTVA allows us to define as the potential outcome for the individual in the treatment condition and as the potential outcome in the nontreated condition. Potential outcomes are assumed to be continuous variables, although our methods also approximately apply to binary outcomes that are aggregated as rates or proportions to the cluster level. Using this framework, the data generating process for the observed outcome at each time point, , can be expressed as . SUTVA also requires that units cannot receive different forms of the treatment.
Assumption 2. No anticipatory behavior. We assume that a future treatment does not affect past outcomes, so that for . This assumption is required to rule out situations where individuals or units change their behavior in anticipation of a treatment that is likely to occur in the future, which could influence preperiod outcomes.
To define the focal estimands for our analysis, we adopt an “event history” approach that aggregates treatment effect parameters across particular treatment timing groups and time periods (Callaway & Sant’Anna, 2021; Sun & Abraham, 2020). With variation in treatment timing, an important consideration for defining our focal estimands is whether to measure the postprogram impacts in calendar time or relative to exposure to the treatment. Consider first a calendar-time analysis, where we define as the average effect of treatment on the treated (ATT) for clusters in timing group in postperiod q (assuming ):
Here, is the mean cluster-level outcome in the treated condition at time q for timing group k, and is the mean outcome if these clusters had not been treated. Following Callaway and Sant’Anna (2021), our first impact parameter of interest averages the estimands across timing groups:
where is an indicator that equals 1 if q is a postperiod for timing group k and 0 otherwise, and are weights (which we set to 1, as discussed in Section “Cross-Sectional Analysis: Framework”). Some timing groups may not contribute to Equation 3. For instance, in our running example, is a postperiod for timing groups and with and , but not for timing group with , so timing group 3 would be excluded from Equation 3 to calculate .
Note that at calendar time q, exposure to the intervention, , will differ across timing groups, which could complicate the interpretation of the estimands if impacts vary over time. Using our running example, at , the exposure time is three periods for timing group 1, compared to only two periods for timing group 2 and one period for timing group 3. Thus, an alternative approach used by Sun and Abraham (2020) is to realign the data to measure postperiod impacts relative to treatment exposure, where exposure time, te, can be mapped to calendar time using . This approach yields the following estimand pertaining to impacts at exposure point l that are averaged over the timing groups:
where is the parameter for timing group k at exposure point l; is an indicator that equals 1 if the length of the postperiod (Ak) is at least l periods and 0 otherwise; and are weights (which we set to 1). These point-in-time estimands are useful for examining treatment dynamics (Sun & Abraham, 2020), although potential differences in impacts across timing groups could affect interpretation.
Power concerns are similar for the point-in-time and estimands as they both rely on the same components, differing in only how these components are organized and averaged. Thus, our analysis applies to both impact parameters. While we prefer the estimand (since it has a clearer interpretation), for ease of exposition, we use calendar time notation because it conforms to the format of time series data that are typically available for analysis; we then transform the results into exposure time as needed. The choice of notation does not change the results.
Another key estimand for our analysis, , takes a weighted average of either the or parameters over their respective postperiods to obtain pooled treatment effects averaged over the full observed postperiod (both approaches yield the same result). This pooled estimand has policy relevance as a summary measure of treatment effects, for instance, for use in benefit–cost analyses. An intuitive way to express the estimand is as follows:
where we set for our analysis.1 This aggregate parameter is flexible in that it allows for heterogenous treatment effects both across timing groups and across time within timing groups.
The estimation challenge for the considered treatment effect parameters is that we do not observe counterfactual outcomes for the treated groups during the posttreatment period, that is, the terms in Equation 2. Thus, our considered DID and CITS/ITS panel methods estimate these counterfactuals under various identification assumptions, which then allows for consistent estimation of the , , and estimands in Equations 3–5.
DID Estimator
DID methods identify causal effects from panel data by contrasting changes in outcomes during the pre- and posttreatment periods across the treatment and comparison groups. The large literature on DID methods focuses on designs without variation in treatment timing (e.g., Angrist & Pischke, 2009; Ashenfelter, 1978; Ashenfelter & Card, 1985; Bertrand et al., 2004). However, a smaller recent literature (cited in the Introduction section) considers impact estimation with staggered treatment timing to produce unbiased estimators. This literature underlies our analysis.
The key identifying assumption for DID methods is that in the absence of treatment, the mean outcomes for the treatment and comparison groups would exhibit parallel trends over time (Abadie, 2005; Heckman et al., 1997):
Assumption DID.1. Parallel trends. For each timing group, , mean counterfactual outcomes for the treatment and comparison groups exhibit parallel trends for each postperiod, , relative to the average preperiod:
Here, is the mean counterfactual outcome averaged over all preperiods and all treatment clusters in timing group k (using equal preperiod weights), and similarly for for the comparison clusters.
Intuitively, this assumption allows us to obtain an unbiased estimate of the unobserved counterfactual, , using , where is the observed preperiod mean outcome for treatments in timing group k, and similarly for and ; this yields the DID estimator discussed below.
In what follows, we develop variance formulas for the DID estimator, first for the cross-sectional design and then for the longitudinal design.
Cross-Sectional Analysis: Framework
To consider DID impact and associated variance estimators for the , , and estimands, we rely on the following regression model, adapted from the literature to our context, using stacked data on separate cross sections of individuals nested within study units (clusters) over time:
In this model, and are cluster and time fixed effects, is an indicator that equals 1 if either or and 0 otherwise, and is a period q indicator that equals 1 if and 0 otherwise. The random error terms are assumed to have mean zero and to be distributed independently of each other: captures the correlations of individuals within the same cluster and time period, and are iid individual-level errors. Following the influential work of Bertrand et al. (2004), we allow to be correlated over time using an autoregressive process of order 1 (AR(1)), where is the autocorrelation parameter and are iid errors with . For our analysis, we assume the same autocorrelation structure during the pre- and posttreatment periods and for each timing group and their associated comparison group. Under the AR(1) model with long panels, , which we assume hereafter.2 This is not a restrictive assumption as converges quickly over time (e.g., for , the variance stabilizes after three periods).
Under the AR(1) model, correlations are larger for cluster observations closer in time than further apart, so for Accounting for autocorrelated errors in many settings is important because cluster-level outcomes are typically highly correlated over time, which if ignored can lead to serious underestimates of standard errors (Bertrand et al., 2004). For instance, pertinent to studies in education, when estimating Equation 7 using National Assessment of Education Progress (NAEP) public use data on 28 school districts over 9 nine years (excluding the interaction terms), we find a value of 0.49 for fourth grade reading scores and 0.46 for fourth grade math scores (both are statistically significant at the 1% level). Thus, the autocorrelations are substantial, even after controlling for time fixed effects.
In DID settings, time intervals are often equally spaced, but not always. Thus, we consider general settings where time measurements can be unevenly spaced within timing groups but not across them. We follow Baltagi and Wu (1999) who discuss ordinary least squares (OLS) estimation with unequal time intervals that fully maintains the AR(1) structure.3 Under this approach, correlations between successive observations are based on their time differences (e.g., the correlation is for successive observations three time units apart). To use this approach, for each t, we use the notation, , to denote the elapsed time between outcome measurement and a common reference point; thus, t refers to a time label (counter), whereas refers to elapsed calendar time.
The regression model in Equation 7 includes three-way interactions between indicators of timing group, time period, and treatment status; the model, however, excludes interactions for each comparison group and preperiod. Thus, the resulting OLS estimators, , provide DID estimates for each posttreatment time point relative to the mean preperiod. Formally, the DID estimator for timing group k in time period is , where and are mean observed outcomes defined above. For instance, using our running example, the DID estimator at time for timing group with can be calculated using . However, the DID estimator, , would not be germane for timing group with , because is not a postperiod for these clusters. Note that the OLS estimates using Equation 7 are the same as those from models run separately by timing group.4
We can now aggregate the estimators across timing groups to obtain an unbiased estimator for the calendar-time estimand in Equation 3 using
Similarly, we can estimate the exposure-specific estimand in Equation 4 using
where converts exposure time to calendar time to select the pertinent estimators. Finally, we can aggregate or across their postperiods to obtain an unbiased estimator for the estimand in Equation 5 using
or using the equivalent expressions in Note 1.
For weighting, we follow Sun and Abraham (2020) who use , so that each included timing group receives the same weight in calculating the point-in-time estimators, . This weighting scheme also implies that when calculating , the weight for each timing group is Ak, the length of its postperiod (i.e., ). For instance, in our running example, timing group 1 with would receive a larger weight than both timing group 2 with and timing group 3 with , because it has more exposure periods. In this setting, can be interpreted as the average treatment effect observed in the sample. Other options exist for and , such as weighting according to the number of clusters ( and ; Callaway & Sant’Anna, 2020) or inversely proportional to timing group variances.
Finally, we note that our event history approach differs from the approach of Goodman-Bacon (2018) who examines impact estimation for a variant of model (7) where the three-way interaction terms are replaced by , where is an indicator that equals 1 if cluster j in the treatment group was treated at or before time t and 0 otherwise and where a common comparison group is used across timing groups. Under this specification, is the average parameter over the postperiod. With staggered treatment timing, Goodman-Bacon (2018) shows that the OLS estimator, , becomes a weighted average of separate two-by-two DID estimators, where each treatment timing group is compared not only to the common comparison group but also to each other based on treatment timing. More specifically, a later timing group serves as a comparison group for the early timing group before its treatment begins and the early group then serves as a comparison group for the later timing group after its treatment begins.
We do not adopt this specification, however, because it only recovers the estimand when treatment effects are homogeneous across timing groups and over time (Callaway & Sant’Anna, 2021; de Chaisemartin & D’Haultfœuille, 2020; Sun & Abraham, 2020). These assumptions, however, are unrealistic in practice. Instead, our event history approach allows for both sources of heterogeneity and yields unbiased and more interpretable impact estimates.
Cross-Sectional Analysis: Variance Estimation
In this section, we first obtain the variance formula for the pooled DID estimator, in Equation 10 and then discuss how this formula reduces to the variance formulas for the point-in-time estimators, in Equation 8 and in Equation 9.
To obtain the variance of , it is useful to express in Equation 10 as an average of treatment effects across each timing group, , as follows:
where , recalling that our weighting scheme uses and . Because of the assumed independence of the outcomes across the timing groups, the variance of can then be obtained using the following simple relation:
We can now calculate by first calculating using the variances and covariances across the pre- and postperiod means based on the model error structure in Equation 7.
To fix concepts, we first consider the variance formula for timing group k without the AR(1) structure, which can be expressed as follows:
This variance is intuitive in that it gets smaller as the number of clusters ( and ), the number of individuals per cluster (N), and the lengths of the pre- and postintervention periods (Bk and Ak) increase. Clustering effects arise due to . If , , and and are redefined to be the number of individuals rather than clusters, then Equation 13 reduces to the nonclustered design.
If we now add the AR(1) structure, the calculations become considerably more complex, because the mean outcomes for a particular cluster become correlated across time periods (both within the pre- and postperiods as well as across them), where the correlations diminish over time. This leads to our first main variance result provided in the following new theorem.
Theorem 1. The variance of the pooled DID estimator, , in Equation 12 obtained from the model in Equation 7 that incorporates clustering, variation in intervention timing, the AR(1) error structure, and general measurement intervals is as follows:
Here, denotes the average autocorrelation in the postperiod, denotes the average autocorrelation in the preperiod, denotes the average autocorrelation between the pre- and postperiods, and other terms are defined above.
In this expression, if (the typical case), we find that the and terms increase variance whereas the term reduces variance. In most cases, the and terms will be larger than the term as cluster observations are, on average, closer in time within a pre- or postperiod than across them. Thus, accounting for the AR(1) structure tends to reduce precision.
However, there are several cases where accounting for the autocorrelated errors improves precision. First, this will occur if we assume constant autocorrelations across time as is sometimes specified in the power literature for shorter panels (Frison & Pocock, 1992; McKenzie, 2012). In this case, , and the formula reduces to . Second, in short panels with and , the formula reduces to , so the term improves precision in this case (a similar result holds if or ). Finally, if is close to 1, the term can offset the sum of the and terms in Equation 14, leading to a reduction in variance. Note that Equation 14 reduces to the AR(1) estimator in Burlig et al. (2019) assuming no staggered treatment timing, , and equal time intervals. The illustrative power analysis presented in Section “An Illustrative Power Analysis” discusses other key features of Equation 14.
Next, we present a Corollary to Theorem 1 that provides new variance formulas for the point-in-time treatment effect estimators, and .
Corollary to Theorem 1. The variance formula for in Equation 9 that pertains to the DID estimator after l periods of treatment exposure is
where is calculated at postperiod . Further, for the calendar-time DID estimator, we can calculate using Equation 15, replacing with and calculating at postperiod q rather than at postperiod .
The result in Equation 15 follows from Theorem 1 by averaging over only those timing groups with l periods of observed exposure, setting , and translating exposure time into calendar time. The expression for can be obtained similarly.
In practice, standard errors for clustered DID designs using Equation 7 are often estimated using OLS with cluster-robust standard errors (CRSE) using data aggregated to the cluster-time level (Cameron & Miller 2015; Liang & Zeger 1986). Based on simulation evidence in Bertrand et al. (2004) using AR(1) errors, CRSE standard errors are typically adjusted for clustering at the unit of treatment assignment only.
Cross-Sectional Analysis: Power Calculations
To facilitate power calculations, it is customary to express variance formulas in terms of ICCs of the error variances. We adopt this strategy by setting and , where is the total error variance. Using this formulation, no longer enters the formula in Equation 1, and the power calculations can be conducted by specifying a value for rather than for the error variances directly.
Power calculations can now be conducted by inserting Equation 14 or 15 into Equation 1 to calculate values for prespecified values of P, N, , Ak, Bk, , , and . Calculating the degrees of freedom () for clustered designs is complex (Donald & Lang, 2007; Hedges, 2007). However, in panel settings, it is customary to use the number of cluster-time observations adjusted by the number of model parameters (Cameron & Miller, 2015). Using Equation 7, this approach yields for the pooled estimator, . In our running example, (where and ). Similarly, we can use for the point-in-time estimator, , where and are sample sizes for the included timing groups, and similarly for .
Alternatively, Equation 1 can be solved to calculate the total number of clusters, , required to attain a given value. For example, for the pooled DID estimator, we can calculate the required cluster sample size using
where V is the variance in Equation 14, setting and , where the inputs , , and are user-specified. Because is also a function of M, iterative methods to solve nonlinear equations can be used to solve for M (Power_Panel uses the secant method). For nonclustered designs, one can set equal to 0, N to 1, and M to the number of individuals.
Longitudinal Analysis
Under the longitudinal AR(1) design, the same individuals within the study clusters are followed over time. The analysis for this design extends the cross-sectional analysis, where the individual-level errors, , are now assumed to follow an AR(1) structure with autocorrelation parameter, . Thus, under the longitudinal AR(1) design, we allow both cluster-level outcomes and individual-level outcomes within clusters to be correlated over time.
The resulting variance formulas are discussed in Online Appendix A.1. They parallel the structure for the cross-sectional estimators, except they now incorporate the added autocorrelation structure for the individual-level errors. For instance, with constant autocorrelations, we find that the variance for the longitudinal DID estimator for timing group k is , where the term is now multiplied by to reflect outcome correlations of individuals over time. Note that setting in this expression and redefining M as the number of individuals yields the variance estimator in Frison and Pocock (1992) and McKenzie (2012) for the nonclustered design. The same pattern holds for the AR(1) specification. In general, the longitudinal estimators will have less precision than the cross-sectional estimators, except for special cases that parallel those discussed in Section “Cross-Sectional Analysis: Variance Estimation” above for the cross-sectional analysis.
Incorporating Covariates
In DID analyses, a vector of time-varying covariates, , with associated parameter vector, , is often included in the models to adjust for potential confounding bias. To examine the effects of covariates on precision, to fix concepts, we begin by assuming no confounding bias by invoking a parallel trends assumption for the covariates (which we then relax):
Assumption DID.2. Parallel trends for the covariates. For each timing group , posttreatment time period , and covariate :
where is the number of covariates and the and covariate means are defined analogously to the and outcome means defined in Equations 2 and 6.
Because is assumed constant over time and across timing groups, the implication of this assumption is that mean covariate values during the pre- and posttreatment periods are independent of treatment status for each timing group. The result is that the addition of covariates to the models does not change the DID estimators in expectation.
To see this result more formally, it can be shown that OLS estimators for with and without covariates, and , can be related using , where and are vectors of covariate means, and is the OLS estimator for (see also Schochet et al., 2021). Note that the term in parentheses is a vector containing DID estimators for each covariate. But these DID estimators each have zero expectation under Assumption DID.2. Thus, , even when .
Thus, under Assumption DID.2, precision gains from covariates can be quantified by multiplying the variance terms from the model without covariates by , where is the R2 value from covariate inclusion. We assume the same R2 value to explain and . With covariates, is reduced by the number of covariates, . This approach parallels the power analysis approach for RCTs where baseline covariates are independent of treatment status due to randomization (Raudenbush, 1997; Schochet, 2008).
Assumption DID.2, however, is not likely to hold in practice. Thus, precision gains from covariates will typically be reduced by treatment-covariate collinearity. Thus, the net variance reduction due to the covariates can be approximated using the ratio , where is the average R2 value from regressing the treatment interaction terms in Equation 7 on the covariates. This ratio can be less than 1. We summarize this result in the following theorem.
Theorem 2. If covariates are added to the model in Equation 7, the variance formula for can be approximated using
and similarly for the point-in-time estimators, and , and the longitudinal estimators.
CITS and ITS Estimators
CITS estimators are based on specifications that model trends before and after the introduction of the treatment. They quantify whether once the treatment begins, the treatment group deviates from its preintervention trend by a greater amount than does the comparison group. The more common ITS estimators pertain to designs without a comparison group, where the treatment group is compared to its own preperiod trend only. CITS and ITS estimators were popularized in the seminal works of Cook and Campbell (1979) and Shadish et al. (2002); recent applications and reviews in various disciplines are provided in Bloom (2003), Somers et al. (2013), Kontopantelis et al. (2015), Linden (2015), Clair et al. (2016), Bernal et al. (2017), and Baicker and Svoronos (2019). We assume at least three (preferably four) pre- and postperiod time points each, so that the trends can be adequately modeled (Cook & Campbell, 1979).
For our CITS and ITS power analysis, we adopt an event history approach for the reasons discussed above for the DID design. We model pre- and postperiod trends using a linear specification, where we allow the intercepts and slopes to differ across the two time periods, which we refer to as the “fully interacted” model. This approach can be extended to allow for segmented regression lines across different postperiods. We also present results for a restricted model that assumes common slopes across the pre- and postperiods. Bernal et al. (2017) and Ferron and Rendina-Gobioff (2014) discuss other CITS estimands not considered here.
We rely on the following fully interacted regression model using data for all timing groups:
where is an indicator that equals 1 if the time period corresponds to the preperiod (i.e., if and 0 otherwise, and is an indicator that corresponds to the postperiod, recalling that denotes when the outcome was measured in calendar time relative to a common reference point.
In this model, and are preperiod intercept and slope parameters for treatments in timing group k; and are postperiod intercepts and slopes for treatments in timing group k; and , , , and are corresponding parameters for the matched comparison group. In this formulation, we include the treatment-by-time indicator terms (such as ) rather than the cluster-level fixed effects as we did for the DID design to simplify variance estimation. Using our running example, the model in Equation 19 contains 24 parameters (eight pre- and postperiod intercept and slope parameters for each of the three timing groups, split evenly between the treatment and comparison groups).
Applying OLS to the model in Equation 19, we can estimate the estimand in Equation 2 using
The first part of this estimator, , measures the difference between the predicted values at time q based on the fitted pre- and postperiod trend lines for the treatment group, and the second part, , measures the pre–post difference—the “forecast error”—for the matched comparison group. Note that is the estimated counterfactual outcome at time q for the treatment group in the absence of the intervention; it incorporates both the forecast at time q from the fitted treatment group preperiod trend line as well as the comparison group forecast error.
Using Equation 20, we see that the CITS estimator differs from the DID estimator in that it measures treatment-comparison differences in estimated trendlines, not just estimated intercepts; thus, the CITS estimator will have a larger variance (which we formalize below). The ITS estimator, , is based on the treatment group trendlines only.
As with the DID approach, we can now average the estimators over k to yield the point-in-time CITS estimators, and , which can then be averaged over their postperiods to yield the pooled estimator, (and similarly for the ITS estimators). These estimators are unbiased for the , , and estimands under Assumptions 1 and 2, assuming the linear specification in Equation 19 is correct, and under the following added assumption:
Assumption CITS.1. Parallel deviations from preperiod trend lines. In the absence of the intervention, deviations from pretreatment trend lines would be equivalent for the treatment and comparison groups for all timing groups ) and postperiods . This assumption can be expressed in terms of potential outcomes in the untreated condition as
This condition requires that in the absence of treatment, forecast errors from the preperiod trend lines would be the same, on average, for the treatment and comparison groups. The parallel condition for the ITS estimator is perfect mean forecasts for the treatment group: .
To calculate the variance of , we can use Equation 20 to compute the variances and covariances of the terms. To facilitate the calculations, note that the OLS estimates from Equation 19 are identical to those from separate models estimated using four groups of observations for each timing group k: Group 1: treatments in the postperiod (to obtain ), Group 2: treatments in the preperiod (to obtain ), Group 3: comparisons in the postperiod (to obtain ), and Group 4: comparisons in the preperiod (to obtain ).
For illustration, consider OLS estimation using Group 2 data for timing group k (with clusters) aggregated to the cluster-time level for a model with an intercept and time variable. For estimation, we center around its preperiod mean, , which facilitates the calculations because now becomes the preperiod mean outcome, .
Consider first the model without the AR(1) error structure that we use to highlight key features of the more general, complex variance estimator presented later in Theorem 3. For this specification, we find after applying standard OLS methods that the variance of the forecast at time q based on the fitted preperiod trend line for Group 2 is
where is the preperiod variance of the time variable.
In this expression, the terms inside the first parentheses capture the variances of the estimated intercept and slope parameters as well as their covariance. We see that precision decreases for later postperiods than earlier ones (as q increases) because the forecasts become more tentative (as reflected in a larger value for the term). Further, increasing the number of preperiods, Bk, will increase precision both because becomes smaller and becomes larger as the trend lines become more precisely estimated. Similar variance expressions exist for the , , and terms.
Because the four groups are independent in the model without the AR(1) error structure, we can sum their variances to obtain the following variance expression for for the model without autocorrelated errors:
where is the postperiod variance of the time variable, and other terms are defined above. The parallel variance expression for the ITS estimator omits the term in Equation 23.
Comparing Equation 23 to the variance in Equation 13 for the corresponding DID estimator, we see that the CITS variance is larger than the DID variance because of the addition of the two time-related terms that reflect the estimation error in the fitted trend lines. Thus, the CITS estimator will require larger sample sizes than the DID estimator to yield impacts with the same precision level. The DID power gains come at the expense of stronger conditions for parameter identification.
Note that the regression R2 value from the model in Equation 19—that measures the fit of the trend lines—enters the variance formula in Equation 22 through its effects on the error variances, and . Thus, a stronger linear time-outcome relationship leads to smaller values of and . But these R2 values do not enter the power calculations—as measured in effect size units—because they also enter the error variances in the denominator of Equation 1 that are used to construct the values and thus cancel. Similar issues apply to the DID estimator.
As with the DID estimator, including the AR(1) structure considerably complicates the CITS analysis due to the emergence of correlations between the fitted pre- and postperiod trend lines. To calculate the variance formulas in this setting, consider, as before, estimating using Group 2 data aggregated to the cluster-time level with observations. Switching to matrix notation to simplify notation, let be the matrix of independent variables from the regression model that includes the constant term and centered time variables, with parameter vector, . With the AR(1) error structure, the variance of the OLS estimator, , becomes , where is an block-diagonal error variance–covariance matrix, containing the block submatrices, , for the Bk observations in the same cluster, with entries along the diagonal and along the off-diagonal cells.
Using this framework, we can then compute , and similarly for the variances of , , and . If we then calculate the covariances across these estimators and average across postperiods and timing groups, we obtain a complex, but closed-form expression for the variance of the pooled CITS estimator, , that we present in the following new theorem.
Theorem 3. The variance of the pooled CITS estimator, , obtained from the model in Equation 19 that incorporates clustering, variation in intervention timing, the AR(1) error structure, and general measurement intervals is as follows:
where
and other terms are defined above. This variance formula can be used for the power calculations with . The ITS variance formula omits the term in Equation 24.
The expressions in Equation 24 have a ready interpretation. is the average variance of the forecasts from the fitted preperiod regression lines for both the treatment and comparison groups, where averaging is taken over all postperiods. captures the covariances between the estimated intercepts and slopes from the fitted preperiod regression lines (which are zero for evenly spaced time periods or for constant autocorrelations). pertains to the covariances between the estimated intercepts from the fitted preperiod trend lines and the slopes from the fitted postperiod trend lines. In these expressions, the terms arise due to the matrices from above and pertain to various time-related variances weighted by the autocorrelations. Finally, is similar to but pertains to the individual-level errors which are uncorrelated over time, so it does not include the terms.
It is interesting that for the pooled estimator,, the predicted values from the postperiod trend lines averaged over all postperiods are simply the mean postperiod outcomes for the treatments and comparisons, and . Thus, Equation 24 does not account for the variances of the postperiod slopes. For the same reason, the variance in Equation 24 also applies to a variant of the model in Equation 19 where the postperiod is modeled using discrete postperiod indicators rather than using a linear trend line (see, e.g., Bloom, 1999; Clair et al., 2016; Somers et al., 2013 that use this specification).5
The Corollary to Theorem 3 in Online Appendix A.2 shows the associated variances for the point-in-time CITS estimators, and . It is surprising that these variances are more complex than for the pooled CITS estimator, as they must also account for the variances of the postperiod slopes. In these variances, if we set and assume one timing group (), then at calendar time q, we have that , which is the expression in Bloom (1999). For the calculations, we can use for the estimator, where and are sample sizes for the included timing groups, and similarly for .6
Next, we consider a popular restricted variant of the model in Equation 19 that assumes common slope parameters for the pre- and postperiod trendlines, where we omit the and terms. For this “common-slopes” specification, the OLS estimators of interest are , which are treatment-comparison differences between the estimated pre- and postperiod intercepts. For this specification, the pooled and point-in-time treatment effects are the same because this approach assumes a constant treatment effect over time (for each k).
To obtain variance expressions for , we can use the same methods as for the fully interacted estimators above. Without the AR(1) structure, we find that
where is the variance of the time variable computed over the full observation period, and other terms are defined above. We see that this variance gets smaller as Bk or Ak increases and is symmetric if Bk and Ak are switched (if time measurements are even). Further, comparing Equation 25 to Equation 13, we see that the variance of the common-slopes estimator is larger than for the DID estimator, because the third bracketed term in Equation 25 is greater than 1. Further, our illustrative power analysis in Section “An Illustrative Power Analysis” shows that the variance of the common-slopes estimator is considerably smaller than for the fully interacted pooled CITS estimator which has weaker parameter identification assumptions.
Theorem 4 in Online Appendix A.3 presents the more complex variance formula for the common-slopes estimator that includes the AR(1) structure. For power calculations using this estimator, we can use Clearly, the choice of the common-slopes or fully interacted specification is an empirical issue and can be tested by examining the statistical significance of the differences in the estimated pre- and postperiod trend lines.
Issues regarding the inclusion of model covariates are similar for the CITS and ITS estimators as for the DID estimators. For instance, invoking the parallel deviations assumption in Equation 21for each covariate, the variance terms in the model without covariates can be multiplied by . This holds because the CITS estimator with and without model covariates, and , can be related using , where are parameter estimates for the covariates and is a vector of CITS estimators for the covariates when each covariate is sequentially treated as the dependent variable in the model in Equation 19. Under the parallel deviations assumption for each covariate, has zero expectation, so . However, in practice, it is more realistic to multiply the variance terms by the factor to adjust for treatment-covariate collinearity.
Finally, for the AR(1) longitudinal design, the off-diagonal block entries of become , which yields a parallel structure for the cluster- and individual-level variance components, leading to parallel variance formulas (see Online Appendix A.1).
An Illustrative Power Analysis
To highlight key features of the variance formulas for the considered DID, CITS, and ITS estimators, this section presents an illustrative power analysis. Due to the large number of parameters that enter the analysis, our goal is not to present a compendium of power results across many combinations of parameter values. Rather, we broadly address two power-related questions that align with the focus of our theoretical analysis: (1) How does variation in treatment timing and the AR(1) error structure affect precision? and (2) To what extent does precision differ for the DID, CITS, and ITS estimators? We conduct our analysis using plausible parameter values, with a focus on sample size requirements to attain a given value for a two-tailed significance test () at 80% power () using the power formula in Equation 16. The calculations were conducted using the Power_Panel dashboard available in the Online Supplementary Materials, that readers can use to conduct power analyses for their specific designs.
Our first main finding is that both variation in treatment timing and the AR(1) error structure increase required cluster sample sizes. These increases become more pronounced if there are a small number of preperiods for any timing group (which occurs in our setting when S1 and S2 diverge) and as the AR(1) autocorrelation coefficient increases (but only up to a point).
Figure 1 illustrates these findings using the pooled DID estimator. The figure shows design effects—the ratio of sample size requirements for a design with staggered treatment timing and/or an AR(1) error structure relative to a reference design without these design features (or that ignores them, which yields bias). The calculations assume time periods (as is the case in our running example), a cross-sectional design, equal time spacing of measurements, a 50–50 treatment-comparison split, individuals per cluster-time cell, no model covariates, and an of —a common value found in education research (Hedges & Hedberg, 2007; Schochet, 2008). For the design with staggered treatment timing, we assume timing groups of equal size with various start times (S1 and S2). For the reference design with , we set the treatment start date as the average of S1 and S2, rounded to the nearest integer. The design effects were then calculated by varying , S1, and S2. Note that the results hold for any value, because this value cancels when calculating the design effects using Equation 16.
Design effects for the pooled difference-in-differences estimator with staggered treatment timing and/or AR(1) errors relative to the reference design without these features. (A) Design with staggered treatment timing and AR(1) errors relative to the reference design. (B) Design with AR(1) errors only (no staggering) relative to the reference design. Note. Design effects are sample inflation factors for the total number of clusters required to attain a given minimal detectable effect size relative to the reference design (for a two-tailed significance test (α = .05) at 80% power), and pertain to pooled effects during the posttreatment period. Calculations assume P = 8 time periods, a cross-sectional design, equal time spacing of measurements, a 50–50 treatment-comparison split, N = 100 individuals per cluster-time cell, no model covariates, and an ICCθ of 0.05. For the design with staggered treatment timing, the calculations assume K = 2 timing groups of equal size with various start times (S1 and S2), whereas for the reference design with K = 1, the treatment start date (S) is the average of S1 and S2, rounded to the nearest integer.
As shown in Figure 1, the presence of both staggered treatment timing and the AR(1) error structure yields design effects for the pooled DID estimator that range from 1.13 to 2.32 across the considered designs, with a mean of 2.05, or a doubling of required cluster samples (Panel 1A). Staggered treatment timing matters as can be seen by examining the design effects when , which become larger as S1 and S2 become more spread out (i.e., as timing group 1 has fewer preperiods). The AR(1) structure also matters as evidenced by increases in the design effects as increases, until gets large or if the pre- or postperiods are short, as predicted by the theory from Section “DID Estimator” (Panel 1B). Similar results hold for the CITS and ITS estimators (not shown).
Our second main finding is that the CITS estimators require larger cluster samples than the DID estimator to attain the same level of precision, especially for the fully interacted CITS estimator. This occurs due to the estimation error in the fitted CITS (and ITS) trend lines. The effects are largest for the fully interacted CITS estimator, which estimates four trend lines compared to only two for the common-slopes CITS estimator and half as many for the ITS estimators (which exclude the comparison group).
These results are illustrated in Table 3, which shows cluster sample sizes required to achieve an of 0.20—a common target used for clustered education RCTs (see, e.g., Schochet, 2008). We present results for the pooled estimators averaged across all postperiods as well as for the point-in-time estimators measured one, three, and five periods after treatment exposure. We apply many of the same assumptions as for Figure 1, except we now consider a broader range of time periods and treatment start times (S1 and S2) to allow for more pre- and postperiods for the CITS and ITS estimators (recall that the fully interacted estimators require at least three preperiods and three postperiods each) and we fix at 0.4 (based on our NAEP analysis presented in Section “Cross-Sectional Analysis: Framework”). We also present selected results for the pooled estimators for the model with constant autocorrelations and for the longitudinal AR(1) design (with ).
Cluster Sample Sizes Needed to Detect an MDE of 0.20 Standard Deviations
Number of Time Periods (P)
Treatment Start Times (S1, S2)
Fully Interacted
Common Slopes
DID Estimatora
CITS Estimatora
ITS Estimatora
CITS Estimatora
ITS Estimatora
Pooled estimator (cross-sectional, AR(1) errors unless indicated otherwise)
Note. The figures show cluster sample sizes required to attain an MDE value of 0.20 for a two-tailed significance test (α = .05) at 80% power using the power formula in Equation 16. Calculations assume a cross-sectional design unless otherwise noted, equal time spacing of measurements, a 50–50 treatment-comparison split, individuals per cluster-time cell, no model covariates, an of , and timing groups of equal size. NA = not applicable because there are not enough preperiod observations for estimation; MDE = minimal detectable impacts in effect size units; DID = difference-in-differences; CITS = comparative interrupted time series.
a Figures pertain to the total number of treatment and comparison group clusters for the DID and CITS estimators, but only to the number of treatment group clusters for the ITS estimators.
Consider first the top panel of Table 3 showing results for the pooled estimators. We see that for all designs, the fully interacted CITS pooled estimator (which is identical to the discrete CITS pooled estimator) requires larger cluster samples than do the other estimators. For instance, when , , and , the fully interacted CITS estimator requires total clusters, compared to for the DID estimator and for the common-slopes CITS estimator (the ITS estimators require half as many treatment group clusters as their associated CITS estimators). Further, if P increases to 12, then with and , the required sample size balloons to for the fully interacted CITS estimator, but decreases to for the DID estimator and to for the common-slopes CITS estimator.
Sample sizes for the fully interacted pooled estimators increase with P because their point-in-time forecast errors increase considerably over time (see below). Further, for a given P, these estimators tend to lose precision with fewer preperiods and more postperiods . In contrast, the pooled DID and common-slopes estimators do not exhibit these patterns; rather their precision tends to increase as P increases and are less sensitive to specific values of Bk and Ak (holding P fixed). Nonetheless, the pooled common-slopes estimator requires two to three times larger cluster samples than the pooled DID estimator (Table 3). Note that precision for the DID estimator noticeably improves if the minimum Bk increases from 1 to 3.
The top panel of Table 3 also shows that consistent with theory, required sample sizes reduce for the model with constant autocorrelations, as this specification actually improves power. These reductions are largest for the DID pooled estimator, where required samples are less than half of those needed for the model with AR(1) errors. We find also that sample sizes are only slightly larger for the pooled longitudinal AR(1) estimator than for the pooled cross-sectional one.
The bottom three panels of Table 3 provide additional perspective on the pooled findings using the point-in-time estimators. First, we see that if for all k, power levels for the fully interacted and common-slopes CITS estimators are comparable for one-period forecasts, although they both still have less power than the DID estimator. In this setting, the ITS estimators require fewer treatment group clusters than the DID estimator. However, precision levels for the fully interacted estimators rapidly decay for the three- and five-period forecasts (as is also the case for the discrete estimators [not shown]). For example, when , , and , the required sample size for the fully interacted CITS estimator is for the one-period forecast, compared to for the three-period forecast and for the five-period forecast. In contrast, while the DID estimator loses some precision over time (because the term in Equation 15 decreases), these effects are small, apart from some five-period forecasts that have low power because they are calculated excluding timing group 2 for whom Period 5 is not an observed postperiod (e.g., when , , and ). Similarly, the common-slopes estimator maintains its precision levels over time as it assumes constant treatment effects (apart from the five-period forecasts that exclude timing group 2).
The same pattern of results holds if we vary other power parameters, although the required M values can change. For instance, if the targeted value is halved from 0.20 to 0.10, the required M increases fourfold relative to those shown in Table 3 for all estimators (because M is inversely proportional to as shown in Equation 16). Similarly, the required M is roughly proportional to the value (for moderate N) and inversely proportional to for models with covariates (see Theorem 2). Further, precision tends to decrease with more timing groups. The results are less sensitive to the number of individuals per cluster-time cell, N, as precision for clustered designs is typically driven by M, unless N is small. For instance, when , , and , the required M for the pooled common-slopes CITS estimator is for , compared to for and for . The results are also relatively insensitive to the and allocations as long as they range from 0.3 to 0.7.
Conclusions
This article developed new closed-form variance expressions for power analyses for commonly used DID, CITS, and ITS regression estimators. The main contribution was to incorporate variation in treatment timing into the variance formulas, but the formulas also account for several other key design features that arise in practice: autocorrelated errors, unequal measurement intervals, and clustering due to the unit of treatment assignment. Using an event history approach, we considered power formulas for both cross-sectional and longitudinal data structures and allowed for the inclusion of model covariates that can improve precision. Further, we considered both point-in-time estimators as well as those pooled over the entire postperiod. These variance formulas can be used to calculate values for a given number of time periods and study clusters or, conversely, to calculate required sample sizes to attain a given . The free Panel_Power dashboard—available for download in the Online Supplementary Materials along with a pdf documentation file—can be used to perform the sample size calculations.
Our theory and illustrative power analysis both demonstrated that variation in treatment timing and the AR(1) error structure increase required cluster sample sizes for the considered panel estimators (except for situations where the treatment occurs very early or very late in the panel). Thus, it is important that these design features (if pertinent) be considered when assessing appropriate sample sizes in designing panel studies.
Further, our analysis showed that the CITS estimators require larger cluster samples than the DID estimator to attain the same level of precision, especially for the fully interacted and discrete CITS estimators. The reason is simple: The CITS estimator must account for the estimation error in the fitted trendlines, not just the fitted intercepts. These effects are most pronounced for the fully interacted CITS estimator (which estimates four trend lines) but are also present for the common-slopes CITS estimator (which estimates two trend lines) and the corresponding ITS estimators. Power for the fully interacted and discrete point-in-time estimators deteriorate rapidly over time, yielding large sample size requirements for the pooled estimators averaged over multiple postperiods. While these time effects do not affect the common-slopes CITS estimator, it still requires about two to three times larger cluster samples than the pooled DID estimator across the considered designs. Power losses for the CITS and ITS estimators would be even greater assuming more complex specifications of the time variable in the model (e.g., quadratic relationships).
The results suggest a trade-off between estimator precision and the strength of the identification assumptions for obtaining unbiased ATT estimators. The DID estimator requires the smallest sample sizes—that may be attainable in practice, especially if there are at least three preperiod observations—but imposes the strongest identification conditions. In contrast, the fully interacted CITS estimator imposes weaker identification conditions but requires large samples (except for the point-in-time estimators soon after treatment exposure if there are at least five preperiod observations). The common-slopes CITS estimator lies somewhere in between. The ITS estimators have more power than the CITS estimators as they avoid the variance contribution from the comparison group but typically have less power than the DID estimator. Thus, while the fully interacted CITS estimator may best control potential selection biases, it may be less feasible for many studies from a power (and mean squared error) perspective. In these cases, the DID, ITS, and common-slopes estimators may be better alternatives (if the data support these specifications).
As quantified in this article, the inclusion of model covariates can reduce required samples, although these precision gains are tempered if the time-varying covariates are associated with treatment, which is likely to occur in practice. Thus, obtaining predictive covariates—such as prior measures of the primary study outcomes for cross-sectional cohorts—may be critical for achieving target precision levels for panel studies, especially for the CITS/ITS estimators.
Of course, sample size viability will depend on the specific study context, including the unit of treatment assignment—that often defines the study clusters—as well as data availability. We recognize that sample sizes for panel studies may be limited by the number of time periods of available data (such as from administrative records or national surveys). However, there may be some choice in the number of units selected for the study. Further, even if there is little flexibility in study sample sizes, the calculation of statistical power is still important to assess the ability of the study to detect impacts of a realistic size and to help researchers and funders prioritize research questions that can be addressed with sufficient statistical power at reasonable cost.
Supplemental Material
Supplemental Material, sj-pdf-1-jeb-10.3102_10769986211070625 - Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Estimators With Variation in Treatment Timing
Supplemental Material, sj-pdf-1-jeb-10.3102_10769986211070625 for Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Estimators With Variation in Treatment Timing by Peter Z. Schochet in Journal of Educational and Behavioral Statistics
Supplemental Material
Supplemental Material, sj-pdf-2-jeb-10.3102_10769986211070625 - Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Estimators With Variation in Treatment Timing
Supplemental Material, sj-pdf-2-jeb-10.3102_10769986211070625 for Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Estimators With Variation in Treatment Timing by Peter Z. Schochet in Journal of Educational and Behavioral Statistics
Supplemental Material
Supplemental Material, sj-r-1-jeb-10.3102_10769986211070625 - Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Estimators With Variation in Treatment Timing
Supplemental Material, sj-r-1-jeb-10.3102_10769986211070625 for Statistical Power for Estimating Treatment Effects Using Difference-in-Differences and Comparative Interrupted Time Series Estimators With Variation in Treatment Timing by Peter Z. Schochet in Journal of Educational and Behavioral Statistics
Footnotes
Acknowledgment
The author would like to thank the three anonymous reviewers and journal editors for excellent comments and suggestions.
Notes
References
1.
AbadieA. (2005). Semiparametric difference-in-differences estimators. The Review of Economic Studies, 72(1), 1–19.
2.
AbadieA.AtheyS.ImbensG.WooldridgeJ. (2017). When should you adjust standard errors for clustering?arXiv: 1710.02926[Math.ST].”]
3.
AngristJ.PischkeJ. S. (2009). Mostly harmless econometrics: An empiricist’s companion. Princeton University Press.
4.
AshenfelterO. (1978). Estimating the effect of training programs on earnings. The Review of Economics and Statistics, 60(1), 47–57.
5.
AshenfelterO.CardD. (1985). Using the longitudinal structure of earnings to estimate the effects of training programs. Review of Economics and Statistics, 67, 648–660.
6.
AtheyS.ImbensG. W. (2018). Design-based analysis in difference-in-difference settings with staggered adoption (Working Paper 24963). National Bureau of Economic Research.
7.
BaickerK.SvoronosT. (2019). Testing the validity of the single interrupted time series design (CID Working Papers 364). Center for International Development at Harvard University.
8.
BaltagiB. H.WuP. X. (1999). Unequal spaced panel data regression with AR(1) disturbances. Econometric Theory, 15, 814–823.
9.
BernalJ. L.CumminsS.GasparriniA. (2017). Interrupted time series regression for the evaluation of public health interventions: A tutorial. International Journal of Epidemiology, 46, 348–355.
10.
BertrandM.DufloE.MullainathanS. (2004). How much should we trust differences-in-differences estimates?The Quarterly Journal of Economics, 119, 249–275.
11.
BloomH. S. (1995). Minimum detectable effects: A simple way to report the statistical power of experimental designs. Evaluation Review, 19(5), 547–556.
12.
BloomH. S. (1999). Estimating program impacts on student achievement using “short” interrupted time series (Working Paper). Manpower Demonstration Research Corporation).
13.
BloomH. S. (2003). Using ‘short’ interrupted time-series analysis to measure the impacts of whole-school reforms: With applications to a study of accelerated schools. Evaluation Review, 27(3), 3–49.
14.
BorusyakK.JaravelX. (2017). Revisiting event study designs (Working Paper). Harvard University.
15.
BurligF.PreonasL.WoermanM. (2019). Panel data and experimental design (Working Paper 26250). National Bureau of Economic Research.
16.
CallawayB.Sant’AnnaP. H. C. (2021). Difference-in-differences with multiple time periods. Journal of Econometrics, 225(2), 200–230.
17.
CameronA. C.MillerD. L. (2015). A practitioner’s guide to cluster-robust inference. Journal of Human Resources, 50, 317–372.
18.
CookT. D.CampbellD. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Houghton Mifflin.
19.
DawJ. R.HatfieldL. A. (2018). Matching and regression-to-the-mean in difference-in-differences analysis. Health Services Research, 53(6), 4138–4156. https://doi.org/10.1111/1475-6773.12993
20.
de ChaisemartinC.D’HaultfœuilleX. (2020). Two-way fixed effects estimators with heterogeneous treatment effects. American Economic Review, 110(9), 2964–2996.
21.
DonaldS. G.LangK. L. (2007). Inference with difference-in-differences and other panel data. Review of Economics and Statistics, 89, 221–233.
22.
DonnerA.KlarN. (2000). Design and analysis of cluster randomization trials in health research. Arnold.
23.
FerronJ.Rendina-GobioffG. (2014). Interrupted time series design. Wiley StatsRef: Statistics Reference Online: 1–6. https://doi.org/10.1002/9781118445112.stat06764
24.
FreedmanD. (2008). On regression adjustments to experimental data. Advances in Applied Mathematics, 40, 180–193.
25.
FrisonL.PocockS. J. (1992). Repeated measures in clinical trials: Analysis using mean summary statistics and its implications for design. Statistics in Medicine, 11(13), 1685–1704.
26.
Goodman-BaconA. (2018). Difference-in-differences with variation in treatment timing (Working Paper 25018). National Bureau of Economic Research.
HawleyS.AliM. S.BerencsiK.JudgeA.Prieto-AlhambraD. (2019). Sample size and power considerations for ordinary least squares interrupted time series analysis: A simulation study. Clinical Epidemiology, 11, 197–205. https://doi.org/10.2147/CLEP.S176723
29.
HeckmanJ.IchimuraH.ToddP. E. (1997). Matching as an econometric evaluation estimator: Evidence from evaluating a job training programme. Review of Economic Studies, 64(4), 605–654.
30.
HedgesL. V. (2007). Effect sizes in cluster-randomized designs. Journal of Educational and Behavioral Statistics, 32(4), 341–370.
31.
HedgesL. V.HedbergE. C. (2007). Intraclass correlation values for planning group-randomized trials in education. Educational Evaluation and Policy Analysis, 29, 60–87.
32.
HillJ.BloomH.BlackR.LipseyM. (2008). Empirical benchmarks for interpreting effect sizes in research. Child Development Perspectives, 2(3), 172–177.
33.
HollandP. W. (1986). Statistics and causal inference. Journal of the American Statistical Association, 81, 945–960.
34.
ImbensG.RubinD. (2015). Causal inference for statistics, social, and biomedical sciences: An introduction. Cambridge University Press.
35.
KishL. (1995). Survey sampling. John Wiley and Sons.
36.
KontopantelisE. (2018). ITSPOWER: Stata module for simulation-based power calculations for linear interrupted time series (ITS) designs (Statistical Software Components S458492). Boston College Department of Economics.
37.
KontopantelisE.DoranT.SpringateD. A.BuchanI.ReevesD. (2015). Regression based quasi-experimental approach when randomization is not an option: Interrupted time series analysis. The British Medical Journal, 350, h2750.
38.
LiangK.ZegerS. (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73, 13–22.
39.
LindenA. (2015). Conducting interrupted time-series analysis for single-and multiple-group comparisons. The Stata Journal, 15(2), 480–500.
40.
LipseyM. W.YunP. K.HebertM. A.Steinka-FryK.ColeM. W.RobertsM.AnthonyK. S.BusickM. D. (2012). Translating the statistical representation of the effects of education interventions into more readily interpretable forms (NCSER 2013-3000). U.S. Department of Education, Institute of Education Sciences, National Center for Special Education Research.
41.
LiuW.YeS.BartonB. A.FischerM. A.LawrenceC.RahnE. J.DanilaM. I.SaagK. G.HarrisP. A.LemonS. C.AllisonJ. J.ZhangB. (2019). Simulation-based power and sample size calculation for designing interrupted time series analyses of count outcomes in evaluation of health policy interventions. Contemporary Clinical Trials Communications, 17, 100474. https://doi.org/10.1016/j.conctc.2019.100474
42.
McKenzieD. (2012). Beyond baseline and follow-up: The case for more T in experiments. Journal of Development Economics, 99(2), 210–221.
43.
MurrayD. (1998). Design and analysis of group-randomized trials. Oxford University Press.
44.
RaudenbushS. (1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2(2), 173–185.
45.
ReddingC.NguyenT. (2020). The relationship between school turnaround and student outcomes: A meta-analysis. Educational Evaluation and Policy Analysis, 42(4), 493–519.
46.
RubinD. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Education Psychology, 66, 688–701.
47.
RubinD. B. (1986). Which ifs have causal answers? Discussion of Holland’s “Statistics and causal inference.”Journal of the American Statistical Association, 81, 961–962.
48.
RubinD. B. (2005). Causal inference using potential outcomes: Design, modeling, decisions. Journal of the American Statistical Association, 100, 322–331.
49.
SchochetP. Z. (2008). Statistical power for random assignment evaluations of education programs. Journal of Educational and Behavioral Statistics, 33, 62–87.
50.
SchochetP. Z. (2009). Statistical power for regression discontinuity designs in education evaluations. Journal of Educational and Behavioral Statistics, 34(2), 238–266.
51.
SchochetP. Z. (2010). Is regression adjustment supported by the Neyman model for causal inference?Journal of Statistical Planning and Inference, 140, 246–259.
52.
SchochetP. Z. (2013). Estimators for clustered education RCTs using the Neyman model for causal inference. Journal of Educational and Behavioral Statistics, 38, 219–238.
53.
SchochetP. Z. (2016). Statistical theory for the RCT-YES software: Design-based causal inference for RCTs (NCEE 2015–4011). Institute of Education Sciences, U.S. Department of Education.
54.
SchochetP. Z. (2020). Analyzing grouped administrative data for RCTs using design-based methods. Journal of Educational and Behavioral Statistics, 45, 32–57.
55.
SchochetP. Z.PashleyN. E.MiratrixL. W.KautzT. (2021). Design-based ratio estimators and central limit theorems for clustered, blocked RCTs. Journal of the American Statistical Association. doi.org/10.1080/01621459.2021.1906685
56.
ShadishW. R.CookT. D.CampbellD. T. (2002). Experimental and quasi-experimental designs for generalized causal inference. Houghton Mifflin College Division.
57.
SomersM.ZhuP.JacobR.BloomH. (2013). The validity and precision of the comparative interrupted time series design and the difference-in-difference design in educational evaluation (Working Paper). Manpower Demonstration Research Corporation.
58.
St. ClairTHallbergK.CookT. D. (2016). The validity and precision of the comparative interrupted time-series design: Three within-study comparisons. Journal of Educational and Behavioral Statistics, 41(3), 269–299.
59.
SunL.AbrahamS. (2020). Estimating dynamic treatment effects in event studies with heterogeneous treatment effects. Journal of Econometrics, 225, 175–199. https://doi.org/10.1016/j.jeconom.2020.09.006
60.
ZhangF.WagnerA. K.Ross-DegnanD. (2011). Simulation-based power calculation for designing interrupted time series analyses of health policy interventions. Journal of Clinical Epidemiology, 64(11), 1252–1261.
61.
ZimmerR.HenryG.KhoA. (2017). The effects of school turnaround in Tennessee’s achievement school district and innovation zones. Educational Evaluation and Policy Analysis, 39(4), 670–696.
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.