Abstract
We explore the conditions under which short, comparative interrupted time-series (CITS) designs represent valid alternatives to randomized experiments in educational evaluations. To do so, we conduct three within-study comparisons, each of which uses a unique data set to test the validity of the CITS design by comparing its causal estimates to those from a randomized controlled trial (RCT) that shares the same treatment group. The degree of correspondence between RCT and CITS estimates depends on the observed pretest time trend differences and how they are modeled. Where the trend differences are clear and can be easily modeled, no bias results; where the trend differences are more volatile and cannot be easily modeled, the degree of correspondence is more mixed, and the best results come from matching comparison units on both pretest and demographic covariates.
Keywords
Introduction
When randomized controlled trials (RCTs) are not feasible, and time-series data are available for the outcome of interest, interrupted time-series (ITS) designs represent an alternative approach to estimating treatment effects. In the simplest ITS, researchers compare the pretest and posttest values of a treatment group in order to assess the impact of a treatment. However, this simple design makes it difficult to account for confounding factors, such as historical events or changes in instrumentation that co-occur with treatment (Cook & Campbell, 1979). As a result, it is common to add time-series data from a nonequivalent comparison group over the same period, thus creating a comparative ITS (CITS) design. The simplest CITS analysis entails a difference-in-difference estimate where the difference between the pre- and postintervention means in the comparison group is used as the counterfactual against which the mean difference in the treatment group is evaluated. In more complex CITS analyses, the means and slopes of the pretreatment values are used to assess not only changes in mean levels but also changes in trend, in the variation around these trends, or in the pattern of temporal variability.
One approach to examining the validity of quasi-experimental designs such as CITS is to conduct a within-study comparison (WSC), also known as a design experiment. WSC studies estimate the extent to which a given type of quasi-experimental study reproduces the results of an RCT when both share the same treatment group. A few WSC studies on ITS have previously appeared, mostly in the medical literature, and as discussed below, they have generally concluded that CITS and ITS designs are able to produce estimates that are concordant with experimental findings. However, these results cannot always be easily generalized to the social sciences, and in this article, we test the internal validity of CITS across three social science data sets. Testing the design across three different data sets increases the external validity of our findings and allows us to better assess the ability of CITS to reliably mimic RCT results. Fortuitously, the three data sets we examine also vary in how the pretest time trends differ between the treatment and the comparison groups, enabling us to explore the impact of various analytic choices across different patterns of pretreatment data.
Schneeweiss, Maclure, Carleton, Glynn, and Avorn (2004) conducted the first WSC on CITS that we are aware of, comparing results from a CITS design to an RCT that determined how restricting insurance reimbursement affects spending on medication. Their CITS results were initially not close to the RCT benchmark, leading the authors to reexamine their experiment, whereupon they determined that a mistake in the experimental protocol had led to low rates of compliance in the control group. The corrected RCT results were more comparable to the CITS results, but this ex post facto analysis raises questions about the RCT benchmark.
More recently, Fretheim et al. (2015) reanalyzed nine RCTs from the medical literature and compared the results to simple ITS estimates with no comparison group. They found that in eight of the nine cases, the estimates had overlapping confidence intervals. The focus was on studies that had time-series data before and after the intervention so that a simple ITS design was possible. However, the length of the time-series data and the focus on simple ITS designs, which suffer from many more internal validity threats than the CITS design, limit the generalizability of these results to other social science settings.
Ferraro and Miranda (2014) conducted a design replication study in environmental policy in which they evaluated the fixed effects estimator by comparing its results to an RCT that tested the effectiveness of conservation messages on water usage. Although not referred to as a CITS study, their analysis involved 13 months of pretreatment data and a treatment that occurred simultaneously for all treated units and so is comparable to the CITS designs discussed here. When they combined the fixed effects estimator with matching, they successfully replicated the RCT benchmark, though their matching results were somewhat sensitive to the choice of matching variables; matching on both pretreatment outcomes and time-invariant observable characteristics worked best, while leaving out one or the other sets of covariates led to a lack of correspondence with the RCT. When they did not preprocess the data by matching, or when they used trimming or incomplete matching, their estimates were biased.
Somers, Zhu, Jacob, and Bloom (2013) is the only WSC we are aware of that uses educational data. Somers et al. evaluated the effect of the federal Reading First program using a sharp regression discontinuity design (RDD), comparing the results to those from a CITS design and also comparing various forms of matching in terms of bias reduction and precision. They found not only that CITS produces valid causal estimates but also that matching resulted in lower standard errors relative to a comparison group of all state schools; their most precise results came from radius matching. However, the benchmark that they relied on was from an RDD, forcing them to show that the local average treatment effect estimate from the RDD was generalizable to schools further away from the cutoff and thus that the two designs were estimating the same causal estimand. Also, most relevant to this study, the pretest trend in test scores was similar for the treatment and comparison schools, leaving unresolved the question of whether the CITS design can produce valid estimates when the difference in pretreatment trends is less easily modeled.
In an earlier study (St.Clair, Cook, & Hallberg, 2014), we found that correspondence with the RCT results requires correctly accounting for pretreatment trend differences in the treatment and comparison groups and that adding more pretest time points can increase bias if the pretreatment trends are incorrectly modeled. In the case in question, they were assumed to be parallel—a frequent assumption in applications of the difference-in-difference design—but were in fact systematically differing from each other across the pretest time period. So a baseline trend model that assumes systematically growing linear differences was more appropriate than a baseline mean model that assumes constant pretest differences. Only when the analysis explicitly accounted for this readily observed pretest group difference was the bias eliminated.
This result led us to the current study, where we use a WSC methodology to examine the CITS design across three data sets, two of which are new and one of which we examined in a more limited fashion in St.Clair, Cook, and Hallberg (2014). Each data set evaluates an educational intervention with respect to its effects on achievement. Each includes treatment and control schools from an RCT, time series pretest data on academic achievement outcomes and other covariates, and a large pool of untreated schools from the same state from which comparison schools could be selected.
We hypothesized that if we appropriately modeled the pretreatment trends in the treatment and comparison groups, we would obtain causal estimates close to the RCT benchmarks. In the first new data set we examined, we discovered that the pretreatment trends were clearly parallel. However, the second data set presented more of a conundrum since the pretreatment trend differences were less clear and not easily modeled as a baseline mean or slope difference.
Since there is especially little clarity in the CITS literature around how to analyze CITS data when pretest trend differences cannot be easily modeled, we considered two alternative approaches: (a) matching on pretest measures of the outcome and (b) matching on both pretest measures and demographic covariates. Our goal was to match treatment and comparison cases at all pretest time points so as to reduce reliance on modeling. While Somers et al. (2013) provided some evidence that matching on pretest measures alone is sufficient, Ferraro and Miranda (2014) found that matching should include other observable characteristics, and so the extent to which demographic covariates may improve matching remains unsettled, particularly when pretreatment time series exhibit considerable volatility.
Our purposes are threefold. First, we examine how closely we can reproduce the RCT results across the three CITS data sets so as to increase the external validity of past findings indicating little bias in CITS studies. Second, we compare the standard errors of the CITS and RCT designs. This is important because standard errors might well differ by analysis mode and hence obfuscate causal interpretation for those who rely on statistical significance testing to validate causal claims. Finally, we compare and contrast the performance of different analytic approaches—two modeling approaches and two matching approaches. On this last point, our article is unique: The various pretest trend differences in our data enable us to compare analytic strategies for the CITS design when differences in pretreatment trends present themselves: (1) when the treatment and comparison groups are parallel, (2) when they are systematically diverging but can be easily modeled, and (3) when one or both time series exhibit volatility and so parametric analysis might not work well. We pay particular attention to this last case, as there is little guidance in the CITS literature.
Conceptual Frame
To produce unbiased estimates of causal effects, CITS designs must contend with uncertainties about the pretreatment functional form of the outcome. There are two primary approaches to dealing with this—properly modeling the pretreatment trend or matching the treatment and comparison cases so as to minimize pretreatment group differences and reduce reliance on modeling. This section provides an overview of these and the other analytic strategies we examine.
Bloom (2003) outlines two modeling approaches in the ITS context. The baseline mean model is the simplest. As shown in Model 1, it assumes a fixed difference between outcomes in the treatment and comparison groups, that is, that the groups follow parallel trends. The average pretreatment performance is then projected into the posttreatment period as the best estimate of the counterfactual—what performance would have been in the absence of treatment. The difference in the actual posttreatment performance from mean past performance in the treatment schools, less this same difference in the comparison schools, serves as the estimate of treatment effects and can be formulated as follows:
where Yit is the outcome for unit i at time t; β0 is a constant term showing average outcomes in comparison units before the intervention; trti is an indicator for whether a unit received the intervention of interest; β1 shows the average difference in performance between treatment and comparison schools in the preintervention time period; Tt is a vector of indicators for each postintervention time period t; 1 α is a vector showing the difference in average outcomes between the preintervention time period and each postintervention time period t for comparison units; γ is a vector showing the change in the difference in average performance between treatment schools and comparison schools at each time t after the intervention was implemented; X it is a vector of time-varying covariates; ui is a unit-level random error term, with an assumed normal distribution with mean zero and variance φ2; and eit is an individual-level error term at time t, also assumed to have a normal distribution with mean zero and variance σ2. In data with higher level structures, such as grades or classrooms within schools, additional error terms may also be included.
While the baseline mean model accounts for fixed differences between the treatment and the comparison cases, this assumption is not always appropriate, as when selection into treatment is due to declining performance over time. Then, different slope values characterize the treatment and comparison schools, and effect estimates predicated on a fixed pretest difference will be biased. The linear baseline trend model seeks to account for pretreatment trend differences by including both a linear term for time (β1time
t
) and an interaction of this term with the treatment indicator (β3time
ttrti
):
This formulation assumes that all units within the treatment and comparison groups share the same trend, though this trend can differ between the two groups. However, the assumption of within-group homogeneity could be relaxed by modeling the trends as random effects. Further, if investigators are substantively interested in whether the slope of performance changed after the intervention, the basic model could be modified by including a three-way interaction between the treatment indicator (trti), a postintervention indicator (Tt ), and the linear time trend (time t ). While the baseline trend model is appropriate with differential trends, if the treatment and comparison groups are actually characterized by parallel trends, fitting a baseline trend model could lead to overfitting and a loss of precision.
The foregoing discussion assumes that the functional form of the between-group pretest trend difference is clear and so can be modeled well. When this situation does not hold, matching is another plausible approach. The aim is to select treatment and comparison cases that are similar in pretest outcomes, thus reducing the need for modeling the group differences in functional form. Matching has the potential to reduce bias by equating nonequivalent groups on observables, but it may also reduce precision due to the reduction in the number of comparison cases. Yet matching need not reduce precision, as it may be possible to sample many comparison schools for each treatment one, and the matching process decreases the variability in the outcome measure for the comparison group.
We employ two sets of variables in matching, one using only pretest measures of the outcome and the other supplementing them with demographic covariates. When pretest measures do not show a stable pretreatment trend, perhaps as the result of annual changes in school composition, the added demographic information might improve the quality of the matches.
Matching can be implemented in many ways—for example, exact case matching, propensity score matching, optimal radius matching, Mahalanobis distance matching, or synthetic matching—though the selection of covariates and their reliable measurement is thought to be more important than the choice of matching method (Steiner & Cook, 2013). We follow the recommendation of Somers et al. (2013) who found that radius matching with replacement increased the precision of their estimates more so than other matching methods. Hence, we implement a radius matching strategy, whereby the number of matches for each treatment school is determined by the number of comparison schools that fall within a prespecified distance of the treatment school on pretreatment measures across all time points. To determine the optimal caliper for each outcome measure, we select the distance that minimized the mean-squared error of the estimated effect in the last baseline year prior to treatment. Since the treatment effect in the year prior to treatment is zero, any deviation from zero in the estimated treatment effect represents bias. We also considered synthetic matching (Abadie, Diamond, & Hainmueller, 2010), but it has primarily been used with a single treatment unit (e.g., a state) and offers no obvious advantages when, as in this study, there are over 1,000 comparison units available.
Although matching is desirable because it reduces the sensitivity to model specification, we follow the recommendation that matching be followed by parametric adjustment for whatever small differences might remain (Ho, Imai, King, & Stuart, 2007; Rubin, 1979). So we implement matching in combination with both the baseline mean and baseline trend models mentioned above to adjust for any differences in observed covariates after matching treatment and comparison pairs.
All of these analytic approaches aim to account for pretest differences. However, it is important to note that selection and selection maturation are not the only plausible internal validity threats in CITS studies. The treatment and comparison groups can also differ in (a) the local historical events they experience at or soon after treatment, (b) whether the outcome assessment procedure changes at treatment onset, or (c) whether differential statistical regression occurs as a result of the treatment being assigned to one group because its pretreatment behavior deviates from trend immediately before treatment onset. The analytic techniques that we explore cannot account for these three additional internal validity threats. Nonetheless, comparing the RCT and CITS estimates allows us to examine whether they collectively operated to a significant degree, in which case the RCT and CITS results will not correspond whatever the initial pretreatment differences.
Method and Data Sets
Our empirical approach is to examine the validity of the CITS design through the use of WSCs, a method of evaluating nonexperimental methods that involves comparing the effect size from a quasi-experiment to the “benchmark” estimate from an RCT that shares the same treatment group. The goal is to identify the extent to which various design elements or statistical adjustments compensate for the bias due to the quasi-experimental comparison group being formed systematically rather than randomly.
To date, WSCs have been used to test the validity of many approaches to causal inference, including ITS (Fretheim et al., 2015), comparative regression discontinuity (Wing & Cook, 2013), matching (Diaz & Handa, 2006; Michalopoulos, Bloom, & Hill, 2004; Smith & Todd, 2005), and covariate measurement and selection (Cook & Steiner, 2010; Steiner, Cook, & Shadish, 2011). While early WSCs in the job training context raised concerns about the ability of quasi-experiments to yield valid causal estimates (Fraker & Maynard, 1987; Friedlander & Robins, 1995; Glazerman, Levy, & Meyers, 2003; Lalonde, 1986), more recent research has suggested that the careful design and analysis of quasi-experiments can yield effect estimates very close to those from RCTs. In fact, for many types of design, the focus has shifted from asking whether quasi-experiments can replicate RCT benchmarks to seeking to identify the conditions under which they more closely reproduce the benchmark.
WSC studies are not without pitfalls. Cook, Shadish, and Wong (2008) enumerate the standards for conducting them. One major challenge is to establish clear a priori standards for assessing when RCT and adjusted quasi-experimental estimates are considered to be correspondent. To address this concern, we compare our CITS and RCT estimates in a number of ways. First, we use a difference of .20 standard deviation units as an informal test, since this (otherwise arbitrary) criterion is currently used in educational evaluations for determining effects that are substantively important and whether pretest imbalances are small enough to ignore (Cohen, 1988). Second, we compute bootstrapped standard errors for the difference between our CITS and RCT results, enabling us to evaluate the statistical significance of the difference. Third, we present all of our estimates along with their standard errors side by side, so that readers can judge for themselves the size of the differences. None of these tests are perfect, but they give us a reasonable basis for comparison.
A second concern with WSCs is that the RCT estimate serves as the causal benchmark but is subject to sampling error. The temptation is to consider the RCT estimate as a true “point effect” when it is logically more like an effect range. We follow Rubin’s advice and compute RCT estimates using all the available preintervention covariates so as to control for any imbalances in the RCT as implemented (Rubin, 2008).
A final desideratum of WSCs is that analysts of the RCT and of the CITS should be blind to each other’s results so as not to inadvertently bias the analyses. That did not occur here; our analysis consists in part of retroactively applying new analytical approaches to studies that were previously known to us.
The three RCTs we examine are all in education. The first two study the impact of Indiana’s Diagnostic Assessment Intervention on student performance as measured by the state’s annual Indiana Statewide Testing for Educational Progress-Plus. In Year 1 of the study, 59 K–8 schools volunteered to implement the formative assessment system in the 2009–2010 school year. Thirty-five of these schools were randomly assigned to implement the state’s formative assessment system, as part of which teachers administered regular formative assessments to students. From these assessments, teachers received immediate feedback on student performance that could be disaggregated in a variety of ways to inform instruction. The remaining 24 schools were assigned to the control condition. Because statewide testing in Indiana does not include Grades K–2, our sample was limited to 34 treatment and 23 control schools in Year 1.
The second wave of the Indiana study involved a modified version of the same program and outcome. Researchers assigned 36 elementary and middle schools to treatment and 34 to the control status. Again, because statewide testing does not include Grades K–2, our sample was limited to 32 treatment and 31 control schools. As in the first Indiana study, our pool of comparison schools consisted of all other schools in the state, though we excluded all schools from the first wave of the experiment.
The third experiment examines the effectiveness of P-SELL, a full-year science curriculum and professional development program that provides strong supports for English-Language Learners (ELLs) to enhance their understanding of science and also to improve their English-language acquisition and literacy. Sixty-four elementary schools in the Miami region with high proportions of ELL students participated in the experiment, with 32 in the treatment group and 32 in the control. The treatment occurred only in fifth grade classrooms, with student performance measured using the Florida Comprehensive Assessment Test in math, science, and English-Language Arts (ELAs). 2
One of the outcome measures in all three experiments was student performance on state accountability tests. These outcomes enable us to draw on archival data from the same grade cohort in prior years. Thus, the design involves repeated cross sections rather than longitudinal data on the same students. For the first year of the Indiana study, we were able to obtain 5 years of pretest data; for the modified replication in that same state, we obtained 6 prior years; and for the Florida experiment, 7 years of prior achievement data. Hence, all the data sets might be considered to have “short” time series. Nonetheless, in each case, we expected to have enough time points to enable us to draw conclusions about the form of the pretreatment trends in the treatment and comparison time series, especially for concluding whether they were parallel, systematically divergent, or indeterminate.
Since archival student performance measures are typically available only at the aggregate level, our outcome measure is always at the school- or grade level rather than at the individual student level. Even in the unlikely case that the RCT had collected multiple waves of student-level pretest data, there would still not be enough time-series data on individual students after Grade 2 to enable the construction of quasi-experimental comparison groups. Hence, we analyze both the CITS and the RCT at the aggregate level, even though the original experimenters collected data on individual student outcomes. Nonetheless, this should not significantly disadvantage the bias or precision of our estimates and is representative of the data typically available to researchers evaluating school-level reforms (Jacob, Goddard, & Kim, 2014).
For each of the three data sets referenced above, we will use two different analytic models to analyze the program impact: (1) a baseline mean model and (2) a baseline trend model. The first model is most appropriate when the pretest trends are parallel; the second when they systematically diverge. Both models also include demographic covariates, including measures of race, socioeconomics status, and native English proficiency, though the specific variables differ across the Indiana and Florida studies. In addition to the two modeling approaches, we apply two different matching strategies: (1) matching on pretest measures of the outcome and (2) matching on pretest measures of the outcome along with a set of demographic covariates. When we implement the matching, we do so in conjunction with the two analytic models. Thus, for each outcome, we present six CITS estimates: two unmatched estimates (baseline mean and baseline trend), two estimates with matching on pretest measures, and two estimates with matching on pretest measures and demographic covariates. Our hypothesis is that matching schools on pretreatment covariates will reduce the sensitivity of our results to model specification and help most where the pretest trends do not show a clear pattern of difference and so cannot be easily modeled.
Findings
Quality of the Benchmark RCTs
Given the modest sample size in each RCT and the fact that RCT results are to be used as presumptively bias-free causal parameters, it is important to establish that the experimental treatment and control groups are balanced over the pretest time period. Appendix Figure A1 shows the unconditional treatment and control group means in all three RCTs. There are no significant mean or trend differences at baseline, with the exception of a slight trend difference in the ELA results in the second Indiana study. To account for any small imbalances, the RCT treatment estimates referenced below come from fully adjusted models that include all pretest and demographic covariates (Rubin, 2008). For further details on the analysis of the RCTs, see St.Clair et al. (2014) and Konstantopoulos, Miller, and van der Ploeg (2013).
The First Indiana Study
Figure 1 shows the math and ELA test scores for the treatment schools from the first Indiana RCT and from the all-state comparison group. The data consist of test scores for Grades 3 through 8, standardized by grade and by year. It is also possible to model the raw data, but changes in scaling procedures and the timing of test administration (visible in Figure 1 immediately prior to treatment) lead us to prefer the standardized scores. 3

First Indiana study treatment schools versus all other schools in the state. (A) English-Language Art scores. (B) Math scores.
Figure 1 shows how the RCT treatment group performed relative to the rest of the state. In both math and ELA, it performed worse than the state average, though this difference is quite stable over the pretest time period. A test of the pretreatment data confirms that there is no statistically significance difference in the slope of the two groups, with p values of .438 and .334 for the slope difference in ELA and math, respectively. The parallel trends and the absence of slope difference or year-specific shocks lead us to prefer the baseline mean model.
Table 1 presents the effect estimates from the baseline mean model and the baseline trend models. For each model, there are three sets of estimates: one from the entire sample of comparison schools, one from the comparison group formed through matching on pretest data alone, and one from matching on both pretest and demographic characteristics. Table 1 shows that the baseline mean model clearly outperforms the baseline trend model, an unsurprising result, given that the pretest trends are parallel. In this case, the baseline trend model illustrates the danger of overfitting in small samples—applying functional form assumptions that the data do not warrant. No trend differences are visible in Figure 1, a sure sign that a baseline trend model is likely to be inappropriate.
First Indiana Study—The Difference Between Experimental and Quasi-Experimental Results
Note. Bootstrap standard errors in parentheses. ELA = English-Language Art; RCT = randomized controlled trial.
aThe number of schools represents an average since the number of matched comparison schools varies by outcome measure.
*p < .05. **p < .01.
Among the baseline mean estimates, the best estimates come from the comparison group formed through matching on both pretest and demographic data, within .021 standard deviation for ELA and .024 standard deviations for math. However, all of the baseline mean estimates show fairly strong correspondence, as even the worst estimate is within .11 standard deviations. The standard errors are uniformly smaller in the CITS results than in the RCT. Among the baseline trend estimates, with one exception matching does seem to mitigate the overfitting that occurs in the unmatched sample.
It is difficult to pronounce one model better than others in the face of such minor differences between estimates. To aid in clarifying differences between modeling approaches, we draw 1,000 bootstrap replicates from our data and compare the percentage of replications in which each model shows the least bias. The bootstrap results confirm that the estimates from the comparison group matched on both pretest and demographic data outperform the unmatched sample but that the difference is negligible; in side-by-side comparisons, the matched estimates outperform the unmatched estimates 55% to 45% (ELA) and perform equally well as the unmatched estimates—50% to 50%—in math.
The Second Indiana Study
Figure 2 presents the descriptive results from the second Indiana study with a new treatment sample and the statewide set of comparison schools. One difference between Figure 2 and Figure 1 is immediately apparent. The ELA results indicate that the pretest performance of the schools in the treatment group is declining relative to the rest of the state. Slight evidence of the same trend difference is also apparent for math, though the pattern is much less clear and the trends are more parallel than for ELA.

Second Indiana study treatment schools versus all other schools in the state—English-language art scores. (A) English-Language Art Scores. (B) Math scores.
It seems that selection into the RCT was correlated with declining performance in ELA and possibly in math, clearly violating the parallel trends assumption of the baseline mean model. Specification tests confirm that the difference in group slopes prior to treatment is statistically significant for ELA (with a p value of .009), though for math, the difference is not significant. So our hypothesis was that the baseline trend model with linear trends is the most appropriate model for ELA; for math, the choice between baseline mean and baseline trend is less clear.
Table 2 presents the effect estimates. For ELA, the unmatched baseline trend model clearly performs better than the unmatched baseline mean in terms of bias reduction, confirming that slope terms needed to be used. With slope terms in the model, the estimate is within .065 standard deviations of the benchmark. Without slope terms, the estimate is .189 standard deviations away. Once again, the matched estimates offer some protection against incorrect modeling, as both of the matched ELA estimates from the baseline mean model are far superior to the unmatched estimate (.005 and .016 standard deviations vs. .189).
Second Indiana Study—The Difference Between Experimental and Quasi-Experimental Results
Note. Bootstrapped standard errors in parentheses. ELA = English-Language Art; RCT = randomized controlled trial.
aThe number of schools represents an average since the number of matched comparison schools varies by outcome measure.
*p < .05. **p < .01.
For math, the difference between the baseline mean and baseline trend models is less stark, reflecting the fact that the pretest slope differences are less clearly linear. Both estimates are close to the benchmark, though the baseline trend model is superior (.006 standard deviations vs. .056 standard deviations). Nevertheless, the bootstrap replicates suggest that there is no difference between the estimates, as the baseline mean model outperforms the baseline trend model in 55% of the replicates in a side-by-side comparison. Matching offers little improvement in the case of the math results, as both unmatched estimates are already within .06 of the benchmark, and thus there is little to be gained in terms of bias.
The Florida Study
Figure 3 shows the results for Florida, once again comparing the performance of the experimental treatment schools to all other schools in the state. Now no obvious pattern of pretest trend differences is evident except that the smaller sample of treatment schools exhibits more intertemporal volatility than the larger sample of comparison schools. It is hard to examine Figure 3 and have any confidence about the viability of a baseline mean or trend model. From this uncertainty arises the need to explore matching strategies. 4 For each outcome, Figure 4 shows the extent of correspondence between the two groups after matching on pretest achievement alone. They become more correspondent, but not perfectly so, thus justifying the analytic strategy of following matching with regression using either the baseline mean or baseline trend model.

Florida study treatment schools versus all schools in the state. (A) Science scores. (B) Math scores. (C) Reading scores.

Florida study matching treatment and comparison schools using pretest data. (A) Science scores. (B) Math scores. (C) Reading scores.
We also tested the quality of our matching by conducting specification tests. Since there can be no true treatment effect in the pretreatment data, testing for a treatment effect in the most recent baseline year gives an indication of the extent to which the earlier pretest information successfully predicts more recent pretest information in our matched sample. These placebo tests, conducted just using the pretest information, lead to estimates of a “treatment effect” in the last baseline year with an average bias of .169 standard deviations across the three outcomes. This is substantially larger than in the Indiana studies, where similar specification tests showed an average bias of .067 standard deviations (see Table A1 for more details). Thus, despite the relatively strong correspondence shown in Figure 4, early indications suggested that we might not be able to successfully extrapolate the counterfactual trajectory of the treatment group based on the pretreatment data.
Table 3 shows the results for both models with and without matching. Only for science do the estimates consistently meet our informal correspondence standard of .20 standard deviations. For the other two outcomes, the baseline mean model combined with matching on both pretests and demographics performs best, producing estimates within .216, and .126 standard deviations of the benchmark, but the correspondence is still weak, particularly for the math outcome. The estimates are more biased than when the pretest trend differences are less volatile and more easily modeled, as they were in the previous studies. 5
Florida Study—The Difference Between Experimental and Quasi-Experimental Results
Note. Bootstrapped standard errors in parentheses. RCT = randomized controlled trial.
aThe number of schools represents an average since the number of matched comparison schools varies by outcome measure.
*p < .05. **p < .01.
Table 4 compares the bootstrap results from all three of the Florida outcomes after matching. It shows the percentage of replications in which each model won out over the others. For the science outcome, which showed the best correspondence with the RCT results, the baseline trend model appears superior to the baseline mean model, but there is no difference between the two types of matching. For the math and reading outcomes, where correspondence with the RCT results was poor, matching on both pretest and demographic variables was clearly superior to matching on pretest information alone, winning out in 78% of the comparisons for math and 93% for reading. 6
Comparison of Bootstrap Replications for Florida Outcomes
Combining Bias and Precision to Determine Design Quality
Table 5 summarizes our findings with respect to bias reduction for all three studies. For both Indiana studies, when the appropriate parametric model is used, all of the RCT and CITS estimates are within .11 standard deviations of each other. Matching has little to add to this, but does help if an incorrect model is used, for then the RCT and CITS estimates are closer. For the Florida study where the treatment group exhibits volatility and the pattern of pretest differences is less stable, all the CITS estimates perform well for science. But for the two other study outcomes, the outcomes do not fall consistently within .2 standard deviations of the RCT; the baseline mean model in conjunction with matching on pretest and demographic information performs best.
Bias Results by Different Modeling Approaches
Note. Bootstrapped standard errors in parentheses. ELA = English-Language Art.
*p < .05.
When prospectively planning a research design, researchers must consider statistical power as well as bias. In order to take statistical power into account in assessing the different models, Tables 6 and 7 present the standard errors and the root mean square error (RMSE), respectively, of the estimates.
Standard Errors by Different Modeling Approaches
Note. Standard errors from 1,000 bootstrap samples. ELA = English-Language Art; RCT = randomized controlled trial.
Root Mean Square Error (RMSE) by Different Modeling Approaches
Note. ELA = English-Language Art.
For the first Indiana study, the baseline mean model remains clearly preferable to the baseline trend model when precision is also taken into account due to the fewer parameters in the model. However, whereas the baseline mean model with matching on pretest and demographic covariates is slightly preferred with respect to bias over the unmatched baseline mean model, the unmatched model has lower standard errors and, for the math outcome, lower RMSE, leading us to conclude that neither model is clearly preferred over the other. Matching reduces the number of comparison schools, which in this case results in a loss of precision. Nevertheless, every CITS estimate has lower standard errors than the RCT.
For the second Indiana study, the baseline trend model continues to perform better than the baseline mean for the ELA outcome when precision is added to bias. For math, where the pretest trend difference is less clear and so the choice between the baseline mean model and baseline trend model less obvious, the RMSE results in Table 7 indicate that the baseline mean model is preferred for reasons of precision. Once again, matching appears less desirable when precision is added as a criterion and offers no significant advantage when the correct analytic model is used.
In Florida, where the trend differences are less easily modeled, the matched models again produce higher standard errors than the unmatched models. Nevertheless, for two of the three outcomes, bootstrap replications and RMSE calculations show that matching on pretest and demographic variables is superior to both the unmatched model and matching on pretest variables alone.
Figure 5 provides a graphical summary of our results across the three WSCs. The estimates shown are from the models we preferred a priori: baseline mean for Indiana 1, baseline trend for Indiana 2, and pretest matched with baseline trend for Florida. For five of the seven outcomes, CITS results are within the 68% confidence interval of the corresponding RCT. The average bias across the seven outcomes, weighted by sample size, is .13 standard deviations.

Comparison of randomized controlled trial and comparative interrupted time series (CITS) estimates with 68% confidence intervals. The CITS estimates shown are from the models preferred a priori: baseline mean for Indiana 1, baseline trend for Indiana 2, and pretest matched with baseline trend for Florida. The average bias across the seven outcomes, weighted by sample size, is .13 standard deviations.
Summary and Discussion
In an earlier study (St.Clair et al., 2014), we examined the validity and precision of the CITS design by comparing its results to those of an educational intervention evaluated by means of an RCT. We now add results from two additional WSCs in an attempt to increase external validity and attest to reproducibility. Two of the three data sets showed clear but different patterns in the pretreatment functional forms, and the correspondence with the RCT results was quite strong. Applying the appropriate analytical model in these instances produced results within .07 standard deviations of the benchmark, even without matching. In the third data set, where the time series of the experimental treatment group showed much greater volatility and so the pattern of preintervention group differences was less clear, the results were more mixed. For two of the three outcomes, the correspondence between the CITS and RCT estimates was quite weak, with even the best model—baseline mean preceded by matching on both pretest and demographic information—producing outcomes that were .22 and .13 standard deviations from the benchmark.
Matching should reduce bias when functional form differences cannot be easily modeled. There are some indications that it did. When the incorrect model was applied—as with the baseline trend model for Indiana 1 and the baseline mean model for ELA in Indiana 2—matching reduced the bias in five of the six cases. In the Florida study, where the treatment group exhibited greater volatility, matching on pretest and demographic data reduced the bias in four of the six cases but still did not result in what many would consider to be an acceptable level of bias, and the results of matching on pretest information alone were even worse.
We also examined the standard errors and RMSE of our models in order to take both bias and precision into account. In all three studies, a large number of comparison cases were available, and every unmatched CITS estimate but one had lower standard errors than the RCT. Our matching estimators did not perform as well after taking precision into account due to the far smaller number of comparison schools relative to the intact (unmatched) comparison group.
Some Implications
In the applications presented, we had access to five to seven pretest time points. This is a number many educational researchers will be able to attain in the current era of “big school data,” though it is perhaps not as many as may be available in other settings, such as medicine. Moreover, the data points used in this article came from annual cross-sectional school cohorts whose composition changed from year to year, thus increasing unreliability relative to a true longitudinal design that follows the same individuals over time. Nevertheless, five of the seven outcomes that we examined showed little bias.
Perhaps the most important implication is that our CITS and RCT studies generally corresponded, despite our analyses explicitly attending to only two of the five internal validity threats commonly associated with CITS (Cook & Campbell, 1979; Shadish, Cook & Campbell, 2002). We controlled for (a) a simple selection difference between the treatment and nonequivalent comparison schools and for (b) a time-varying difference in these schools. However, we did not explicitly control for group differences at intervention that might affect academic achievement such as (c) historical events, (d) changes in testing practices, or (e) statistical regression due to the treatment being introduced as a response to a sudden change in performance immediately prior to treatment. Nonetheless, CITS results still closely approximated those of the RCT in five of the seven outcomes, indicating that these other threats were not operating to any significant degree. Nor did they operate in any of the previously published WSCs of CITS, all of which obtained comparable RCT and CITS results. While possible, these internal validity threats may be rare in the real world of educational research.
Although this and other studies have shown that CITS results can mimic RCT results, the Florida study offers somewhat of an exception. Matching seemed to offer the best promise as an analytic technique due to the uncertainty in functional forms and the especially high volatility in the pretest time trend of the treatment group. After matching, the pretest means and slopes seemed comparable, if not perfectly so. Nevertheless, an uncomfortable amount of bias remained, particularly when matching on pretest information alone. Only matching on both pretest and demographics offered improvements for the two outcomes that performed poorly in the unmatched analysis, but even then the degree of correspondence was weak.
It is not clear why the Florida results show a lack of correspondence. The lack of stability in the pretreatment time series suggests that variability in cohorts may be confounding identification of the treatment effect. It may also be that matching failed to include all of the variables correlated with selection into treatment. Both Morgan and Winship (2015) and Heckman and Hotz (1989) point to the ways in which pretest information can be misleading, with Heckman and Hotz (1989) emphasizing caution when more recent pretest information cannot be well predicted from earlier years. Indeed, specification tests confirmed that the earlier pretest information in the Indiana studies was much more helpful in predicting the more recent pretest information than in the Florida data, casting doubt on our ability to recover unbiased estimates from the Florida study. It is also possible that the RCT estimates deviate from their unknown parameter values due to unobservable baseline differences in the treatment and control groups. Whatever the reason, these results suggest that the CITS design may perform poorly when there are high levels of variation over time in the study outcome.
Some caveats are necessary. Two of the three RCTs we draw on are from the same state and examine the same intervention. Thus, although they are independent samples and show different pretest trend differences, it is possible that they face similar sources of confounding and thus do not represent two completely independent tests of the CITS design. Furthermore, based on the results of the Florida study, we draw conclusions about the validity of the CITS design in situations where the functional form of the pretest time trend differences renders modeling difficult; however, with such large cohort-to-cohort variability, the performance of the treatment group is likely to have significant variance, and it is possible that our standard errors do not fully capture these cohort effects, thereby overstating the difference between CITS and RCT results. Thus, an alternative interpretation is that WSC results will be highly variable in the face of such treatment group instability and that it is not possible in this instance to draw conclusions about study design.
In light of these caveats, what do we recommend to researchers prospectively planning a short, CITS study? Data collection is extremely important; with more reliable data, it is easier to recognize functional forms and thus to make decisions around model selection. A minimum of three pretreatment time points are necessary to fit linear slopes, but we would urge researchers to collect at least five if possible. Specification tests demonstrating that earlier pretest data successfully predict more recent preintervention outcomes represent a useful exercise for validating analytic choices. Instances where the treatment group does not exhibit a stable trend are trickiest. The evidence presented here suggests that in such a case, a combination of matching on both pretest and demographic information performs best, though this result may not be generalizable to all settings, and more research is needed on time series where there is a high level of variability in the study outcome. However, when a large group of comparison cases is available, and there are stable trends in pretest group differences, CITS designs can produce unbiased estimates that are as precise as those from RCTs.
Footnotes
Appendix
Placebo Tests
| Without Matching | Matching Pretest | Matching Pretest and Demographics | |||||
|---|---|---|---|---|---|---|---|
| Baseline Mean | Baseline Trend | Baseline Mean | Baseline Trend | Baseline Mean | Baseline Trend | ||
| IN 1 | ELA | 0.016 | 0.077 | 0.016 | 0.003 | 0.037 | 0.046 |
| Math | 0.073 | 0.000 | 0.074 | 0.024 | 0.035 | 0.045 | |
| IN 2 | ELA | 0.247 | 0.079 | 0.181 | 0.154 | 0.213 | 0.153 |
| Math | 0.088 | 0.101 | 0.050 | 0.092 | 0.121 | 0.118 | |
| FL | Science | 0.235 | 0.212 | 0.262 | 0.276 | 0.212 | 0.199 |
| Math | 0.010 | 0.133 | 0.014 | 0.104 | 0.132 | 0.050 | |
| Reading | 0.112 | 0.143 | 0.118 | 0.126 | 0.087 | 0.123 | |
Note. This table reports the absolute value (bias) of the estimated “treatment effect” in the last baseline year prior to treatment. The values reported in the text refer to the average bias from the models preferred a priori: baseline mean for Indiana 1, baseline trend for Indiana 2, and pretest matched with baseline trend for Florida. ELA = English-Language Art.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors gratefully acknowledge funding from National Science Foundation PRIME Grant DRL-1228866.
