Abstract
Compared to the randomized experiment (RE), the regression discontinuity design (RDD) has three main limitations: (1) In expectation, its results are unbiased only at the treatment cutoff and not for the entire study population; (2) it is less efficient than the RE and so requires more cases for the same statistical power; and (3) it requires correctly specifying the functional form that relates the assignment and outcome variables. One way to overcome these limitations might be to add a no-treatment functional form to the basic RDD and including it in the outcome analysis as a comparison function rather than as a covariate to increase power. Doing this creates a comparative regression discontinuity design (CRD). It has three untreated regression lines. Two are in the untreated segment of the RDD—the usual RDD one and the added untreated comparison function—while the third is in the treated RDD segment. Also observed is the treated regression line in the treated segment. Recent studies comparing RE, RDD, and CRD causal estimates have found that CRD reduces imprecision compared to RDD and also produces valid causal estimates at the treatment cutoff and also along all the rest of the assignment variable. The present study seeks to replicate these results, but with considerably smaller sample sizes. The power difference between RDD and CRD is replicated, but not the bias results either at the treatment cutoff or away from it. We conclude that CRD without large samples can be dangerous.
Keywords
In expectation, randomized experiments (REs) justify valid causal conclusions because the treatment and control groups are equivalent on all observed and unobserved covariates, thus meeting the “strong ignorability” assumption (G. W. Imbens & Rubin, 2015; Rosenbaum & Rubin, 1983). However, it is difficult to assign national, state, or local laws and policies at random, and the same is true of some social programs. Hence, quasi-experimental methods will always be needed. Among the best of these is the regression discontinuity design (RDD). First described in Thistlethwaite and Campbell (1960), RDD requires that all individuals scoring on one side of a cutoff determining treatment assignment receive treatment while all those on the other side do not. The selection process into treatment is therefore fully known and can be easily modeled; it depends only on a cutoff score on an assignment variable.
RDD needs an empirical as well as a theoretical rationale because its practice is beset with many implementation shortfalls and analysis decisions that can threaten the internal validity. A meta-analysis of 15 quite heterogeneous studies compared RE and RDD estimates at the treatment cutoff in the context where each design shared the same treatment and measurement details (Chaplin et al., in press). The average design difference in impact at the cutoff was .01 standard deviations (SDs), and individual study differences were tightly distributed around this effectively zero mean. These results conform with theoretical prediction and add support to the relevance of RDD as a causal design. It is also a practical design. In education alone, it has been used to evaluate interventions as diverse as (1) a remedial mathematics program administered to students scoring below a specific value on a placement test (Lesik, 2006), (2) a kindergarten program for which eligibility was determined by date of birth (Elder, 2010), and (3) an educational funding and school lunch program that was only provided to school districts scoring below a certain score on an economic advantage scale (Guryan, 2001; McEvan, 2013). RDD is now included in almost every compendium of evidence-based practices and rightly so.
Nonetheless, it has limitations. First, the functional form of the relationship between the assignment variable and outcome variable has to be correctly modeled. This is because a counterfactual potential control outcome mean and slope are needed to compare with the observed treated data, and this counterfactual is constructed by taking the observed regression slope in RDD’s untreated segment and extrapolating it into the treated segment.
A second limitation of basic RDD is that it is less efficient than RE by a factor of about 2.75 (Goldberger, 1972), thus requiring more cases for the same statistical power. This problem arises because fully modeling the selection process into treatment requires that both the assignment variable and the binary treatment variable are in the impact model. Yet, each is a product of the same assignment variable, and so they are highly colinear, thus reducing power when jointly used (Goldberger, 1972; Schochet, 2009). Simulation studies show that the size of this reduction varies with many factors. With sample size held constant, RDD and RE have equivalent statistical power only in the rare circumstance where the assignment variable and outcome are uncorrelated (Cappelleri, Darlington, & Trochim, 1994; Klerman, Olsho, & Barlett, 2015).
While the first two limitations concern the average treatment effect (ATE) or ATE above the cutoff, a third limitation applies to local average treatment effect (LATE). The third, and arguably greatest, limitation of RDD is that it generates a causal inference that applies at the cutoff and nowhere else. This local ATE at the cutoff is less general than the RE’s ATE that applies to the entire study population, whether treated or not. It is also less general than any treatment on treated (TOT) effect applicable to all those cases in the treated segment of the assignment variable. The LATE restriction follows from the cutoff being the only point where the predicted regression lines on each side of the cutoff overlap. The cost of such understandable caution is that basic RDD causal estimates do not generalize to the study population or to its subpopulation that actually experiences treatment.
Relevant Theory
This article explores how functional form estimation, statistical power, and causal generalization away from the cutoff are facilitated by taking the basic RDD structure and adding to its untreated observations that are observed along all or most of the assignment variable. Two untreated slopes created by the data of two untreated groups are now available in the untreated RD segment instead of one; and for the first time, untreated observations are now available in the treated segment. Also novel is that untreated observations are now found across the cutoff determining treatment assignment, instead of being restricted to one side of it as in basic RDD. The design that results we call comparative regression discontinuity (CRD). Its untreated observations can come from many sources. The two examined to date add data from a pretest measure of the study outcome (CRD-Pre), as in Wing and Cook (2013), or from a nonequivalent comparison group (CRD-CG), as in Tang, Cook, Kisbu-Sakarya, Hock, and Chiang (2017). However, data could be used from more than one preintervention time point, from more than one nonequivalent comparison group, or from both pretest and comparison group data used together within the same application. For ease of exposition, we limit ourselves here to the simplest CRD-Pre and CRD-CG designs with one pretest data wave and one comparison group.
Figure 1 illustrates CRD using hypothetical linear data. The solid line represents the usual treated and untreated groups in RDD, and the depicted difference in means at the cutoff indicates a positive treatment effect there. The dashed line illustrates the added untreated observations within the treated and untreated RD segments and also across them. Since no intervention is involved, adding the extra untreated data should not change the cutoff mean or slope, and so none is portrayed. Taking the dashed and solid slopes together, Figure 1 depicts a mean difference at the cutoff in the basic RDD that is different from the no-difference observed with the added untreated comparison data. Unique to CRD are the three untreated regression lines and thus the ability to move from a null hypothesis predicated on a single difference, as in basic RDD, to a null hypothesis predicted on a difference in differences.

Observed data in comparative regression discontinuity.
Missing from the observed means and slopes in Figure 1 is the counterfactual—how the unobserved data would have looked along the assignment variable if all the treated cases had not been treated. Figure 2 illustrates three hypothetical scenarios for this potential outcome function, and, for clarity of exposition only, they are all presented as linear. One extrapolation takes the normal RDD untreated slope into the treated segment and shows no change in either mean or slope at the cutoff (Scenario 1). If this were the true counterfactual, we would correctly conclude from the observed data that there is a mean treatment effect, not just at the cutoff but also across all the treated side of the assignment variable and thus suggesting a more general TOT causal conclusion than RDD’s LATE. In the second scenario in Figure 2, the unobserved counterfactual is presented as a change in the cutoff mean to the level observed in the treated group. This implies that some factor other than the treatment raised scores around the cutoff in the treatment group without also affecting the comparison pretest or the comparison population data where no corresponding change is observed. Such an unobserved counterfactual would confound any attempt to use the observed data to claim an internally valid effect either at the cutoff or away from it. The third scenario in Figure 2 depicts a cutoff-based change in the unobserved counterfactual slope, but not in the unobserved mean. In this case, CRD would not lead to a biased estimate at the cutoff if the relevant functional form were correctly modeled. But the estimate of a more general TOT inference away from the cutoff would be biased since it depends on summing impact estimates across all points in the treated segment of the assignment segment. In doing this with the third scenario, the counterfactual and observed data have different functional forms and so would sum differently. Figure 2 indicates that we do not know the untreated form of the actually treated data; yet knowing this is crucial for unbiased causal estimates away from the cutoff.

Three hypothetical values of the unobserved potential outcome slope in comparative regression discontinuity.
The problem for CRD is that the added untreated regression lines do not exactly correspond to the counterfactual. CRD-Pre leads to a period difference between what is observed and what is required; while CRD-CG leads to a population difference between what is available and what is required. Nonetheless, the added untreated data are what is at hand for constructing a (necessary imperfect) argument about bias away from the cutoff. The key to this argument is in Angrist and Rokkanen (2015) who contend that causal inference away from the cutoff is justified when the assignment variable is not related to the study outcome—namely, when the untreated regression lines have zero slope, indicating that the treatment assignment process and outcome are independent, this being the very mechanism that justifies valid causal inference in experiments with random assignment. Sometimes, the independence is unconditional because the assignment variable and outcome just happen not to be related—for an example, see the CRD-CG results in Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017). At other times, the claim is that the relationship can be made conditionally independent by the judicious use of covariates (Angrist & Rokkanen, 2015). With CRD, the claim is that the relationship can be made independent, not by some constellation of covariates but by knowledge of the functional forms of the added comparison data.
Past studies of CRD have assumed a difference in differences model. Generically, such a model takes any slope difference observed in the untreated segment of RDD and extrapolates it into the treated segment where it becomes the operational counterfactual against which the observed data are compared. Consider, first, the case where the added untreated data show no change across the cutoff and the observed slopes in the untreated RDD segment are not parallel. Thus, there is a nonzero (and contingent) correlation between the assignment variable and outcome. To reduce this pattern of difference to conditionally zero requires, in essence, stably estimating each slope, differencing at each point on the assignment variable to achieve a single slope value, and then statistically controlling for this value in the impact model. Much can go wrong in such a process, especially with small samples.
That is why Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017) make two assumptions about CRD that limit what type of difference in difference model should be used to justify causal inferences away from the cutoff. One is that the relationship between the assignment variable and potential control outcomes does not change across the cutoff. This excludes from CRD the hypothetical Scenario 2 in Figure 2 because it assumes a shift in the unobserved means at the cutoff, and it also excludes Scenario 3 because it assumes a shift in the unobserved slopes. However, since the assumption pertains to potential control outcomes, it cannot be directly observed. One can only test whether a cutoff-based change in means or slopes is observed at pretest time (CRD-Pre) or in the comparison population (CRD-CG). But as useful as these steps are, they do not directly test whether the same relationship would hold in the treated group if it were observed without treatment.
The second assumption is that the assignment variable is independent of period differences for CRD-Pre and of population differences for CRD-CG. This assumption limits CRD to difference of difference contexts characterized by a constant difference between the untreated curves in the untreated segment of RDD; they have to be parallel, therefore. Moreover, the assumption of no change around the cutoff when CRD data are added means that the untreated data in the treated segment will have the same slope value along at least part of the assignment variable. The assumption of parallel slopes in the untreated segment can be partially tested by examining how parallel they are, though a comparable test is not possible in the treated segment because only one untreated slope is observed there.
Why is the assumption of parallel functional forms so important? The claim has been made that the strong ignorability assumption is met with RDD when the assignment variable and study outcome are independent (Angrist & Rokkanen, 2015). When regression curves are plausibly parallel, they permit simple impact models that control for the constant slope value and hence for all the factors responsible for the assignment variable and outcome being related. Testing the conditional independence assumption is more difficult with more general versions of a difference in difference model that allow of more complex functional form differences. All three untreated curves might then have unique values and be less stably estimated when compared to parallel curves with the same slope value.
The key to bias control away from the cutoff is trustworthy knowledge that the untreated regression lines in the untreated RD segment are plausibly parallel, and an added hope is that the untreated slope in the treated segment will also have the same value. Either situation provides support (but not proof) for the crucial but untestable assumption that the potential outcome slope of the treated group would have had the same value. However, the more complex obtained curves are in form, the more difficult CRD analysis will be. Such forms are usually estimated less well, they are less well controlled in the impact analysis, and they often take the data into parts of the assignment variable where estimation is more difficult because cases are less dense and scale intervals are less likely to be equal. While transformations of the data can reduce some of these complexities of estimation, our belief is that parallel and linear slopes offer more security about the data quality needed for strong tests of CRD when compared to tests predicated on more complex functional forms—even plausibly parallel ones.
The main internal validity threat with CRD is very narrow and depends on two things. One is causally irrelevant forces that operate across the cutoff and the other depends on the form of the CRD variant. In CRD-Pre, alternative interpretations are further limited to those that differentially affect the pretest and posttest, while in CRD-CG they are further limited to those that differentially affect the intervention and population groups. In CRD-Pre, the chance of such a unique force operating in this way is presumably low and even lower as the pretest/posttest correlation increases—for example, with shorter rather than longer intervals between measures. In CRD-CG, the chance of a unique outcome-correlated force operating contemporaneously across the cutoff is presumably less the more similar the untreated group means and functions are to start with, perhaps due to local matching instead of choosing a comparison population that is very different from the treatment one.
Our speculation is that adding a comparison function to basic RDD will reveal most of the nontreatment forces that affect the outcome across the cutoff and that relatively few alternative interpretations will be unique to the period between pretest and posttest or to specific differences between the intervention and comparison populations. However, we cannot definitively assert that no period or population differences operate across the cutoff. Hence, the rationale for CRD is (1) that such internal validity threats are usually implausible and (2) estimation issues are less problematic when the observed but untreated regression lines are parallel and perhaps also linear.
Prior Literature
Wing and Cook (2013) examined CRD-Pre results away from the cutoff, using a within-study comparison (WSC) design to do this. WSCs usually compare effects from an RE that serves as a causal benchmark with those from a nonexperiment. To do this for CRD-Pre, the authors took the data from a RE on how Medicaid expenditures changed when the family of disabled persons were allowed to use the money allocated to them for services versus having a government official determine the services. Family members spent more of their allotment than did nonfamily professionals. Using data on about 1,000 households from each of three separate states, Wing and Cook (2013) created a CRD-Pre design by using a third variable from the RE as the assignment variable and determining three cutoff values on it per state. They then eliminated all the treated cases in the untreated segment and all untreated cases on the treated side, thus resulting in a basic RDD. They then added pretest data on expenditures per household and used them in the outcome model as part of a difference in differences analysis. Adding the pretest data increased power over basic RDD; the untreated functional forms were plausibly linear and parallel in two of the three states examined, and little bias was obtained away from the cutoff in these two states relative to the RE results. Where the untreated regression lines were not plausibly parallel, bias away from the cutoff was greatest. But even here it might be interpreted as “modest” for some purposes—between .10 and .15 SD units.
Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017) modified and extended this work with another large data set, but still creating the synthetic RDD out of an RE on how Head Start affected social behavior and academic performance in both math and language arts. They examined both CRD-Pre and CRD-CG, creating the nonequivalent comparison group from students who served in the RE control group but who were 1-year older on average than the treated 3-year-olds. This study found that (1) all the untreated regression segments were plausibly linear and parallel, (2) standard errors (SEs) were lower in each CRD than in basic RDD, (3) little meaningful difference was obtained between the RDD and CRD-Pre or CRD-CG treatment effect estimates at the cutoff, and (4) little meaningful design difference was obtained along all the treated side of the assignment variable. However, the CRD-CG results have to be interpreted with caution because the assignment variable and outcome were not correlated with each other even before adding the untreated comparison group data. This means that the added data were not needed to set the conditional correlation between the assignment variable and outcome to zero; it was already zero. Fortunately, this was not the case with most of the CRD-Pre analyses, and here the results closely corresponded with those in Wing and Cook (2013). It seems, then, that CRD can produce unbiased causal estimates away from the cutoff and that it can also mitigate the other limitations of basic RDD. But the evidence is stronger for CRD-Pre, not just because of the replication across two such different data sets, but also because the assignment variable and outcome were initially correlated in CRD-Pre but not in CRD-CG. Indeed, they have yet to be initially correlated in any CRD-CG example.
Prior research on CRD also deals with statistical power. Adding pretest data reduces the overall correlation between the assignment variable score and the binary treatment dummy because there is now less variation in the treatment dummy due to the constant control status at the pretest. Including the pretest data in the analysis will therefore reduce the collinearity between the two that is endemic to basic RDD and so will increase power. The same process is involved when nonequivalent comparison cases are added in CRD-CG to serve as the untreated comparison function. CRD involves more data than RDD, and it is not a surprise that statistical power increases because of this. The conditions under which each CRD design differentially affects power are laid out in Tang, Cook, and Kisbu-Sakarya (2017) and Tang and Cook (under review), and evidence that each does indeed reduce power is in both Wing and Cook (2013) and Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017).
Small Samples and Present Study Purposes
The current study is a small-sample stress test of how well CRD-Pre and CRD-CG overcome the limitations of simple RDD, especially as concerns causal generalization away from the cutoff. WSCs require comparing causal estimates from an RE, RDD, and CRD when each design has the same treatment, the same measurements, and the same estimand. This last requirement leads to comparing LATE estimates for all three designs, whereas TOT estimates can only be compared for RE and CRD.
In a WSC, the RE treatment effect estimates are used as the causal benchmark, the best approximation to the true causal parameter that is not directly observed. However, with smaller sample sizes, the sampling error will be larger and treatment point estimates will be more variable. The RE is used as part of the effort to index how much bias there is in the CRD, but given sampling error some of the obtained difference may be due to variation in the RE estimate and not to bias in the contrast. Realizing this, Rubin (2008) has counseled use of covariates to reduce such sampling error in WSCs, and we will do this. Nonetheless, smaller samples entail more sampling variance, and an advantage of large sample REs is that their results are more efficient.
Small samples also tend to reduce how precisely functional forms and functional form differences are estimated. In difference in difference analyses, this influences how precisely the difference in the untreated segment of the assignment variable is estimated, all the more so since that difference is itself the product of two unreliable slopes. Estimation quality is also an issue when constant slopes (parallel functional forms) are assumed. How well this assumption can be tested in the data becomes problematic, and while averaging three slopes assumed to be equal increases reliability, the increase can still be marginal where very small numbers are concerned. Smaller samples entail less precise estimation of slopes or slope differences, all other things being equal. Even if the constructed counterfactual in a difference of difference or constant slope scenario is correct, estimation of the necessary parameters will be less precise as sample sizes decrease.
In RDD, it is now common to avoid uncertainty about functional forms by resorting to nonparametric analyses such as local linear regression. However, such analyses require larger samples than when analysts are willing to postulate broader functional forms that those within narrow bandwidths. Indeed, the sample size in the analyses we present precludes all use of nonparametric or even semiparametric tests. Another disadvantage of smaller samples, then, is their inability to promote nonparametric analysis or adequately powerful nonparametric analysis.
Small samples have a further disadvantage. While CRD may indeed increase power over RDD thanks to the added information, the power gain in CRD might still not be to a level that makes null hypothesis significance testing meaningful. The consequence of this is important because it would be desirable to test diagnostic assumptions about, say, how parallel untreated functional forms are. Such diagnostic tests are necessary to the logic of this article, but they are underpowered, thus less meaningful with smaller sample sizes that make it difficult to reject the null hypothesis. This same point also applies to the outcome data, to tests of the difference between RE and RDD estimates away from the treatment cutoff. Smaller samples entail larger SEs and hence a bias toward falsely interpreting statistically nonsignificant difference as evidence that of comparable causal effects between the RE and any form of RD estimates. To use null hypothesis, testing criteria to this end requires a conventionally efficient test of design differences. But smaller samples reduce the chances of this.
This article uses the WSC method (for a discussion of it, see Cook, Shadish, & Wong, 2008) with small samples in order to compare causal impact estimates and SEs at and beyond the cutoff. It does this for basic RDD, CRD-Pre, CRD-CG, and RE. We hypothesize that the added CRD data will increase power over basic RDD. While theory suggests that bias away from the cutoff will be reduced if the untreated regression curves form a clear pattern, and especially when they are linear and parallel, smaller samples becloud strong tests of how well these conditions are met. This might be termed our estimation problem. Also, the missing counterfactual data might not be of the form attributed to them in either a difference in difference model or a model based on stable and parallel regression curves. This might be termed our identification dilemma.
Method
WSC Design
WSCs test the correspondence between estimates from a presumptively valid causal benchmark from an RE and a quasi-experimental design (QED) when each has identical treatment content and measurement characteristics. In the present case, the QED is of three kinds: basic RDD, CRD-Pre, and CRD-CG. The benchmark is an RE that has been vetted for adherence to the usual assumptions—use of a correct randomization procedure, no differential attrition, no violation of the stable unit treatment value assumption, and pretest balance on all observables. In most WSCs, effects are separately estimated for the RE and RDD, and they are then differenced to index the amount of bias remaining in the nonexperiment after whatever steps were taken to control for the initial selection bias. Numerically identical estimates are not expected from each design, given sampling error in both the RE and RDD.
To date, 15 WSCs have tested the effects of basic RDD at the cutoff (Chaplin et al., in press). Nine used a tiebreaker experiment to contrast RE and RD estimates. This design requires conducting an RE within some part of the assignment variable and an RDD along the rest. Thus, there can be two cutoffs—one at the RE’s lower bound on the assignment variable and the other at its upper bound. Five of the WSCs used the same “synthetic” design as here, creating the RDD from the RE data by designating a continuous assignment variable in the RE as the running variable and a specific score on it as the cutoff. Then, all control cases above the cutoff are deleted, plus all treatment cases below it. The result is an RDD with the same treatment and measurement details as in the RE, but with only about half as many cases. To move from a WSC on CRD requires these same procedures except for adding a pretest score with CRD-Pre and nonequivalent group data with CRD-CG. The meta-analysis discovered that the amount of RD bias did not depend on how the RDD was created and that results were consistent with the theoretical prediction of no RD bias at the treatment cutoff and with less efficiency in the RDD whose SE estimates were consistently larger.
The Data Set
To examine bias away from the cutoff in CRD with sparse data, we used information from Shadish, Clark, and Steiner (2008) who examined how training in math or vocabulary affected performance in these domains. Preintervention tests of general math and vocabulary performance were administered, as were posttest measures of math and vocabulary closely aligned to the training materials. Students were also tested on a large battery of other tests assessing motivational, psychological, and demographic attributes. For those exposed to math training, posttest math was the treatment-linked outcome and was compared to the math performance of those exposed to the vocabulary training. Likewise, vocabulary scores served as the treatment-linked outcome for those exposed to vocabulary training, and the vocabulary scores of those exposed to math training served as the controls. All participants were introductory psychology students in the same term in the same class. The training lasted less than an hour in total and so is laboratory-like, promising high internal validity as a WSC but more limited external validity.
The study involved 445 undergraduates. They were first randomly assigned to the RE (N = 235) or to a nonequivalent comparison group (N = 210). Those in the RE arm were then randomly assigned to mathematics or vocabulary training (Ns = 119 and 116, respectively), while those from the quasi-experiment were asked to self-select into the curriculum of their choice. More chose vocabulary training (N = 131) than math training (N = 79).
The RDD and CRD Data
In the WSC, we report preintervention American College Testing Exam (ACT) scores were used as the assignment variable and its median value served as the treatment cutoff. Treatment group participants in the RE who scored below the ACT cutoff were systematically deleted, as were control participants scoring above it. This resulted in an RDD data set with N = 123 for the mathematics outcome and N = 112 for the vocabulary outcome, about half of the RE sample size. The cases on each side of the cutoff were about half of these numbers. Such sample sizes are quite small for regression analyses with individual data.
The CRD-Pre data were created by adding pretest math or vocabulary measures to the RDD data set, the choice depending to the outcome being tested. Strictly speaking, this is a proxy pretest version of CRD-Pre, since the pretest and posttest items were not identical, one tapping into general math or vocabulary and the other into the specific math or vocabulary content taught.
The CRD-CG data set was created by adding to the basic RDD the posttest scores of the untreated group in the quasi-experimental arm of the original WSC study. They differed from RE students by virtue of self-selecting themselves into the math or vocabulary curriculum and on all details correlated with this. They are nonetheless matched with treatment students on many other attributes, such as the university and classes attended and the year of attendance. Adding these nonequivalent controls leads to a total CRD-CG sample size of 254 for the vocabulary outcome and 191 for math, the difference due to more students self-selecting into vocabulary training—an imbalance that was not possible with the experimenter-controlled process of random assignment. Again, about half of these numbers served on each side of the cutoff.
Analyses
To minimize human biases due to the RE results being known before the CRD results were analyzed, RE treatment effects and SEs were computed after the RDD and CRD analyses were completed. However, such blinding is only partial, since the RE’s ATE estimates were known from Shadish et al. (2008). This same estimate might not hold for the cutoff LATE, or even for the TOT representing the whole treated area, the published ATE results nonetheless constitute a ballpark estimate of what the CRD LATE and TOT results might be.
We computed LATE estimates for the basic RDD, each CRD design, and the RE to eliminate confounding between the estimands and research designs. However, this reduces the RE’s precision and generates an estimate that would rarely, if ever, be chosen for theory or policy purposes. The LATE estimate is also irrelevant to the CRD question of whether unbiased effects can be detected for all treated cases, including at points away from the cutoff. We computed TOT estimates in the RE, CRD-Pre, and CRD-CG designs, excluding basic RDD since TOT estimates are widely regarded as inappropriate for it. We compare LATE estimates at the cutoff to examine bias there, and TOT estimates to examine bias away from the cutoff.
Schochet et al. (2010) recommend fitting parametric and nonparametric models to RDD data, while others advocate the use of nonparametric local regression with empirically determined bandwidths (Hahn, Todd, & Van der Klaauw, 2001; G. Imbens & Kalyanaraman, 2011). However, nonparametric methods require correctly specified bandwidths and, for the same sample size, produce larger SEs than carefully chosen parametric tests. That limits their utility in the small-sample context, and our exploratory analyses showed they could not be used at all with the data we had.
Parametric analyses require identifying the true functional forms and so are increasingly poor choices as sample sizes fall to the small (Gelman & Imbens, 2014). Nonetheless, to estimate parametric models, we follow Wing and Cook (2013) in preferring a parameter selection protocol based on least-square cross-validation methods. Models with linear, quadratic, cubic, and quartic polynomial of the centered assignment variable were tested, as well as models with interacted polynomials with the dummy treatment indicator. The model with the smallest mean square error was retained as the main model to be compared to causal benchmarks. We also report as exploratory the results for the models with linear, quadratic, cubic, and quartic polynomials.
All SEs were estimated using nonparametric bootstrap methods with 1,000 replications. That is, bias estimates were computed 1,000 times, and the SD of these 1,000 estimates was used as the SE. For CRD-Pre, individuals were resampled rather than observations due to the within-person correlated error structure.
Estimation
RE
The t tests were conducted on all baseline covariates to identify those to be included in the RE estimates. Only one group difference was significant at baseline, far fewer than due to chance. It was for college credit hours taken to date, t(233) = 2.12, p < .05. This variable was therefore included in all the RE, RDD, and CRD analyses together with the treatment indicator, the ACT assignment variable, and the pretest score of the outcome variable. The following model was estimated for the RE:
where Yi is the posttest outcome variable for individual i, T is the binary treatment indicator, and X is the covariates vector (i.e., college credit hours taken, the ACT assignment variable, pretest of the outcome variable). ATE is estimated by β. The RE estimate at the cutoff may be different from the ATE if treatment effects vary along the assignment variable. Thus, we have estimated separate linear regression functions of the outcome variable on the assignment variable for the RE treatment and control groups. We then computed LATE as the point-wise differences in the RE treatment and control group regression functions at the cutoff value of the assignment variable.
The RE TOT estimate above the cutoff was computed as follows:
where A is the assignment variable and m is the observed value of A.
CRD-Pre and CRD-CG
To estimate treatment effect at the cutoff with CRD-Pre, the treated and untreated outcome functions need to be estimated separately. To approximate the unknown smoothing function, using the data points forming the three untreated segments in CRD-Pre, the untreated regression function for person i at time t is estimated using Kth order polynomial series:
where A represents the assignment variable. The Kth order polynomial series represents the terms to create the linear, quadratic, cubic, or quartic functional forms of the assignment variable (i.e., r
1
A, r
2
A
2, r
3
A
3, or r
4
A
4) as required by the least-square cross-validation method.
Then, the treated outcome function is estimated by regressing the outcome variable on polynomials of the assignment variable and the same covariates using posttest data from treated individuals above the cutoff score:
After estimating the untreated and treated outcome functions, we then compute CRD-Pre LATE by taking the difference in predicted values of untreated outcomes and treated outcomes at the cutoff value of the assignment variable.
Estimates above the cutoff were computed as the weighted averages of differences between the treated and untreated outcome functions at each value above the cutoff, with weights again equal to the relative frequency of each assignment variable value. So, for a cutoff value of 0, the treatment effect (TE) above the cutoff is:
Treatment effects at the cutoff and above the cutoff were computed in the same way for CRD-Pre and CRD-CG, but in CRD-CG the untreated comparison regression comes from an independent group rather than from pretest scores from those assessed at posttest. Thus, the dummy variable for the untreated regression differs by CRD variant, being the treatment versus comparison group contrast for CRD-CG and the within-person pretest versus posttest contrast for CRD-Pre. The untreated regression function for CRD-CG is estimated using the data points forming the three untreated segments in CRD-CG:
where A represents the assignment variable. The Kth order polynomial series represents the terms to create the linear, quadratic, cubic, or quartic functional forms of the assignment variable.
Basic RDD
For the basic RDD, treatment effects were only estimated at the cutoff. This was done by regressing the math or vocabulary outcome on the treatment indicator dummy, a polynomial on this variable, interactions between the polynomial series and the treatment indicator, and the baseline covariates. The polynomial functional form was chosen based on the least mean square error criterion described above. Treatment effects above the cutoff were not estimated.
Measures of final bias in causal estimates
We compared the differences between RE, RDD, CRD-Pre, and CRD-CG results by computing the treatment effects for each and then testing the statistical significance of difference between each RDD variant and the RE (Shadish, Galindo, Wong, Steiner, & Cook, 2011). We used t-statistics to test the hypothesis of no difference. Thus, to examine the difference between the RE and RDD estimates at the cutoff, we computed the difference between the respective cutoff impacts and then created the t statistic by dividing that difference by their bootstrapped SE. Following Shadish, Galindo, Wong, Steiner, and Cook (2011), we computed the effect size of the final bias as the difference between the RE and RDD estimates divided by the pooled SD of the RE treatment and control groups. Differences in the bootstrapped SEs between designs are interpreted as differences in statistical precision and thus power.
Results
The RE
RE causal estimates at the cutoff and their SDs are presented in Tables 1 and 2, as are ATE estimates for comparison purposes only. Above the cutoff estimates for RE are provided in Tables 3 and 4. For math, the LATE estimate at the cutoff is .99 and its SE (SE) is .28, above the cutoff is .79 (SE = .16), and the usual ATE is .97 (SE = .10). For vocabulary, the corresponding values are 1.37 (SE = .25), 1.65 (SE = .10), and 1.56 (SE = .07). Noteworthy is that SEs are highest for LATE estimates at the cutoff where sample sizes are smallest, and they are lowest for the usual ATE where the number of cases is highest. They are thus intermediate for all RE estimates away from the cutoff.
Mean Effects and Standard Errors at the Cutoff for Mathematics.
Note. All ATE and LATE point estimates are significant at p < .05 unless otherwise stated. ns = nonsignificant; ATE = average treatment effect; LATE = local average treatment effect; RDD = regression discontinuity design; RE = randomized experiment; CRD = comparative regression discontinuity design.
a The functional form with the least mean square error was quadratic. bThe functional form with the least mean square error was cubic for the untreated sample and quadratic for the treated sample. cThe functional form with the least mean square error was quartic for the untreated sample and quadratic for the treated sample.
*p < .05. **p < .01. ***p < .001.
Mean Effects and Standard Errors at the Cutoff for Vocabulary.
Note. All ATE and LATE point estimates are significant at p < .05 unless otherwise stated. ns = nonsignificant; ATE = average treatment effect; LATE = local average treatment effect; RDD = regression discontinuity design; RE = randomized experiment; CRD = comparative regression discontinuity design.
a The functional form with the least mean square error was cubic. bThe functional form with the least mean square error was linear for both the untreated and treated samples.
*p < .05. **p < .01. ***p < .001.
Mean Effects and Standard Errors Above the Cutoff (Treatment on Treated Estimates) for Mathematics.
Note. All RE and CRD point estimates are significant at p<.05 unless otherwise stated. ns = nonsignificant; ATE = average treatment effect; RDD = regression discontinuity design; RE = randomized experiment; CRD = comparative regression discontinuity design.
a The functional form with the least mean square error was cubic for the untreated sample and quadratic for the treated sample. bThe functional form with the least mean square error was quartic for the untreated sample and quadratic for the treated sample.
*p < .05. **p < .01. ***p < .001.
Mean Effects and Standard Errors Above the Cutoff (Treatment on Treated Estimates) for Vocabulary.
Note. All RE and CRD point estimates are significant at p < .05 unless otherwise stated. ns = nonsignificant; ATE = average treatment effect; RDD = regression discontinuity design; RE = randomized experiment; CRD = comparative regression discontinuity design.
a The functional form with the least mean square error was linear for both the untreated and treated samples. bThe functional form with the least mean square error was linear for both the untreated and treated samples.
*p < .05. **p < .01. ***p < .001.
Figure 3a and b describes the RE treatment effect arrayed by the ACT assignment variable scores used in the RDD and CRD analyses. A main effect of treatment is clearly visible with both outcomes. For the vocabulary outcome, no interaction between the treatment and assignment variable is visually apparent or statistically significant. For math, the RE treatment and control slopes are less obviously parallel, but the interaction is still far from statistically significant. Thus, there is no consistent evidence from the RE that the treatment and assignment variable systematically interact with each other.

(a) Plot of the randomized experiment (RE) for the mathematics outcome with fitted LOESS curves. (b) Plot of the RE for the vocabulary outcome with fitted LOWESS curves.
Tests of Assumptions
Figure 4 uses LOWESS plots to describe how the assignment variable is related to pretest and posttest outcomes in the RE control group where no treatment effect is possible. For math, no discontinuity is evident at the cutoff in Figure 4a, implying no role for irrelevant causal forces operating at the cutoff. Moreover, the cross-sectional relationship between the assignment variable and outcome is similar at both pretest and posttest. Figure 4b also suggests no discontinuity at the cutoff for vocabulary either, whether in mean or slope. However, the ACT assignment variable seems to be differently related to the pretest and posttest, a data pattern that normally indicates selection maturation (Cook & Campbell, 1979), though here the pretest and posttest are separated by less than an hour. There is, though, no evidence with either measure of unanticipated discontinuities in the control group data.

(a) Plot of mathematics outcome against the assignment variable ACT in the randomized experiment (RE) control group. (b) Plot of vocabulary outcome against the assignment variable ACT in the RE control group. Gray regions indicate 95% confidence intervals.
Figures 5 and 6 present evidence about functional forms. The smaller samples preclude powerful statistical tests of slopes differences, but visual inspection suggests that the three untreated segments are plausibly linear and parallel for math with CRD-Pre and, with CRD-CG, for vocabulary. They are clearly not parallel for vocabulary with CRD-Pre, and for math the difference in functional form is somewhat uncertain with CRD-CG. Such functional forms suggest that bias away from the cutoff should be least for CRD-Pre math and CRD-CG vocabulary and should be most for CRD-Pre vocabulary. In expectation, all cutoff effects are unbiased with all forms of RD design including CRD.

(a) Local linear regression of mathematics outcome against the assignment variable in the comparative regression discontinuity-Pre design. Gray regions indicate 95% confidence intervals. (b) Local linear regression of mathematics outcome against the assignment variable in the comparative regression discontinuity-CG design. Gray regions indicate 95% confidence intervals.

(a) Local linear regression of vocabulary outcome against the assignment variable in the comparative regression discontinuity-Pre design. Gray regions indicate 95% confidence intervals. (b) Local linear regression of vocabulary outcome against the assignment variable in the CRD-CG design.
Bias and precision at the cutoff
Row 1 of Table 1 compares RDD LATE to RE ATE math estimates for exploratory purposes. Row 2 is more important because it compares only LATE estimates at the cutoff. For math, none of the LATEs for RDD, CRD-Pre, and CRD-CG statistically differ from the RE LATE, but given the small-sample sizes the bias still exceeds .20 with the CRD-Pre design. For vocabulary, Table 2 reveals a reliable difference between the RE and CRD-Pre LATE estimates, but all bias estimates are manifestly large, exceeding .30 SDs. For no design is there convincing evidence of a bias-free estimate at the cutoff.
To examine statistical power at the cutoff, we compared the SEs of the causal estimates just described. Table 1 shows consistently lower SEs for either CRD version when compared to basic RDD, indicating greater precision from adding pretest data or extra subjects. SEs are slightly larger for each CRD variant than for the RE, although the latter is based on more cases.
Bias and precision away from the cutoff
Tables 3 and 4 compare estimates away from the cutoff. For math, there is no evidence of statistically significant differences from the RE, but the standardized effect differences are nonetheless large—.33 for CRD-CG and .46 for CRD-Pre. Table 4 provides the corresponding vocabulary results and now the RE estimate reliably differs from both the CRD-Pre and the CRD-CG estimates. Thus, for neither math nor vocabulary is there any evidence of bias-free generalization beyond the cutoff, irrespective of how close some analyses came to being based on parallel and linear regression slopes. Table 3 presents the SEs of the causal estimates just described. For each version of CRD, they tend to be slightly higher than in the RE, though the latter has more cases than either CRD variant.
Discussion
The statistical power results reported here replicate those in Wing and Cook (2013) and Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017). Adding pretest or nonequivalent comparison group data increases power at the cutoff in RDD. It also makes the power obtained in either CRD design close to what RE achieved. When Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017) equated RE and CRD sample size, they also found no power difference between the SEs of estimates away from the cutoff. The conditions under which the power of CRD-CG approximates that of the RE are elaborated in Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017) and the same is done for CRD-Pre in Tang and Cook (under review).
The bias in causal estimates we obtained here are different from those in prior research on bias in CRD (Tang, Cook, Kisbu-Sakarya, Hock, et al., 2017; Wing & Cook, 2013). At the cutoff, the math evidence indicates little bias for CRD-CG but considerable bias for CRD-Pre, while considerable bias is evident for vocabulary in both CRD variants. Away from the cutoff, bias is evident for both math and vocabulary, whether in either the CRD-Pre or CRD-CG variants.
The failure to replicate past findings of minimal bias at the cutoff is surprising. In expectation, there should be no bias; and 14 past WSCs have consistently shown RDD bias estimates smaller than the .09 for math and the .41 for vocabulary found here (Chaplin et al., in press). The fact that the theoretically expected unbiased effects were not consistently found at the cutoff suggests that functional forms were not well estimated in the current study, despite the statistical steps taken to optimize the choice of form. Of course, all things being equal, poorer estimation is expected with smaller samples.
Another pointer to the role of sample size is that the present study was included in the meta-analysis of RDD’s internal validity in Chaplin et al. (in press). For each WSC, average RDD bias were calculated with and without shrinking. Essentially, the shrinkage procedure adjusted treatment estimates toward the average of all treatment estimates depending on its sample size. The other WSC studies tended to be much larger and so the RDD bias at the cutoff in the present study shrank from .256 SD units when math and vocabulary were combined to .010, suggesting that modest sample sizes and the variability associated with them led to the unusually poor estimate of cause at the cutoff.
Statistical theory is much less developed for drawing unbiased causal inferences away from the cutoff than at it. So long as the functional form of the unobserved potential outcome slope is unknown, there can be no certainty that an obtained TOT value is unbiased. Instead, we are forced to assume the treatment group’s control outcome slope from some form of a difference in difference model that utilizes CRD’s three observed and untreated regression curves, even though each is inevitably estimated with error and that error is larger with smaller samples. None of the results we presented showed bias away from the cutoff that was less than .10 SDs. The same lack of correspondence between RE and CRD estimates was found even in the fortuitous cases where the three untreated regression curves seemed parallel to the eye—statistical tests having little power to reject the null hypothesis of different functional forms. In our view, parallel linear untreated slopes are most likely to lead to unbiased causal estimates away from the cutoff because their slope value is constant and raises the odds that the untreated potential outcome slope would have had the same value. Moreover, averaging three slopes better estimates this presumed counterfactual than relying on three different slope values, as in the most general version of a difference in differences model. But the results we obtained were not appreciably better when we concluded from visual tests only that the relevant slopes were parallel or that they were both parallel and linear. One should also note that future work is needed to incorporate uncertainty regarding parallelism into SE estimates.
Why were the current estimates of RDD and CRD differences away from the cutoff so different from in past studies? One explanation has to do with sample sizes. The current CRD-Pre and CRD-CG samples were noticeably smaller than in Wing and Cook (2013) and Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017), where samples numbered in the thousands instead of less than 200. Regression slopes with modest error are indispensable to CRD analysis and, with much error it is difficult to know their true value or how similar or different they are. In addition, RDD impact tests are usually preferred if they are nonparametric as opposed to the parametric tests we had to conduct because of the even larger sample size requirements of nonparametric work. But even our fallback analysis could not be implemented well with the sample sizes available. CRD is a large sample method.
Although we prefer an explanation based on sample size, other reasons for the failure to replicate past findings can be invoked. One is the use of a proxy-pretest measure in CRD-Pre, as opposed to using the exact same measure at pretest and posttest. Compared to a true pretest, proxy pretests are likely to attenuate pretest–posttest correlations and generally induce more uncertainty. Yet the pretest and posttest measures in this study tapped into the same conceptual domains, and anyway pretests did not apply in the same way in CRD-CG where the results were equally disappointing.
Another reason for the lack of correspondence may be due to the constrained difference in difference model that we used. It was predicated on parallel untreated slopes, but even where these seemed to hold we still failed to achieve unbiased results away from the cutoff. The assumption is that this model failed to capture the unknown counterfactual slope that operated in this particular application. Adding untreated data on each side of the cutoff and across it allows us to rule out internal validity threat that are judged to operate at pretest or with a somewhat different population. But it cannot directly inform us of how the treated group would have behaved across and beyond the cutoff if they been untreated. CRD seeks to support inferences about what this slope might have been by “borrowing” information from the added untreated slopes, and we judge that it is more likely to achieve this successfully when the borrowed slopes are parallel and perhaps even when they are both parallel and linear. But CRD cannot guarantee what this unobserved potential control slope really is. Would the students in this application who self-selected into the math or vocabulary treatments have achieved differently even without treatment? And would they have done so for irrelevant reasons that are correlated with scoring on one side of the ACT cutoff versus the other?
The CRD-CG results away from the cutoff are particularly disappointing because there is still no empirical evidence that CRD-CG promotes causal generalization away from the cutoff in the usual case where the assignment variable and outcome are initially correlated. CRD-CG was not part of Wing and Cook (2013). However, it was a component of Tang, Cook, Kisbu-Sakarya, Hock, et al. (2017), and they found little bias away from the cutoff. However, their assignment variable and outcome were not correlated initially. So in that case adding data from another population data could not contribute to reducing to zero a correlation that was already zero. WSC tests of CRD-CG’s effectiveness are still needed, but this article indicates they should be large-sample tests.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this article was supported by the National Science Foundation PRIME Grant DRL-1228866.
