Abstract
Much of the currently available evidence on the causal effects of public prekindergarten programs on school readiness outcomes comes from studies that use a regression-discontinuity design (RDD) with the age cutoff to enter a program in a given year as the basis for assignment to treatment and control conditions. Because the RDD has high internal validity when its key assumptions are met, these studies appear to provide strong evidence for the effectiveness of these programs. However, there are overlooked methodological problems in the way this design has typically been applied that have the potential to produce biased effect estimates. We describe these problems, argue that they deserve more attention from researchers using this design than they have received, and offer suggestions for improving future studies.
B
A primary rationale for prekindergarten programs is their ability to increase school readiness, that is, to prepare children, especially disadvantaged children, for constructive participation in the learning opportunities provided in kindergarten and beyond (Pianta, Cox, & Snow, 2007). The generally recognized domains of school readiness include language and literacy skills, mathematics, cognition and general knowledge, approaches to learning, physical well-being and motor development, and social and emotional development (Britto, 2012). Evidence about the effects of prekindergarten on these school readiness constructs is therefore central to the justification of policies that support public funding for prekindergarten.
A number of non-experimental studies conducted prior to 2005 investigated the effects of enrollment in public prekindergarten programs on some of these school readiness outcomes (Henry, Gordon, Mashburn, & Ponder, 2001; Magnuson, Meyers, Ruhm, & Waldfogel, 2003; Xiang & Schweinhart, 2002). However, in 2005, the first such study to use a design with strong internal validity was reported by Gormley, Gayer, Phillips, and Dawson (2005). The Gormley et al. study used a regression-discontinuity design (RDD) that took advantage of the strict age cutoff for prekindergarten eligibility in Tulsa, Oklahoma, that naturally divided children into a treatment group (age 4 by the cutoff date and enrolled in prekindergarten) and a control group (age 4 after the cutoff and not enrolled until the next school year).
Following the Tulsa study, and no doubt largely inspired by it, the age-cutoff RDD has been widely adopted for studying the effects of publicly funded prekindergarten programs on school readiness outcomes and, indeed, has virtually become the standard for that purpose. It has been used by different research teams to evaluate the prekindergarten programs in Boston (Weiland & Yoshikawa, 2013), Kalamazoo County, Michigan (Bartik, 2013), San Francisco (Applied Survey Research, 2013), Georgia (Peisner-Feinberg, Schaaf, LaForett, Hildebrandt, & Sideris, 2014), North Carolina (Peisner-Feinberg & Schaaf, 2011), and Tennessee (Coburn, 2009; Lipsey, Farran, Bilbrey, Hofer, & Dong, 2011). In addition, researchers at the National Institute for Early Education Research (NIEER) have applied it to evaluate the state prekindergarten programs in Arkansas (Hustedt, Barnett, Jung, & Thomas, 2007), Michigan (Lamy, Barnett, & Jung, 2005a), New Jersey (Frede, 2005; Frede, Jung, Barnett, Lamy, & Figueras, 2007), New Mexico (Hustedt, Barnett, & Jung, 2007; Hustedt, Barnett, Jung, & Figueras, 2008, 2009; Hustedt, Barnett, Jung, & Friedman, 2010), South Carolina (Lamy, Barnett, & Jung, 2005b), and West Virginia (Lamy, Barnett, & Jung, 2005c), with applications in five of these states analyzed altogether by Wong, Cook, Barnett, and Jung (2008).
The age-cutoff RDD has also been used to investigate the effects of prekindergarten programs for student subgroups, for example, children with special needs (Phillips & Meloy, 2012) and Hispanic students (Gormley, 2008); the effects of an Early Reading First program (Wilson, Dickinson, & Rowe, 2013); and the comparative effects of a state-funded program and Head Start (Gormley, Phillips, Adelstein, & Shaw, 2010). Beyond school readiness outcomes, age-cutoff RDD studies have been the basis for projecting adult earning benefits of prekindergarten (Bartik, Gormley, & Adelstein, 2012) and estimating its effects on mothers’ participation in the labor market (Berlinski, Galiani, & McEwan, 2011; Fitzpatrick, 2010).
These age-cutoff RDD studies now constitute the primary body of research supporting the immediate benefits of public prekindergarten programs. The credibility of their results stems from the strong internal validity of the RDD, which, after the randomized control trial (RCT), is generally considered the next best design for obtaining unbiased causal estimates. However, as with any research design, the internal validity of the prekindergarten age-cutoff RDD rests on the degree to which its assumptions are met. There are a number of ways age-cutoff RDD studies, as usually implemented, fall short (or could potentially fall short) of the requirements for internally valid effect estimates. Some of these issues create distinct problems that may have resulted in biased treatment effect estimates for all the implementations of this design reported to date. Others may not be a problem in all instances but have the potential to affect the treatment estimates and thus warrant consideration when age-cutoff RDD studies are being planned or their results are being assessed.
In what follows, we first describe the prekindergarten age-cutoff RDD as it is typically implemented and review the requirements that must be met for any RDD to yield internally valid effect estimates. We then discuss several methodological issues that deserve more attention in the implementation of this design and the interpretation of its results than they have received heretofore. The first of these relates to the impact estimate inherent in the age-cutoff RDD, especially the nature of the counterfactual condition. Another set of issues involves the comparability of the treatment and control groups from which outcome data are collected and analyzed. The final issue we consider relates to the equivalence of the outcome measures for the treatment and control conditions and the circumstances under which they are collected. We then offer some suggestions for mitigating these problems. To illustrate both problems and possible responses, we have drawn examples from three age-cutoff RDD studies involving one or more of the authors—one study in Boston and two in Tennessee. Given its many attractive features for assessing the effects of public prekindergarten programs, our intent is not so much to critique the age-cutoff RDD as to provide an analysis that will help improve future applications.
Basic Structure and Implementation of the Prekindergarten Age-Cutoff RDD
The distinguishing feature of the RDD is a distinct cutoff value an observed variable that assigns participants to conditions. On one side, the cutoff participants receive the treatment; on the other side, they do not (Imbens & Lemieux, 2008; Shadish, Cook, & Campbell, 2002; Thistlethwaite & Campbell, 1960; Trochim, 1984). Members of the two groups sufficiently near the cutoff are considered “equivalent in expectation” (Imbens & Lemieux, 2008; Trochim, 1984), and comparing their outcomes yields an unbiased causal effect. Prekindergarten age-cutoff RDD studies use age and the age-based eligibility requirement for a particular program in a particular year as the basis for assignment. Within the catchment area for a prekindergarten program, a strict age cutoff assigns two birth cohorts of children to enter the program in different years. The cohort that turns 4 before the cutoff date is allowed to enter the program in Year 1 (the treatment group); the cohort that turns 4 during the year after the cutoff must wait until Year 2 to enter the program (the control group). In the year that the control group is too young to enter the prekindergarten program, they may or may not be enrolled in other child care or preschool programs, but they are not enrolled in the program being evaluated.
A key feature of this design as it has been implemented in virtually all applications to public prekindergarten programs is the timing of the outcome measurement. For both cohorts, outcomes are measured at the beginning of Year 2 when the treatment children have completed the prekindergarten program and are starting kindergarten, and the control children have completed their year in waiting and are now starting the prekindergarten program (Figure 1). The timing of these assessments is for practical reasons. It would be preferable to obtain outcome data from both groups of children at the end of Year 1 when the treatment group has finished prekindergarten and the control group has not yet begun. However, until control children arrive in prekindergarten in the fall of the next school year, it is difficult to identify and locate them.

The two-cohort comparison and test timing in the typical prekindergarten age-cutoff regression-discontinuity design.
Figure 2 depicts the hypothesized relationship between the assignment variable (age) and a child-level outcome when there is a positive impact of the prekindergarten program. Children’s age has been centered on the cutoff and the vertical distance between the two regression lines at the cutoff corresponds to a positive treatment effect. The most comparable treatment and control groups consist of those children immediately on either side of the cutoff. However, because there are typically relatively few children born very close to the cutoff, data from children further away from the cutoff are generally included in the analysis as well.

Plot of the relationship between the assignment variable (age) and an outcome (assuming a linear relationship).
The internal validity of the estimated treatment effect in prekindergarten age-cutoff RDD studies rests on a number of critical assumptions distinctive to RDD. Because other authors have explained these requirements and how to evaluate them in detail (Imbens & Kalyanaraman, 2009; Imbens & Lemieux, 2008; Lee & Lemieux, 2010; Ludwig & Miller, 2007; Shadish et al., 2002; Trochim, 1984), we present them only briefly here. First, the cutoff must be exogenously imposed; participants should not be able to manipulate their position in relation to the cutoff. Second, the statistical model used to determine program impacts must include the assignment variable (age in the age-cutoff RDD), and the functional form of the relationship between the assignment variable and the outcome variable must be correctly specified. Third, all observed and unobserved characteristics of participants should vary smoothly and continuously across the cutoff; discontinuities indicate that some irregularity has occurred that threatens the internal validity of the design. Finally, the bandwidth (span around the cutoff on the assignment variable across which cases are analyzed) must not be chosen in a way that skews impact estimates in one direction or another.
Prekindergarten age-cutoff RDD studies have generally met these core assumptions. Gormley et al. (2005), Gormley, Phillips, and Gayer (2008), and Wong et al. (2008), for example, presented evidence that the cutoff dates in their studies were exogenously imposed. They also conducted careful checks of the functional form of the relationship between the age variable and each outcome and examined the robustness of their results to bandwidth choice. In addition, Gormley et al. (2008) provided evidence that a small set of student characteristics varied smoothly at the cutoff, and other researchers have made some attempt to check that such characteristics were similar on both sides of the cutoff in their studies.
There are additional requirements of the age-cutoff RDD, however, that have not been well recognized in the existing studies. These are requirements that apply generally to designs in which active assignment to conditions is determined as an inherent part of the design itself (RCTs as well as RDDs). First, a well-defined sample should be identified prior to assignment to conditions. That is, the specific units that constitute the initial research sample are defined and known, and all members of that sample are then assigned to the research conditions according to the procedure prescribed by the design. The known basis for the assignment of every individual in that initial intent-to-treat (ITT) sample, and the ability to fully account for it in the analysis of the outcome data, are the sources of the strong internal validity of these designs. If there is ambiguity about the composition of the initial research sample, there can be no assurance that all its members, and only its members, are accounted for in the final analytic sample.
Other considerations are not so fundamental but are distinctively problematic in the age-cutoff RDD. For instance, there should be no biasing attrition from outcome measurement for the initially defined and assigned sample. The unknown basis for any attrition potentially introduces differences between treatment and control group outcomes that are not treatment effects and are not fully accounted for by the known basis for assignment. Also, outcome measures must be operationalized the same way for treatment and control groups. That is, there should be no systematic differences that influence the resulting scores between the groups in the measures used, how they are administered, the measurement situation, and the like. Any such differences that relate only to the measurement procedure and not to the constructs being measured potentially bias the treatment effect estimates based on those measures.
In the next section, we examine the prekindergarten age-cutoff RDD with regard to the nature of the impact estimates it produces and threats to its internal validity. Doing so identifies problems with the design as it is typically implemented and other potential problems that may occur in some implementations.
Potentially Problematic Features of the Age-Cutoff RDD
Nature of the Impact Estimate
In an RCT, comparison of the outcomes for all the participants assigned to the treatment and control conditions, respectively, produces an ITT estimate of the intervention effect. It is the ITT effect estimate that preserves the randomization and has the strongest internal validity for causal inference. In the age-cutoff RDD, the initial ITT sample consists of all the children in the catchment area at the beginning of a given school year who are eligible for the prekindergarten program in that year (treatment) or the following year (control). Preserving the integrity of that initial ITT sample requires that outcome data be obtained for all these children irrespective of whether they go on to participate in the prekindergarten program or not.
In the typically implemented age-cutoff RDD, however, outcome data are not obtained from all the children in the initial ITT sample. Outcomes are obtained only for the subset of children in the ITT treatment sample who actually participate in the prekindergarten program and can be identified and assessed at the beginning of the next (kindergarten) school year. Similarly, outcomes are obtained only for the subset of children in the ITT control sample who appear in the prekindergarten program at the beginning of that next school year and can be assessed at that time. The resulting treatment impact estimates, therefore, are based on outcome data from the subsets of children who participate in the prekindergarten program rather than the initial ITT treatment and control samples that are defined by their age-eligibility for the program.
The authors of the seminal Tulsa age-cutoff RDD study recognized this situation and referred to their treatment effects as being akin to treatment-on-the-treated (TOT) estimates (Gormley et al., 2005; Gormley et al., 2008). TOT effect estimates are derived from ITT samples and are relevant in circumstances such as these where nontrivial portions of the ITT samples do not actually participate in the conditions to which they are assigned. In this context, TOT estimates attempt to characterize the effect of participation in the prekindergarten program for the subgroup of eligible children in the initial ITT sample who then actually participated in it.
Internally valid TOT estimates require information about who from the initial ITT treatment sample experienced the treatment and some basis for accounting for any selection bias between them and the individuals in the control group with whom they are compared. The most credible TOT estimates result from instrumental variable techniques applied to data from the full ITT sample (Angrist & Pischke, 2009; Gennetian, Morris, Bos, & Bloom, 2005). However, that approach requires information about (a) the assignment each individual in the initial ITT sample received, (b) whether they complied with that assignment, and (c) the outcome for each individual whether or not they complied. Because the full ITT sample is not identified in the typical prekindergarten age-cutoff RDD study and outcome data are not collected from that full ITT sample, none of these conditions can be met.
Another approach, albeit one with less inherent internal validity, is to incorporate baseline variables that address the expected selection bias in the analysis of the outcomes for the prekindergarten participants and control children. This might involve matching strategies, use of the baseline variables as covariates, or propensity scores derived from the baseline variables—any of which would adjust the effect estimates to some extent for biasing baseline differences. However, because data are collected only at posttest in the usual age-cutoff RDD, baseline variables are limited to those that can be obtained from school records or parent reports and which, further, are unchanging enough to support the presumption that they characterized the sample at the earlier time of assignment to conditions. Those variables will rarely be sufficient for valid estimation of TOT effects. Thus, although the effect estimates from the age-cutoff RDD are not ITT estimates, they will rarely be convincing TOT estimates either.
The Counterfactual Condition
Another distinctive feature of the impact estimate obtained from an age-cutoff RDD is the nature of the counterfactual condition with which the outcomes for the prekindergarten participants are contrasted. In an analogous RCT, the children in the control group would be in the same age cohort as those in the treatment group. The parents of the control children, without the opportunity to have their children attend the prekindergarten program, would turn to the alternatives available for children of that age. The impact estimate produced by such an RCT, therefore, describes the additional benefits conferred by the prekindergarten program above and beyond those expected from the alternatives parents would choose in the absence of the program. That estimate addresses a key question many policymakers and stakeholders have about state-funded prekindergarten, namely what added value it has for the participating children relative to a scenario in which there is no state-funded program that augments the otherwise available options.
The counterfactual condition in the age-cutoff RDD, however, is different from its counterpart in the analogous RCT (Bartik, 2013; Whitehurst & Armor, 2013). In this design, the counterfactual condition is children’s experiences during the year before they are eligible for the prekindergarten program. These children are too young to enroll in kindergarten the next school year and have not lost the opportunity to attend the public prekindergarten program. In this situation, parents may make different child care decisions than they would if their children were assigned to the control condition in the analogous RCT. The impact estimates from the age-cutoff RDD, therefore, do not describe the value added of a public prekindergarten program relative to the other options available to children during the year before kindergarten. Instead, they describe the benefits of the prekindergarten program relative to the child care options used by parents 2 years before their children are eligible for kindergarten. The policy relevance of these latter impact estimates is not clear.
The Comparability of the Age-Cutoff RDD Treatment and Control Analysis Samples
As discussed above, the age-cutoff RDD does not typically start out with the initial ITT sample identified and the member children known but, rather, assembles the sample that will contribute to the analysis after the fact at the time the outcomes are measured. Between the time when the ITT sample forms in the catchment area for a given school year and the time when the subset of that sample that provides outcome data is identified, there is potential for various events to affect the nature and composition of the final treatment and control analysis groups in ways that can compromise the internal validity of the effect estimates. The events with the most serious implications for age-cutoff RDD studies are described below.
Differential Attrition From the ITT Samples
If outcome data are not obtained for all the participants initially assigned to the treatment and control conditions in an RCT, internal validity is compromised in rough proportion to the extent of the presumptively nonrandom loss of cases from those original ITT samples. Similarly in an RDD, any significant loss of outcome data from the ITT sample will compromise the internal validity of the effect estimates. Figure 3 depicts the initial ITT treatment and control samples implicit in the structure of the age-cutoff RDD, that is, all the children in the catchment area at the beginning of the school year who are eligible for the prekindergarten program either that school year or the next. Figure 3 then tracks each of those groups through to the point where the analysis samples are defined and the outcome measures are collected in the typical age-cutoff RDD study.

Children included or potentially included in the age-cutoff regression-discontinuity design.
As Figure 3 reveals, there are several places where attrition from the ITT treatment and control samples can occur and, in most instances, can be assumed to occur. Children assigned to the control group may leave the area prior to outcome assessment (Box A) or remain in the area but not enroll in prekindergarten (Box B), and thus not be present for outcome assessment at the beginning of that year. Children assigned to the treatment group, in turn, may leave the area before kindergarten and miss the outcome assessment session at the beginning of that year (Boxes C and D). Furthermore, the treatment cases chosen for assessment at the beginning of the kindergarten year are typically identified from school records indicating enrollment in the prekindergarten program the year before. Thus, children in the initial age-eligible ITT treatment sample who did not attend the program would not have their outcomes assessed at the beginning of kindergarten and would not be included in the analysis sample (Box E).
The absence of outcome data for these children is analogous to attrition after the point of randomization in an RCT. Attrition from the ITT samples has not been estimated and reported in age-cutoff RDD studies, but could be substantial given the potential for many parents of children eligible for public prekindergarten programs to decide not to enroll their children and the mobility of the economically disadvantaged populations often targeted by the public programs.
Some indication of the proportion of eligible children who actually participate in public prekindergarten programs can be gleaned from data on the enrollment of 4-year-olds in the universal state-funded prekindergarten programs. The estimates reported by the NIEER for states with universal programs in which age-cutoff RDD studies have been reported show proportions that range from 59% to 74% (Barnett et al., 2012, Table 2, p. 15). Enrollment rates are not generally available for states with more restrictive eligibility requirements, but one local study of the Tennessee program reported that it enrolls about 42% of the children meeting its income-based eligibility requirements (Grehan et al., 2011). Eligible children who do not participate in the public prekindergarten program are not included in the analysis sample for virtually all applications of the age-cutoff RDD (Boxes B and E in Figure 3) and thus represent a loss of cases from the initial ITT sample that can be considerable. The attrition that results from families moving out of the catchment area before outcome assessment could also be appreciable (Boxes A, C, and D in Figure 3). Data from the nationally representative Current Population Survey indicate that approximately 23% of families with children below age 6 were residentially mobile in 2010 (U.S. Census, 2010).
Because of the different ways in which attrition can occur from the initial ITT treatment and control samples, there is no basis for assuming that their attrition rates will be similar or that the children lost from each group will be similar. This potential for differential attrition thus undermines the integrity of the initial ITT samples and compromises the internal validity of effect estimates based only on the children who appear for outcome assessment.
Differential Entry Into the ITT Samples
Differential entry of children into the treatment and control groups providing the outcome data used in the analysis is likewise a threat to the integrity of the age-cutoff RDD. The initial ITT control group may be augmented by children who move into the catchment area and enroll in the prekindergarten program where they will provide outcome data if they are not screened out by the researchers (Box F). Similarly, age-eligible children who move into the catchment area, enroll in the prekindergarten program during the school year, and continue into kindergarten will be included in the outcome data collection unless screened out (Box G). Very few of the reports of age-cutoff RDD studies indicate that such screening is done. The inclusion of these additional children in the analytic sample is akin to new respondents joining an RCT after randomization. With no assurance that the rates of entry or the characteristics of the entering children are the same for the treatment and control groups, inclusion of those children can further compromise the internal validity of the effect estimates.
Differential Propensity to Participate in Prekindergarten
As noted, the children who constitute the treatment and control groups whose outcomes are analyzed in the age-cutoff RDD are not identified until the time of outcome measurement. The treatment group consists of children who participated in the prekindergarten program the year before and are then found in kindergarten at the beginning of the next school year. Researchers reporting age-cutoff RDD studies have not generally been specific about how participation in the prekindergarten program the previous year was defined other than indicating that the requisite information came from school records. Some of the children who initially enrolled in the program may have withdrawn before the end of the school year or attended only intermittently; others may have joined the program after the beginning of the year. The definition of the treatment sample may or may not take these differences in the pattern and extent of participation into account.
Unless the treatment group is defined solely in terms of attendance in the prekindergarten program during the early weeks of the school year, however, some of the comparability of the treatment and control groups with regard to the propensity for program participation will be lost. This is because the control group is defined on the basis of enrollment in the prekindergarten program at the beginning of the school year. The appearance of the control children in the prekindergarten program affirms some similarity with the treatment group with regard to parents’ interest in having their children participate in the program. Nonetheless, identifying them by their presence in prekindergarten only at the beginning of the year means that any children who will drop out early or attend irregularly are also included. Similarly, any children who will enter the program late are excluded. Those children may be systematically different from the children who participate more fully throughout the school year. If the same inclusion and exclusion criteria are not applied to both the treatment and control groups, those differences will diminish their comparability and potentially exacerbate selection bias in the effect estimates.
Cohort Differences
The span of the age-cutoff RDD across two successive school-year birth cohorts allows for the possibility of various intervening events that might create different circumstances for one cohort relative to the other. Any such categorical differences that thus do not vary smoothly across the age cutoff or that influence factors, whether observed or not, that do not vary smoothly across that cutoff, could bias the effect estimates if related to the outcomes of interest. With the analysis samples for the age-cutoff RDD defined around participation in the prekindergarten program, an especial concern would be any cohort differences that affected that participation or the nature of the children participating.
One possible source of such differences is a change in the preschool options in the catchment area. For example, if additional community preschool programs or new Head Start sites opened during Year 1 of an age-cutoff RDD, that might alter the enrollment pattern of the younger cohort in public prekindergarten in the fall of Year 2. Policy changes between Years 1 and 2 in the catchment area schools might also have such effects. Changes in income-eligibility requirements for the prekindergarten program, for instance, could result in differences in the characteristics of the children enrolling in each of the successive school years. Changes in school configurations might also have such effects, for example, revised school zones that change the neighborhoods served in ways that draw different children into the prekindergarten program in Year 2 than in Year 1. Another possible source of cohort differences would be changes in the population of young children in the catchment area, for example, a rapid influx of immigrant families.
The potential for substantial cohort differences is not merely hypothetical. In an age-cutoff RDD study conducted by one of the authors, the Tennessee Early Reading First study (Wilson et al., 2013), changes in the school configuration produced substantial cohort differences on key demographic variables. Two schools were added to the catchment area between Years 1 and 2 that were located in neighborhoods populated by recent immigrant families and included classrooms for the prekindergarten program. The control group that resulted included disproportionate numbers of English-language learners, Hispanics, and recent immigrants, whereas the treatment group included a larger proportion of African American children. The proportion of children contributing to the outcome data whose native language was not English, for instance, was 22% for the first (treatment) cohort and 44% for the second (control) cohort. The proportion of African American children, in contrast, fell from 67% to 48%. Differences of this magnitude have ample potential to bias the treatment effect estimates.
Differential Outcome Measurement
As with all designs in which outcomes for treatment and control groups are compared, the outcome measures in the age-cutoff RDD must be operationalized the same way for both groups. That is, there should be no differences between the groups in the measures used or the way they are administered that might create differences in the observed values. Because of the age difference between the treatment and control groups in prekindergarten age-cutoff RDD studies, however, there are some potential problems with outcome measurement that could compromise internal validity.
Different Start Rules
Many early childhood assessments have different start rules for children of different ages. For instance, the Peabody Picture Vocabulary Test (PPVT; Dunn & Dunn, 2007), which has been frequently used in age-cutoff RDD studies, has 5-year-old children (concentrated in the treatment group) begin 12 items further into the test than 4-year-old children (concentrated in the control group). The final score is determined by the highest item reached before the stop rule applies minus the number of errors. This scoring thus assumes that every treatment child who is 5 years old would make no errors on the base set of items given to 4-year-olds. If that assumption is not correct for some children in the treatment group, they will receive a higher score than they would get if they were tested the same way as the children in the control group. Moreover, there may be some children in the treatment group who are 6 years old by the time of testing; they start the PPVT in yet a different place from the 4- and 5-year-olds.
Other measures that have been used in age-cutoff RDD studies also have different start rules by age. The Woodcock–Johnson III Spelling and the Picture Vocabulary subtests specify different start rules based on school grade (Woodcock, McGrew, & Mather, 2001). The Brigance II, used in an age-cutoff RDD by Coburn (2009), has different forms for prekindergarten and kindergarten that produce scores that have to be converted to be presumptively equivalent to each other (Brigance, 2005; Curriculum Associates, n.d.). Researchers using any measures of this sort face a trade-off when deciding whether to start both treatment and control children at the same point or follow the age-graded start rules. Using different start rules may create spurious differences in the observed outcome scores that bias the estimates of treatment effects. However, asking older children to respond to a series of relatively easy items intended for younger children before they get to those more indicative of their actual ability may produce test fatigue that artificially diminishes their performance.
In the Boston prekindergarten study, Weiland and Yoshikawa (2013) investigated the influence of different start rules for PPVT scores. In one analysis, they artificially moved children’s starting points on the basal items that indicate the floor for a given child. For example, if a 4- or 5-year-old child passed the basal for the 6-year-old start set, any errors on items before that start set were ignored. This strategy resulted in score increases for 15% of the sample that were between 1 and 10 points. For the children whose scores were changed, the average increase in the treatment group was 1.78 more correct items and, in the control group, was 1.93 more items. In this instance, the influence of the different scoring procedures was similar for the treatment and control groups, but it is not safe to assume that the effect will always be so benign.
Another example comes from the Woodcock–Johnson Picture Vocabulary test used in the Tennessee Prekindergarten study (Lipsey et al., 2011). For that test, the start items vary by grade level—kindergarten children start eight items ahead of prekindergarten children. However, these researchers started all children at Item 1. They then computed the scores the kindergarten children (treatment group) would have received if scored using the start rule for their grade and found that 6.5% of the scores increased. The change in the W scores (the Woodcock–Johnson Item-Response Theory [IRT] scaled, but not age-normed scores) for those children ranged from 3 to 6 points with a mean of 3.6, a magnitude that rivals that of the overall treatment effects found on this measure.
Test developers strive to make the scores from different forms or for children of different ages equivalent, all else equal. The success of those efforts for any given sample in an age-cutoff RDD study will depend on the similarity of the responses of the children in that study to those on which the psychometric work for the test was done. Where the outcome measures are not exactly parallel for younger and older children, the potential for bias exists in age-cutoff RDD studies where treatment and control groups are created explicitly to differ on age.
Floor and Ceiling Effects
One reason many tests for younger children have different start rules and sometimes different forms for children of different ages is that performance can vary widely over relatively small age differences in the early years. The school readiness measures of most interest in the age-cutoff RDD studies are especially susceptible to this because of the rapid gains children can make on the respective skills during the prekindergarten years. One implication of this situation is that some measures otherwise appropriate for young children might not have sufficient range to properly scale the lower performance of the youngest children entering prekindergarten and the oldest children entering kindergarten. The result would be floor or ceiling effects that could bias the treatment effects by overestimating the performance of the control group or underestimating that of the treatment group.
Reactivity to Testing
The age-cutoff RDD procedure of assessing control children at the beginning of prekindergarten and treatment children at the beginning of kindergarten also raises questions about another source of possible bias. The control children in this comparison may well have considerably less experience than the treatment children with testing and test-like situations, the school context in general, and interactions with unfamiliar adults. These children thus may be less comfortable with the testing situation and the general context within which it occurs in ways that cause them to perform below their actual ability. This may especially be the case for children with no prior experience with preschool or day care outside the home. Weiland and Yoshikawa (2013) found that roughly one third of the children in their control sample had no prior out-of-home care; for Lipsey et al. (2011), this group comprised more than half the sample.
Strengthening the Age-Cutoff RDD
Full Application of the Age-Cutoff RDD
Fully applying the age-cutoff RDD is one very direct way of strengthening it. An uncompromised RDD has inherently strong internal validity (Cook & Wong, 2008). With age as the assignment variable and an exogenously applied cutoff determining eligibility, the prekindergarten age-cutoff RDD can be expected to have strong internal validity when applied to the entire initial ITT sample. In a full application, the initial ITT sample would be identified at the beginning of the school year and followed prospectively as the treatment and control conditions occur during that year. Outcomes would then be assessed with age-equated measures for all the children in the initial sample whether they attended the program or remained in the catchment area.
This application of the age-cutoff RDD should yield unbiased estimates of the effects on the selected outcomes of being eligible for prekindergarten—the ITT impact estimates. With the initial sample well defined, information about which children actually attended the prekindergarten program, and outcome data for all the children in the initial sample, instrumental variable techniques could be used to produce estimates of TOT effects. Those TOT estimates would describe the effects of actually participating in the prekindergarten program rather than merely being eligible to participate.
The effect estimates resulting from this full application of the age-cutoff RDD, however, would still represent the benefits of the prekindergarten program relative to a counterfactual condition consisting of the child care and preschool experiences parents choose for children in the year prior to age-eligibility for prekindergarten. For reasons described earlier, these impact estimates may not adequately describe the added value of a public prekindergarten program relative to the alternatives that would be chosen by parents during the prekindergarten year itself if the public program was not available.
Given the inherent difficulty of identifying the initial ITT sample, none of whom are in school at the point where that sample is defined, and the effort required to track and assess outcomes for all the children, including those outside of the catchment area, no age-cutoff RDD study has attempted full implementation of these procedures. Some, however, have used extensive school samples, for example, statewide, that reduce the chance that a mobile child would not be enrolled in a school where testing was conducted (e.g., Hustedt, Barnett, Jung, & Goetze, 2009; Lipsey et al., 2011; Wong et al., 2008).
Typical Application of the Age-Cutoff RDD
The full application of the age-cutoff RDD described above imposes formidable practical requirements that entail considerable effort and expense, though it is not entirely unrealistic. It is for such practical reasons that the typical application of this design restricts data collection to children who can be found and assessed in prekindergarten and kindergarten classrooms in the catchment area. In doing so, it focuses on the effects of participation in the prekindergarten program, essentially TOT estimates, without embedding them in data on the full ITT sample. Absent that ITT context, as we have shown, the resulting TOT-like estimates are subject to a number of threats to internal validity, virtually all of which represent some form of selection bias.
For the age-cutoff RDD, being free of selection bias means that all variables, observed or unobserved, that characterize the sample prior to intervention and are related to the outcome vary smoothly across the cut-point and thus do not show discontinuities that are confounded with the treatment conditions. There are a number of ways that the potential for selection bias can be reduced or, at least, assessed in the typical application of the age-cutoff RDD while retaining much of its practicality that we describe next.
Matching on history in the catchment area
One straightforward step is to better match the treatment and control samples with regard to their participation in prekindergarten and their presence in the catchment area during the span of the two birth cohorts represented in the ITT sample. This could be done by identifying children in both cohorts at the beginning of their prekindergarten year and tracking them to the beginning of their kindergarten year. School records and parent reports could be used to identify children who were in the catchment area prior to the beginning of each school year, to describe their prekindergarten participation, and to determine which children remained through to the beginning of the kindergarten year after the prekindergarten year. This information would allow the treatment and control groups used in the analysis to be matched for residence in the catchment area prior to prekindergarten, attendance and dropout in prekindergarten, and enrollment in the catchment area schools in kindergarten. Although this procedure restricts the treatment effect estimates to children who remain in the catchment area for at least two school years, and thus may seem to limit external validity, it is more closely aligned with the 2-year span of the ITT sample than the usual procedure. More important, matching on these aspects of the history of the treatment and control samples will help equate them on their propensity for participation in the prekindergarten program, differential attrition from the ITT samples, and differential entry into those samples. Alternatively, the availability of this information will allow sensitivity tests to determine whether the treatment effect estimates differ when the analysis sample includes children with different opportunities to drop out of prekindergarten or different residential histories in the catchment area.
Weiland and Yoshikawa (2013), for example, used administrative records to follow the control children into their kindergarten year and identify those who left the school district before beginning kindergarten. As a sensitivity check, they dropped the children who did not appear in kindergarten from their analysis and re-estimated the treatment effects. On the PPVT, they found that this made the effect estimates slightly larger than those found with the full sample. Wilson (2011) made a similar comparison in the Tennessee Early Reading First study, also with the PPVT. As with the Boston sample, treatment effect estimates were larger with the control children who did not continue through to kindergarten dropped from the analysis. Weiland and Yoshikawa (2013) also identified and dropped late enrollee treatment children from the analysis and found that the estimate of the effect on PPVT scores was somewhat smaller than with the full sample. Wilson et al. (2013) also made this adjustment and likewise found that the effect estimates for the PPVT were smaller when the late enrollee children were excluded.
Baseline covariates
A substantial limitation of the typical age-cutoff RDD study is the absence of a well-chosen array of baseline covariates that can be used to assess initial differences between the treatment and control groups in the analysis sample, to examine continuity on such variables across age and, especially, across the cut-point on age, and to provide data for statistical adjustments as needed to reduce selection bias. Obtaining such baseline data is especially difficult for the control group. The baseline for the treatment group is at the beginning of the prekindergarten program, which allows relatively good access for data collection. The baseline for the control group, however, is at the beginning of that same school year when the control children are still in the community and not in any regular school program. Identifying and locating them for data collection, therefore, would be a considerable challenge.
Nonetheless, it would be advantageous to obtain whatever baseline information is feasible. This can be done after the fact for variables that are stable over time (as has been done in many applications of the RDD age-cutoff design). The gender, age, and race/ethnicity of the children, for instance, could be ascertained from school records, observation, and/or parent reports when the outcome measures are collected at the beginning of Year 2. Parent reports of family and child characteristics at the baseline time also could be obtained, for example, mother’s education and general socioeconomic status, books in the home, amount of reading to the child, home language, and the like. These would be retrospective reports, which are known to have problems of validity and reliability, but if collected at the time of outcome measurement, the memory burden would be equal for the parents of the treatment and control children and any errors may not be biasing. Moreover, even if not wholly trustworthy for inclusion in the main analysis, these variables could be used for sensitivity tests to assess the extent to which the effect estimates change when adjusted for differences on these variables.
The entry of the treatment and control groups into prekindergarten is an especially convenient time for collecting descriptive data. As noted above, this is the baseline for the treatment sample, but it comes at the time of outcome assessment for the control sample. Nonetheless, such data can help assess the potential for biasing attrition and cohort differences. The control group in the age-cutoff RDD is essentially wait-listed for the prekindergarten program, thus both groups should look very similar at the time they enter the program. In addition to the demographic and family background measures mentioned above, one or more key outcome measures might be administered to the treatment group when they enter prekindergarten. Well-matched treatment and control groups should show similar performance levels at prekindergarten entry despite that occurring a year apart for the two groups. Such data can be used to better match the treatment and control groups if there are notable differences.
Equivalent measures
Finally, in all age-cutoff RDD applications with young children, researchers should ensure that, to the extent possible, the outcome measures are operationalized the same way for the treatment and control groups and free of floor and ceiling effects. For measures administered differently for children of different ages, an effort should be made to ensure that those differences do not bias the treatment effect estimates. If the start points for the younger and older children do not differ by very many items, for example, all the children can be started at the same point to keep the testing procedure uniform. If that would require giving the older children a very large number of items that are easy for them, it may be best to conduct a pilot test with children like those to be included in the research sample to empirically identify the items that can be safely skipped because few of the older children miss them and then adjust the start rule if necessary. Also, an exploration of the comparability of the results produced by different procedures could be embedded in the study itself by assessing a random subset of the older children using the procedures for the younger children and vice versa. The results can then be used to identify any testing artifacts and, if necessary, adjust for them in the analysis.
Problematic floor and ceiling effects appear as a bunching of scores in the lower range for the younger control children or the upper range for the older treatment children. Diagnosing those effects will thus be relatively straightforward, but, if problems are discovered, there is little that can be done to correct them after the fact. The better strategy is to use measures for which an appropriate response range for the study sample has already been demonstrated or, if there is any doubt, to try out the measures in advance with similar children to ensure that they adequately differentiate performance at both the low and high end of the relevant age range.
The possibility that effect estimates will be inflated by the greater familiarity of children in the treatment group with the testing situation can also be investigated, though this is an area where further research is needed. As shown by Lipsey et al. (2011) and Weiland and Yoshikawa (2013), documenting the type of child care control children experienced in the year prior to prekindergarten entry will yield insight about their experience with center-based preschool environments and perhaps also with assessment procedures. If a large proportion of the control group experienced home-based care in that prior year, the risk of inflation of treatment effect estimates due to testing familiarity artifacts may be larger than if most of the children participated in some center-based program.
There are also instruments that may help assess this problem. The Task Orientation Questionnaire (Smith-Donald, Raver, Hayes, & Richardson, 2007), for example, allows assessors to rate a child’s behavior and self-regulation during the testing session. This was the only measure with no statistically significant difference between the treatment and control samples in the Boston study (Weiland & Yoshikawa, 2013), suggesting little likelihood of greater reactivity to testing by the control group. However, about two thirds of the Boston control group was enrolled in non-parental care settings in the year before they were assessed; there is no assurance that comparable results would be found for children with less experience with out of home care.
Conclusion
The key feature of an RCT that supports its strong internal validity for estimating treatment effects is the process by which the research sample is divided into treatment and control groups. With randomization, only chance differences are expected between these groups prior to intervention on all characteristics, observed or unobserved. When these groups are kept intact through to the collection of outcome data, their initial equivalence engenders confidence that any outcome differences are due to the treatment. Similarly, the key feature of an RDD is the process by which a research sample is divided into treatment and control groups. When that process involves dividing a defined sample with a cut-point exogenously imposed on a continuous observed variable and the resulting groups are kept intact, we expect all other characteristics of the sample, observed or unobserved, to vary smoothly across that cut-point. It is that expectation (provided other assumptions are met) that creates confidence that a discontinuity at the cut-point on an outcome variable is a treatment effect.
The main conclusion of the analysis presented here is that the prekindergarten age-cutoff RDD, as typically implemented, lacks the inherent internal validity expected of a well-executed RDD. This application of the design is structured around two cohorts of children in a catchment area divided by age of eligibility for a prekindergarten program. But that initial sample is not explicitly identified and followed prospectively. Instead, the treatment and control groups on which the analysis is based are identified post hoc at the point of outcome measurement for only subsets of the children in the respective cohorts. As described above, there are many ways in which the resulting treatment and control analysis samples might differ that could bias the treatment effect estimates. In addition, those effect estimates are not based on the same treatment-control contrast represented in an analogous RCT but, rather, involve a different counterfactual condition consisting of child care arrangements for children younger than those of prekindergarten age.
The likely direction and magnitude of the potential bias in application of the age-cutoff RDD varies across the threats to validity we have discussed and, in addition, would depend on the circumstances of each particular application. It is thus not possible to estimate the extent to which the age-cutoff RDD studies conducted to date may have misestimated the effects of public prekindergarten programs on the school readiness outcomes investigated. The most serious of the threats to the validity of this design stem from the large amount of attrition that is possible between the initial ITT sample and the analysis samples from which outcome data are obtained. For the common situation of public prekindergarten programs that target economically disadvantaged children, it is often the lower performing children who disproportionately fail to enroll in the program, enroll but then drop out, or move from the catchment area. The greater opportunity for this to occur in the control group than the treatment group opens the door to differential attrition that would upwardly bias the treatment effects.
Despite its limitations, the practical advantages of the age-cutoff RDD make it worth considering where applicable. Because of the minimal disruption it causes for school systems, it is feasible to implement in situations where an RCT would not be possible. It can also be applied with relative ease to a large catchment area, even statewide, and thus allows assessment of the effects of public prekindergarten implemented at scale. Moreover, the division of two successive cohorts of children into those eligible for prekindergarten in the current versus the next school year based on an exogenously imposed age cutoff does, at that level, constitute a credible RDD with its intrinsic advantages for estimating treatment effects. Although we have shown that the age-cutoff RDD is typically applied in a way that compromises this inherent internal validity, it is arguably better to begin with a strong design than a weak one.
In this regard, it is worth considering the alternatives to the age-cutoff RDD for investigating the effects of participation in public prekindergarten programs. A randomized design would, of course, be preferable as a source of valid effect estimates, but is difficult to implement. An appropriate RCT requires that some eligible applicants be randomly denied admission. Done broadly across a large-scale program, this is not likely to be acceptable to any of the parties affected. As a practical matter, randomization in such circumstances has only been feasible when there are more applicants than the program can accommodate (e.g., Lipsey, Hofer, Dong, Farran, & Bilbrey, 2013; Puma, Bell, Cook, Heid, & Lopez, 2005). This typically occurs, if at all, only in particular oversubscribed program sites that may well be unrepresentative of the overall program.
Much more feasible for large-scale application are nonrandomized designs that compare outcomes for age-eligible children who attend prekindergarten with those for children who do not attend (e.g., Henry et al., 2001). However, in most contexts, there are many possible differences between these groups of children that are confounded with their different prekindergarten participation. This design is thus rife with potential for selection bias and presents great challenges to researchers attempting to disentangle program effects from all the other group differences that may influence the outcomes.
Under these circumstances, the age-cutoff RDD is an attractive option for studying public prekindergarten programs despite the limitations and potential problems we have identified. It is thus especially important to critically examine the strengths and weaknesses of any planned application of this design and find ways to shore up the weaknesses. Ideally, researchers would identify the ITT sample at baseline and follow children prospectively, assessing their outcomes regardless of their prekindergarten enrollment decisions. For the many circumstances in which that will not be feasible, we have offered suggestions for strengthening this design and for sensitivity analysis to identify potential problems. The most important of these involves gathering what information is possible on relevant aspects of the prior history of the children in the analysis sample and their characteristics at baseline, and at the beginning and end of the respective prekindergarten enrollment periods. That information can then be used to better match the children in the treatment and control analysis samples and identify differences with potential to bias the effect estimates. Even with such enhancements, however, it must be recognized that the typical age-cutoff RDD implementation will not generally support strong claims about the internal validity of the resulting treatment effect estimates.
A methodologically sound and rigorous evidence base is essential for informing policy and practice as public prekindergarten programs expand across the United States. Our hope is that this article will help improve future studies that add to that evidence base and deepen our understanding of the effects of public prekindergarten programs implemented at scale.
Footnotes
Acknowledgements
We would like to thank Institute of Education Sciences (IES) for funding a Working Group meeting on Regression-Discontinuity Methods in Age-Cutoff Designs at the Harvard Graduate School of Education on the issues discussed here. Particular thanks to the following Working Group members: W. Steven Barnett, Howard Bloom, Hans Bos, William Gormley, Guido Imbens, Stephanie Jones, Kwanghee Jung, Thomas Lemieux, Jens Ludwig, Douglas Miller, Pamela Morris, Richard Murnane, Peter Schochet, John Willett, and Vivian Wong.
Authors’ Note
The first two authors contributed equally to this study and are listed alphabetically.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work on this study was funded by the U.S. Department of Education, Institute of Education Sciences (IES).
Authors
MARK W. LIPSEY is the director of the Peabody Research Institute at Vanderbilt University with research interests in interventions for at-risk children and youth and program evaluation methodology.
CHRISTINA WEILAND is an assistant professor in the School of Education at the University of Michigan with research interests in the effects of early childhood interventions and public policies on children’s development, particularly among children from low-income families, and on the mechanisms by which such effects occur.
HIROKAZU YOSHIKAWA is the Courtney Sale Ross professor of globalization and education and university professor at New York University Steinhardt focusing on the effects of programs and policies related to early childhood development and poverty on children in the United States and in low- and middle-income countries.
SANDRA JO WILSON is the associate director at the Peabody Research Institute and a research assistant professor in the Department of Special Education at Vanderbilt University. She also serves as education editor for Campbell Systematic Reviews, a peer-reviewed monograph series. Her professional interests are in the areas of research synthesis and meta-analysis, program evaluation, research methodology, and the effectiveness of interventions to support children and families.
KERRY G. HOFER is a senior research associate at the Peabody Research Institute at Vanderbilt University focusing on rigorous evaluations of early education environments and interventions including the effectiveness of a statewide public prekindergarten program, links between pre-k classroom quality and children’s outcomes, and longitudinal relationships among students’ early and later math knowledge and self-regulatory skills.
