Abstract
Background:
Annually, American colleges and universities provide developmental education (DE) to millions of underprepared students; however, evaluation estimates of DE benefits have been mixed.
Objectives:
Using a prototypic exemplar of DE, our primary objective was to investigate the utility of a replicative evaluative framework for assessing program effectiveness.
Research design:
Within the context of the regression discontinuity (RD) design, this research examined the effectiveness of a DE program for five, sequential cohorts of first-time college students. Discontinuity estimates were generated for individual terms and cumulatively, across terms.
Subjects:
Participants were 3,589 first-time community college students.
Measures:
DE program effects were measured by contrasting both college-level English grades and a dichotomous measure of pass/fail, for DE and non-DE students.
Results:
Parametric and nonparametric estimates of overall effect were positive for continuous and dichotomous measures of achievement (grade and pass/fail). The variability of program effects over time was determined by tracking results within individual terms and cumulatively, across terms. Applying this replication strategy, DE’s overall impact was modest (an effect size of approximately .20) but quite consistent, based on parametric and nonparametric estimation approaches. A meta-analysis of five RD results yielded virtually the same estimate as the overall, parametric findings. Subset analysis, though tentative, suggested that males benefited more than females, while academic gains were comparable for different ethnicities.
Conclusion:
The cumulative, within-study comparison, replication approach offers considerable potential for the evaluation of new and existing policies, particularly when effects are relatively small, as is often the case in applied settings.
Keywords
Introduction
Within American colleges and universities, developmental education’s (DE) effectiveness is widely accepted. To a large extent, the acceptance of DE is based on the presumption that if underprepared students successfully complete prescribed, developmental coursework, their subsequent academic paths will be enhanced (McCabe 2000). In concordance with such favorable sentiments, there exists a long-standing, philosophical belief in the intrinsic merit of DE, an inclination that has received enormous financial support and publicly stated approval. Bailey (2009) summarized this endorsement: Developmental education is a core part of Achieving the Dream, a $100 million initiative … to improve student success at eighty-four community colleges (www.achievingthedream.org). The U.S. Department of Education’s Institute of Education Sciences has funded the National Center for Postsecondary Research (NCPR, www.postsecondaryresearch.org) whose research is focused mainly on evaluating initiatives (primarily, but not exclusively, in community colleges) to improve outcomes for students with weak academic skills. (pp. 27–28)
While institutional processes vary, students are typically placed into a developmental curriculum track predicated on their documented lack of readiness for college-level coursework. DE students who possess an inadequate set of requisite skills are often restricted from enrolling in college-level courses. Depending on the magnitude of their skill deficiency, DE students may be required to complete several DE-level courses before beginning college-level study. Underlying this practice of sorting and rerouting is the assertion that DE students are otherwise unlikely to achieve academic success. Scholars and nonprofit organizations routinely support this conviction by arguing that student achievement and national graduation rates suffer when academically deficient students either progress inefficiently or fail to complete a compensatory curriculum (American Association of Community Colleges 2000; AchievingtheDream.org; Bailey 2009; Boylan 2001).
In spite of such confidence in DE as a programmatic strategy, after more than three decades of empirical research (see also Trochim 1984, for a history of “compensatory education”), there remains conflicting evidence regarding DE’s effectiveness. Positive effects of DE were found by Bahr (2010) who analyzed the extent of student academic deficiency related to two indicators of long-term, academic attainment. For students who were least academically prepared and for those who possessed deficiencies across multiple content areas, he found that degree or certificate attainment and upward transfer were positively associated with DE completion.
Similar positive findings have been reported by Bettinger and Long (2005), again asserting the effectiveness of remediation efforts on increasing the number of credits earned and rate of transferring to a 4-year college for developmental mathematics students. Recent evidence by Moss, Yeaton, and Lloyd (2014) that used an embedded research design approach (a randomized experiment within an regression discontinuity [RD]) found that DE enhanced student mathematics achievement. Furthermore, there is sound research evidence that DE significantly increased the odds of passing subsequent gatekeeper courses in mathematics (Lesik 2006), and DE completion is one of the most important forecasters of student persistence (Fike and Fike 2008).
In contrast to these positive findings, two large studies have found little evidence that DE provides benefit. Using the regression discontinuity design (RDD), based on records of entering freshman at 2- and 4-year Texas public colleges during two academic years, Martorell and McFarlin (2010, 2011) found essentially no relationship between DE and several, long-term, student outcomes including extent of upward transfer, number of academic credit hours attempted, years of college completion, and likelihood of degree completion. These authors, however, did not utilize more immediate outcomes such as grades in college-level courses as a dependent variable (though one analysis from an earlier version of their article was based on grades, as briefly described in a footnote) and, furthermore, “… did not include information on the courses students take” (Martorell and McFarlin 2010, p. 23).
When data were taken from all community colleges in Florida, Calcagno and Long (2008) found that DE had a weak and generally nonsignificant impact on the number of credits completed, probability of transferring to a 4-year institution, or likelihood of passing college-level courses and that DE led to only slight gains in student persistence. Again, individual grades were not systematically examined, though pass/fail results in mathematics were collected and found to be not statistically related to DE. Thus, although the empirical evidence is inconsistent, there remains a pervasive presumption that DE adds value.
Consequently, despite some positive research evidence and a pervasive belief that DE participation results in student success, many policy-relevant questions remain unresolved. More specifically, while the existence of positive DE effects has been demonstrated in a few instances, precise estimates of the magnitude of program effects are far less common. As grade in the first college-level course subsequent to remediation efforts has seldom been reported in these research studies, we have insufficient knowledge about the immediate impact of DE. As argued by Martorell and McFarlin (2010, 2011), given choice, upon completion of DE, students may choose easier or more difficult courses, and this difference in difficulty may confound commonly reported, less immediate, DE versus non-DE differences in dependent variables such as college drop-out rate or number of academic credit hours completed. In addition, evaluators have inconsistently reported differential program effects of DE for specific demographic groups. Finally, to best judge the impact of DE, the consistency of its impact across successive cohorts of students is critical to policy makers. The existing void follows from a typical strategy in which follow-up periods have been brief and a single aggregate of multiple cohorts has been utilized to assess efficacy. These global estimates make it impossible to discern the consistency and permanence of program effects.
Replication as a Framework for Policy Decisions and Program Effectiveness
Across the United States, statewide regulation of the assessment, placement, and delivery of DE is uncommon since nearly all DE policy decisions are formulated locally, within distinct institutions. In a national study of higher education DE, the National Center of Education Statistics (Parsad, Lewis, and Greene 2003) estimated that over 70% of institutions make DE policy decisions at the local level. Administrators and faculty are routinely faced with the challenge of judging the extent, consistency, and duration of program effectiveness, assessing subgroup impact, and establishing resource allocation priorities based on such findings. In this vein, our data were derived from a single college within a state where DE practices were locally controlled, as is the case in the majority of U.S. higher education institutions (Michigan Developmental Education Consortium 2010; Parsad, Lewis, and Greene 2003).
Static measures of benefit, based solely on averages, during moderate durations, may lead to invalid conclusions. Early, no-difference results may outweigh, later, more positive findings. Evaluations that reveal the variability of program effects provide an important basis for decision making. Even randomized studies, though not without their own inherent shortcomings (Kaptchuk 2001), typically do not consider such an evolving context. Outcomes for both treatment and comparison groups are averaged over the enrollment period of the experiment, so researchers have no way of knowing the degree to which the treatment’s effectiveness changed during the study’s course. An exception sometimes occurs in the evaluation of new pharmaceutical drugs when trial results are monitored over considerable durations and “stopping rules” are implemented (Hennekens and Buring 1987). Here, the experiment terminates when there is either clear benefit or apparent harm (the drug is substantially better or worse than the placebo). But, as a general rule, interim temporal analyses are not reported in most research.
Similar temporal issues exist in education when changes in programs, curricula, and instructor delivery methods are introduced at regular intervals (e.g., at the beginning of a semester or academic year). New, promising programs are implemented, but the program itself may evolve as staff members conduct numerous modifications. New cohorts of students arrive each semester or year. Program staff become more adept in delivering its elements. Researchers may compare the program’s effectiveness pre and postintervention and sometimes to a single comparison group that did not participate in the program. Though benefit may be found, we do not know if the program worked equally well at all periods during its delivery.
Method
Choice of Participants, Assignment to Treatment, and Description of DE
Eligible study participants were first-time college students who attended a large, five-campus, suburban community college during one of the five consecutive terms from fall 2003 through winter 2005 and completed a placement test to determine if they were prepared for college-level English (n = 6,568). To assess program effects, we tracked DE and nondevelopmental education (NDE) students whose final grades were reported for their first, entry-level, college-level English course. During the study’s time period, 3,940 first-time students lacked placement scores and, thus, were excluded. (These missing data are primarily due to the large number of students who took and obtained American College Testing (ACT) test scores that exempted them from the placement test or those who either transferred or dropped out during the one-semester grace period when no placement test was required.) Also, DE students who failed to complete both the DE program and a first, college-level English class, and NDE students who did not finish a first, college-level English class were not included in our analyses. A final sample of 3,589 students participated in this study (1,058 DE students and 2,531 NDE students who completed college-level English).
Demographic characteristics, placement scores, course enrollment data, and course outcomes were extracted from the college’s student information system. Overall, the final sample consisted of approximately 54% female students. Because of the relatively small number of persons who classified themselves in minority racial or ethnic subgroups (18%), students were listed as either White (76.8%) or multiracial (African American, Hispanic, Asian American, or Other) and 5.2% did not report an ethnic classification. On average, the DE group was marginally younger and had modestly higher percentages of multiracial and female students when compared to the NDE group.
Student demographic characteristics remained stable over the time frame of the study. We compared fall 2003 to fall 2004 student characteristics, the largest enrollment periods in our study, and found small, nonsignificant differences in the college-wide, average age or gender, race and ethnicity percentage, as well as full- or part-time enrollment (less than or equal to 1 percentage point difference, for each variable).
Students were assigned to the DE or NDE group after assessing each student’s reading and writing ability using a standardized COMPASS placement test (American College Testing 2003). After completing this test, each student obtained a composite placement score that ranged from 2 to 198; scores at or exceeding 150 classified a student as NDE. Students not assigned to DE were free to enroll into any of the 385 entry-level, first college-level, English course sections offered at five different campuses, during any study semester. Enrollment in any college-level English course was restricted for DE students who scored below the cut score of 150 until they successfully completed developmental English.
When this study began, the DE program had been in existence since 1998 and was considered well established. In the half decade leading up to and during our study, there was little change in a curriculum focused on improving DE students’ initially low, English skill levels. Instructional activities were aimed to yield gains in reading and writing, while simultaneously ameliorating skill deficiencies (e.g., grammar and spelling). Strategies for recognizing and correcting writing mistakes were also stressed. Ultimately, course content was designed not only to prepare DE students for the college-level NDE composition course but also to provide a foundation for success in all coursework.
DE instruction mainly involved lecturing, with occasional small group activities. When weaknesses were identified, instructors met individually with students. Each instructor had control over textbook selection, classroom activities, and instructional methods. However, across the five campuses, the English faculty established a uniform set of expectations that primarily emphasized student’s ability to communicate effectively through writing. Enrollment in DE courses was limited to 20 students and, during the time frame of our study, 75% of sections were taught by full-time instructors.
Measures of Achievement
We operationalized student achievement in two ways, using easily obtainable data from commonly found institutional records. First, we collected final, college-level, English grades for both DE and NDE students as our primary measure of achievement. There were 11 possible grades ranging from F (0.0) to A (4.0), including plus and minus grades. Second, we created a secondary, binary measure of achievement using each student’s final grade in college-level English. Consistent with this college’s policy, successful achievement (pass) included students who earned a grade of C (2.0) or better; those who scored less than C were judged to be unsuccessful. Though this dichotomous measure did not produce precisely the same pattern of results as grade, pass or fail has often been utilized as a policy-oriented indicator of program effectiveness within higher education. Typically, a minimum grade of C is required for a course to qualify for credit toward a degree, to satisfy prerequisite requirements for more advanced courses, or to enable students to meet course completion rules at another university or college.
Finally, given our focus upon grade as the single most immediate, compelling, and readily available measure of program benefit, and given that NDE students could exercise choice as to when they would take college-level English (and the necessary delay of enrolling of at least one semester for DE students), the time point for assessing grade was not consistent for all students. However, in our attrition analyses and in our analysis of selection by maturation, outlined subsequently, we directly address these two issues.
Assessing the Variability of Program Effects
To evaluate consistency of program effects, we considered each student cohort as an individual RD study. We defined a cohort by grouping students from their initial term (semester) of enrollment. Each of the five, consecutive terms generated a cohort that varied considerably in size based on the overall number of first-time students entering college during that respective term. The fall terms (Terms 1 and 4) represented the two largest cohorts (2003: n = 1,540; and 2004: n = 1,323). The two winter terms (Terms 2 and 5) enrolled approximately the same number of participants (2004: n = 278; and 2005: n = 289). The single summer term (Term 3) contained the smallest cohort (2004: n = 159). Only a student’s first grade in college English was utilized, so no student appeared in more than one cohort. We combined the five, individual cohort discontinuity estimates to calculate an overall estimate of program effect, and we report cohort results for both individual terms and cumulatively, for these consecutive terms.
RD Design Validation
A valid application of the RD design requires that persons assigned to the treatment condition receive the treatment and those assigned to the control group do not. In our study, it was possible that the potential discontinuity in grade or pass/fail was influenced by a difference in the percentage of DE and NDE participants at or near the cut score (i.e., the two percentages should be quite comparable). An asymmetry is especially likely if treatment was viewed as relatively desirable or undesirable. In our research, participants assigned to DE might cross over to the NDE group. Evidence of such migration would likely be reflected in different densities of study participants near the cut score. Fortunately, we were able to provide a graphical representation of the density of study participants near the cut score and to test the significance of possible differences (e.g., McCrary 2008). The graphic also allowed us to determine if there were discontinuities in densities throughout the entire distribution of baseline, placement scores.
A discontinuity in density at the cut score might also occur if administrative testing procedures led to test score manipulation. Standardized placement tests were administered in the controlled setting of campus testing centers. Scores were computer generated and automatically entered into the college’s student information database without alteration by testing center staff. Immediately after completing the test, students were provided a hard copy of their results, and their assignment to either DE or NDE group based on the predetermined cut score. Taken together, these procedures made manipulation implausible.
Though knowledge of the precise value of the cut score was available, we have no knowledge of the degree to which particular students spent additional time preparing for the placement pretest based on their proximity to the cut point. Systematic and successful test preparation by students in the neighborhood near and to the left of the cut point might also alter the density on both sides of the cut score, reducing the density of DE students on the left and increasing the density of NDE students on the right. DE students successfully retaking the placement test could avoid DE and artificially increase the density of NDE students just above the cut score. Reviewing the college’s student information database, we found that of the 3,589 students in the sample, only 56 students completed the placement test more than once. Of the 56 test retakes, only 18 students increased their placement score beyond the cut point (18/3,589 = 0.5%).
In addition to differences in overall density of treatment and control students near the cut score, varying makeup of DE and NDE students near the cut score could potentially compromise an inference of program effectiveness. Thus, we explored the possibility that differences in demographic characteristics (age, gender, and ethnicity) of DE and NDE students, near the cut score, might be linked to study outcomes. We used both graphical displays and statistical tests to assess possible discontinuities in density for each of the three demographic aspects noted previously. A test of significance was conducted within the optimal bandwidth (OBW) using an Epanechnikov kernel, for 50 bootstrapped samples (with replacement), for each demographic variable. Nonsignificant differences between DE and NDE students, for each covariate, would provide evidence against selection bias in proximity to the cut score.
Parametric Estimates of Discontinuity Magnitude
The presence of a discontinuity is fundamental to ruling out many potential threats to internal validity in the RD design (Trochim 1990). For example, given a discontinuity, it is generally implausible that preexisting group differences (selection bias) could explain why there was a substantial change in outcome that “just happened” to occur at the cut score. Thus, the presence of a discontinuity is fundamental to a claim that an effect is attributable to treatment. However, recent advances in RD analytic approaches place a greater burden on the researcher to both demonstrate the existence of a discontinuity and to reliably estimate its magnitude. Explicit standards for exemplary RD research have also begun to emerge (e.g., Imbens and Lemieux 2008; Schochet et al. 2010). These contemporary, analytic yardsticks incorporate several, nonparametric strategies that allow the researcher to determine the degree of consistency between parametric and nonparametric estimates of the size of a discontinuity.
While parametric estimates are typically considered to produce the most reliable, single estimate of discontinuity, nonparametric approaches are meant to corroborate those from parametric analyses (Bloom 2012). Each researcher’s aim is to produce an unbiased estimate of the magnitude of a discontinuity that varies little across the two types of analyses. Ordinarily, these nonparametric estimates are focused near the cut score rather than over the entire domain of the independent, forcing variable.
In this article, our primary and secondary outcomes were based on parametric approaches. We first demonstrated the existence of a discontinuity and estimated its size using the regression approach recommended by Shadish, Cook, and Campbell (2002). We applied a linear model by regressing course grade (yi ) on placement score (β1), group membership (DE or NDE) variable (β2), and the interaction between placement score and group membership (β3) (see regression model, mentioned subsequently). Prior to the analysis, each student’s placement score was adjusted by subtracting from it the placement cut score value. From this linear transformation of the assignment variable, we could discern a discontinuity between regression lines at the cut score. To detect a possible nonlinear relationship between placement scores and achievement, we overfit the model by adding quadratic (β4) and cubic terms (β6), as well as each nonlinear term’s interaction with the group membership variable (β5 and β7).
In this parametric approach, we also calculated standard errors that were adjusted for clustering by placement score (Lee and Card 2008) and compared the adjusted and unadjusted findings. In this article, we report unadjusted standard errors (without clustering) since they were generally slightly larger and the more conservative of the two estimates. (Using robust, clustered standard errors changed three discontinuity estimates p values from significant at the .05 level to below .01. Furthermore, one discontinuity estimate switched from marginally significant to be significant at the .01 level, and another estimate’s p value changed from nonsignificant to less than .10.) In its combined form, we tested the following model:
Analysis of this linear regression model followed an iterative process. Beginning with the overfit linear model, we sequentially removed nonsignificant, nonlinear terms and their interactions. For example, if the model included nonsignificant cubic terms, we reduced the model by removing both the cubic term and its interaction. Subsequently, we reanalyzed the model and removed nonsignificant quadratic terms. The Group by Placement Score interaction term (β3) was also eliminated from either the individual term or the successive term models, if it was determined to be nonsignificant. The process of omitting or including higher order main effect and interaction terms was intended to produce an unbiased estimate of the treatment effect (Shadish, Cook, and Campbell 2002; Trochim 1984). Nevertheless, to assess the generality of our findings, we also report results of models containing quadratic and cubic variables.
Because we also constructed a binary outcome variable to assess student success (pass/fail), our second analytic approach utilized a logistic regression model. As with the linear regression model, we regressed the outcome variable on the placement score, group variable, nonlinear terms, and all corresponding interactions. The logistic regression model assessed the magnitude of a discontinuity at the cut score as a difference in the log odds that DE and NDE students passed college English.
Students who were assigned to DE but opted to not enroll contributed to a “fuzzy” RD design (Bloom 2012). One method to gauge the extent of this noncompliance is to plot the probability of DE enrollment by placement score, for each group. We found a relatively low amount of noncompliance, with assignment to DE resulting in a high, average probability (M = .83) of DE enrollment for those scoring DE and almost zero probability (M = .01) of DE enrollment for students scoring NDE.
As has become common practice, we also assessed the impact of noncompliance by conducting an instrumental variable (IV) analysis (Bloom 2012; Calcagno and Long 2008; Imbens and Lemieux 2008; Lesik 2006). This IV analysis was conducted with both of our dependent variable operationalizations, using two-stage least squares regression. The first stage regressed a binary, DE compliance variable (enrolled in DE or not) onto the dichotomous assignment to DE or to NDE and onto each student’s placement score to obtain a set of predicted values. The predicted values from Stage 1 (in place of the treatment indicator used in the primary parametric model), the placement scores, and the placement score by group interaction variables were used in the second-stage regression model to predict the dependent variables in this study.
Prior to analyzing either statistical model, regression diagnostics indicated the relationship between the assignment and outcome variables violated the homoscedasticity and normality assumptions. Inspection of a residual plot for COMPASS and grade suggested that an increase in COMPASS score was accompanied by an increased variance in grades. Additional screening suggested that COMPASS scores had a moderate, negative skew. We reduced this skew by reflecting (adding one to the highest placement score value and subtracting from this sum each student’s placement score) and then taking the square root of that resulting score. Transforming the COMPASS variable successfully reduced skew and met the normality and constant error variance assumptions for analysis of regression models (Fox 2008).
To summarize, our parametric analyses established an overall estimate of discontinuity across five academic terms and for each individual term. These estimates were calculated for both the primary (grade) and secondary dependent variable (pass/fail). We also parametrically estimated a discontinuity for each of successive Terms 1–2, 1–3, and 1–4, and 1–5, for both dependent variables, and for policy-relevant subgroups based on gender and ethnicity. In addition, in the Discussion section, we report the cumulated Terms 2–5, 3–5, and 4–5, to further evaluate the consistency of the grade discontinuity estimate.
Nonparametric Estimates of Discontinuity Magnitude
Generating unbiased discontinuity estimates in the RDD relies heavily on correctly modeling the functional form of the relationship between the placement and achievement variables. Though not a substitute for parametric estimates, nonparametric analytic approaches complement parametric strategies since nonparametric techniques do not assume an a priori functional form relating the independent and dependent variables (Lee and Lemieux 2010). One nonparametric analytic technique that has been recommended to use with RD is local linear regression (LLR); (Bloom 2012; Imbens and Lemieux 2008; Lee and Lemieux 2010).
Since LLR discontinuity estimates are sensitive to the size of bandwidth used, we utilized procedures developed by Imbens and Kalyanaraman (2009) for identifying an OBW around each COMPASS score. To reduce the degree of any boundary bias near the cut point (which is critical for an unbiased estimate of the discontinuity at the two boundaries on either side of the cut point), an Epanechnikov kernel function (roughly, an inverted U) was utilized. (Kernels are weighting functions used to calculate an average within each band; values at or near the midpoint of each interval are given the greatest weight in this calculation.)
Thus, we established a single, OBW that creates a conditional distribution (conditional on the chosen x value) of y values for each focal point (COMPASS score). Within each band, using the Epanechnikov kernel, we conducted an LLR to produce a single estimate for each x value. This OBW minimizes the combination of bias and variability of the estimate at each x. If the number of scores within the band is relatively large, the estimate will be less variable (bigger sample size) but more biased (the set of estimates will look less like the population regression line); if the number of scores within a band is relatively small, each estimate will be more variable (smaller n) but less biased (the smoothed, nonparametric line will look more like the population regression line).
In summary, LLR uses linear regression to produce a single estimate within each band, for each x value. The set of these points establishes a “line” that allows one to visually examine its resulting smoothness and to determine the existence of a discontinuity at the cut score. The actual size of the discontinuity was established from an LLR immediately to the left of the cut score and a second LLR immediately to the right of the cut score; the estimate was based on a calculation of the difference in the intercepts where each of these two LLRs would intersect the vertical line, at the cut score.
Using these data-dependent procedures, we report a discontinuity estimate based on this OBW, as well as discontinuity estimates for both larger (200%) and smaller (50%) bandwidths to assess the sensitivity of this optimal choice. These three LLR analyses, with differing bandwidths, permitted us to base the discontinuity estimate on data near the cut score, where treatment and control groups were most similar, and to corroborate parametric estimates that used all data points in the sample.
To test the robustness of the single estimate of discontinuity based on this OBW and the Epanechnikov kernel, Stata-based, bootstrapping procedures were utilized to construct 50 random (with replacement) replicates (Nichols 2007a, 2007b), and a test of significance was calculated. To assess the sensitivity of these findings, we repeated this significance test for both the 50% and 200% bandwidths.
Results
Parametric Results for Individual and Cumulative Terms
The primary parametric findings for individual and successive terms are presented in column 1 of Table 1. Since quality of study conclusions are contingent on the existence of a linear relationship between the assignment variable and outcome, for each analysis we also tested the impact of including quadratic and cubic terms in our RD models. For the results reported in column 1 of this table, all nonlinear main effect and interaction terms were found to be nonsignificant and deleted from the linear model.
Discontinuity Estimates, by Term and Cumulative Terms, for RDD of College-Level English Achievement.
Note. RDD = regression discontinuity design; RD-IV = regression discontinuity instrumental variable; OR = odds ratio.
Standard errors are reported in parentheses; 95% confidence intervals for odds ratios are reported in brackets. Column 1 displays results for the primary linear model. Column 2 displays instrumental variable results. Column 3 includes model variables in column 1, plus quadratic variables. Column 4 includes all model variables from column 3, plus cubic variables. Columns 5 through 8 are analogous to columns 1 through 4, except that odds ratios are used.
aReported to three decimal points to more clearly show differences between column 1 and 2.
† p < .10. (two-tailed) *p < .05. (two-tailed) **p < .01. (two-tailed)
Analyzing the aggregate of all five terms, we found significant gains in DE grade (β2 = .20, p < .01). Next, we evaluated program effects by sequentially accumulating cohorts, in each case beginning with Term 1. The aggregation of Terms 1 and 2 produced a significant discontinuity (β2 = .26, p < .01), indicating that DE students received a grade benefit from the program. When the first three terms were included in the RD, we also observed a significant gain in grade for DE students (β2 = .23, p < .05). The accumulation of Terms 1 through 4 also resulted in a significant, positive program effect (β2 = .21, p < .01).
Results for the five individual terms are also displayed in column 1 of Table 1. When analyzed by individual term, DE students from Term 1 (β2 = .24, p < .01) and Term 2 (β2 = .54, p < .05) demonstrated a statistically significant grade gain from the DE program. The three remaining individual cohorts did not exhibit a significant benefit in grade after completing the DE program. Thus, despite finding that the majority of terms did not reflect a significant discontinuity, we did note evidence for positive program benefit for DE students when individual terms were cumulated.
In Table 1, columns 3 and 4 included discontinuity estimates for the overfit quadratic and cubic models for each individual and aggregate term combination. With the exception of Term 5, we found that when polynomials terms and their interactions were included, estimates of the discontinuity were similar in direction and comparable in magnitude to those found in the linear model.
As a method of illustrating cohort variation, the discontinuity results for the individual terms are displayed chronologically in Figure 1a. Here, the height of each diamond represents the magnitude of the discontinuity (i.e., the diamond height corresponds to the regression coefficients, β2, from column 1 of Table 1). The top of the diamond was the point at which the DE group’s regression line met the vertical line at the cut score (Δ), whereas the bottom of the diamond was the position where the NDE regression line intersected the vertical line at the cut score (†). Inspection of these data demonstrates that, across terms and different student cohorts, the DE group’s regression line was consistently above the regression line for the NDE group, at the cut score. (For Term 3, the DE group’s regression line was only slightly above [β2 = .004] the NDE group’s regression line, and this small value concealed the diamond.)

(a) Discontinuity by term. (b) Discontinuity by cumulative term.
Figure 1b illustrates estimates of the program effect as RDD studies were accumulated. As we found for each term, there was a consistent pattern of achievement gain for DE students with the addition of each ensuing term. The cumulative approach also reflected a monotonic narrowing of the confidence intervals (CIs), indicating that the estimate of the program effect became increasingly more precise with the addition of each RDD study and larger, aggregate sample size.
The linear pattern associated with DE and NDE can be seen in Figure 2, which displays plots of average grade, binned by transformed placement scores (width = 0.5). A discontinuity can be clearly observed at the cut score, with data points closely neighboring imagined straight lines. The additional variability to the far left in this graphic reflects the small sample size for low-scoring DE students.

Average grade in college-level English by transformed binned placement scores (width = 0.5) and study condition.
Complementary, Nonparametric Results
To minimize any subjectiveness associated with bandwidth selection, we used the procedure outlined previously to identify an OBW based on study data. The procedure determined that the OBW for our data was 1.73 transformed placement score points. Applying the OBW within LLR and using an Epanechnikov kernel, we estimated the discontinuity (D) at the cut score (.18) for Terms 1–5 to be very close to the parametric estimate (.20). Subsequently, we bootstrapped this estimate and, with 50 repetitions, found the discontinuity to represent a marginally significant (p < .10) gain in DE achievement (Figure 3a).

(a) Nonparametric discontinuity estimates (D), by multiple bandwidths (BW), using local linear regression estimates for each transformed placement score. Bootstrapped standard errors (for 50 repetitions) are in parentheses, using Epanechnikov kernel with optimal bandwidth = 1.73. Each “line” uses predicted means within each band, across placement scores. †p < .10. **p < .01. (b) Density plots of the distribution of students based on transformed scores. The sum of observations in binned placements scores are plotted, with standard error lines, against the midpoint of each bin.
As nonparametric regression estimates can vary with different bandwidths, we reestimated the treatment effect, first using half, then twice the OBW, and found very robust estimates of the discontinuity. The LLR for both 200% (D = .19, p < .01) and 50% (D = .23, p < .10) of the OBW resulted in quite comparable, positive achievement gains for DE students. Therefore, the sensitivity analyses substantiated overall program effectiveness, both graphically and statistically, with the nonparametric estimates (range .18 to .23) closely neighboring the parametric estimate (.20). Within each of the three graphs in Figure 3a, we note clear discontinuities that are comparably sized, regardless of bandwidth. Taken together, the nonparametric findings reveal a consistent pattern and magnitude of program benefit that closely mirror the overall, parametric results.
Dependent Variable Consistency: Grade and Pass/Fail
Our secondary operationalization of student achievement allowed us to determine if the pattern of results reported for grades was consistent with the pattern for pass/fail. As with the linear regression model for grade, we systematically confirmed that the inclusion of nonlinear or interaction terms was unnecessary since each higher level component of the model was nonsignificant.
Measures of student success (pass/fail) based on cumulative terms can be found in column 5 of Table 1. When Terms 1–5 were cumulated, the odds that DE students passed college English was 1.56 greater than was expected if they had not completed DE (odds ratio [OR] = 1.56, 95% CI [1.08, 2.26]). As with grade, we assessed the successive, additive impact of subsequent terms. Beginning with the cumulation of Term 1 and Term 2, there was a significant discontinuity (OR = 1.70, 95% CI [1.02, 2.82]). When adding the third term (Terms 1–3), we found marginal evidence (p = .056) of increased odds that students passed by completing the program (OR = 1.62, 95% CI [0.99, 2.66]). Significant discontinuities were also found with the addition of Term 4 (OR = 1.57, 95% CI [1.07, 2.30]). Thus, we found that the pattern of consistent significance to be similar for pass/fail and for grade.
Term-by-term results of the logistic regression are also noted in column 5 of Table 1. None of the five terms was found to have a significant discontinuity at the cut score. With the exception of the nonsignificance of individual Terms 1 and 2, these findings are consistent with the linear regression results for grade. Similar to the previous linear regression results, however, we did note a consistent pattern whereby DE students achieved higher grades at the cut score than NDE students, for all terms.
As with the models that used grade as our dependent variable operationalization, for the logistic regression models we included quadratic and cubic polynomials and their interactions (columns 7 and 8, Table 1). The overall pattern (15 of the 18 estimates had odds greater than 1.00) indicated DE students odds of passing college-level English increased due to DE participation, though only one of the higher order models approached the traditional .05 level of significance.
We also assessed the impact of DE students’ noncompliance with treatment assignment on our discontinuity estimates using IV analyses (Table 1, columns 2 and 6). When contrasted to the primary parametric results (column 1), the IV results (column 2) were quite comparable, yielding almost all slightly larger (only once slightly smaller) discontinuity estimates. All cumulative results were greater than .22 and, when the difference became large (greater than .03), the IV estimate was always larger. The IV results in column 6 for ORs were consistently smaller than those in column 5 (with two exceptions), but the pattern of statistical significance (as was true for the analogous comparisons for grade) was essentially the same. Therefore, even with noncompliance, all IV analyses produced positive discontinuity coefficients, and each of the 10 significant coefficients from the primary parametric results in columns 1 and 5 remained at least marginally significant.
Discontinuities by Gender and Race/Ethnicity
In a set of supplementary analyses, we explored differences in program effects by gender and race/ethnic subgroups. For this analysis, we conducted subgroup analyses using both operationalizations of achievement. In the interest of brevity, we present aggregate results based on all five terms.
The use of linear regression for grade and logistic regression for success produced different findings for DE on male (n = 1,509) and female (n = 1,923) students. Male students had significant achievement gains achievement for both linear (β2 = .36, p < .01) and logistic models (OR = 1.77, 95% CI [1.04, 3.04]). In contrast, female did not show a significant discontinuity utilizing either operationalization.
Both analytic models produced similar results for race and ethnicity. A marginally significant effect was found for White (n = 2,756) students when applying both linear (β2 = .15, p < .10) and logistic regression (OR = 1.50, 95% CI [0.97, 2.32]). On the other hand, multiracial (n = 646) students’ results reflected no significant benefit in achievement after completing the program, for both kinds of analysis.
Dispelling Additional Threats to Causal Inference
To assess the possibility of placement score manipulation, we first analyzed the smoothness of placement score density. Using the procedure outlined by McCrary (2008), in Figure 3b we display plots of the density functions along with two standard error bands, across the domain of the COMPASS test for the initial (Figure 3b.1) and final samples (Figure 3b.2), including the critical area near the cut score, for both DE and NDE students. As shown, the two density distributions appear quite symmetric relative to the vertical line at the cut score. Furthermore, while the density of DE placement scores was slightly below the density of the NDE students at the cut score for both the initial (log difference in height = .012) and final (log difference in height = .061) samples, neither of these differences was sufficiently large to reject the null hypothesis for a discontinuity between DE and NDE placement score densities (initial sample; t = .21, p > .05; final sample: t = .01, p > .05). Collectively, the objective, administrative testing procedures, the visual symmetry and smoothness of the density plots, and the nonsignificance of tests for a discontinuity in density provided reassuring evidence that manipulation of the placement score had not occurred.
An inferential flaw may occur when groups are maturing at different rates (a threat to internal validity termed “selection by maturation”). Since a DE student cannot take college English as a first-term freshman, selection by maturation is distinctly plausible (before taking college English, DE students may have gained valuable success-in-college skills during their first term, while NDE students were quite likely to take regular English in their first term).
To explore whether selection by maturation could be ruled out, we determined the number of terms between initial college enrollment and completion of college English, for both DE and NDE students. For DE students, we also computed the number of terms between DE completion and finishing the first college-level English course. We expected that if selection by maturation compromised our conclusions, we would have found statistically significant correlations between number of terms in college and college English grade or number of terms between DE and NDE course completion and college English grade. Neither of these temporal associations was significant. Together, these two sets of analyses substantiate the general argument for DE’s effectiveness.
Another potential weakness of the RDD approach might emerge when participants selected for the study do not follow protocol (e.g., some students assigned to DE may not take DE classes) or are withdrawn from the study before receiving a grade in regular English. If this attrition is differential (students who attrit from DE and NDE groups have differing characteristics and the characteristics of these dropouts are related to the dependent variable), study conclusions are suspect.
Differential attrition was evaluated by comparing the characteristics of the DE and NDE students in the initial sample to those in the final sample, and also by exploring possible differences between all eligible DE students and DE completers, across race and ethnic groups, age, gender, and placement scores. Contrasting the initial and final samples, for both the DE and NDE groups, we found no significant difference in average age. Differences in racial composition and average placement scores for the initial and final sample were nonsignificant for NDE students. The final DE sample did include a significantly greater percentage of White (8.2%) and female (7.5%) students when compared to the initial sample. On a range of 196 points, the final DE sample scored a significantly higher average of 6.8 points on the placement test than the initial DE sample.
We also compared all eligible DE students to the entire group of DE completers and found no age difference. DE completers were slightly more likely to be White and female than all eligible DE students (both were 2.8 percentage points higher). When compared to DE completers, all eligible DE students’ placement scores were significantly lower by an average of 5.7 points. With regard to any potential impact of overall differential attrition or attrition within DE completers for these demographic variables, when taken together the modest and insignificant differences found in these supplementary analyses suggested a low likelihood of influence on our study inferences.
A potential threat to the validity of the causal inference for DE’s effectiveness may also emerge if characteristics of the DE and the NDE group differ near the cut score. That is, if nonequivalence for many covariates related to the dependent variable occurs near the cut score, inference regarding program effectiveness is in jeopardy. To assess this threat, we applied the nonparametric procedures previously used to ascertain the possible presence of a discontinuity, now substituting student demographic variables age, gender, and ethnicity as outcomes (Bloom 2012).
Both the graphic display and the discontinuity estimates for density of student age (D = .43, p > .05) and sex (D = −.02, p > .05) indicated equivalence between the treatment and control group at the cut score. When race was considered, we noted an asymmetry in density at the cut score (D = .12, p < .001; the DE group had a significantly lower density of multiracial students at the cut score). To determine whether these different racial densities were related to grade, we conducted a sensitivity analysis using the familiar three bandwidths and found no significant differences in mean grade between DE and NDE multiracial students, 50% OBW t(151) = .66, p > .05; OBW t(297) = .02, p > .05; 200% OBW t(522) = .43, p > .05. Also, in separate parametric analyses, we divided the sample by racial categories, estimated discontinuities for each, and compared the discontinuity coefficients between White and multiracial students. We found that the discontinuity coefficient for White students was not significantly different from the discontinuity coefficient for multiracial students (Wald χ2 = .26, p > .05). In concert, these supplementary analyses demonstrated that, even though the DE group contained a smaller proportion of multiracial students at the cut score than did the NDE group, this difference was not related to level of achievement, the discontinuity estimates among the two racial subgroups were comparable and, thus, was not likely to impact our estimate of discontinuity.
Discussion
In this study, we found modestly positive measures of discontinuity that were consistent across cohorts, based on different outcome measures for both parametric and nonparametric estimates. While findings were variable across individual semesters, a favorable, aggregate impact was apparent using the cumulative approach of individual replicates. Thus, our conclusions favoring DE were based upon multiple analyses using differing levels of inquiry and operationalizations of treatment effectiveness.
Substantive Findings
For the term-by-term RDDs, using parametric estimates, we found a consistent pattern whereby DE students benefited from completing developmental English. During each of the five terms, DE students experienced higher achievement at the cut score when compared to NDE students. The magnitude of the discontinuity was comparable across semesters based on two different operationalizations of program outcomes, each of which was adjusted for noncompliance (range: .004 to .54 for grade; IV range: .005 to .67), and (1.23–2.27 for pass/fail; IV range: 1.18 to 3.52). Our nonparametric analyses substantiated the magnitude of the discontinuity estimate from parametric procedures when based on all five terms (.20). Overall, we estimated DE increased achievement by approximately one fifth to one half a grade point for those students bordering the cut score. Subset analysis revealed significant or near significant discontinuities for males and White students, on both primary and secondary measures.
The cumulative results also reflected a pattern of positive achievement attributable to DE. As one would expect from increasing sample size and decreasing standard errors, outcomes reported by consecutive terms were associated with successively narrower discontinuity CIs (the range for grade, .20 to .26, was smaller than the range for individual terms; similarly, the OR range for pass/fail was 1.56–1.70, again narrower than for term-by-term results). Thus, using multiple conceptualizations of course achievement, successive cohort samples, parametric and nonparametric analyses, and multiple years of data, we noted quite consistent patterns of program benefit.
A typical, term-by-term approach might consider each semester as an independent RDD study and either accept or reject the null hypothesis. We found that only two (Terms 1 and 2) of the five terms resulted in a significant program effect (column 1 of Table 1 and Figure 1a). Relying solely on a vote-counting strategy from null hypothesis statistical testing (e.g., Howard, Maxwell, and Fleming 2000) would have led to a different conclusion regarding program efficacy, since effects found in three of the five terms were not significant. Policy makers faced with divergent term-by-term RDD data might conclude that the preponderance of evidence indicated program ineffectiveness and would likely recommend programmatic change or elimination. Instead, using a cumulative RD approach provides a more precise estimate of program effects, narrower CIs, and tests of interim and overall significance.
Akin to our approach, Kumar and Geraci (2012) advise education administrators to use cumulative and comparative practices when assessing the quality of distance education. They recommend that policy makers ask questions such as “How did this semester compare to historical data … [and] … to previous semesters?” (p. 13). Answering these questions requires cumulative and time-defined data that permit researchers to “Compare this semester to data from all semesters combined to identify if anything has changed” and to “Analyze semester-by-semester data to discern trends” (p. 13).
Based on visual displays and tests of significance, we found no evidence that a difference in placement score density on either side of the cut score would likely contribute to our findings. For both initial and final samples, the DE and NDE placement score density distributions were symmetric, with essentially no suggestion of a discontinuity. With regard to student demographic characteristics, age and gender exhibited no significant discontinuity near the cut score. A single discontinuity for the density of student ethnicity was shown not to be related to achievement for any of the three bandwidths used, and there was no significant difference between discontinuity coefficients based on racial subgroups. Though noncompliance to the treatment or control group was minimal, the model results using IV analyses were in the same direction and similar in magnitude to the primary, parametric models (columns 1 and 5 of Table 1). Our LLR, nonparametric estimates of the discontinuity were also stable and closely corralled estimates obtained from the primary parametric models, both for an OBW and for bandwidths half and twice as large. Bootstrapping methods were applied within each of these three bins, and they provided further confirmatory evidence of the size of the discontinuity.
Consensual Validation: Comparing RDD Replication and Meta-Analysis Results
In meta-analysis (MA) research, the impact of programs is determined by averaging effects of one or more statistical measures from different studies, with different participants, over different periods of time. This evaluation framework is remarkably similar to that found in many educational contexts, including the current one. While the fit is not seamless, the MA template provides yet another means to corroborate our findings.
Triangulation of this sort has often been termed “critical multiplism” (Shadish, Cook, and Houts 1986) or “pattern matching” (Shadish and Cook 2009). We do not wish to suggest that the MA approach can stand alone, but we do regard an accumulation of evidence approach that relies on the MA framework to be a piece of a story line with converging subplots that combines “multiple probes of a causal hypothesis that inform different threats to internal validity …” (Shadish and Cook 2009, p. 622). Thus, we would regard the MA of RD results as additional evidence for the effectiveness of DE and as a policy strategy from which newly emerging data can be compared to an existing knowledge base.
If the results of our five cohorts had been reported in five, separate, published studies, one might easily imagine that an MA of RDD results could be conducted. Though an aggregation of distinct cohorts to obtain a single estimate is probably the most common strategy used in RD studies (e.g., Ou 2010; Reardon et al. 2010), the authors are not aware of a single, published MA of RDD studies. While MA is typically conducted with independent studies at separate settings, and, while in the present context some instructors may teach more than one section, the degree of intercohort dependence was minimal in our research. To demonstrate the promise of conducting an MA of studies using RDD, we illustrate the strategy with both our primary and secondary outcomes.
Fortunately, as the regression model used to assess the effectiveness of DE was the same in each cohort, we could meaningfully synthesize these five β coefficients. To obtain an effect size (ES; d) measure based on each of the five discontinuities for each RDD using grade as a dependent variable, we divided the regression coefficient for the treatment dummy variable by the overall standard deviation based on both the DE and NDE groups. (We considered using only the standard deviation in the NDE group, but the overall standard deviation and the NDE standard deviation were consistently close.) With these five d’s, we computed weighted means of the ESs and utilized the procedures provided by Borenstein et al. (2009) for a fixed-effect model. In our case, the between-groups (cohorts) variance was estimated to be zero, so the fixed-effect and random effects models yielded the same estimates. Conceptually, the within-site (one college) replication feature of our research argued for the fixed-effect model as outlined by Borenstein et al. (In their example for a fixed-effects MA, results from 10 replicates of the same research question were aggregated, for the same research team, at the same medical site.) And as these authors note, the relatively large sample sizes found in our work made the conversion to the unbiased g from the d statistic unnecessary, since the two estimates would be essentially the same.
The MA, fixed-effects model offers policy makers an alternative way to estimate DE’s effectiveness by incorporating additional ESs from new cohorts by entering each cohort’s contributions to generate an updated, summary effect. When an aggregate, meta-analytically derived d value was calculated using a fixed-effect model with weights inversely proportional to the variance of d within each cohort, we found an average d of .197, a value quite close to the overall d of .20 resulting when the five terms were cumulated. The MA summary effect was significant at the p < .001 level (95% CI [.13, .26]).
To further illustrate the viability of meta-analyzing RDD results, we utilized the formulas for converting the log OR to a d and the variance of the d value within each cohort (Borenstein et al. 2009, p. 47), then again followed the meta-analytic procedures outlined by these authors for the binary outcome, pass/fail. In this case, the summary d was .23 (p < .001, 95% CI [.11, .36]), a value only slightly larger than the overall d of .20 found for Terms 1–5. Thus, the consistency of results found in the MA and the parametric and nonparametric approaches further corroborates our earlier estimates.
To be cautious, instructors who taught classes in more than one cohort may have created some level of dependence which could decrease the size of cohort standard errors and consequently inflate our ES estimates. Therefore, we calculated the intraclass correlation (ICC) of grades by cohort (i.e., the ratio of grade variance between cohorts to the sum of the grade variance within each cohort and the grade variance between cohorts) and found it to be small (ICC = .004). This trivial result indicated that very little of the variability (0.4%) in grades was attributed to differences between cohorts and that between-cohort estimates were not likely correlated (Tabachnick and Fidell 2007). While this finding strongly suggests that our ES estimate was not inflated by dependence, this conclusion should be tempered by the small number of cohorts (an ICC based on only five groups may be biased toward zero).
Conceivably, ES estimates were influenced if an individual instructor taught more than one course within a cohort. In such circumstances, NDE instructors would first need to know which students had completed DE (by having previously taught those particular students or by having accessed the student information database to determine each student’s placement score) and then provide DE students with additional instruction. This possibility was unlikely since only 11% of instructors had taught both DE and NDE courses across cohorts, and the personal cost of differential attention would be prohibitive (an instructor would need to access, record, and track placement scores). Second, if prior DE status was known, instructors might have treated DE and NDE students in different ways which favored DE students (e.g., consistently grade DE students higher or NDE students lower), or perhaps instructors might have more effectively taught DE students. Lacking knowledge of the degree to which these conditions might have occurred, we regarded their prospect to be either implausibly coincidental or simply to be unlikely.
Study Strengths and Weaknesses
By using this replicate strategy, if grade inflation was manifest here, it would have been controlled by the cumulative nature of the RDD approach, since the DE and the NDE groups were directly compared within each of the five time periods of the study. Furthermore, in our research, decisions to examine data were not based on the favorability or unfavorability of the interim results (such data may suffer from what is often called the “multiple looks” problem—as data inspection and analysis increases, the greater the chance statistical significance will occur; see Yusuf et al. [2006] for a discussion in the context of medical interventions). Instead, data were viewed retrospectively and analyzed at predetermined and logical stop points (viz., at the end of each term).
While this study’s strengths were numerous, a number of shortcomings were also present. One cut score was utilized in this research and that threshold was implemented based on the recommendation of the placement test publisher. Though we have no reason to believe that the efficacy of DE necessarily hinges on the choice of that particular cut score, it is preferable that multiple cut scores for different cohorts be established as a necessary condition of sound generalization. Our single-college study did not explore long-term outcomes such as persistence, degree completion, or upward transfer as reported in both the Texas and the Florida studies, so we cannot directly compare our findings to those statewide studies. However, the consistent evidence that DE has an immediate impact on grade for courses at the same level of difficulty leaves open the possibility that the no-difference outcomes for DE programs in Texas and Florida may have been confounded by the level of difficulty of courses chosen by DE students subsequent to remedial education. Future research might more closely investigate this potential confounding.
There were no redundancies of student data in the five replicates (study students who reenrolled in college English with hopes of raising their grade were omitted from the study) but, in assessing the potential applicability of the MA approach, we found a very small ICC for grades by cohort. Given the relatively small number of NDE faculty who also taught DE classes, our findings suggest that since dependence was small, differential treatment of DE and NDE students was unlikely, and the MA estimates of benefit were minimally altered.
In our supplementary analyses, we found little evidence of differential attrition, though the absolute number of variables upon which we based this conclusion was relatively small, and it is possible that some degree of program effect may be explained by such unobserved variables. Certainly, our estimated treatment effects generalize only to DE completers rather than to all students identified as DE.
It is well known that the power to detect differences in a RDD is considerably less than in randomized studies (Cappelleri, Darlington, and Trochim 1994; Goldberger 1972). This inherent weakness and the relatively small sample size found in three of the five academic terms may contribute to some of the nonsignificant, term-by-term results. Fortunately, the cumulative RD approach compensates for this underpowered deficiency by accumulating sample size for successive cohorts.
Policy Implications and Caveats
An evaluation of program effectiveness can be initiated retrospectively, as when a newly appointed President or Dean uses previously generated data from preceding terms as a basis for contemplating programmatic change. In fact, the assessment of DE in the current research study was completed several years after the program was initiated. Though logistically more difficult, additional assessments could occur at multiple time points (say at midterm or annually).
While attractive in many ways, the cumulative strategy may be problematic if tracking happens to start in the “wrong” term or is prematurely terminated. If we had first accumulated students in the third of these five terms (a term when enrollment was lowest and the discontinuity was not significant), evidence of benefit would have been tenuous. To further investigate this timing quandary, we analyzed cumulative Terms 2–5, 3–5, and 4–5 and found that cumulative Terms 2–5, a time period that did not include Term 1, was significant (D = .22, p < .05). These supplementary findings demonstrated that the overall benefit found from aggregating Terms 1–5 was not simply a “borrowing effect” of the first term when the discontinuity was significant and sample size was large. Additional, nonsignificant terms beyond this five-term period could potentially alter an interim finding of significance from Terms 1–5. In general, the number of terms needed for a more correct decision will depend upon many considerations, some fiscal and some logistical, but, perhaps most critically, upon the size of the discontinuities across terms as well as their standard errors.
Context and Conclusions
The procedures illustrated here add to a growing set of methods utilizing within-study comparisons that allow researchers to discern those relatively small effects often germane to applied settings (Boruch 1975; Shadish and Cook 2009). Furthermore, the notion of updating the status of an inference is not new (e.g., see Skinner’s 1969, p. 81, discussion of the cumulative recorder) and also stands firmly within the fundamental principles of Bayesian statistics (Howard, Maxwell, and Fleming 2000). Within this institutional context, our best, single estimate suggests that the overall ES benefit of DE was approximately one fifth of a standard deviation. The widths of the CIs decreased as terms were added, indicating reduced uncertainty in this estimate. When the discontinuity was operationalized with respect to pass/fail results, the odds ratio for cumulative terms was consistently significat (with and without an IV analysis). In each of the five cohorts, DE results, individually and cumulatively, were consistently superior to those in NDE. whether reported as grade or as pass/fail.
With the usual call for caution that conclusions based on subset analyses should be made tenuously (Fleming 2010), we did find consistent results (for both grade and pass/fail) that males benefited from DE more than females. We also found evidence that White students were more likely to benefit from DE than multiracial students (the estimate of benefit, .25, would likely have been significant with a multiracial sample size comparable to that of White students).
As noted previously, other researchers have utilized RDD in assessing the merits of programs aimed at underprepared students. Statewide evaluations of DE in Texas and Florida found little benefit for completing DE (Calcagno and Long 2008; Martorell and McFarlin 2011). Although these studies are useful for estimating state-level outcomes, they may conceal beneficial program effects within institutions, where most DE policies materialize and program elements are assessed (Moss, Kelcey, and Showers 2014). To illustrate, the exemplary evaluation of a remedial writing program by Aiken et al. (1998) was conducted at a single university and found a benefit for remedial education using both RDD and a randomized experiment. This study also provides a normative comparison to our RDD findings. Based on d estimates of ES, the overall RDD outcomes in our study were near the center of their range of their d values (.02 to .49), again suggesting the impact of remedial education is modest. Thus, the current study not only adds to the expanding evidence of DE’s beneficial impact but also illustrates a strategy enabling one to track the pattern of programmatic results and to plan midcourse corrections or enhancements.
Footnotes
Authors’ Note
Authorship is equally shared; listing is alphabetical.
Acknowledgment
We would like to thank Joseph Cappelleri, Daniel Kreisman, Daniel Lawson, Christopher Thompson, Beth Tipton, and Bernd Weiss for their feedback on earlier drafts of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
