Regression Discontinuity Design in Criminal Justice Evaluation

Abstract

Background:

Corrections agencies frequently place offenders into risk categories, within which offenders receive different levels of supervision and programming. This supervision strategy is seldom evaluated but often can be through routine use of a regression discontinuity design (RDD). This article argues that RDD provides a rigorous and cost-effective method for correctional agencies to evaluate and improve supervision strategies and advocates for using RDD routinely in corrections administration. The objective is to better employ correctional resources.

Method:

This article uses a Neyman–Pearson counterfactual framework to introduce readers to RDD, to provide intuition for why RDD should be used broadly, and to motivate a deeper reading into the methodology. The article also illustrates an application of RDD to evaluate an intensive supervision program for probationers.

Result:

Application of the RDD, which requires basic knowledge of regressions and some special diagnostic tools, is within the competencies of many criminal justice evaluators. RDD is shown to be an effective strategy to identify the treatment effect in a community corrections agency using supervision that meets the necessary conditions for RDD.

Conclusion:

The article concludes with a critical review of how RDD compares to experimental methods to answer policy questions. The article recommends using RDD to evaluate whether differing levels of control and correction reduce criminal recidivism. It also advocates for routine use of RDD as an administrative tool to determine cut points used to assign offenders into different risk categories based on the offenders’ risk scores.

Keywords

crime and justice (adult and juvenile)content area methodological development outcome evaluation (other than economic evaluation)design and evaluation of programs and policies quasi-experimental design methodology

Introduction

A growing field of study in community corrections is devoted to developing prediction tools that differentiate offenders based on their risk of reoffending. Corrections officers place offenders whose risk scores exceed an upper threshold into a high-risk category, within which controlling and correctional resources are applied intensively. Officers place other offenders into lower risk categories, within which controlling and correctional resources are applied at lower doses. Two evaluation questions arise: (1) Does the use of intensive supervision (i.e., enhanced application of both controlling and correctional resources) improve supervision outcomes? (2) How would outcomes change if the threshold were altered?

Although this strategy of triage and targeting of controlling and correctional resources has several different names, in this article, we call it the risk–need–responsivity (RNR) model. The RNR approach is based on research showing that people with certain characteristics are more likely to commit crimes, but with appropriate treatment and monitoring those characteristics can be modified or mediated to prevent future recidivism (Lipsey and Cullen 2007; Gaes et al. 1999; Gendreau, Goggin, and Little 1996). RNR gives practitioners a rationale for allocating sparse resources to those who pose the greatest threat to public safety and future criminal justice costs, but agencies often apply the RNR principle without testing whether RNR actually improves outcomes in applied settings and whether the threshold used to assign treatment conditions is optimal.¹

Testing the threshold is practical using a regression discontinuity design (RDD) evaluation framework: RDD provides the means for a community corrections agency to evaluate the effectiveness of RNR on an ongoing basis with minimal costs and with minimal disruption for the agency. The costs are minimal because RDD uses administrative data that are routinely maintained by community corrections agencies and the statistics required to apply the RDD approach are within the competence of local evaluators. The disruption is minimal because RDD may be applied to an ongoing community corrections system that has already implemented triage and targeting, although testing the efficacy of a varying threshold would require experimental manipulation of the threshold.

The authors of this article have worked with probation agencies using an RDD design to assess whether using rules based on risk to assign offenders to regular and intensive supervision has reduced criminal recidivism. We believe that RDD should be used more broadly to evaluate RNR practices in community supervision. We also believe that RDD should be used as an ongoing evaluation tool to inform restructuring of threshold values for offender assignment to risk categories. RDD is a valuable evaluation tool for community supervision agencies struggling to make the best use of their limited controlling and correctional resources.

The statistics behind RDD are established (Hahn, Todd, and van der Klauuw 2001; van der Klaauw 2002; Imbens and Lemieux 2007; Lee and Lemieux 2009), and others have used RDD in criminal justice evaluations (Berk and DeLeeuw 1999; Berk and Rauma 1983; Chen and Shapiro 2007; Berk et al. 2010; Jalbert et al. 2010), but as Berk notes (Berk et al. 2010), use of this powerful tool has been infrequent in criminal justice evaluation. The reason may be that applied researchers are unfamiliar with the RDD framework and with applicable estimation procedures. This article outlines the logic of RDD and discusses some practical aspects of applying the design in a community corrections setting. It illustrates application of RDD in a single probation agency. The authors’ intention is to motivate readers to investigate more technical treatments of RDD and to use the RDD technique in community corrections settings.

This article is organized to introduce readers to the basic assumptions that underlie RDD design. We discuss the treatment effect that is identified by RDD and discuss diagnostics, standard errors, and power. We then move from the abstract to the concrete by using an illustration from part of an evaluation of intensive supervision probation (ISP). Finally, we use the discussion to suggest policy questions best answered by RDD, and how ongoing RDD might be incorporated into corrections administration.

The RDD

The RDD provides a quasi-experimental method for identifying a treatment effect in settings where offenders are triaged based on risk scores and given different doses of controlling and correctional interventions depending on an offender’s risk category. Community corrections is such a setting but so too are many pretrial supervision settings and jail or prison settings. The first subsection discusses how the RDD identifies a specific treatment effect. Justification of an RDD rests on assumptions about the data generation process, and if those assumptions do not hold, the treatment effect is not identified. A practical aspect of RDD is that assumptions are testable, and the second subsection explains some tests. Finally, RDD comes in two forms—sharp RDD and fuzzy RDD. The latter requires some special steps to compute test statistics and confidence intervals and a third subsection explains.

Identifying the Treatment Effect using a RDD

This section explains the RDD using assignment to intensive supervision as an example. The evaluator observes outcomes for $i = 1 \dots N$ offenders. Avoiding an arrest is a positive outcome; being arrested for a new crime (recidivism) is a negative outcome. As they enter probation, offenders are assessed using a screening tool. Based on the assessment score, some offenders are assigned to intensive supervision (i.e., the treatment) and others are assigned to regular supervision (i.e., the comparison).² The assignment is rule-driven, so offenders at higher risks of recidivism are more likely to be assigned to intensive supervision, and specific assignment rules will be considered in the discussion. The evaluation question is whether intensive supervision reduces recidivism and if so by how much. The size of that reduction is the treatment effect.

To formalize the argument, adopt some definitions:

T_i : This is a dummy variable denoting that the ith offender was assigned to intensive supervision (T = 1) or to regular supervision (T = 0). Intensive supervision is considered the treatment.

Y_i : This is the outcome measure, here considered to reflect recidivism.

R_i : This is a risk score considered to be measured on a continuous scale although practically measured on an integral scale. It is known at the time of assignment and its estimation is not part of the exercise.

X_i : This is a covariate, possibly a vector of covariates.

C: This is a critical value of the risk score, the threshold value that determines the treatment. Its purpose will be clarified but generally offenders are assigned to intensive supervision if $R_{i} \geq C$ and are assigned to regular supervision if $R_{i} < C$ (i.e., a sharp RDD); in some versions of the RDD, the inequalities are not strict (i.e., a fuzzy RDD).

I_i : This is an indicator variable such that $I_{i} = 1$ when $R_{i} \geq C$ and 0 otherwise.

We introduce the RDD using Rubin’s counterfactual framework expressed via three equations: The outcome equation if the offender had been assigned to regular supervision (Equation 1), the outcome equation if the offender had been assigned to intensive supervision (Equation 2), and the selection equation determining whether the offender was assigned to intensive supervision (Equation 5). Greek letters designate parameters in these three equations. Defined for all offenders, the first equation is the outcome equation given regular supervision:

Y_{i} (T = 0) = α_{1} + α_{2} (R_{i} - C) + e_{i} .

Placing T = 0 in parentheses denotes the outcome that would be observed if the offender had been placed on regular supervision; the outcome is actually observed only for offenders placed on regular supervision. R_i − C is a convenient (optional) transformation of the risk score. The transformation emphasized that the expected outcome is α₁ when R_i = C and the offender is assigned to intensive supervision. Readers who are so inclined might approximate a more complex equation using a polynomial of any degree $Y_{i} (T = 0) = α_{1} + α_{2} (R_{i} - C) + α_{3} {(R_{i} - C)}^{2} + \dots + e_{i} .$ or add covariates X to this equation without materially altering the argument. The e is a zero-mean random error term that is independent of R − C.

The second equation is the outcome equation given intensive supervision:

Y_{i} (T = 1) = β_{1} + β_{2} (R_{i} - C) + u_{i} .

Placing T = 1 in parentheses denotes the outcome that would be observed if the offender had actually been placed on intensive supervision. The transformation emphasizes that the expected outcome is β₁ when R ₁ = C and the offender is assigned to intensive supervision. Again the equation could be written as a polynomial; covariates could be included; and u is a zero-mean random error term independent of R − C. Given Equations 1 and 2, the treatment effect for the ith offender is defined as the reduction in recidivism from applying intensive supervision instead of regular supervision, which is expressed as a difference:

δ_{i} = Y_{i} (T = 1) - Y_{i} (T = 0) = (β_{1} - α_{1}) + (β_{2} - α_{1}) (R_{i} - C) + e_{i} - u_{i} .

Equation 3 is a theoretical construct: This treatment effect is never observed because an evaluator can observe $Y_{i} (T = 1)$ or $Y_{i} (T = 0)$ but not both—no offender is assigned to both intensive and regular supervision. The broad research question is to estimate the average treatment effect conditional on R, although we will see that, with important nuances, the RDD strictly leads to an estimated treatment effect when the risk score equals the critical threshold value (R = C). From Equation 3, the expected value of the treatment effect conditional on R is:

E [δ | R] = (β_{1} - α_{1}) + (β_{2} - α_{2}) (R - C) .

The treatment effect might be homogeneous and thus not depend on R, in which case the treatment effect would be just $β_{1} - α_{1}$ , and even if the treatment effect is heterogeneous and depends on R, the treatment effect is $β_{1} - α_{1}$ when R = C. The estimation problem may seem straightforward: Estimate the parameters in Equation 1 using data from offenders assigned to regular supervision, estimate the parameters in Equation 2 using data from offenders assigned to intensive supervision, and use the parameter estimates to estimate Equation 4. This approach could work if the models conforming to Equations 1 and 2 were truly linear, but this is an extremely strong assumption. Absent this assumption, we cannot reliably identify the parameters of Equation 1 for offenders assigned to regular supervision with risk scores in the range $R \geq C$ and we cannot identify the parameters of Equation 2 for offenders assigned to intensive supervision with risk scores in the range of R for which $R < C$ . Figure 1, discussed later, will clarify this point.

Figure 1.

An illustration of the sharp and fuzzy regression discontinuity design using hypothetical data.

RDD avoids this identification problem by adopting a seemingly innocuous assumption that offenders with a risk score slightly less than C and offenders with a risk score slightly larger than C would have essentially the same outcomes if all had been assigned to regular supervision. This provides the contrast at the heart of RDD: What were the outcomes for offenders with risk scores near C who were assigned to regular supervision compared with the outcomes for offenders with risk scores near C who were assigned to intensive supervision? We have to introduce the selection equation to investigate the solution following from this assumption. The selection equation is written as the probability of being assigned to intensive supervision:

P (R_{i}) = γ_{1} + γ_{2} (R_{i} - C) + γ_{3} I_{i} .

Earlier, we defined a variable $I_{i} = 1$ when $R_{i} \geq C$ and 0 otherwise. The transformation R_i − C emphasizes the discontinuity of this function at R ₁ − C. As before, this selection equation could be written as a polynomial and might be expressed alternatively in a form suitable for estimation using a logistic regression. The essential feature of RDD is that $γ_{3} \neq 0$ , implying that the probability of assignment to intensive supervision has a discontinuity at $R = C$ , giving RDD its name. Another essential feature is that the outcome Equations 1 and 2 are continuous at $R = C$ .³ Thus, an evaluator cannot simply choose to use an RDD because it is deemed a strong design; rather, an RDD is only applicable when these two conditions hold. Note especially that this selection equation seemingly rules out selection on unobservables,⁴ leading to the following argument that is based on expectations.

To develop the argument, combine Equations 1, 2, and 5 to rewrite the expected outcome as:

\begin{array}{l} E [Y] = [α_{1} + α_{2} (R - C)] (1 - P (R)) + [β_{1} + β_{2} (R - C)] P (R) \\ = [α_{1} + α_{2} (R - C)] (1 - (γ_{1} + γ_{2} (R - C) + γ_{3} I)) + [β_{1} + β_{2} (R - C)] (γ_{1} + γ_{2} (R - C) + γ_{3} I) \\ = α_{1} + α_{2} (R - C) + (γ_{1} + γ_{2} (R - C) + γ_{3} I) [(β_{1} - α_{1}) + (β_{2} - α_{2}) (R - C)] . \end{array}

The first line shows the expected outcome as a mixture that depends on the probability of being assigned to intensive supervision. The second line shows the mixture probability in terms of the parameters appearing in Equation 5. To use RDD to identify the treatment effect, select offenders whose risk scores are slightly less than C and offenders whose risk scores are slightly higher than C. When the scores are sufficiently near C, $R - C \approx 0$ and Equation 6 can be rewritten as:

\begin{aligned} E [Y] \approx [α_{1}] (1 - (γ_{1} + γ_{3} I)) + [β_{1}] (γ_{1} + γ_{3} I) \\ = α_{1} + (β_{1} - α_{1}) (γ_{1} + γ_{3} I) . \end{aligned}

The formal argument is asymptotic (Hahn, Todd, and van der Klauuw 2001) and we will not develop it here. Estimating the parameters of Equation 7 can be seen as a regression of the observed outcome Y on the estimated probability of being assigned to intensive supervision and the combination of Equations 5 and 7 might be seen as an instrumental variable estimator. The necessity of the condition $γ_{3} \neq 0$ is apparent. If that condition did not hold, the equation would represent a regression on two constants and identification would be impossible. The utility of Equation 7 is that it shows how the RDD approach identifies the treatment effect $β_{1} - α_{1}$ in the vicinity of $R = C$ .

The practical problem with this estimator is that there may be few offenders with risk scores just slightly less than C and with scores just equal to C. There may in fact be none because risk instruments are usually scaled as integers $1, 2, \dots C - 1, C, C + 1 \dots .$ It is essential to increase the sample size to a cluster of observations that are within a range $C \pm Δ$ where Δ is the bandwidth. The asymptotic argument no longer holds and so we make a needed assumption: No matter how complicated functions 1 and 2 are in fact (i.e., after allowing for higher level polynomials), the simple linear models represented by Equations 1 and 2 approximate the complex function within the narrow bandwidth. The accuracy of this approximation outside the bandwidth is immaterial because our estimation does not use data outside the bandwidth.

The discontinuity in the selection equation is the identification condition. Given the identification condition, there are many ways to estimate the parameters. In general, rewrite the last line of Equation 6, group the variables, a constant, R − C, (R − C)², I, and I(R − C), and substitute parameters that are estimated by a regression on those grouped variables. The new parameters are ξ, which are functions of the α, β, and γ parameters:

\begin{aligned} E [Y] = α_{1} + α_{2} (R - C) + \{[(β_{1} - γ_{1})] + [β_{2} - α_{2}] (R - C)\} \\ (γ_{1} + γ_{2} (R - C) + γ_{3} I) \\ = ξ_{1} + ξ_{2} (R - C) + ξ_{3} {(R - C)}^{2} + ξ_{4} I + ξ_{5} I (R - C) . \end{aligned}

To interpret the meaning of the $ξ_{4}$ parameter, consider two forms of the selection equation. In a sharp RDD, the probability of selection is written as $P = I$ , so everyone with a risk score of C or higher is assigned to intensive supervision and everybody with a risk score less than C is assigned to regular supervision. Necessarily $γ_{1} = 0$ , $γ_{2} = 0,$ and $γ_{3} = 1$ . Therefore, in a sharp RDD, $ξ_{4} = β_{1} - α_{1}$ , which we have interpreted as the treatment effect at R = C. In the fuzzy RDD, the selection probability is written more generally as Equation 5, meaning practically that there were some exceptions to the rule-based assignment so that $γ_{1} \geq 0$ , $γ_{2} \geq 0,$ and $0 < γ_{3} < 1$ . In a fuzzy RDD, $ξ_{4} = (β_{1} - α_{1}) (γ_{3})$ , so we can estimate the treatment effect at R = C as $(β_{1} - α_{1}) = ξ_{4} / {\hat{γ}}_{3}$ . Note that this final estimate requires some assumptions identified in Note 5 and an estimate of $γ_{3}$ , which comes from a regression based on the selection Equation 5.

Methodologists writing about RDD commonly use a graphical device to provide intuition for why RDD estimates the treatment effect at R = C. Based on simulated data, created to illustrate points, Figure 1 (sharp RDD) shows three curves: The first curve is a graphical illustration of Equation 1, representing the potential outcome assuming a polynomial and given regular supervision. The second curve is an illustration of Equation 2, representing the potential outcome assuming a polynomial and given intensive supervision. Both these lines are drawn with light curves, which are partially obscured because they are coincident with the third curve, representing the observed outcomes given that all offenders with $R < C$ are assigned to regular supervision while all with $R \geq C$ are assigned to intensive supervision. The threshold is set at 10 in this illustration.

As a function of the risk score, the treatment effect is the vertical distance between first two curves. Although the treatment effect is defined, it is not in general identified. For scores below the critical risk score of C = 10, we can observe outcomes for offenders assigned to regular supervision, but we cannot observe the counterfactual for comparable offenders who were assigned to intensive supervision. For scores equal to 10 or above, we can observe the outcomes for offenders assigned to intensive supervision, but we cannot observe the outcomes for comparable offenders assigned to regular supervision. The treatment effect is only identified at the margin where $R = C$ , because with the assignment of treatment, the probability of recidivism falls by about .062. The treatment effect is not identified at any other value of R absent some strong assumptions.⁵

Using the same simulated data, Figure 1 (fuzzy RDD) shows the same three curves, but the probability of assignment to intensive supervision is a continuous function of R except at R = C where it is discontinuous. (Figure 2 shows the discontinuity.) The first and second curves are the same in Figure 1 (sharp RDD) but the third curve, representing the observable outcomes, differs. Arguably, we could estimate the first two curves because, according to the assignment equation, we can always observe some offenders assigned to regular and intensive supervision for every risk score. There are two problems, however: Decreasing numbers of offenders receive intensive supervision as R decreases and decreasing numbers of offenders receive regular supervision as R increases, and we would be increasingly concerned with selection bias. However, given the discontinuity at R = C, we can unambiguously identify the treatment effect at R = C. As shown in Figure 1 (fuzzy RDD), what might be called the intent-to-treat estimator of the treatment effect equals about 0.0289 units at C = 10. According to Figure 2, the probability of treatment jumps from .269 to .731 at C = 10, or by .462. Therefore, an estimate of the size of the treatment effect is 0.0269/0.562 = 0.062; the sharp and fuzzy RDD give the same answers.

Figure 2.

An illustration of the selection equation for a fuzzy RDD using hypothetical data.

We qualify the previous discussion with two technical observations. First, although the intuition is helpful, this equivalency between the estimated treatment effect from sharp RDD and fuzzy RDD will not typically hold. The estimated treatment effect from the RDD is called a local average treatment effect (LATE); the estimated treatment effect from a fuzzy RDD is sometime distinguished as the LATE at the cut point (LATEC). A general literature describes LATE (Imbens and Angrist 1994; Heckman and Vytlacil 2007; Gennetian et al. 2005) and Bloom discusses LATEC with respect to RDD (Bloom 2013). The crux of the argument is that if treatment effects are heterogeneous at R = C, RDD identifies the treatment effect for a specific subset of offenders with risk scores of T = C,⁶ so given heterogeneous treatment effects the sharp and fuzzy designs lead to estimates of different treatment effects. Second, the discussion implies that the estimated treatment effect is literally the treatment effect when R = C, and while this is often the way that evaluators interpret the effect, an alternative interpretation is that the treatment effect is a weighted average of heterogeneous treatment effects over the entire bandwidth (Lee 2008; Lee and Lemieux 2009; Bloom 2013).⁷ This article does not pursue this argument further: While these qualifications are relevant for interpreting the treatment effect, the weighting is unobservable so the argument has minimal practical value for applied researchers.

This discussion has explained how RDD identifies the treatment effect. Some comments are instructive:

Examining Equation 8, researchers will often assume that $ξ_{3} \approx 0$ within a narrow bandwidth so that the quadratic term does not enter into the regression, or equivalently that a local linear regression is a close approximation to what may be in fact a more complex function. Figure 1 provides some intuition: Although the curves are clearly not linear, they are approximately linear in the vicinity of R = 10. The assumption of local linearity is testable as we will illustrate and possibly it will be necessary to substitute a simple polynomial for the linear equation.

There are many ways to estimate the local linear regression within the bandwidth, some of them very sophisticated, many of them controversial. Our discussion does not go beyond a simple regression but technical discussions and software packages often employ other devices including kernel density estimation (Lee and Lemieux 2009). Simple, traditional estimators are likely to work for most applied researchers.

It is practical to estimate Equation 8 using two regressions, one using data for all observations within the bandwidth where $R < C$ , and the other using data for all observation where $R \geq C$ . Given that R − C enters as the regressor, the estimated treatment effect is just the difference in the constants for those two regressions, given sharp RDD and the difference in the constant divided by the size of the discontinuity in the probability of assignment to intensive supervision given fuzzy RDD. Looking at Equation 8, it should be clear that the parameters could be estimated using two regressions, one applied to data when $R < C$ and the other applied to data when $R \geq C$ .

Testing the statistical significance of the treatment effect is a standard test of the null hypothesis that $ξ_{4} = 0$ . For sharp RDD, the standard error used to test the null leads to a straightforward confidence interval for the treatment effect, but for fuzzy RDD the confidence interval depends both on that standard error and on the precision of estimating the probability of treatment at R = C. One way to compute the standard error for the ratio is to use the delta method (discussed in a subsequent section) although other methods have been proposed (Lee and Lemieux 2009).

The bandwidth is a concern. The narrower the bandwidth, the more likely that a local linear regression will fit the data, but the fewer the data points, the higher the standard error for the estimated treatment effect. The wider the bandwidth, the greater the validity challenges to the simple regression specification, but the smaller the standard error. Evaluators will typically expand the bandwidth, progressively examining how a wider bandwidth changes estimates of the treatment effect, and reporting the sequence of estimates. As we discuss later in this article, formal procedures for expanding the bandwidth have been proposed (Lee and Lemieux 2009), but unfortunately these test–retest procedures impart additional uncertainty to the estimates that are difficult to evaluate.

The discussion assumes that the risk score is measured on a continuous scale so that in theory there are offenders with risk scores that are infinitely smaller than R. In criminal justice problems, risk scores are discrete. Lee and Card (2006) discuss how using discrete risk scores affects estimation, including the greater risk of specification error and the need to account for clustering based on risk scores. This article will return to these issues with the applied illustration, where risk scores are discrete.

Diagnostics

An array of designs exists for evaluators seeking to estimate treatment effects (Rosenbaum 2002; Cameron and Trivedi 2005; Lee 2005; Morgan and Winship 2007; Angrist and Pischke 2009), but with the exception of random assignment, all rest on largely untestable assumptions. In contrast, the assumptions that underlie RDD can be partly tested, weakening the need to rely on them. If tested assumptions are rejected, inferences based on RDD are called into question and the evaluator might seek another approach. We discuss some diagnostic tests here.

RDD requires that the probability of treatment be discontinuous at the critical value C. Testing this assumption is a matter of estimating the parameters of Equation 5, perhaps after specifying a more complex structure. The diagnostic test is based on the null hypothesis that $γ_{3} = 0$ . If the null cannot be rejected, the treatment effect is not identified because there is no discontinuity in the selection equation.

The second diagnostic test is that the outcome equation absent treatment is continuous at C. This is not testable directly because the outcome absent treatment is not observable. There are indirect tests, however. One indirect test is to determine whether the risk score R is distributed continuously about C. This test is useful because Y is a function of R, so discontinuity in the distribution of R implies discontinuity in the distribution of Y. The validity of the continuity assumption might be determined using a regression similar to that used to estimate the probability of treatment. Alternatively, one might stratify the R and use a histogram to determine that there are no discontinuities around the value of C. More formal tests are available (McCrary 2007). The test is less useful when the risk score is discrete because there is some natural lumpiness to discrete data.

Another complementary indirect test examines the distribution of covariates X even if these do not appear in the regression. The concern is that the risk score may be manipulated either by probation officers (POs) in order to shift an offender into more intensive supervision or by the offender (through self-reported behaviors) to avoid being shifted into more intensive supervision. Manipulation of R will have an incidental effect on the distribution of X. If there is no manipulation of the risk score, then an evaluator would expect to see a reasonably smooth distribution of the X variables about the critical value C. If the data fail to pass that test, the validity of the RDD is called into question.

Still another diagnostic comes from regressing the outcome Y on the risk score. If there is no discontinuity at C, then there is no treatment effect. This might be done by estimating robust regressions on both sides of C. Typically, the regression would be linear, and if a polynomial were used, it would not be of high degree. The purpose of this diagnostic is to test the specification of the regressions on the right and left of C. A visual test might suffice; Lee and Lemieux (2009) and Lee and Card (2006) discuss formal tests. When the risk score is discrete, the test requires projecting the regression of the left of C to C, adding some uncertainty to the test.

This last diagnostic raises a practical issue. The objective is to estimate the $ξ_{4}$ , or alternatively $γ_{3} (β_{1} - α_{1})$ . A high-degree polynomial can do a very good job of fitting the data but a very poor job of estimating $ξ_{4}$ . The purpose of the diagnostic is not just to determine how well the model fits the data, but rather, how well the model fits the data close to C. This can be especially difficult, given the need to project the left-hand-side regression to R = C, that is, projecting the regression outside its support.

Estimation and diagnostics assume a bandwidth, so testing the sensitivity of results to the size of the bandwidth seems prudent. Recall that the estimated treatment effect has the greatest validity when the bandwidth is narrow, but it has the greatest efficiency when the bandwidth is wide. There are some formal tests for the optimal bandwidth (Imbens and Kalyanaraman, 2009), but these approaches can be complicated, and a practical alternative is to experiment by expanding bandwidth and assessing how the expansion affects estimates (Angrist and Pischke 2009).

Diagnostics are straightforward and lend credibility to RDD. A necessary argument for RDD is that the probability of treatment is a discontinuous function of R at C. An evaluator who stops at this necessary argument has failed to make a compelling case that he or she has identified the treatment effect. Responsible evaluators will offer their reviewers the opportunity to consider diagnostics before passing judgment on the validity of the estimated treatment effect.

Standard Errors and Efficiency

Recall from the discussion surrounding Equation 8 that the estimated treatment effect at R = C is ${\hat{ξ}}_{4} / {\hat{γ}}_{3}$ . For sharp RDD, $γ_{3} = 1,$ so the treatment effect is estimated as ${\hat{ξ}}_{4} .$ The null hypothesis is ${\hat{ξ}}_{4} = 0$ and this null can be tested using standard test procedures based on a regression. Typically, the ${\hat{ξ}}_{4}$ is treated as having a normal distribution, which leads directly to a confidence interval for the treatment effect. Nothing about sharp RDD requires novel thinking when testing the hypothesis of no treatment effect and when constructing confidence intervals for treatment effects.

The simplicity disappears when using fuzzy RDD because the variance when estimating $γ_{3}$ enters the estimation. A test of the null hypothesis of no treatment effect is the same for sharp and fuzzy RDD. A sharp RDD will have more power than its fuzzy RDD counterpart, but otherwise the nature of the statistical test does not change. However, constructing a confidence interval for the estimated treatment effect for a fuzzy RDD requires special attention. There are several approaches, but one approach to estimating the standard error for ${\hat{ξ}}_{4} / {\hat{γ}}_{3}$ uses the delta estimator, according to which:

var (\frac{{\hat{ξ}}_{4}}{{\hat{γ}}_{3}}) \approx \frac{v a \hat{r} (ξ_{4})}{{({\hat{γ}}_{3})}^{2}} + \frac{var ({\hat{γ}}_{3}) {({\hat{ξ}}_{4})}^{2}}{{({\hat{γ}}_{3})}^{4}} - 2 cov ({\hat{ξ}}_{4}, {\hat{γ}}_{3}) \frac{{\hat{ξ}}_{4}}{{({\hat{γ}}_{3})}^{3}} .

The problem is estimating the covariance terms. With Stata, this can be done with the seemingly unrelated regression (sureg) command. Sometimes researchers use instrumental variable programs. Alternatively, it can be done mechanically by an evaluator with programming skills.

Although RDD identifies the treatment effect, the sampling variance can be high, necessitating the use of large samples. Some special cases are often considered in the evaluation literature (Schochet 2009; Bloom 2013; Bloom et al. 2005). A summary from this literature points toward a conclusion that when using an RDD, large sample may have less power than might be supposed.

As expected, the sampling variance for the estimated treatment effect should decrease inversely with the bandwidth.

Assuming a uniform distribution of R about C leads to a sampling variance that is 4 times as large as it would have been if we had been able to randomly assign the same study subjects to the treatment or no treatment condition. Put another way, the RDD requires a sample size that is 4 times as large as that required by a random assignment design in order to have the same power.

Assuming a normal distribution about C requires a sample size 2.7 times as large as that required by random assignment.

Lee and Card (2006) show that discrete risk scores will inflate standard errors. Schochet (2009) argues that clustering will inflate standard errors; of course, this would also be true for random design experiments.

Covariates are not required for identification provided the continuity assumptions hold. However, the standard error for the estimated treatment effect is proportion to the square root of the residual variance in the estimated regression. Covariates may be useful for reducing standard errors but their use can raise questions about model specification for those covariates.

An Illustration

This section uses data from an evaluation of an ISP program to illustrate points made previously. It provides circumscribed data and less detail than would be expected of a report on evaluation findings because the intent is illustration, not dissemination of evaluation findings. More information about the full evaluation can be found in Jalbert et al. (2011).⁸

In theory, ISP allows POs to provide enhanced control and correctional interventions to high-risk offenders who otherwise would receive inadequate supervision and support because of large caseloads. Most previous experiments with ISP were failures: Some programs failed to deliver increased interaction or treatment despite smaller caseloads; others increased supervision intensity that increased technical violations for behaviors that would not be criminal except for an offender’s status on probation (Petersilia and Turner 1993). However, there have been exceptions (Pearson 1990; Byrne and Kelly 1989; Paparozzi and Gendreau 2005) in agencies that use programming responsive to specific offender needs, suggesting ISP can be effective if it is employed in an agency using RNR supervision to allocate treatment and supervision resources using a validated risk/need assessment instrument.

The principal null hypothesis in this illustration is that criminal recidivism is the same for high-risk offenders receiving intensive supervision from low-caseload offices as it is for high-risk offenders supervised under normal (nonintensive) supervision from officers with regular caseloads in an agency using RNR. The alternative hypothesis is that criminal recidivism is lower for offenders supervised under intensive supervision. Based on the literature review, we use a one-tailed test of statistical significance to test this null hypothesis.

Data

Data come from Polk County, Iowa, a midsize county that lies within Iowa’s fifth Judicial District. We limited our analysis to offenders supervised by officers in the Des Moines location. The fifth Judicial District was an early adopter of RNR-style supervision and has substantially implemented components since 1997. In 2000, the agency began implementing standardized case planning. In 2002, it began implementing training in RNR-style practices that were fully implemented by 2004. Because the study period begins late in 2001 prior to full implementation, estimation may understate the full effectiveness of reduced caseloads in an RNR environment.

Although we do not present the evidence here, a longer report (Jalbert et al. 2011) shows that (1) POs who supervised ISP have smaller caseloads (about 30 offenders per PO) than POs who provide regular supervision⁹ (about 50 offenders per PO), (2) both control and correctional interventions are more frequent for the ISP caseload, and (3) ISP lasts sufficiently long (about 1 year on average) so that the dose of treatment is meaningful. Polk County uses a risk score (R from above) based on the Iowa Risk Assessment tool to classify offenders and assign them to ISP.

Data comprised 8,878 probationers who entered supervision between the years 2001 and 2007. Eighteen percent were placed on regular supervision and 37% were placed on ISP (other probationers were assigned to administrative or low supervision status). All male ISP offenders take part in a special treatment protocol but female offenders are assigned to other programming; consequently, we limit the analysis to males. Offenders assigned to special caseloads, for example sex offenders and offenders with serious mental illness, were also excluded from this analysis, as were offenders who were assigned to jail diversion or similar programming.

Overrides are allowed and we discovered that overrides almost always occur for specific conditions:

The offender is assaultive.

The offense was very serious.

The case plan indicated high needs.

We excluded offenders meeting these conditions from the analysis. We also excluded offenders when the Parole Board required intensive supervision and when the offender was not available for active supervision. We also excluded offenders who had risk scores of less than 18 because these offenders are routinely assigned to a lower level of supervision. Offenders had to be between the ages of 18 and 65. Because these classes of individuals were excluded from the analysis, we treat the treatment effect as generalizing to offenders who were included in the analysis, and even then, the generalization extends to those whose risk scores were at the margin.

Finally, we limited the analysis to offenders who had a valid secondary risk score (Level of Service Inventory–Revised [LSI-R]) that is not used for assigning offenders to intensive supervision and, surprisingly, is only modestly correlated with the Iowa Risk Score. We added this final limitation to demonstrate how the addition of covariates affected findings.

Estimation

RDD is a design for identifying a treatment effect. It is not an estimator. Most applications of RDD use least squares regression as the estimation procedure, but given our concern with criminal recidivism, we use partially parametric survival analysis (Cox Proportional Hazard models) to study time until criminal recidivism, subject to right-hand censoring. Recidivism is equated to an arrest for a new offense; sensitivity testing will define a new offense variously. Censoring arises from one of three causes: data collection ends, the sentence ends, or there is a probation revocation for a technical violation of the conditions of supervision exclusive of revocations imposed because of an arrest for a new crime. The third form of censoring is sometimes known as a competing event. We assume that the competing event is independent of criminal recidivism. Rhodes (1986) provides some justification for this assumption, but if it does not hold, whatever bias it introduces into parameter estimates is likely to have the same effect to the left and right of C.¹⁰ Diagnostic testing (not reported here) failed to reject the null hypothesis of proportional hazards but rejected the null that the survival distribution was Weibull (and hence possibly exponential).

Immediately below we will demonstrate that a sharp RDD is suitable for analyzing data from Iowa. We estimate two basic models. One includes R (the risk score) in the regression but no other covariates. The other model includes R and the LSI-R.

Diagnostics

There is some concern that the LSI-R has systematic missing patterns. However, when we regress a missing value indicator for the LSI-R on the risk score R and its square R ², the regression has no significant explanatory power. We believe that the LSI-R can be treated as missing completely at random. Data imputation might be of some value but adding the additional complexity from imputing data would not add to the illustration.

According to Polk County’s classification policy, offenders with an Iowa Risk score of 21 or higher should be assigned to ISP (scores have been rescaled so all values are above zero). There are override criteria, but after excluding those overrides as discussed earlier, Figure 3 shows that the classification policy is broadly followed and that a sharp RDD is appropriate.

Figure 3.

Probability of selection into intensive and regular supervision as a function of risk score using Iowa data.

Next we determine whether the risk scores vary continuously around the critical threshold. In fact, there is a modest discontinuity in the distribution of risk scores between 20 and 21, but given the overall variation in risk scores, this appears to be attributable to the fact that the risk scores are discrete. That is, the distribution of risk scores appears to increase to a peak at 21 and to decline thereafter. There is no evidence of manipulating risk scores. See Figure 4. We conclude that RDD passes this diagnostic test.

Figure 4.

Distribution of the risk scores using Iowa data.

Another diagnostic test is to assure that other likely predictors of criminal recidivism vary continuously around the critical value. Figure 5 plots four variables that are likely predictors of criminal recidivism: the LSI-R score, the number of prior arrests, age, and a history of arrests for violent crimes. According to Figure 5, these four variables vary continuously about the critical threshold, suggesting that the risk score is not being manipulated to shift offenders into different supervision categories. On the other hand, offenders with risk scores lower than 21 are systematically different from offenders with risk scores of 21 or higher, as would be expected given that the variables appearing in the graph are associated with recidivism. The assumption of the RDD is that the offenders are equivalent as the risk score approaches 21 from below and from above, and this appears to be true as far as we can tell given the lumpiness of the risk scores.

Figure 5.

Trends in four variables associated with recidivism using Iowa data.

Lee and Lemieux (2009) recommend another simple diagnostic: Form bins based on the risk scores, compute the average outcome for each bin, and then plot the average outcomes against the midpoints of the bin scores. The important observation is how the average outcomes vary just to the left and to the right of the critical value C. (The first bin on the right is inclusive of the critical value.) If treatment is effective, one would expect the outcomes to improve sharply at the threshold. The graph also provides insight into the structural relationship between the risk score and the outcome.

Lee and Lemieux provide guidance for selecting bin widths, but for the current application, this is straightforward because the risk scores are discrete. Figure 6 identifies the risk scores on the horizontal axis. The average outcomes are not so easily computed because of right-hand censoring. Our approach was to define recidivism as an arrest for a new offense within 6 months, within 1 year, within 18 months, and within 2 years. We used a Cox hazard estimator that controls for the LSI-R score, using results to estimate the rate of recidivism within 6 months, within 1 year, within 18 months, and within 2 years. The estimates are reported in the figure, which also provide a linear projection of the recidivism rates based on an ordinary least squares regression.

Figure 6.

Local linear regressions of outcomes (any arrest) and risk score based on a Cox hazard model using Iowa data.

To facilitate interpretation, we have drawn a linear regression through the estimates to the right of the critical risk score of 21 and separately to the left of 21. The line to the left of 21 includes a projection for a risk score of 21. The interpretation seems straightforward. First, projecting the regressions based on the regular supervision data to the risk score value of 21, there is a sharp break in the average outcome at the value of 21. Second, to the left of the critical threshold, criminal recidivism is an increasing function of the risk score; to the right, there is no strong relationship between the risk score and criminal recidivism. Most importantly, there is no compelling reason to believe that the relationship between the outcome and the risk score is anything but linear. However, there is considerable noise in these estimates and apparently some sensitivity to the length of the follow-up period.

Our analysis used three definitions of recidivism: an arrest for any new crime; an arrest for a property, drug, or violent crime; or an arrest for a property or violent crime. Figure 6 only shows plots for recidivism for any new crime, but plots for recidivism defined otherwise show comparable breaks in the outcome measure. Hereafter, we will only consider recidivism within a 6-month period and a 2-year period. Most offenders complete supervision at 1 or 2 years.

The earlier discussion of RDD methodology ignored issues that arise when the risk score is discrete. We postponed that discussion because Figure 6 provides an illustration that would have had little meaning had they been introduced earlier. Lee and Card (2006) give an extended treatment; this article summarizes. When the risk score is discrete, application of RDD requires projections outside the support for the regression—shown clearly in the two figures. The lines in Figure 6 assume a linear relationship between the probability of recidivism and the risk score, but if that assumption is wrong, the projection of the left-hand-side regression to the risk score of 21 would be suspect. Additionally, Lee and Card argue that estimation should use cluster-consistent standard errors, where the cluster is determined by the value of the risk score. The analysis reported later adjusts standard errors for clustering.

Recidivism, Revocation, and the Effects of ISP

Offenders supervised under ISP and regular caseloads have high rates of recidivism; over two thirds are arrested for some new charge during or after supervision; nearly half are arrested within 6 months of the start of their supervision period. Most new charges are for public order offenses (65%), including traffic violations, that are often not punishable by lengthy incarcerations or indeed by any criminal sanction. The other third are for more serious matters: drug-law violations (8%), property crimes (12%), and violent crimes (15%). The majority (71%) of offenders with a new arrest during their supervision period also have their probation revoked, although the revocation does not always immediately follow the new arrest, and sometimes it is for a technical violation of the conditions of supervision. See Note 9 for a discussion of this issue.

We defined recidivism as an arrest for a new charge during or after the probation supervision period. Because the effectiveness of ISP may differ depending on the nature of the crime, we alternatively defined a new arrest for:

Public order, drug-law, property, or violent crime

Drug-law, property, or violent crime

Property or violent crime

We estimated treatment effects using several bandwidths:

20 to 21

19 to 22

18 to 23

18 to 24 and higher

The minimum bandwidth is never lower than 18. This lower limit on the bandwidth is to prevent confusing offenders supervised under regular supervision with offenders supervised under lower levels of supervision. We put different constraints on the follow-up period for the survival analysis:

6 months

2 years

One reason for examining follow-up periods of varying length is that all offenders are likely under supervision for 6 months but some may have moved off active supervision (or off any supervision) before the end of 2 years. Treatment effects are reported as relative hazards. For example, a relative hazard of 0.75 implies that the hazard is reduced by 0.25, or 25%.

Table 1 reports the number of cases entering the analysis. We only used cases that had a valid LSI-R score, regardless of whether the LSI-R entered the regression as a covariate, because our intention is to compare results with and without using a covariate. Given that the minimum risk score is 18, the effective sample size does not grow much beyond the bandwidth of 18–24.

Table 1.

Number of Cases Entering the Analysis as a Function of Bandwidth.

Bandwidth		High-Normal Cases	ISP Cases
20	21	327	332
19	22	603	651
18	23	862	904
18	24+	862	1,043

Note. ISP = intensive supervision probation.

Table 2 reports the estimated relative hazard, the standard error for the estimated relative hazard, and a one-sided probability value for a test of the null hypothesis that the relative hazard equals 1. A relative hazard of less than 1 implies that intensive supervision decreased recidivism and a relative hazard of greater than 1 implies that intensive supervision increased recidivism. Estimation was performed using a Cox proportional hazard model with shared frailty to account for clustering by risk scores. When modeling used the LSI-R as a covariate, the model used the LSI-R and its squared value.

Table 2.

Estimated Relative Hazards for a Cox Model With Shared Frailty: 2-Year Follow-Up.

Bandwidth	No Covariates			LSI-R as a Covariate
Bandwidth	Relative Hazard	Standard Error	One-Sided p Value	Relative Hazard	Standard Error	One-Sided p Value
Recidivism for any crime
20–21	0.841	0.082	.038	0.825	0.081	.025
19–22	0.622	0.106	.003	0.622	0.106	.003
18–23	0.779	0.098	.024	0.783	0.099	.027
18–23+	0.776	0.092	.017	0.783	0.093	.020
Recidivism for property, drug, or violent crime
20–21	1.005	0.166	.999	0.939	0.156	.353
19–22	0.671	0.204	.095	0.667	0.202	.091
18–23	0.770	0.174	.123	0.779	0.176	.135
18–23+	0.678	0.146	.036	0.688	0.148	.041
Recidivism for property or violent crime
20–21	0.951	0.174	.392	0.880	0.162	.244
19–22	0.537	0.182	.034	0.528	0.179	.030
18–23	0.726	0.182	.101	0.731	0.184	.107
18–23+	0.657	0.157	.040	0.664	0.159	.044

Note. LSR = Level of Service Inventory–Revised.

Evidence appears strong that intensive supervision has reduced recidivism when the outcome measure is defined as recidivism for any crime. The evidence is weaker when recidivism is defined otherwise, as the statistical significance is sensitive to the bandwidth.

One objection to using a 2-year follow-up period is that supervision ends after 1 year for many offenders. Additionally, the first 6 months of supervision are often seen as the period when offenders are at elevated risks (National Research Council 2008). Table 3 is the same as Table 2 except that Table 3 reports the estimated relative risk when the follow-up period is limited to 6 months.

Table 3.

Estimated Relative Hazard for the Cox Model With Shared Frailty: 6-Month Follow-Up.

Bandwidth	No Covariates			LSI-R as a Covariate
Bandwidth	Relative Hazard	Standard Error	One-Sided p Value	Relative Hazard	Standard Error	One-Sided p Value
Recidivism for any crime
20–21	0.827	0.107	.072	0.854	0.110	.110
19–22	0.626	0.143	.020	0.615	0.140	.017
18–23	0.701	0.119	.018	0.704	0.120	.020
18–23+	0.706	0.113	.015	0.712	0.114	.017
Recidivism for property, drug, or violent crime
20–21	0.911	0.193	.331	0.836	0.179	.201
19–22	0.454	0.182	.024	0.433	0.173	.019
18–23	0.579	0.173	.034	0.574	0.172	.032
18–23+	0.518	0.147	.011	0.516	0.147	.010
Recidivism for property or violent crime
20–21	0.832	0.191	.212	0.746	0.173	.104
19–22	0.372	0.162	.012	0.349	0.152	.008
18–23	0.532	0.171	.025	0.523	0.169	.023
18–23+	0.488	0.149	.010	0.482	0.148	.009

Note. LSR = Level of Service Inventory–Revised.

The evidence is stronger that intensive supervision reduces recidivism, however defined, during the first 6 months of supervision. In contrast to results reported in Table 2, the standard errors in Table 3 tend to get smaller as the bandwidth expands. They also tend to get smaller as covariates are included in the Cox regression.

Inspecting the standard errors, our interpretation is that the analysis is only sufficiently powered to find what we perceive to be very large treatment effects, despite the fact that the samples are large. One might conclude that we were “lucky” to find significant treatment effects. Readers considering an RDD approach for other jurisdictions are cautioned that Polk County barely provided an adequate sample size for performing the analysis. Using RDD in even smaller jurisdictions may be ill advised. Offsetting this warning, given that intensive supervision is expensive, only a large treatment effect would justify implementing or continuing such a program.

Note that in this application the use of a covariate in the form of the LSI-R risk score made little difference. Because a covariate is unnecessary to identify the treatment effect, there is little surprise that including and excluding the covariate made little difference to the relative risks. More interesting, perhaps, is that the use of a covariate had little or no effect on the estimated standard errors.

As noted, RDD is an identification strategy. A survival model was selected as an estimation strategy, given that the data were censored, a not uncommon way of estimating recidivism models. A further sensitivity test comes from treating the outcome variable as a dichotomous outcome—recidivism or no recidivism. There are costs to using a linear model with a dichotomous dependent variable. Perhaps the most important is that the regression cannot account for censoring resulting from variable length follow-up periods. Because of this limitation, we restricted the follow-up period to 6 months, and we added a covariate to control for the possible time at risk. This covariate was the smaller of (1) 6 months and (2) the last date observed minus the date entered supervision. Another limitation is that the linear model is heteroscedastic, and we dealt with this by using feasible generalized least squares.¹¹ Finally, the data are clustered, and we used a cluster-robust variance estimator. Table 4 reports results.

Table 4.

Estimated Treatment Effects Using a Linear Probability Model Robust Standard Errors: 6-Month Follow-Up.

Bandwidth	No Covariates			LSI-R as a Covariate
Bandwidth	Relative Hazard	Standard Error	One-Sided p Value	Relative Hazard	Standard Error	One-Sided p Value
Recidivism for any crime
20-21	−0.017	0.002	.036	−0.059	0.002	.009
19-22	−0.078	0.001	.000	−0.161	0.004	.000
18-23	−0.061	0.024	.026	−0.100	0.040	.027
18-23+	−0.067	0.024	.009	−0.092	0.037	.015
Recidivism for property, drug, or violent crime
20-21	−0.004	0.003	.208	−0.031	0.000	.001
19-22	−0.053	0.001	.000	−0.096	0.002	.000
18-23	−0.017	0.017	.190	−0.035	0.031	.150
18-23+	−0.027	0.018	.084	−0.041	0.031	.102
Recidivism for property or violent crime
20-21	−0.012	0.002	.054	−0.040	0.000	.003
19-22	−0.059	0.001	.000	−0.102	0.002	.000
18-23	−0.023	0.016	.108	−0.034	0.033	.170
18-23+	−0.032	0.017	.043	−0.039	0.032	.126

Note. LSR = Level of Service Inventory–Revised.

Generally, the findings based on Table 4 are consistent with the findings based on Table 3. Evidence is strongest that intensive supervision reduced recidivism when recidivism is defined as an arrest for any crime. The evidence is weaker when recidivism is defined as an arrest for a limited range of crimes because the estimates are sensitive to the bandwidth.

Based on the literature review, we are concerned that ISP may increase the rate of revocations for technical violations. Although we do not report details here, summarizing the results is straightforward: There is no strong evidence that ISP increased the hazard rate for revocations for technical violations. Still, conclusions require caution. The minimum detectable effects are large. There is not much power to detect moderate effects on increasing the rate of revocations for technical violations.

Discussion

The National Institute of Justice is a frequent sponsor of criminal justice program evaluations. Many of its solicitations advise applicants that “… funding priority will be given to experimental research designs that use random selection and assignment of participants to experimental and control conditions. When random designs are not feasible, priority will be given to quasi-experimental designs that include Regression Discontinuity Design to address selection bias in evaluating outcomes and impacts.” If NIJ is a barometer, RDD is becoming mainstream as a preferred evaluation tool. Indeed, some researchers assert that RDD is almost as good as random assignment. While the authors believe in the utility of RDD, a critical review ends this introduction.

Inference based on random assignment designs requires no strong assumptions beyond the integrity of the random assignment. Inference based on RDD requires some stronger assumptions. Granted, in contrast to other research designs, the assumptions supporting use of RDD are partly testable. However, while an evaluator can reject a null that the assumptions hold, and therefore reject use of RDD, the alternative is to accept the null and maintain the assumptions. This does not mean that the assumptions are correct—merely that they are consistent with the evidence and not rejected. While adequate data strengthen conclusions about assumptions, the need to make some assumptions means that inferences based on RDD are not on the same plane as inferences based on random assignment.

Unless the treatment effect is homogeneous, RDD and a random control trial (RCT) estimate different things. RDD estimates the treatment effect at the margin as that term was defined earlier. In contrast, RCT estimates the average treatment effect over study subjects within a range equivalent to the bandwidth. If Polk County wanted to know if the selection rule for ISP assured that offenders at the margin benefited, then RDD provides an answer. However, it is possible that intensive supervision is relatively ineffective at the margin where R = C but very effective for higher risk offenders, or of course, intensive supervision may be very effective at the margin but comparatively ineffective at higher risk scores. (Figure 6 informally implies that intensive supervision is increasingly effective as the risk score increases.) The authors’ opinions are that jurisdictions should routinely change the threshold values and test whether higher or lower thresholds might be adopted.

The marginal treatment effect may not be the policy question, however. Program interventions sometimes set selection rules so that the benefit from treatment is modest at the margin but increases with R. For example, the federally funded school lunch program may have little benefit for children whose family income is at the threshold for eligibility, but it may have great benefit for children whose family income is much lower than the threshold. In this case, RDD might not answer the most interesting policy question: Is the school lunch program cost effective? Still, the value of RDD depends on the context. School lunch programs are unlikely to be eliminated based on any program evaluation. They are likely to be expanded or contracted, and making a decision about marginal program adjustments requires an estimate of marginal program effects.

Clearly, RDD and random assignment experiments are not the same with respect to statistical power. The treatment effect estimated by a random assignment experiment has a smaller sampling variance than that of the treatment effect estimated by an RDD given equivalent sample sizes.

There is a risk of placing too much weight on the advantages of an RCT and the disadvantages of an RDD, so some perspective may be useful. First, an RCT can be difficult to implement in a community supervision setting. Polk County was one of three sites studied by the authors. In one of our sites, we randomly assigned POs to regular supervision or low-caseload intensive supervision, and we randomly assigned probations to the regular supervision officers or the intensive supervision officers. The low-caseload officers were content with their assignments and stayed in their supervision status, while the higher caseload officers accepted opportunities to move into other assignments. The experiment collapsed. An advantage of RDD is that it is not disruptive of ongoing community corrections activities and the treatment effect can be estimated with retrospective data while an RCT requires estimation with prospective data. Community supervision agencies may be unwilling or unable to wait for the outcome from an RCT.

Second, RCT has greater power than RDD for the same sample size. One should not place too much weight on this apparent advantage, however. RDD typically uses institutional administrative data, so that the sample available for RDD may be much larger than the sample available for a randomized experiment. Furthermore, RDD can be repeated over time at little additional cost. In practice, RDD may provide estimates that are much more precise than those from random assignment experiments, at lower cost to the evaluator or agency.

Third, RCT estimates the average treatment effect and by itself says nothing about the distribution of the treatment effect, so RCT does not lead to an inference about treatment at the margin. Yet a community corrections agency employing RNR supervision has to set the threshold value, so the treatment effect at the margin is an important consideration. It is possible to partition the random sample by risk scores either after or before random assignment and estimate the treatment effect within each partition. This presumes that an RCT is adequately powered to identify average treatment effects within partitions and in practice RCT designs are seldom powered to estimate treatment effects for subgroups.

RDD is unquestionably valuable as an evaluation tool and has only begun to be used in criminal justice program evaluation. Developments in evaluation techniques are occurring rapidly and our understanding of RDD will only grow. As more evaluators become comfortable using RDD, the cutting-edge diagnostics and estimation routines that this article has only mentioned may become conventional. This article is meant as a starting point for evaluators interested in applying the design, not the final word.

Footnotes

Acknowledgment

The authors wish to thank Abt Associates’ Journal Author Support Group for many helpful suggestions. The authors also wish to thank the Iowa Fifth Judicial District and Iowa Department of Corrections for their participation in the original evaluation.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The data used for this article were collected under Grant #2006-IJ-CX-0011, awarded to Abt Associates by the Office of Justice Programs, National Institute of Justice.

Notes

References

Angrist

Pischke

. 2009. Mostly Harmless Econometrics: An Empiricist's Companion. Princeton, NJ: Princeton University Press.

Berk

Barnes

Ahlman

Kurtz

. 2010. “When Second Best Is Good Enough: A Comparison between a True Experiment and a Regression Discontinuity Quasi-experiment.” Journal of Experimental Criminology 6:191–208.

Berk

R. A.

de Leeuw

. 1999. “An Evaluation of California's Inmate Classification System using a Generalized Regression Discontinuity Design.” Journal of the American Statistical Association 94:1045–52.

Berk

R. A.

Rauma

. 1983. “Capitalizing on Nonrandom Assignment to Treatments: A Regression-discontinuity Evaluation of a Crime-control Program.” Journal of the American Statistical Association 78:21–27.

Bloom

2013. “Modern Regression Discontinuity Analysis.” Journal of Research on Educational Effectiveness 5:43–82.

Bloom

Kemple

Gamse

Jacob

. 2005. Using Regression Discontinuity Analysis to Measure the Impacts of Reading First. Paper presented at the annual conference of the American Educational Research Associates, Montreal, Canada.

Byrne

J. M.

Kelly

L. M.

. 1989. Restructuring Probation as an Intermediate Sanction: An Evaluation of the Massachusetts Intensive Probation Supervision Program: Final Report to the National Institute of Justice. Lowell: University of Massachusetts.

Cameron

Trivedi

. 2005. Microeconometrics: Methods and Applications. Cambridge, UK: Cambridge University Press.

Chen

Shapiro

. 2007. “Do Harsher Prison Conditionsl Redice Recidivism? A Discontinuity-based Approach.” American Law and Economics Review 9:1–29.

10.

Gaes

G. G.

Flanagan

T. J.

Motiuk

L. L.

Stewart

. 1999. “Adult Correctional Treatment.” Crime and Justice 26:361–426.

11.

Gendreau

Goggin

Little

. 1996. “Predicting Adult Offender Recidivism: What Works.” Criminology 34:575–608.

12.

Gennetian

Morris

Bos

Bloom

. 2005. “Constructing Instrumental Variables from Experimental Data to Explore How Treatment Produces Effects.” pp. 75–114 In Learning More from Experiments: Evolving Analytic Approaches, edited by Bloom

H. S.

. New York: Russell Sage Foundation.

13.

Hahn

Todd

van der Klauuw

. 2001. “Identification and Estimation of Treatment Effects with a Regression Discontiuity Design.” Econometrica 69:201–9.

14.

Heckman

Vytlacil

. 2007. “Econometric Evaluation of Social Programs, Part I: Causal Models, Structural Models and Econometric Policy Evaluations.” In Handbook of Econometrics, vol. 6B, pp. 4780–4874 edited by Heckman

Leamer

. Amsterdam: North Holland Press.

15.

Imbens

Angrist

. 1994. “Identification and Estimation of Local Average Treatment Effeects.” Econometrica 62:467–75.

16.

Imbens

Kalyanaraman

. 2009. “Optimal Bandwidth Choice for the Regression Discontinuity Estimator.” NBER Working Paper 14726. Cambridge, MA: National Bureau of Economic Research.

17.

Imbens

Lemieux

. 2007. “Regression Discontinuity Designs: A Guide to Practice.” NBER Working Paper 13039. Cambridge, MA: National Bureau of Economic Research.

18.

Jalbert

Sarah Kuck

Rhodes

William

Flygare

Christopher

Kane

Michael

. 2010. “Testing Probation Outcomes in an Evidence-based Practice Setting: Reduced Caseload Size and Intensive Supervision Effectiveness.” Journal of Offender Rehabilitation 49:233–53.

19.

Jalbert

Sarah Kuck

Rhodes

William

Kane

Michael

Clawson

Elyse

Bogue

Bradford

Flygare

Christopher

Kling

Ryan

Guevara

Meaghan

. 2011. A Multisite Evaluation of Reduced Probation Caseload Size in an Evidence-based Practice Setting. Washington, DC: U.S. Department of Justice, National Institute of Justice. https://www.ncjrs.gov/pdffiles1/nij/grants/234596.pdf accessed February 1, 2013.

20.

Lee

2005. Micro-econometrics for Policy, Program and Treatment Effects. Oxford, UK: Oxford University Press.

21.

Lee

2008. “Randomized Experiments from Non-random Selection in the U.S. House of Representatives.” Journal of Econometrics 142:675–97.

22.

Lee

Card

. 2006. “Regression Discontinuity Inference with Specification Error.” NBER Technical Working Paper 322. Cambridge, MA: National Bureau of Economic Research.

23.

Lee

Lemieux

. 2009. “Regression Discontinuity Designs in Economics.” Journal of Economic Literature 48:281–355.

24.

Lipsey

M. W.

Cullen

F. T.

. 2007. “The Effectiveness of Correctional Rehabilitation: A Review of Systematic Reviews.” Annual Review of Law and Social Science 3:297–320. doi:10.1146/annurev.lawsocsci.3.081806.112833

25.

McCrary

2007. “Manipulation of the Running Variable in the Regression Discontinuity Design: A Density Test.” NBER Technical Working Paper 334. Cambridge, MA: National Bureau of Economic Research.

26.

Morgan

Winship

. 2007. Counterfactuals and Causal Inferences: Methods and Principals for Social Research. Cambridge, UK: Cambridge University Press.

27.

National Research Council. 2008. Understanding Crime Trends: Workshop Report. Washington, DC: National Academies Press.

28.

Paparozzi

Gendreau

. 2005. “An Intensive Supervision Program that Worked: Service Delivery, Professional Orientation, and Organizational Supportiveness.” Prison Journal 85:445–66.

29.

Pearson

F. S.

1990. “Contingent Intermediate Sentences: New Jersey's Intensive Supervision Program.” Crime & Delinquency 36:75–86.

30.

Petersilia

Turner

. 1993. “Intensive Probation and Parole.” Crime and Justice 17:281–335.

31.

Rhodes

1986. “A Survival Model with Dependent Competing Events and Right-hand Censoring: Probation and Parole as an Illustration.” Journal of Quantitative Criminology 113–37.

32.

Rosenbaum

2002. Observational Studies. 2nd ed. Berlin, Germany: Springer-Verlag.

33.

Schochet

2009. “Statistical Power for Regression Discontinuity Designs in Education Evaluations.” Journal of Educational and Behavioral Statistics 34:238–66.

34.

van der Klaauw

2002. “Estimating the Effect of Financial Aid Offers on College Enrollment: A Regression-discontinuity Approach.” International Economic Review 43:1249–87.