Abstract
Background:
Corrections agencies frequently place offenders into risk categories, within which offenders receive different levels of supervision and programming. This supervision strategy is seldom evaluated but often can be through routine use of a regression discontinuity design (RDD). This article argues that RDD provides a rigorous and cost-effective method for correctional agencies to evaluate and improve supervision strategies and advocates for using RDD routinely in corrections administration. The objective is to better employ correctional resources.
Method:
This article uses a Neyman–Pearson counterfactual framework to introduce readers to RDD, to provide intuition for why RDD should be used broadly, and to motivate a deeper reading into the methodology. The article also illustrates an application of RDD to evaluate an intensive supervision program for probationers.
Result:
Application of the RDD, which requires basic knowledge of regressions and some special diagnostic tools, is within the competencies of many criminal justice evaluators. RDD is shown to be an effective strategy to identify the treatment effect in a community corrections agency using supervision that meets the necessary conditions for RDD.
Conclusion:
The article concludes with a critical review of how RDD compares to experimental methods to answer policy questions. The article recommends using RDD to evaluate whether differing levels of control and correction reduce criminal recidivism. It also advocates for routine use of RDD as an administrative tool to determine cut points used to assign offenders into different risk categories based on the offenders’ risk scores.
Keywords
Introduction
A growing field of study in community corrections is devoted to developing prediction tools that differentiate offenders based on their risk of reoffending. Corrections officers place offenders whose risk scores exceed an upper threshold into a high-risk category, within which controlling and correctional resources are applied intensively. Officers place other offenders into lower risk categories, within which controlling and correctional resources are applied at lower doses. Two evaluation questions arise: (1) Does the use of intensive supervision (i.e., enhanced application of both controlling and correctional resources) improve supervision outcomes? (2) How would outcomes change if the threshold were altered?
Although this strategy of triage and targeting of controlling and correctional resources has several different names, in this article, we call it the risk–need–responsivity (RNR) model. The RNR approach is based on research showing that people with certain characteristics are more likely to commit crimes, but with appropriate treatment and monitoring those characteristics can be modified or mediated to prevent future recidivism (Lipsey and Cullen 2007; Gaes et al. 1999; Gendreau, Goggin, and Little 1996). RNR gives practitioners a rationale for allocating sparse resources to those who pose the greatest threat to public safety and future criminal justice costs, but agencies often apply the RNR principle without testing whether RNR actually improves outcomes in applied settings and whether the threshold used to assign treatment conditions is optimal. 1
Testing the threshold is practical using a regression discontinuity design (RDD) evaluation framework: RDD provides the means for a community corrections agency to evaluate the effectiveness of RNR on an ongoing basis with minimal costs and with minimal disruption for the agency. The costs are minimal because RDD uses administrative data that are routinely maintained by community corrections agencies and the statistics required to apply the RDD approach are within the competence of local evaluators. The disruption is minimal because RDD may be applied to an ongoing community corrections system that has already implemented triage and targeting, although testing the efficacy of a varying threshold would require experimental manipulation of the threshold.
The authors of this article have worked with probation agencies using an RDD design to assess whether using rules based on risk to assign offenders to regular and intensive supervision has reduced criminal recidivism. We believe that RDD should be used more broadly to evaluate RNR practices in community supervision. We also believe that RDD should be used as an ongoing evaluation tool to inform restructuring of threshold values for offender assignment to risk categories. RDD is a valuable evaluation tool for community supervision agencies struggling to make the best use of their limited controlling and correctional resources.
The statistics behind RDD are established (Hahn, Todd, and van der Klauuw 2001; van der Klaauw 2002; Imbens and Lemieux 2007; Lee and Lemieux 2009), and others have used RDD in criminal justice evaluations (Berk and DeLeeuw 1999; Berk and Rauma 1983; Chen and Shapiro 2007; Berk et al. 2010; Jalbert et al. 2010), but as Berk notes (Berk et al. 2010), use of this powerful tool has been infrequent in criminal justice evaluation. The reason may be that applied researchers are unfamiliar with the RDD framework and with applicable estimation procedures. This article outlines the logic of RDD and discusses some practical aspects of applying the design in a community corrections setting. It illustrates application of RDD in a single probation agency. The authors’ intention is to motivate readers to investigate more technical treatments of RDD and to use the RDD technique in community corrections settings.
This article is organized to introduce readers to the basic assumptions that underlie RDD design. We discuss the treatment effect that is identified by RDD and discuss diagnostics, standard errors, and power. We then move from the abstract to the concrete by using an illustration from part of an evaluation of intensive supervision probation (ISP). Finally, we use the discussion to suggest policy questions best answered by RDD, and how ongoing RDD might be incorporated into corrections administration.
The RDD
The RDD provides a quasi-experimental method for identifying a treatment effect in settings where offenders are triaged based on risk scores and given different doses of controlling and correctional interventions depending on an offender’s risk category. Community corrections is such a setting but so too are many pretrial supervision settings and jail or prison settings. The first subsection discusses how the RDD identifies a specific treatment effect. Justification of an RDD rests on assumptions about the data generation process, and if those assumptions do not hold, the treatment effect is not identified. A practical aspect of RDD is that assumptions are testable, and the second subsection explains some tests. Finally, RDD comes in two forms—sharp RDD and fuzzy RDD. The latter requires some special steps to compute test statistics and confidence intervals and a third subsection explains.
Identifying the Treatment Effect using a RDD
This section explains the RDD using assignment to intensive supervision as an example. The evaluator observes outcomes for
To formalize the argument, adopt some definitions:
Ti
: This is a dummy variable denoting that the ith offender was assigned to intensive supervision (T = 1) or to regular supervision (T = 0). Intensive supervision is considered the treatment.
Yi
: This is the outcome measure, here considered to reflect recidivism.
Ri
: This is a risk score considered to be measured on a continuous scale although practically measured on an integral scale. It is known at the time of assignment and its estimation is not part of the exercise.
Xi
: This is a covariate, possibly a vector of covariates.
C: This is a critical value of the risk score, the threshold value that determines the treatment. Its purpose will be clarified but generally offenders are assigned to intensive supervision if
Ii
: This is an indicator variable such that
We introduce the RDD using Rubin’s counterfactual framework expressed via three equations: The outcome equation if the offender had been assigned to regular supervision (Equation 1), the outcome equation if the offender had been assigned to intensive supervision (Equation 2), and the selection equation determining whether the offender was assigned to intensive supervision (Equation 5). Greek letters designate parameters in these three equations. Defined for all offenders, the first equation is the outcome equation given regular supervision:
Placing T = 0 in parentheses denotes the outcome that would be observed if the offender had been placed on regular supervision; the outcome is actually observed only for offenders placed on regular supervision. Ri
− C is a convenient (optional) transformation of the risk score. The transformation emphasized that the expected outcome is α1 when Ri
= C and the offender is assigned to intensive supervision. Readers who are so inclined might approximate a more complex equation using a polynomial of any degree
The second equation is the outcome equation given intensive supervision:
Placing T = 1 in parentheses denotes the outcome that would be observed if the offender had actually been placed on intensive supervision. The transformation emphasizes that the expected outcome is β1 when R
1 = C and the offender is assigned to intensive supervision. Again the equation could be written as a polynomial; covariates could be included; and u is a zero-mean random error term independent of R − C. Given Equations 1 and 2, the treatment effect for the ith offender is defined as the reduction in recidivism from applying intensive supervision instead of regular supervision, which is expressed as a difference:
Equation 3 is a theoretical construct: This treatment effect is never observed because an evaluator can observe
The treatment effect might be homogeneous and thus not depend on R, in which case the treatment effect would be just

An illustration of the sharp and fuzzy regression discontinuity design using hypothetical data.
RDD avoids this identification problem by adopting a seemingly innocuous assumption that offenders with a risk score slightly less than C and offenders with a risk score slightly larger than C would have essentially the same outcomes if all had been assigned to regular supervision. This provides the contrast at the heart of RDD: What were the outcomes for offenders with risk scores near C who were assigned to regular supervision compared with the outcomes for offenders with risk scores near C who were assigned to intensive supervision? We have to introduce the selection equation to investigate the solution following from this assumption. The selection equation is written as the probability of being assigned to intensive supervision:
Earlier, we defined a variable
To develop the argument, combine Equations 1, 2, and 5 to rewrite the expected outcome as:
The first line shows the expected outcome as a mixture that depends on the probability of being assigned to intensive supervision. The second line shows the mixture probability in terms of the parameters appearing in Equation 5. To use RDD to identify the treatment effect, select offenders whose risk scores are slightly less than C and offenders whose risk scores are slightly higher than C. When the scores are sufficiently near C,
The formal argument is asymptotic (Hahn, Todd, and van der Klauuw 2001) and we will not develop it here. Estimating the parameters of Equation 7 can be seen as a regression of the observed outcome Y on the estimated probability of being assigned to intensive supervision and the combination of Equations 5 and 7 might be seen as an instrumental variable estimator. The necessity of the condition
The practical problem with this estimator is that there may be few offenders with risk scores just slightly less than C and with scores just equal to C. There may in fact be none because risk instruments are usually scaled as integers
The discontinuity in the selection equation is the identification condition. Given the identification condition, there are many ways to estimate the parameters. In general, rewrite the last line of Equation 6, group the variables, a constant, R − C, (R − C)2, I, and I(R − C), and substitute parameters that are estimated by a regression on those grouped variables. The new parameters are ξ, which are functions of the α, β, and γ parameters:
To interpret the meaning of the
Methodologists writing about RDD commonly use a graphical device to provide intuition for why RDD estimates the treatment effect at R = C. Based on simulated data, created to illustrate points, Figure 1 (sharp RDD) shows three curves: The first curve is a graphical illustration of Equation 1, representing the potential outcome assuming a polynomial and given regular supervision. The second curve is an illustration of Equation 2, representing the potential outcome assuming a polynomial and given intensive supervision. Both these lines are drawn with light curves, which are partially obscured because they are coincident with the third curve, representing the observed outcomes given that all offenders with
As a function of the risk score, the treatment effect is the vertical distance between first two curves. Although the treatment effect is defined, it is not in general identified. For scores below the critical risk score of C = 10, we can observe outcomes for offenders assigned to regular supervision, but we cannot observe the counterfactual for comparable offenders who were assigned to intensive supervision. For scores equal to 10 or above, we can observe the outcomes for offenders assigned to intensive supervision, but we cannot observe the outcomes for comparable offenders assigned to regular supervision. The treatment effect is only identified at the margin where
Using the same simulated data, Figure 1 (fuzzy RDD) shows the same three curves, but the probability of assignment to intensive supervision is a continuous function of R except at R = C where it is discontinuous. (Figure 2 shows the discontinuity.) The first and second curves are the same in Figure 1 (sharp RDD) but the third curve, representing the observable outcomes, differs. Arguably, we could estimate the first two curves because, according to the assignment equation, we can always observe some offenders assigned to regular and intensive supervision for every risk score. There are two problems, however: Decreasing numbers of offenders receive intensive supervision as R decreases and decreasing numbers of offenders receive regular supervision as R increases, and we would be increasingly concerned with selection bias. However, given the discontinuity at R = C, we can unambiguously identify the treatment effect at R = C. As shown in Figure 1 (fuzzy RDD), what might be called the intent-to-treat estimator of the treatment effect equals about 0.0289 units at C = 10. According to Figure 2, the probability of treatment jumps from .269 to .731 at C = 10, or by .462. Therefore, an estimate of the size of the treatment effect is 0.0269/0.562 = 0.062; the sharp and fuzzy RDD give the same answers.

An illustration of the selection equation for a fuzzy RDD using hypothetical data.
We qualify the previous discussion with two technical observations. First, although the intuition is helpful, this equivalency between the estimated treatment effect from sharp RDD and fuzzy RDD will not typically hold. The estimated treatment effect from the RDD is called a local average treatment effect (LATE); the estimated treatment effect from a fuzzy RDD is sometime distinguished as the LATE at the cut point (LATEC). A general literature describes LATE (Imbens and Angrist 1994; Heckman and Vytlacil 2007; Gennetian et al. 2005) and Bloom discusses LATEC with respect to RDD (Bloom 2013). The crux of the argument is that if treatment effects are heterogeneous at R = C, RDD identifies the treatment effect for a specific subset of offenders with risk scores of T = C, 6 so given heterogeneous treatment effects the sharp and fuzzy designs lead to estimates of different treatment effects. Second, the discussion implies that the estimated treatment effect is literally the treatment effect when R = C, and while this is often the way that evaluators interpret the effect, an alternative interpretation is that the treatment effect is a weighted average of heterogeneous treatment effects over the entire bandwidth (Lee 2008; Lee and Lemieux 2009; Bloom 2013). 7 This article does not pursue this argument further: While these qualifications are relevant for interpreting the treatment effect, the weighting is unobservable so the argument has minimal practical value for applied researchers.
This discussion has explained how RDD identifies the treatment effect. Some comments are instructive: Examining Equation 8, researchers will often assume that There are many ways to estimate the local linear regression within the bandwidth, some of them very sophisticated, many of them controversial. Our discussion does not go beyond a simple regression but technical discussions and software packages often employ other devices including kernel density estimation (Lee and Lemieux 2009). Simple, traditional estimators are likely to work for most applied researchers. It is practical to estimate Equation 8 using two regressions, one using data for all observations within the bandwidth where Testing the statistical significance of the treatment effect is a standard test of the null hypothesis that The bandwidth is a concern. The narrower the bandwidth, the more likely that a local linear regression will fit the data, but the fewer the data points, the higher the standard error for the estimated treatment effect. The wider the bandwidth, the greater the validity challenges to the simple regression specification, but the smaller the standard error. Evaluators will typically expand the bandwidth, progressively examining how a wider bandwidth changes estimates of the treatment effect, and reporting the sequence of estimates. As we discuss later in this article, formal procedures for expanding the bandwidth have been proposed (Lee and Lemieux 2009), but unfortunately these test–retest procedures impart additional uncertainty to the estimates that are difficult to evaluate. The discussion assumes that the risk score is measured on a continuous scale so that in theory there are offenders with risk scores that are infinitely smaller than R. In criminal justice problems, risk scores are discrete. Lee and Card (2006) discuss how using discrete risk scores affects estimation, including the greater risk of specification error and the need to account for clustering based on risk scores. This article will return to these issues with the applied illustration, where risk scores are discrete.
Diagnostics
An array of designs exists for evaluators seeking to estimate treatment effects (Rosenbaum 2002; Cameron and Trivedi 2005; Lee 2005; Morgan and Winship 2007; Angrist and Pischke 2009), but with the exception of random assignment, all rest on largely untestable assumptions. In contrast, the assumptions that underlie RDD can be partly tested, weakening the need to rely on them. If tested assumptions are rejected, inferences based on RDD are called into question and the evaluator might seek another approach. We discuss some diagnostic tests here.
RDD requires that the probability of treatment be discontinuous at the critical value C. Testing this assumption is a matter of estimating the parameters of Equation 5, perhaps after specifying a more complex structure. The diagnostic test is based on the null hypothesis that
The second diagnostic test is that the outcome equation absent treatment is continuous at C. This is not testable directly because the outcome absent treatment is not observable. There are indirect tests, however. One indirect test is to determine whether the risk score R is distributed continuously about C. This test is useful because Y is a function of R, so discontinuity in the distribution of R implies discontinuity in the distribution of Y. The validity of the continuity assumption might be determined using a regression similar to that used to estimate the probability of treatment. Alternatively, one might stratify the R and use a histogram to determine that there are no discontinuities around the value of C. More formal tests are available (McCrary 2007). The test is less useful when the risk score is discrete because there is some natural lumpiness to discrete data.
Another complementary indirect test examines the distribution of covariates X even if these do not appear in the regression. The concern is that the risk score may be manipulated either by probation officers (POs) in order to shift an offender into more intensive supervision or by the offender (through self-reported behaviors) to avoid being shifted into more intensive supervision. Manipulation of R will have an incidental effect on the distribution of X. If there is no manipulation of the risk score, then an evaluator would expect to see a reasonably smooth distribution of the X variables about the critical value C. If the data fail to pass that test, the validity of the RDD is called into question.
Still another diagnostic comes from regressing the outcome Y on the risk score. If there is no discontinuity at C, then there is no treatment effect. This might be done by estimating robust regressions on both sides of C. Typically, the regression would be linear, and if a polynomial were used, it would not be of high degree. The purpose of this diagnostic is to test the specification of the regressions on the right and left of C. A visual test might suffice; Lee and Lemieux (2009) and Lee and Card (2006) discuss formal tests. When the risk score is discrete, the test requires projecting the regression of the left of C to C, adding some uncertainty to the test.
This last diagnostic raises a practical issue. The objective is to estimate the
Estimation and diagnostics assume a bandwidth, so testing the sensitivity of results to the size of the bandwidth seems prudent. Recall that the estimated treatment effect has the greatest validity when the bandwidth is narrow, but it has the greatest efficiency when the bandwidth is wide. There are some formal tests for the optimal bandwidth (Imbens and Kalyanaraman, 2009), but these approaches can be complicated, and a practical alternative is to experiment by expanding bandwidth and assessing how the expansion affects estimates (Angrist and Pischke 2009).
Diagnostics are straightforward and lend credibility to RDD. A necessary argument for RDD is that the probability of treatment is a discontinuous function of R at C. An evaluator who stops at this necessary argument has failed to make a compelling case that he or she has identified the treatment effect. Responsible evaluators will offer their reviewers the opportunity to consider diagnostics before passing judgment on the validity of the estimated treatment effect.
Standard Errors and Efficiency
Recall from the discussion surrounding Equation 8 that the estimated treatment effect at R = C is
The simplicity disappears when using fuzzy RDD because the variance when estimating
The problem is estimating the covariance terms. With Stata, this can be done with the seemingly unrelated regression (sureg) command. Sometimes researchers use instrumental variable programs. Alternatively, it can be done mechanically by an evaluator with programming skills.
Although RDD identifies the treatment effect, the sampling variance can be high, necessitating the use of large samples. Some special cases are often considered in the evaluation literature (Schochet 2009; Bloom 2013; Bloom et al. 2005). A summary from this literature points toward a conclusion that when using an RDD, large sample may have less power than might be supposed. As expected, the sampling variance for the estimated treatment effect should decrease inversely with the bandwidth. Assuming a uniform distribution of R about C leads to a sampling variance that is 4 times as large as it would have been if we had been able to randomly assign the same study subjects to the treatment or no treatment condition. Put another way, the RDD requires a sample size that is 4 times as large as that required by a random assignment design in order to have the same power. Assuming a normal distribution about C requires a sample size 2.7 times as large as that required by random assignment.
Lee and Card (2006) show that discrete risk scores will inflate standard errors. Schochet (2009) argues that clustering will inflate standard errors; of course, this would also be true for random design experiments. Covariates are not required for identification provided the continuity assumptions hold. However, the standard error for the estimated treatment effect is proportion to the square root of the residual variance in the estimated regression. Covariates may be useful for reducing standard errors but their use can raise questions about model specification for those covariates.
An Illustration
This section uses data from an evaluation of an ISP program to illustrate points made previously. It provides circumscribed data and less detail than would be expected of a report on evaluation findings because the intent is illustration, not dissemination of evaluation findings. More information about the full evaluation can be found in Jalbert et al. (2011). 8
In theory, ISP allows POs to provide enhanced control and correctional interventions to high-risk offenders who otherwise would receive inadequate supervision and support because of large caseloads. Most previous experiments with ISP were failures: Some programs failed to deliver increased interaction or treatment despite smaller caseloads; others increased supervision intensity that increased technical violations for behaviors that would not be criminal except for an offender’s status on probation (Petersilia and Turner 1993). However, there have been exceptions (Pearson 1990; Byrne and Kelly 1989; Paparozzi and Gendreau 2005) in agencies that use programming responsive to specific offender needs, suggesting ISP can be effective if it is employed in an agency using RNR supervision to allocate treatment and supervision resources using a validated risk/need assessment instrument.
The principal null hypothesis in this illustration is that criminal recidivism is the same for high-risk offenders receiving intensive supervision from low-caseload offices as it is for high-risk offenders supervised under normal (nonintensive) supervision from officers with regular caseloads in an agency using RNR. The alternative hypothesis is that criminal recidivism is lower for offenders supervised under intensive supervision. Based on the literature review, we use a one-tailed test of statistical significance to test this null hypothesis.
Data
Data come from Polk County, Iowa, a midsize county that lies within Iowa’s fifth Judicial District. We limited our analysis to offenders supervised by officers in the Des Moines location. The fifth Judicial District was an early adopter of RNR-style supervision and has substantially implemented components since 1997. In 2000, the agency began implementing standardized case planning. In 2002, it began implementing training in RNR-style practices that were fully implemented by 2004. Because the study period begins late in 2001 prior to full implementation, estimation may understate the full effectiveness of reduced caseloads in an RNR environment.
Although we do not present the evidence here, a longer report (Jalbert et al. 2011) shows that (1) POs who supervised ISP have smaller caseloads (about 30 offenders per PO) than POs who provide regular supervision 9 (about 50 offenders per PO), (2) both control and correctional interventions are more frequent for the ISP caseload, and (3) ISP lasts sufficiently long (about 1 year on average) so that the dose of treatment is meaningful. Polk County uses a risk score (R from above) based on the Iowa Risk Assessment tool to classify offenders and assign them to ISP.
Data comprised 8,878 probationers who entered supervision between the years 2001 and 2007. Eighteen percent were placed on regular supervision and 37% were placed on ISP (other probationers were assigned to administrative or low supervision status). All male ISP offenders take part in a special treatment protocol but female offenders are assigned to other programming; consequently, we limit the analysis to males. Offenders assigned to special caseloads, for example sex offenders and offenders with serious mental illness, were also excluded from this analysis, as were offenders who were assigned to jail diversion or similar programming.
Overrides are allowed and we discovered that overrides almost always occur for specific conditions: The offender is assaultive. The offense was very serious. The case plan indicated high needs.
We excluded offenders meeting these conditions from the analysis. We also excluded offenders when the Parole Board required intensive supervision and when the offender was not available for active supervision. We also excluded offenders who had risk scores of less than 18 because these offenders are routinely assigned to a lower level of supervision. Offenders had to be between the ages of 18 and 65. Because these classes of individuals were excluded from the analysis, we treat the treatment effect as generalizing to offenders who were included in the analysis, and even then, the generalization extends to those whose risk scores were at the margin.
Finally, we limited the analysis to offenders who had a valid secondary risk score (Level of Service Inventory–Revised [LSI-R]) that is not used for assigning offenders to intensive supervision and, surprisingly, is only modestly correlated with the Iowa Risk Score. We added this final limitation to demonstrate how the addition of covariates affected findings.
Estimation
RDD is a design for identifying a treatment effect. It is not an estimator. Most applications of RDD use least squares regression as the estimation procedure, but given our concern with criminal recidivism, we use partially parametric survival analysis (Cox Proportional Hazard models) to study time until criminal recidivism, subject to right-hand censoring. Recidivism is equated to an arrest for a new offense; sensitivity testing will define a new offense variously. Censoring arises from one of three causes: data collection ends, the sentence ends, or there is a probation revocation for a technical violation of the conditions of supervision exclusive of revocations imposed because of an arrest for a new crime. The third form of censoring is sometimes known as a competing event. We assume that the competing event is independent of criminal recidivism. Rhodes (1986) provides some justification for this assumption, but if it does not hold, whatever bias it introduces into parameter estimates is likely to have the same effect to the left and right of C. 10 Diagnostic testing (not reported here) failed to reject the null hypothesis of proportional hazards but rejected the null that the survival distribution was Weibull (and hence possibly exponential).
Immediately below we will demonstrate that a sharp RDD is suitable for analyzing data from Iowa. We estimate two basic models. One includes R (the risk score) in the regression but no other covariates. The other model includes R and the LSI-R.
Diagnostics
There is some concern that the LSI-R has systematic missing patterns. However, when we regress a missing value indicator for the LSI-R on the risk score R and its square R 2, the regression has no significant explanatory power. We believe that the LSI-R can be treated as missing completely at random. Data imputation might be of some value but adding the additional complexity from imputing data would not add to the illustration.
According to Polk County’s classification policy, offenders with an Iowa Risk score of 21 or higher should be assigned to ISP (scores have been rescaled so all values are above zero). There are override criteria, but after excluding those overrides as discussed earlier, Figure 3 shows that the classification policy is broadly followed and that a sharp RDD is appropriate.

Probability of selection into intensive and regular supervision as a function of risk score using Iowa data.
Next we determine whether the risk scores vary continuously around the critical threshold. In fact, there is a modest discontinuity in the distribution of risk scores between 20 and 21, but given the overall variation in risk scores, this appears to be attributable to the fact that the risk scores are discrete. That is, the distribution of risk scores appears to increase to a peak at 21 and to decline thereafter. There is no evidence of manipulating risk scores. See Figure 4. We conclude that RDD passes this diagnostic test.

Distribution of the risk scores using Iowa data.
Another diagnostic test is to assure that other likely predictors of criminal recidivism vary continuously around the critical value. Figure 5 plots four variables that are likely predictors of criminal recidivism: the LSI-R score, the number of prior arrests, age, and a history of arrests for violent crimes. According to Figure 5, these four variables vary continuously about the critical threshold, suggesting that the risk score is not being manipulated to shift offenders into different supervision categories. On the other hand, offenders with risk scores lower than 21 are systematically different from offenders with risk scores of 21 or higher, as would be expected given that the variables appearing in the graph are associated with recidivism. The assumption of the RDD is that the offenders are equivalent as the risk score approaches 21 from below and from above, and this appears to be true as far as we can tell given the lumpiness of the risk scores.

Trends in four variables associated with recidivism using Iowa data.
Lee and Lemieux (2009) recommend another simple diagnostic: Form bins based on the risk scores, compute the average outcome for each bin, and then plot the average outcomes against the midpoints of the bin scores. The important observation is how the average outcomes vary just to the left and to the right of the critical value C. (The first bin on the right is inclusive of the critical value.) If treatment is effective, one would expect the outcomes to improve sharply at the threshold. The graph also provides insight into the structural relationship between the risk score and the outcome.
Lee and Lemieux provide guidance for selecting bin widths, but for the current application, this is straightforward because the risk scores are discrete. Figure 6 identifies the risk scores on the horizontal axis. The average outcomes are not so easily computed because of right-hand censoring. Our approach was to define recidivism as an arrest for a new offense within 6 months, within 1 year, within 18 months, and within 2 years. We used a Cox hazard estimator that controls for the LSI-R score, using results to estimate the rate of recidivism within 6 months, within 1 year, within 18 months, and within 2 years. The estimates are reported in the figure, which also provide a linear projection of the recidivism rates based on an ordinary least squares regression.

Local linear regressions of outcomes (any arrest) and risk score based on a Cox hazard model using Iowa data.
To facilitate interpretation, we have drawn a linear regression through the estimates to the right of the critical risk score of 21 and separately to the left of 21. The line to the left of 21 includes a projection for a risk score of 21. The interpretation seems straightforward. First, projecting the regressions based on the regular supervision data to the risk score value of 21, there is a sharp break in the average outcome at the value of 21. Second, to the left of the critical threshold, criminal recidivism is an increasing function of the risk score; to the right, there is no strong relationship between the risk score and criminal recidivism. Most importantly, there is no compelling reason to believe that the relationship between the outcome and the risk score is anything but linear. However, there is considerable noise in these estimates and apparently some sensitivity to the length of the follow-up period.
Our analysis used three definitions of recidivism: an arrest for any new crime; an arrest for a property, drug, or violent crime; or an arrest for a property or violent crime. Figure 6 only shows plots for recidivism for any new crime, but plots for recidivism defined otherwise show comparable breaks in the outcome measure. Hereafter, we will only consider recidivism within a 6-month period and a 2-year period. Most offenders complete supervision at 1 or 2 years.
The earlier discussion of RDD methodology ignored issues that arise when the risk score is discrete. We postponed that discussion because Figure 6 provides an illustration that would have had little meaning had they been introduced earlier. Lee and Card (2006) give an extended treatment; this article summarizes. When the risk score is discrete, application of RDD requires projections outside the support for the regression—shown clearly in the two figures. The lines in Figure 6 assume a linear relationship between the probability of recidivism and the risk score, but if that assumption is wrong, the projection of the left-hand-side regression to the risk score of 21 would be suspect. Additionally, Lee and Card argue that estimation should use cluster-consistent standard errors, where the cluster is determined by the value of the risk score. The analysis reported later adjusts standard errors for clustering.
Recidivism, Revocation, and the Effects of ISP
Offenders supervised under ISP and regular caseloads have high rates of recidivism; over two thirds are arrested for some new charge during or after supervision; nearly half are arrested within 6 months of the start of their supervision period. Most new charges are for public order offenses (65%), including traffic violations, that are often not punishable by lengthy incarcerations or indeed by any criminal sanction. The other third are for more serious matters: drug-law violations (8%), property crimes (12%), and violent crimes (15%). The majority (71%) of offenders with a new arrest during their supervision period also have their probation revoked, although the revocation does not always immediately follow the new arrest, and sometimes it is for a technical violation of the conditions of supervision. See Note 9 for a discussion of this issue.
We defined recidivism as an arrest for a new charge during or after the probation supervision period. Because the effectiveness of ISP may differ depending on the nature of the crime, we alternatively defined a new arrest for: Public order, drug-law, property, or violent crime Drug-law, property, or violent crime Property or violent crime
We estimated treatment effects using several bandwidths: 20 to 21 19 to 22 18 to 23 18 to 24 and higher
The minimum bandwidth is never lower than 18. This lower limit on the bandwidth is to prevent confusing offenders supervised under regular supervision with offenders supervised under lower levels of supervision. We put different constraints on the follow-up period for the survival analysis: 6 months 2 years
One reason for examining follow-up periods of varying length is that all offenders are likely under supervision for 6 months but some may have moved off active supervision (or off any supervision) before the end of 2 years. Treatment effects are reported as relative hazards. For example, a relative hazard of 0.75 implies that the hazard is reduced by 0.25, or 25%.
Table 1 reports the number of cases entering the analysis. We only used cases that had a valid LSI-R score, regardless of whether the LSI-R entered the regression as a covariate, because our intention is to compare results with and without using a covariate. Given that the minimum risk score is 18, the effective sample size does not grow much beyond the bandwidth of 18–24.
Number of Cases Entering the Analysis as a Function of Bandwidth.
Note. ISP = intensive supervision probation.
Table 2 reports the estimated relative hazard, the standard error for the estimated relative hazard, and a one-sided probability value for a test of the null hypothesis that the relative hazard equals 1. A relative hazard of less than 1 implies that intensive supervision decreased recidivism and a relative hazard of greater than 1 implies that intensive supervision increased recidivism. Estimation was performed using a Cox proportional hazard model with shared frailty to account for clustering by risk scores. When modeling used the LSI-R as a covariate, the model used the LSI-R and its squared value.
Estimated Relative Hazards for a Cox Model With Shared Frailty: 2-Year Follow-Up.
Note. LSR = Level of Service Inventory–Revised.
Evidence appears strong that intensive supervision has reduced recidivism when the outcome measure is defined as recidivism for any crime. The evidence is weaker when recidivism is defined otherwise, as the statistical significance is sensitive to the bandwidth.
One objection to using a 2-year follow-up period is that supervision ends after 1 year for many offenders. Additionally, the first 6 months of supervision are often seen as the period when offenders are at elevated risks (National Research Council 2008). Table 3 is the same as Table 2 except that Table 3 reports the estimated relative risk when the follow-up period is limited to 6 months.
Estimated Relative Hazard for the Cox Model With Shared Frailty: 6-Month Follow-Up.
Note. LSR = Level of Service Inventory–Revised.
The evidence is stronger that intensive supervision reduces recidivism, however defined, during the first 6 months of supervision. In contrast to results reported in Table 2, the standard errors in Table 3 tend to get smaller as the bandwidth expands. They also tend to get smaller as covariates are included in the Cox regression.
Inspecting the standard errors, our interpretation is that the analysis is only sufficiently powered to find what we perceive to be very large treatment effects, despite the fact that the samples are large. One might conclude that we were “lucky” to find significant treatment effects. Readers considering an RDD approach for other jurisdictions are cautioned that Polk County barely provided an adequate sample size for performing the analysis. Using RDD in even smaller jurisdictions may be ill advised. Offsetting this warning, given that intensive supervision is expensive, only a large treatment effect would justify implementing or continuing such a program.
Note that in this application the use of a covariate in the form of the LSI-R risk score made little difference. Because a covariate is unnecessary to identify the treatment effect, there is little surprise that including and excluding the covariate made little difference to the relative risks. More interesting, perhaps, is that the use of a covariate had little or no effect on the estimated standard errors.
As noted, RDD is an identification strategy. A survival model was selected as an estimation strategy, given that the data were censored, a not uncommon way of estimating recidivism models. A further sensitivity test comes from treating the outcome variable as a dichotomous outcome—recidivism or no recidivism. There are costs to using a linear model with a dichotomous dependent variable. Perhaps the most important is that the regression cannot account for censoring resulting from variable length follow-up periods. Because of this limitation, we restricted the follow-up period to 6 months, and we added a covariate to control for the possible time at risk. This covariate was the smaller of (1) 6 months and (2) the last date observed minus the date entered supervision. Another limitation is that the linear model is heteroscedastic, and we dealt with this by using feasible generalized least squares. 11 Finally, the data are clustered, and we used a cluster-robust variance estimator. Table 4 reports results.
Estimated Treatment Effects Using a Linear Probability Model Robust Standard Errors: 6-Month Follow-Up.
Note. LSR = Level of Service Inventory–Revised.
Generally, the findings based on Table 4 are consistent with the findings based on Table 3. Evidence is strongest that intensive supervision reduced recidivism when recidivism is defined as an arrest for any crime. The evidence is weaker when recidivism is defined as an arrest for a limited range of crimes because the estimates are sensitive to the bandwidth.
Based on the literature review, we are concerned that ISP may increase the rate of revocations for technical violations. Although we do not report details here, summarizing the results is straightforward: There is no strong evidence that ISP increased the hazard rate for revocations for technical violations. Still, conclusions require caution. The minimum detectable effects are large. There is not much power to detect moderate effects on increasing the rate of revocations for technical violations.
Discussion
The National Institute of Justice is a frequent sponsor of criminal justice program evaluations. Many of its solicitations advise applicants that “… funding priority will be given to experimental research designs that use random selection and assignment of participants to experimental and control conditions. When random designs are not feasible, priority will be given to quasi-experimental designs that include Regression Discontinuity Design to address selection bias in evaluating outcomes and impacts.” If NIJ is a barometer, RDD is becoming mainstream as a preferred evaluation tool. Indeed, some researchers assert that RDD is almost as good as random assignment. While the authors believe in the utility of RDD, a critical review ends this introduction.
Inference based on random assignment designs requires no strong assumptions beyond the integrity of the random assignment. Inference based on RDD requires some stronger assumptions. Granted, in contrast to other research designs, the assumptions supporting use of RDD are partly testable. However, while an evaluator can reject a null that the assumptions hold, and therefore reject use of RDD, the alternative is to accept the null and maintain the assumptions. This does not mean that the assumptions are correct—merely that they are consistent with the evidence and not rejected. While adequate data strengthen conclusions about assumptions, the need to make some assumptions means that inferences based on RDD are not on the same plane as inferences based on random assignment.
Unless the treatment effect is homogeneous, RDD and a random control trial (RCT) estimate different things. RDD estimates the treatment effect at the margin as that term was defined earlier. In contrast, RCT estimates the average treatment effect over study subjects within a range equivalent to the bandwidth. If Polk County wanted to know if the selection rule for ISP assured that offenders at the margin benefited, then RDD provides an answer. However, it is possible that intensive supervision is relatively ineffective at the margin where R = C but very effective for higher risk offenders, or of course, intensive supervision may be very effective at the margin but comparatively ineffective at higher risk scores. (Figure 6 informally implies that intensive supervision is increasingly effective as the risk score increases.) The authors’ opinions are that jurisdictions should routinely change the threshold values and test whether higher or lower thresholds might be adopted.
The marginal treatment effect may not be the policy question, however. Program interventions sometimes set selection rules so that the benefit from treatment is modest at the margin but increases with R. For example, the federally funded school lunch program may have little benefit for children whose family income is at the threshold for eligibility, but it may have great benefit for children whose family income is much lower than the threshold. In this case, RDD might not answer the most interesting policy question: Is the school lunch program cost effective? Still, the value of RDD depends on the context. School lunch programs are unlikely to be eliminated based on any program evaluation. They are likely to be expanded or contracted, and making a decision about marginal program adjustments requires an estimate of marginal program effects.
Clearly, RDD and random assignment experiments are not the same with respect to statistical power. The treatment effect estimated by a random assignment experiment has a smaller sampling variance than that of the treatment effect estimated by an RDD given equivalent sample sizes.
There is a risk of placing too much weight on the advantages of an RCT and the disadvantages of an RDD, so some perspective may be useful. First, an RCT can be difficult to implement in a community supervision setting. Polk County was one of three sites studied by the authors. In one of our sites, we randomly assigned POs to regular supervision or low-caseload intensive supervision, and we randomly assigned probations to the regular supervision officers or the intensive supervision officers. The low-caseload officers were content with their assignments and stayed in their supervision status, while the higher caseload officers accepted opportunities to move into other assignments. The experiment collapsed. An advantage of RDD is that it is not disruptive of ongoing community corrections activities and the treatment effect can be estimated with retrospective data while an RCT requires estimation with prospective data. Community supervision agencies may be unwilling or unable to wait for the outcome from an RCT.
Second, RCT has greater power than RDD for the same sample size. One should not place too much weight on this apparent advantage, however. RDD typically uses institutional administrative data, so that the sample available for RDD may be much larger than the sample available for a randomized experiment. Furthermore, RDD can be repeated over time at little additional cost. In practice, RDD may provide estimates that are much more precise than those from random assignment experiments, at lower cost to the evaluator or agency.
Third, RCT estimates the average treatment effect and by itself says nothing about the distribution of the treatment effect, so RCT does not lead to an inference about treatment at the margin. Yet a community corrections agency employing RNR supervision has to set the threshold value, so the treatment effect at the margin is an important consideration. It is possible to partition the random sample by risk scores either after or before random assignment and estimate the treatment effect within each partition. This presumes that an RCT is adequately powered to identify average treatment effects within partitions and in practice RCT designs are seldom powered to estimate treatment effects for subgroups.
RDD is unquestionably valuable as an evaluation tool and has only begun to be used in criminal justice program evaluation. Developments in evaluation techniques are occurring rapidly and our understanding of RDD will only grow. As more evaluators become comfortable using RDD, the cutting-edge diagnostics and estimation routines that this article has only mentioned may become conventional. This article is meant as a starting point for evaluators interested in applying the design, not the final word.
Footnotes
Acknowledgment
The authors wish to thank Abt Associates’ Journal Author Support Group for many helpful suggestions. The authors also wish to thank the Iowa Fifth Judicial District and Iowa Department of Corrections for their participation in the original evaluation.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The data used for this article were collected under Grant #2006-IJ-CX-0011, awarded to Abt Associates by the Office of Justice Programs, National Institute of Justice.
