Abstract
Project Greenlight (GL) was an innovative and intensive prison-based intervention delivered to inmates during the 8-week period immediately prior to their release from prison. Using a relatively rigorous research design, evaluators reported significant negative effects associated with the treatment at one year after release. We reassess the GL intervention over a longer follow-up period and focus on differences by risk level. Findings indicate that although the bulk of the negative effects dissipate compared with one control group, significant negative effects remain when compared with a second. More importantly, however, we find that program effects are differentially distributed by risk level in a counterintuitive direction.
Introduction
Over the past several decades, a substantial and growing body of empirical evidence emerged that substantiates the efficacy of correctional interventions (see, for example, Cullen, 2005; Cullen & Gendreau, 2000; Lipsey & Cullen, 2007; Lipsey, Landenberger, & Wilson, 2007; MacKenzie, 2006; Marlowe, 2011). The evidence also shows that such interventions not only reduce recidivism but when well-implemented, can produce substantial cost-savings as well (e.g., Aos, Miller, & Drake, 2006; Farrington, Petrosino, & Welsh, 2001; Welsh, Farrington, & Sherman, 2001). One of the major vulnerabilities in this literature, however, is the great variability of outcomes often seen in evaluation research (Lipsey & Cullen, 2007). This variability indicates that the same or similar interventions associated with reductions in recidivism in one instance may have no effect on recidivism outcomes in a second instance. In some cases, these same interventions may even be associated with increases in criminal behavior—that is, as McCord (2003) has so eloquently labeled it, “Cures that Harm” (also see Barnoski, 2004; Lilienfeld, 2007; Lowenkamp & Latessa, 2005; Wilson & Davis, 2006). It is especially important, then, to define not only the parameters of effective correctional interventions but also the conditions under which these same interventions are ineffective or even harmful.
Our understanding of what makes for effective correctional interventions has increased considerably. Explanations for why the same or similar programs fail to work, or in some cases produce negative outcomes, are often less developed. The most commonly cited rationales generally include poor program implementation or delivering inappropriate programming to the wrong population (e.g., Lowenkamp, Latessa, & Smith, 2006; Rhine, Mawhorr, & Parks, 2006). And because many evaluations fail to open the “black box,” it is often unclear why theoretically sound programs expected to reduce recidivism produce negative outcomes. Program evaluations are typically designed to detect and understand positive program effects, and when negative outcomes are the result, evaluators must typically “sift the ashes” to try to understand the outcome. In this longer term and more detailed follow-up of an innovative “reentry” program (Project Greenlight [GL]) in which negative program effects were observed, we contribute to the literature on correctional programming with a specific focus on offender risk.
Although the literature recommends the most intensive programming for higher risk offenders, we find that under some circumstances, more “intensive” programs may result in negative outcomes for individuals classified as moderate and high risk. We consider the possibility that this outcome may be related to notions of individual learning styles and responsivity to specific forms of correctional programming that is likely tied to offender risk level. Our analysis also raises questions about the wisdom of prison transfers that occur immediately before release from prison.
A Brief Review
The “What Works” literature in correctional programming is well grounded theoretically and includes numerous empirical assessments of programs and program types that work, those that do not, and those that still require more explanation (e.g., Andrews & Bonta, 2006, 2010; Andrews et al., 1990; MacKenzie, 2006; Office of Justice Programs, 2011; Sherman et al., 1997). Effective programs tend to have solid theoretical foundations that explain how the intervention is expected to influence behavior and outcomes. Evidence supports the notion that effective programs are firmly rooted in cognitive-behavioral or behavioral foundations based on social learning principles and that target offender characteristics that are changeable. The theoretical rationales of such cognitive programs can be expressed fairly succinctly: Individuals who are undersocialized or poorly socialized exhibit a general deficit in “prosocial” cognitions, behaviors, attitudes and/or beliefs, and are therefore more prone to be involved in criminal behavior. Because such prosocial attributes are learned, then it follows that the knowledge, thinking and behavior associated with poor socialization can be altered—that is, prosocial skills can be taught.
In terms of correctional interventions specifically, Andrews (2006, p. 596; also see Andrews & Bonta, 2006; Andrews, Bonta, & Wormith, 2011) refers to the RNR (risk-needs-responsivity) model in which the central principles are to “treat moderate and higher risk cases, target criminogenic needs, and use powerful cognitive social learning influence strategies.” Offender habilitation can be most effectively achieved when interventions target the factors directly implicated in criminal behavior that are the most amenable to change. The “dynamic” risk factors most often targeted include antisocial attitudes, impulsive behavior, poor social skills, peer associations, and substance use. Focusing treatment on moderate and higher risk individuals makes both theoretical and practical sense. As low-risk offenders are by definition less likely to reoffend than other individuals, it is more cost effective to devote limited resources to groups at a higher risk of reoffending. The general notion is that devoting significant resources to lower-risk offenders is unlikely to significantly reduce their already low propensity for engaging in criminal conduct. Likewise, moderate- and higher-risk individuals are characterized by attributes that are most predictive of recidivism and are most amenable to change and are therefore likely to show the greatest reductions in recidivism when targeted by intensive interventions.
Project GL was an intensive prison-based program that used a strong cognitive behavioral foundation. There were high hopes for GL among its various stakeholders; it was considered comprehensive with a short delivery period, it was developed in conjunction with the New York State Department of Correctional Services (DOCS) and the Division of Parole, and it was designed to be easily turned over operationally after a pilot period. Program staff, developers, collaborators, and participants all had exceedingly positive views of the program (Wilson & Davis, 2006). Despite initial hopes, findings were generally disappointing, and at the extreme, pointed to potentially harmful consequences associated with the intervention.
Project Greenlight Overview
The GL intervention operated in the Queensboro Correctional Facility in Long Island City, New York from February, 2002 until February, 2003. 1 GL was designed to be an intensive, prison-based “reentry” program delivered in 8 weeks immediately prior to release. In reality, however, the program is better referred to as a “prerelease” program for reasons we discuss shortly. The foundation of the GL intervention was a restructured version of the cognitive-behavioral Reasoning and Rehabilitation (R&R) program, a program with substantial empirical support in reducing offender recidivism (e.g., Tong & Farrington, 2006). The R&R program was restructured to be delivered over 8 weeks rather than the suggested 4-to-6-month period, and classes held 26 individuals rather than the generally recommended 8 to 12 (Porporino & Fabiano, 2000). The general appeal of such changes is self-evident: Potential reductions in recidivism might be achieved at a substantially reduced cost.
A number of other program elements supplemented GL’s cognitive-behavioral foundation: life skills education, employment assistance and job readiness training, housing services, drug relapse prevention and drug education, linking inmates to community-based services, and working with parole officers to review parole conditions and responsibilities before release. An already intensive program had therefore compressed its delivery into a much shorter period of time, delivered content to larger classes, and added a number of other program elements as part of the total intervention.
Some of the limitations associated with the GL intervention may amount to true design flaws. Although the program incorporated elements that either had anecdotal or empirical support in the research literature, it often failed to adhere to the underlying principles of why we think certain programs might work. First, risk assessment and need for programming played no role in program assignment (Andrews, 2006; Wilson & Davis, 2006). The program was essentially a “one-size-fits-all” design such that every participant received virtually all components of the program—in the most obvious example of a misdirected effort, Wilson and Davis (2006) indicated that individuals with no history of drug use were required to attend drug education classes. Also, some of the program elements, such as the family counseling and reintegration sessions, despite their intuitive appeal, have little, if any empirical basis in reducing recidivism (Marlowe, 2006). 2 Another potential issue is that although the program aimed to connect participants to community-based service providers, it did not provide any community aftercare services after release from prison. Such services have solid empirical support in terms of assisting in successful transition from prison to community (Petersilia, 2003). Finally, it is also important to consider whether the restructured program may have compromised key elements of the program’s foundation that make it effective. In this case, although the restructured program may have been more intensive in delivering a wider variety of services in a shorter period of time, it is not necessarily the case that “more intensive” equates to “appropriate.” Effective interventions deliver clinically appropriate programming to populations defined as needing those programs.
In terms of outcomes, Wilson and Davis (2006) reported that for nearly all interim outcomes (e.g., employment, housing stability, drug use, meeting parole requirements), there was virtually no difference between the GL group and the main comparison group. Recidivism results however, were a different matter. At 12 months after release, the GL participants were significantly more likely to be arrested than control groups (see Wilson & Davis, 2006, table 3). GL participants also exhibited lower cumulative survival probabilities and lower mean survival times for both parole revocations and felony arrests. In sum, the data suggested that the GL intervention not only failed to improve both interim and recidivism outcomes but was instead associated with negative program effects. Controlling for individual-level characteristics in Cox regressions did not alter the relationships. Wilson and Davis (2006) reviewed a number of possible explanations for the negative findings, with the most plausible having to do with poor program design and poor program implementation.
In this analysis, we reexamine the GL program over a longer postrelease period (a minimum of 30 months) and reanalyze the data by the risk level of the study participants. Although negative program effects were shown in the original analysis, it is reasonable to think that the effects might be differentially distributed across risk levels of the study population, as the literature suggests that intensive programming should have the most success with higher-risk offenders. We find, however, and counterintuitive to what one might predict based on the evidence-based literature about effective correctional interventions, that moderate- and higher-risk participants fare much worse from their association with the program, whereas low-risk participants exhibit small benefits.
Data and Method
The data and research design for the GL evaluation have been extensively described elsewhere, and, for that reason, we highlight only the most important elements of the evaluation, as well as any factors that differentiate this analysis from prior work (Wilson & Davis, 2006; Wilson et al., 2005). The most important differences in this analysis are that we focus our analysis on arrests (misdemeanor and felony), the 1-year postrelease data of the original analysis are bolstered by additional data on arrest outcomes for a minimum of 30 months after release, and our analyses take into account the risk level of the participants. 3
All individuals incarcerated in a New York State prison who met the criteria for study inclusion between February, 2002 and February, 2003 are included in this analysis with the participants divided into three distinct study groups. The GL participants (N = 345) were transferred from upstate prisons to Queensboro, where the pilot program operated. The Transitional Services Programming (TSP) participants (N = 278) were also transferred from upstate prisons to Queensboro but received DOCS standard TSP. 4 Finally, all individuals who met the criteria for study inclusion but were not transferred to Queensboro due to space limitations at the facility constitute the Upstate (UPS; N = 113) group.
TSP constitutes the primary control group as both TSP and GL were transferred and assigned to the same facility where the pilot program was operational. Although GL participants were not totally isolated from the rest of the institutional population, they were housed in a separate wing and did not encounter other inmates as part of the programming. GL participants were in the mandatory segments of the programming for the entire 8 weeks. Although participants were not required to take part in the research, those who refused to participate in mandated programming were subject to sanctions per DOCS policy, including loss of good-time release. Only one individual refused to participate in the program.
The UPS group varies from the GL and TSP groups in two important respects. First, the UPS group received no prerelease programming. To the degree that such programming serves an habilitative function, we might expect that UPS would fare worse than our other two groups. However, the UPS group did not experience an institutional transfer and coerced program participation immediately before release from prison. To the degree that inmates form social bonds and networks, are embedded within a specific community and a stable institutional life, and have some semblance of control over their lives, an involuntary transfer to another facility with coerced programming may be disruptive and counterproductive and might lead us to expect the UPS group to perform better. Although there has been little work done on the impact of involuntary institutional transfers and what it means for individuals, there is a diverse literature that suggests that situations and events that create stress, especially those that create a sense of powerlessness, such as involuntary moves, can negatively affect a host of life outcomes, including recidivism (Agnew, 2001; Mazerolle & Maahs, 2000; Zamble & Quinsey, 1997).
As indicated in Wilson et al. (2005; also see Wilson & Davis, 2006), the GL evaluation used a fairly rigorous quasi-experimental design during the first 5 months of operation—the sequential “haphazard assignment” procedure that was in place between February and June, 2002 is among the stronger quasi-experimental designs (Shadish, Cook, & Campbell, 2002). In July, 2002, as the program increased to its full capacity of 104 participants, a trickle-process random assignment procedure was implemented. This random assignment process remained in effect through the termination of the program (7 months) with few deviations from the assignment protocols. In short, the assignment of individuals to GL and TSP was accomplished through a relatively strong research design. Assignment to the UPS group was by default—individuals who were identified as meeting the criteria, but who were not transferred in time to participate in the program due to DOCS transportation schedules, were summarily allocated to this group. Individuals were transferred to Queensboro at least 8 weeks before release to participate in the full program (in other words, every individual assigned to GL received the full 8-week program). If they were not transferred before they had less than 8 weeks remaining to release, they automatically became part of the UPS study group.
Criminal history and arrest data come from New York State’s Department of Criminal Justice Services’ (DCJS) criminal history database and are supplemented by information obtained from New York State DOCS and Parole. We first analyze the data using survival and Kaplan-Meier (KM) analyses. Although the maximum potential time at risk in the community is up to 42 months, all participants have close to a minimum of 30 months at risk. As the number of participants decline steeply after the 30-month period, we censor all individuals at 30 months for these analyses, and all subsequent discussions are based on a time at risk of 30 months. Given that our sample sizes are sometimes relatively small, especially when disaggregated by risk levels, tests of statistical significance may fail to capture meaningful relationships as well as the magnitude of those relationships; for these reasons we compute and report effect sizes (ES; Cohen’s d) to complement our other analyses. 5
In addition, the prior analysis by Wilson and Davis (2006, table 1, p. 315) suggests that differences in the attributes of the study groups are largely trivial, and for that reason we do not present a set of descriptive statistics here. Nevertheless, we conduct several multivariate analyses to examine the impact of our control variables on longer term outcomes because simple descriptive statistics do not provide a view of how the controls might be interrelated. As we standardize time at risk by censoring all cases at 30 months, and because both rates of failure and time to failure may have different predictors and meanings, we conduct both multivariate logistic and Cox regression to better understand the outcomes.
Mean Survival Times for Any Arrest With Confidence Intervals and Test of Equality of Survival Distributions (Kaplan-Meier) by Program Group—30 Months After Release; GL-TSP-UPS
Note: All comparisons of statistical significance are with the GL group. GL = Greenlight; TSP = Transitional Services Programming; UPS = Upstate.
p = .10. **p = .05. ***p = .01. ****p = .001.
The risk instrument is developed from data on individual attributes that have been associated with criminal recidivism. These include criminal history measures such as types of offenses, including numbers of arrests and convictions for misdemeanor and felony offenses, bench warrants, and indicators for drugs, weapons, and firearm offenses (e.g., Andrews et al., 1990; Andrews & Bonta, 2006; Gendreau, Little, & Goggin, 1996). Basic demographic data such as age, race/ethnicity, and educational level, as well as some information on prior substance use are also contained in the data files. In estimating the various models for constructing a risk scale, we were not only cognizant of the literature on the predictors of recidivism but also considered the variables available to us as well as their potential meaning for respondent outcomes.
Following Gottfredson and Snyder (2005), we use logistic regression to obtain unstandardized coefficients for variables that are predictive of new arrests (data available from authors). Variables that were statistically predictive of new arrests included prior revocations, prior felony arrests, bench warrant indicators, substance use indicator, release age and borough of release. Educational level and race/ethnicity were not included as they were not predictive in any of the models tested and because the use of race/ethnicity variables in such scales raises ethical concerns (Gottfredson & Snyder, 2005). We include the borough of release as it could potentially indicate opportunities and networks available to individuals recently released from prison (Mears, Wang, Hay, & Bales, 2008). Given the lack of dynamic risk predictors available, geographic location may be the next best thing as it suggests such neighborhood characteristics as employment and work opportunities, living arrangements, and exposure to prosocial peers. Once our scale was constructed, rather than simply dividing the scale into thirds, we selected the bottom 30% as “low risk,” the top 30% as “high risk,” and the middle 40% as “medium risk.” 6
Figures 1a and 1b address our risk measure. Figure 1a illustrates the distribution of risk levels by program group—there are 220 individuals across the three study groups classified as low risk, 294 as medium risk, and 222 individuals classified as high risk. Although some differences in level of risk by group are apparent, a simple chi-square test indicates that these differences are trivial (χ2 = 5.968, ns). Slightly higher proportions of the GL group (33%) are categorized as high risk compared with the TSP (29%) and UPS (25%) groups, and higher proportions of the UPS group are classified as low risk (38%). These findings are largely consistent with the descriptive statistics reported by Wilson and Davis (2006, table 1) suggesting no significant difference in the various criminal history and demographic attributes across study groups.

Assessing the risk scale: (a) Distribution of risk by program group; (b) Rearrests by risk level of study participants
Figure 1b highlights the degree of discrimination within the risk scale showing the distribution of arrests by risk level. For low-risk individuals specifically, only 22.3% experience an arrest of any kind. In contrast, nearly 72% of high-risk individuals experience an arrest within 30 months of release, and 49% of those categorized as medium risk are rearrested in that 30-month period. The degree of discrimination across risk categories for any arrest is statistically significant (χ2 = 110.79, p < .001).
One of the weaknesses of our risk scale is that its post hoc construction means that we do not have a good sense of how our population compares with all others being released from New York State prisons. However, existing data suggest that New York’s 3-year recidivism rate is similar to the 67% rearrested nationally within 3 years after release from prison (see Independent Committee on Reentry and Employment, 2006; Langan & Levin, 2002). Figure 1b indicates that within 2 ½ years, 48% of our study sample had been rearrested—this suggests that our population overall is at a lower risk for rearrest than New York State prison releases in general. 7 Nevertheless, to our knowledge, no one has established specific criteria for what constitutes a low-, medium-, and high-risk offender, but we believe that our scale adequately represents to a reasonable degree these conceptual categories (e.g., Wilson, 2005).
Findings
Figure 2 shows that the overall survival results for arrests closely mirror the original analysis by Wilson and Davis (2006). The longer term analysis, however, shows that the differences between the GL and TSP groups in the proportion surviving closes over time with 48% and 52% of participants, respectively, not experiencing a new arrest after 30 months. This contrasts with 66% of UPS participants—a substantial difference. Although these survival curves provide an excellent overview of the general trends, KM analyses are somewhat more informative as they test for statistically significant differences in the survival distributions.

Cumulative survival (life-table estimates) for all arrests and Ns by risk interval (N @ risk in interval)
Table 1 presents KM results for the entire population and is also our first examination of differences by risk level. KM generates three χ2 statistics: the log-rank, the Breslow, and the Tarone–Ware. These tests compare the number of terminal events with the expected number for each time interval in the study, but each uses a different weighting scheme to account for differences in time at risk. As we control for time at risk by censoring all cases at 30 months, all three produce similar estimates—thus, we show only the log-rank χ2 value and significance level. The survival distribution is a function of both the time to failure and the proportion failing and is given in the column labeled mean survival estimate (months). The column labeled censored includes the number and percentage of participants still at risk in the community without experiencing a new arrest.
Table 1 shows that at 30 months after release, and similar to the survival curves, 47.5% of all GL participants (mean survival time = 20.63 months), 51.8 % of TSP participants (22.18 months), and 66.4% of UPS individuals (23.46 months) were still in the community without a new arrest. KM statistics indicate that the significant negative relationships in mean survival times between GL and TSP shown in the original analysis have largely dissipated. However, the UPS group still outperforms the GL participants to a statistically significant degree in terms of mean time to arrest.
Our analyses by risk suggest that the results are not only unevenly distributed by the risk level of the population but also trend in fundamentally different directions. Focusing on the low-risk comparisons first (and despite the small N), a marginally significant positive difference in mean survival times exists between the GL and TSP groups (27.51 vs. 25.48 months; χ2 = 2.77, p < .10) indicating that despite the trend toward poorer performance by the entire GL group, there is a small positive program effect for the lowest risk group. Furthermore, despite being marginally significant, the 10 percentage point difference in the percentage surviving without an arrest after 30 months (80.4 compared with 70.0) is a notable difference in recidivism. The difference in survival times between GL and UPS (27.51 months vs. 27.21 months) for low-risk individuals is not statistically significant although there is a 6 percentage point difference in the percentage without an arrest after 30 months.
Consistent with the classification of medium risk, both the survival times and percentage without a new arrest are less than those from the low-risk group. GL performs slightly worse than TSP, although the differences are not significant. Compared with the UPS group, GL’s moderate-risk offenders show a significantly shorter mean survival time (20.98 vs. 25.29 months, p <.01) and a 25 percentage point difference in the percentage without a new arrest (44.0% to 69.0%).
For those classified as high risk, the GL participants show a significantly shorter 4-month difference in mean survival times compared with the TSP participants (14.35 vs. 18.35). Similar to the low-risk participants, there is again a 10 percentage point difference in the percentage without a new arrest after 30 months, although in this case, the difference favors the TSP group. When comparing GL to the UPS group, mean survival times are roughly similar, but there is a 9 percentage point difference in the percentage without a new arrest (23.7% to 32.1%).
The effects of the GL program appear to operate most strongly in a counterintuitive direction with individuals of different risk levels. Moderate- and high-risk offenders who participated in the more intensive programming of Project GL generally performed worse in terms of arrests than comparable higher-risk offenders in the other two groups. And for low-risk participants, there appear to be small benefits associated with the GL intervention.
Multivariate Analyses
The KM tests and survival data provide an understanding of simple differences between study groups without the benefit of including other covariates. However, multivariate analyses allow us to model the controls simultaneously and thus provide a better understanding of whether any potential differences may account for, or moderate to some extent, the observed group differences just shown. Furthermore, GL has slightly more high-risk participants, and UPS has slightly more low-risk participants. In other words, it may be possible that GL performance may be worse (and UPS better) than the other groups because the three groups vary enough on characteristics related to criminal recidivism that would account for the group differences. Cox regression is a standard method for modeling time-to-event data with a set of control variables. We model the data for the full analysis (rather than by risk level) because that will indicate whether the differences in the controls (and by definition, by differences in risk levels) might account for the differences we have illustrated.
Table 2 shows three separate models: the first includes only the program groups (in which TSP is the comparison), the second shows demographic and criminal history covariates, and the third combines the first two to see if the covariates mediate the estimates between the study groups. Covariates were chosen based on widely recognized predictors of adult reoffending and include the variables we used in constructing the risk scales. Like many studies, the variables available to us are primarily “static variables,” or characteristics such as prior criminal history or age, that cannot be addressed through programming but that still have strong predictive utility (Andrews & Bonta, 2006; Gendreau et al., 1996). With the exception of the substance use indicator, we have no indicators of mutable characteristics such as antisocial attitudes, peer associations, or other dynamic predictors that might explain the differences.
Cox Regression of Total Arrests Within 30 Months After Release (N = 736)
Note: UPS = Upstate; TSP = Transitional Services Programming.
p = .10. **p = .05. ***p = .01. ****p = .001.
Each model presents the unstandardized coefficients (b) and the exponentiated coefficient, Exp (B), also known as the hazard ratio. The sign of the b coefficient indicates the direction of the effect on the hazard rate; a positive value means an increased hazard for arrests, and a negative value suggests a reduction in the hazard and increased survival times. We show the hazard ratio, Exp (B), because Luke and Homan (1998) indicate that it
provides the most useful information for estimating the magnitude of the effects of the covariates. . . (and). . .can be interpreted as a measure of ES, as a relative risk, and as the ratio of the relapse hazards of two types of persons. (p. 369)
A straightforward interpretation of the Exp(B) of 1.071 for prior felony arrests in Model 2 for example, indicates that a one-unit increase in prior arrests (each additional arrest) results in a 7.1% increase in the hazard rate, or the propensity for failure ([1.071 – 1.00] × 100 = 7.1%) (Luke & Homan, 1998).
In Model 1, and consistent with the prior analyses, the GL intervention is associated with a statistically nonsignificant 16.6% increase in recidivism compared with the TSP group, and the UPS group is associated with a nontrivial 32.3% reduction in recidivism compared with the TSP group. Model 2 indicates that traditional predictors such as release age, number of prior arrests, revocations, and bench warrants are statistically associated with mean survival time. Although release age seems to have a small relationship with recidivism (each additional year results in a 3.3% reduction in the hazard rate), understanding the net effect is also essential. In this case, using the b coefficient (−.033) to compute the change for a 10-year increase in age yields a 28.11% decrease in the hazard for failure. In addition, the borough of release also appears to be strongly related to mean survival time with all four boroughs of New York City showing substantial reductions compared with Manhattan. Although the variables do not achieve statistical significance, the 29.9% increase associated with being Black compared with White/Other, and the 24.6% increase of drug crime compared with robbery crimes are still large by most standards.
When the controls are combined with the program groups in Model 3, the analyses suggest sufficient differences to reverse the statistical significance of the relationships, with GL now showing a statistically significant 27.4% increase in the hazard for failure. Although nonsignificant, UPS is only somewhat less in magnitude in the opposite direction, showing a 22.4% decrease in the hazard rate compared with TSP. The shifts in the values of the exponentiated coefficients between Models 1 and 3 for the GL (1.166 to 1.274) and UPS (0.677 to 0.776) groups with the addition of the controls indicates that differences in the individual attributes between groups do account for some of the differences in the findings for the KM analyses. The addition of the controls moderates the difference between the TSP and the UPS group and increases to some degree the difference between the TSP and the GL group. However, it does not erase those differences. It appears then that despite the lack of statistically significant differences in the descriptive statistics, GL appears to have been somewhat higher risk than TSP, and UPS appears to have been somewhat lower risk than TSP in terms of the association with mean survival time. 8
In Table 3, we reproduce the same analyses as a series of logistic regressions. There are several reasons for this. First, the differences in mean survival times as reported in Table 1 are often a matter of only 2 or 3 months and a difference of this magnitude may not be viewed as very compelling. The percentage differences surviving at 30 months often appear more substantial. As an example, the statistically significant finding in Table 1 that the difference in mean survival time between GL and UPS is 2.8 months may not be as compelling as the 18.9 percentage point difference in new arrests at 30 months. In addition, control variables may show a different set of relationships with the binary outcome of a logistic analysis in contrast to the survival distribution. In this sense, the logistic analysis may provide further insights. As we have censored all cases at 30 months, differences in time at risk in the community do not affect our failure rates.
Logistic Regression of Total Arrests Within 30 Months After Release (N = 736)
Note: UPS = Upstate; TSP = Transitional Services Programming.
p = .10. **p = .05. ***p = .01. ****p = .001.
Model 1 in Table 3 indicates that the net effect of the GL intervention is similar to that shown in Table 4, where participation in the GL program is associated with an 18.6% increase in the hazard for rearrest. The UPS value of 0.544 suggests that the effect of being released directly from an upstate prison with no prerelease programming is associated with a 45.6% decrease in the probability of arrest as compared with the TSP group. Model 2 shows that the bulk of the predictor variables that were statistically significant in Table 2 remain so, although some of the estimates have shifted slightly in terms of their predictive utility. For instance, the effect of prior revocations on the probability of rearrest is much stronger in the logistic model (49.1% increase) compared with the Cox model (26.2% increase). In the third model, the coefficients and significance values for the control variables remain virtually unchanged, as do those for the UPS and GL groups. Unlike the Cox analyses then, the control variables appear to have little effect on the probabilities of rearrest at 30 months for the three different groups.
Cohen’s d, by Risk Level and Recidivism Measure
Note: Cells that are highlighted indicate Cohen’s d values of 0.20 or greater. GL = Greenlight; TSP = Transitional Services Programming; UPS = Upstate.
p = .10. **p = .05.
The final aspect of our analysis, as presented in Table 4, explores ES. We focus on the binary outcomes at 30 months rather than computations for the differences in mean survival times for several reasons. As other studies are more likely to have binary outcomes (which are more easily obtainable than survival distributions), our findings are more readily comparable. In addition, our logistic analysis indicated that binary outcomes were largely unaffected by the addition of the control variables whereas some small effects were noted in our Cox models. As we censored all cases at 30 months, differences in the binary outcomes are not likely due to differences in time at risk.
Cohen’s (1988) generalizations about ES have long been the standard in understanding the magnitude of differences in research and evaluation outcomes. ES of 0.20 or less have generally been considered small, of about 0.50 are moderate, and of 0.80 or greater are large. Lipsey and Wilson (1993, 2001), however, note that these estimates are not based on systematic evidence. Generating a distribution based on an analysis of over 300 meta-analyses from psychological, behavioral, and education research, they indicate more specifically that ES of ≤ 0.3 fall into the bottom quartile, a median ES is 0.47, and ≥ 0.67 defines the top quartile. We employ the conventions set by Cohen (1988) and Lipsey and Wilson (1993) as comparative referents. We compute an odds ratio as the first indicator of ES, and then convert that to the more widely recognized Cohen’s d. Table 4 then, presents the Cohen’s d for all arrests, revocations and felony arrests. 9 Although not a focus of our discussion, largely because most of the contrasts are not of a sufficient N for adequate statistical power, we also show levels of statistical significance for the ES shown (Kraemer & Thiemann, 1987; Lipsey, 1990). And although we have not discussed recidivism in terms of revocations or felony arrests due to space limitations, we provide similar statistics here for those outcome measures for interested readers.
In each contrast, a positive value associated with the Cohen’s d value indicates an increased probability of success for the first study group shown in each row, and a negative value indicates the same for the second group. Thus, the value of −.09 for GL-TSP arrests in Table 4 indicates that TSP has proportionally fewer arrests than GL at 30 months, but this appears to be a fairly minor effect at best and the difference is not statistically significant. The next comparison, between GL and UPS, with a value of −0.43 indicates that the UPS group has proportionally fewer arrests at 30 months than the GL group (specifically, a mean standardized difference between the two of .43), with the difference achieving statistical significance. Based on the estimates of Cohen (1988) and Lipsey and Wilson (1993), this approaches a moderate or median ES. Also in this table, we have highlighted contrasts with ES of .2 or greater as shown for the Cohen’s d. We use this as our cutoff value since even small effects can have important practical consequences (e.g., Kirk, 1996).
For the full analysis, the directions of the effects are consistent with our prior results and indicate that GL performs worse than both control groups on every measure of recidivism although the ES differences with TSP tend to be quite small and nonsignificant. The UPS group performs better than both GL and TSP across all measures, and all differences are at least marginally statistically significant with estimates of the Cohen’s d ranging between small and moderate.
As we have already noted, the size and the direction of some of these effects are differentially distributed in important ways by the risk levels of the populations examined. For those scored as low risk, GL participants reverse the overall trend and perform better than the TSP group with the .31 d-value between small and moderate. In comparison with the UPS group, GL performs worse with small to moderate effects exhibited. The UPS group performs better than the TSP group with at least moderate-sized effects.
When the medium-risk participants are the focus, the same general pattern exists as for the full analysis—GL performs worst and UPS performs best. However, the ES are larger for every contrast other than the TSP-UPS difference for felony arrests. This is especially true for both arrests and revocations where the ES more than double for revocations. Thus, the Cohen’s d for many of the contrasts for medium-risk individuals are moderate to large, and 6 of the 9 contrasts are at least marginally statistically significant.
High-risk contrasts show differences that are relatively smaller compared with the low- and medium-risk participants, and the effects are a bit more mixed in terms of direction. ES of magnitudes less than 0.2 characterize most of the contrasts. GL performs worse than both TSP and UPS for arrests and felony arrests where ES range between −0.18 and −0.29. Compared with the overall analysis, UPS actually shows worse performance compared with TSP for arrests and revocations, and GL for revocations, although these differences are negligible.
Discussion and Conclusion
Our basic analyses are largely consistent with the original findings by Wilson and Davis (2006). The current research, however, provides valuable new insights which might suggest the importance of tailoring programming to a target population, both in design and implementation. In the full analyses, we find that GL consistently performs worse than both of our comparison groups, and UPS consistently performs better. Our analyses by risk level of the participants, however, show that there are fundamental differences in the distribution of effects by risk level that gives us a better understanding of the overall relationships. We show that intensive programming may result in negative outcomes for moderate- and high-risk offenders when the design and implementation does not address the needs of these populations.
The multivariate analyses suggest that our control variables explain some differences in the survival distributions, but contribute little to our understanding of the study group differences in the dichotomous outcomes. The various individual attributes combine to make GL slightly higher risk than TSP, and TSP slightly higher risk than UPS. However, the individual differences, and consequently differences in risk, do not explain away the variation between study groups. These findings give us greater confidence that flaws in the assignment procedures and research design do not account for the differences in the findings. Despite our post hoc construction of the risk scale, we are somewhat reassured in our construction of risk groups, given the differences in rearrest rates across groups.
Although GL performs worse than TSP (our primary control group) in the full analysis, the effects of the intervention are differentially distributed across risk levels, with low-risk GL participants exhibiting positive program effects and moderate and higher risk participants showing small to large negative effects. This finding initially appears to contradict certain principles of correctional interventions suggesting that the most intensive programming should be reserved for higher-risk offenders.
Wilson and Davis (2006) suggested that poor program implementation, including the mismatch between program structure and the population to whom it was delivered was a likely explanation for the negative program effects in the original analysis. The findings presented here point toward a more nuanced discussion of the relationship between risk, program design, and program implementation. If program implementation were the sole explanation for negative outcomes of the GL intervention, then we would not likely see positive program effects associated with low-risk offenders.
The risk principle holds that the most intensive programming should be reserved for those who are moderate to higher risk. However, the responsivity principle holds that treatment programs should be delivered in a style and mode consistent with the ability and learning style of the offender (Andrews & Bonta, 2006, p. 283). The GL intervention’s intensive programming may have been inappropriate for moderate- and high-risk offenders given the compressed delivery time, increased class sizes, and additional program elements. Low-risk offenders tend to be less impulsive, have better attention and social skills, higher verbal skills, greater cognitive maturity and other individual attributes that make them low risk by definition. Consistent with the responsivity principle, they may be better able to process the more intensive GL program in the shorter time period than the moderate- and higher-risk individuals.
The literature also argues that when effective programs are well implemented, the greatest reductions in recidivism can be achieved with moderate- and high-risk offenders. In the same sense, when programs are poorly implemented or targeted, the greatest negative impact seems likely to occur among low- and moderate-risk individuals. Just as low-risk individuals are unlikely to see huge reductions in recidivism even when programs are effective, poorly implemented programs may not increase the probability of reoffending of high-risk groups, as they are already at high risk to reoffend. Poorly implemented programming however, could increase the probability of reoffending for those in low- and moderate-risk categories. In our comparisons between GL and TSP, the data are largely consistent with this interpretation. There are generally small positive effects for the low-risk GL individuals who may be able to better absorb the “very” intensive GL programming, and for moderate-risk offenders, there are small to large program effects that are negative.
The UPS group poses a separate problem of interpretation altogether, especially because UPS does better across almost every contrast presented, with the exception of high-risk participants. Our findings suggest that the inmates’ subjective experience of the programming and/or the involuntary relocation right before release may be related to negative outcomes. Both the TSP and GL groups were involuntarily placed in programming and transferred to another facility immediately prior to release. GL program designers assumed that transferring individuals to an institution in their home community right before release would help them in the prerelease planning process—it may be however, that forcibly uprooting individuals from an environment with which they are familiar to place them in an environment they do not know, without sufficient time to adapt, has deleterious consequences. Our data therefore raise the possibility that prison transfers or coerced programming (or a combination of the two) immediately before release may be counterproductive. At the very least, these issues warrant a harder look.
There is little doubt that designers of the GL program failed to consider a number of important aspects of effective correctional programming, and that the GL program might have been better structured and more appropriately targeted. Our analysis suggests that some intensive programming, compressed in a short period of time, may have beneficial effects for low-risk individuals. Delivering the GL program in a compressed “dose” to large classes of low-risk inmates may be worthwhile, especially if significant reductions in recidivism can be achieved with a small cost, one of the major appeals of the GL program.
Prisons, probation and parole are often tasked with changing maldaptive behavior and deficits in “prosocial” character, skills, and attributes. With “rehabilitative” interventions, correctional agencies attempt, in what is often a very short period of time, to correct for a lifetime of less than adequate positive socialization. We have illustrated that for some higher-risk individuals, intensive programming may be inappropriate if delivered over too short period of time and that such programming may be beneficial for low-risk individuals. We have also raised questions about the wisdom of transferring inmates and/or placing inmates in coerced programming immediately before release. Future work might be well served by considering the effects of prison transfers, as they may exacerbate antisocial behaviors. Our findings raise important questions about program structure and delivery and contribute in a fundamental way to a continued evolution in understanding what correctional programs work for whom and why.
Footnotes
A prior version of this article was presented at the Academy of Criminal Justice Sciences meeting in March, 2009
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The authors received no financial support for the research, authorship, and/or publication of this article.
