Abstract
We provide a comparison of analyses used to estimate predictive validity, across fixed (logistic regression and area under the curve receiver operating characteristic [AUC-ROC]) and variable (Cox regression and Harrell’s C) lengths of follow-up. This study adds to research demonstrating a relationship between time at risk offense free and recidivism in two ways. First, reoffending hazard rates were calculated across levels of general offending risk to better understand how failure relates to time at risk. Second, this research compared validity estimates derived from Cox and logistic regression analyses to examine the importance of variable versus fixed follow-up periods. Results show that risk declines as a function of time offense free for all but low risk offenders. In addition, findings demonstrate remarkable stability in estimates of validity after just 7 months of follow-up. Finally, comparisons of Cox and logistic regression analyses, along with their related Harrell’s C and AUC-ROC validity estimates, revealed little substantive differences in prediction
Keywords
The assessment and prediction of recidivism has become an integral part of modern correctional practice, wherein agencies must grapple with seemingly contradictory responsibilities that include punishment and confinement, as well as rehabilitation and reentry. In addition, institutional overcrowding, unmanageable community supervision caseloads, and limited intervention options have all become obstacles to achieving the correctional goals of population management, recidivism reduction and, ultimately, public safety. Offender risk assessment has the ability to offer some relief in this regard. The assessment and classification of offenders into risk-based categories can facilitate better informed decisions concerning the use of precious correctional resources.
Methods used to assess and predict recidivism have evolved considerably over the last century. From the humble beginnings of simply summing a few historical offender characteristics to the complex assessment of a variety of historical and currently occurring factors that both predict recidivism and serve as the basis for case management, offender assessment has become synonymous with effective correctional practice (Andrews & Bonta, 2010). In fact, research demonstrating the reliability and validity of numerous assessment tools across a variety of outcomes is relatively easy to come by in the existing literature. As a result, contemporary research can focus less on what and how to assess, and more on innovations and statistical issues related to predicting recidivism.
Examining the effect that the time spent at risk without committing a new offense may have on recidivism risk is one recent innovation to emerge in the area of recidivism prediction. Initial research into this topic was conducted from the policy perspective of background checks and employment eligibility for ex-offenders. Specifically, several studies have investigated the relevance of criminal history in predicting recidivism within the context of time at risk offense free (Blumstein & Nakamura, 2009; Bushway, Nieuwbeerta, & Blokland, 2011; Hanson, Harris, Helmus, & Thornton, 2014; Harris & Rice, 2007; Kurlychek, Brame, & Bushway, 2006, 2007; Soothill & Francis, 2009). One of the first studies to examine this question was conducted by Kurlychek et al. (2006). Using a large dataset from the Second Philadelphia Birth Cohort Study, the authors compared offenders’ risk of any recidivism with nonoffenders’ onset risk, and found that although the two groups never converged, they did become very risk-similar after a period of 6 to 7 years. Similar research has replicated this effect, finding that offender and nonoffender risk to commit an offense converges after offenders were offense free in the community for 7 to 10 years (Kurlychek et al., 2007; Soothill & Francis, 2009).
Blumstein and Nakamura (2009) coined this effect “time to redemption,” defined as “[t]he process of going straight and being released from bearing the mark of crime” (p. 328). Alternatively, redemption can be thought of as the time at which offender recidivism risk has declined to where it is nearly indistinguishable from that of nonoffenders. Results of their analyses indicated that the redemption effect differed across age at first offense and type of crime. Time to redemption has also been found to be influenced by age at first conviction and number of prior offenses (Bushway et al., 2011). Specifically, risk of recidivism declined as a function of time offense free, and the redemption period was the shortest for those offenders who were older and had less extensive criminal histories (Blumstein & Nakamura, 2009; Bushway et al., 2011).
Although the studies discussed above looked at the effect that time offense free had on recidivism risk while also controlling for initial risk differences (e.g., age at first offense, number of prior offenses), only two existing studies have used a validated risk assessment score and resulting risk categorization (e.g., low risk, moderate risk, high risk) in this respect. Harris and Rice (2007) examined the effects of age, time since first offense, time incarcerated, and time in the community offense free on violent recidivism across the nine risk levels of the Violence Risk Appraisal Guide (VRAG). Their findings indicated that time in the community offense free affected violent recidivism to such an extent that it can be used to adjust actuarial risk reassessment scores. Interestingly, however, this pattern did not hold for the three highest VRAG risk categories of offenders.
The most recent investigation into the effect of time offense free on recidivism risk involved an examination of sex offenders across risk level categorization (as determined by the STATIC-99/R). Hanson et al. (2014) studied a large sample (N = 7,740) of sex offenders over a 20-year follow-up period to calculate sexual recidivism rates using survival analysis. The analysis in this study was innovative in that the authors used life table survival analysis to estimate recidivism rates, which allowed them to state the relative reduction of recidivism that was attributable to time sex offense free in the community. Their results suggested that, after the first few years of follow-up, the risk of sexual reoffending was approximately reduced by half for each 5 years the offender remained sex offense free in the community (Hanson et al., 2014). This finding was especially pronounced for high risk offenders who were tracked for 10 years. The percent of these offenders sexually recidivating, for example, declined from 29% from the time of release to 13% for those who remained offense free for 5 years to 6% for those who remained offense free for 10 years (Hanson et al., 2014).
The body of research discussed above used several different statistical approaches in estimating the relationship between time offense free and recidivism. A common theme in recidivism prediction is that of fixed versus variable outcome follow-up periods. Specifically, logistic regression analysis essentially ignores the length of follow-up in an outcome study and instead examines whether or not an event has occurred (Menard, 2010). This makes it particularly well suited to datasets with fixed follow-up periods. By contrast, survival analysis (including Cox regression) is focused on understanding the time to recidivism, as well as on understanding how other variables might also simultaneously relate to time to recidivism (Allison, 2010).
Relatedly, Harrell’s C and area under the curve receiver operating characteristic (AUC-ROC) analyses of predictive accuracy correspond to Cox regression and logistic regression, respectively. These indices are often used to compare outcomes across models. The AUC-ROC is generally used when using logistic regression models to classify cases with dichotomous outcomes and, while generally not concerned with time, has been used to discriminate recidivists from nonrecidivists (Hilbe, 2009). Specifically, it is a plot of sensitivity (vertical axis) by the inverse specificity or 1 − specificity (horizontal axis) across various cut-points used to classify offenders into one of two groups (i.e., recidivists vs. nonrecidivists; Kleinbaum & Klein, 2012). Stated another way, the AUC-ROC is a plot of the true positive rate by the false positive rate. This statistic basically provides an assessment of how well the model predicts who will have a particular outcome by providing information on the probability that a randomly chosen recidivist (i.e., true case) will have a higher score than a randomly chosen nonrecidivist (i.e., true noncase; Kleinbaum & Klein, 2012).
Although the AUC-ROC has become a commonly used metric when examining dichotomous outcomes, it has limitations when evaluating censored outcomes (Kang, Chen, Petrick, & Gallas, 2015). Specifically, survival outcomes are typically continuous rather than binary, necessitating an extension of the AUC-ROC concept to survival analysis through the use of the Harrell’s C index. As stated by Harrell, Lee, and Mark (1996), “The [Harrell’s] C index is defined as the proportion of all useable patient pairs in which the predictions and outcomes are concordant” (p. 370). Hence, in predicting time until an offender recidivates, the Harrell’s C would be calculated by considering all possible pairs of offenders, at least one of whom recidivated. If the predicted survival time was longer for the offender who remained offense free for a longer period of time, “the predictions for that pair are said to be concordant with the outcomes” (Harrell et al., 1996, p. 370). Various scholars have concluded that the Harrell’s C is the most appropriate metric for capturing the discriminating ability of a predictive variable to separate those with longer event-free survival from those with shorter event-free survival within some time horizon of interest (Kang et al., 2015).
Although the calculations for both statistics differ and, hence, the two metrics are not directly comparable, it is important to note that the interpretation at least for effect sizes is the same. For both Harrell’s C and AUC-ROC, a value of .5 indicates no ability to discriminate while a value of 1.0 (or 0.0) indicates perfect separation of recidivists from nonrecidivists. AUC-ROC scores and Harrell’s C values above the .50 threshold indicate that the instrument has positive predictive accuracy (i.e., higher scores predict recidivism), while scores below .50 mean that the instrument has negative predictive accuracy (i.e., lower scores predict recidivism). It is important to note that when calculating the AUC-ROC scores or Harrell’s C values using Cox or logistic regression models, the values will always range from .50 to 1.0 as they are based on predicted probabilities necessitating that the effect size values be flipped in cases where the instrument has a negative predictive direction. Generally, models with Harrell’s C and AUC-ROC of .70 are indicative that the model performs well at prediction (Rice & Harris, 2005). Finally, it is worthy of note that several researchers conclude that the AUC-ROC is a recommended index to be used in correctional prediction research (Babchishin & Helmus, 2015; Rice & Harris, 2005).
The current study adds to this growing body of research by examining the time offense free effect for general offenders and by using (and comparing) multiple statistical techniques. To do so, this research used a large sample (N = 27,156) and a 10-year follow-up period. The large sample and lengthy follow-up used in this research were conducive to providing a comparison of analyses used to estimate validity, across both fixed (logistic regression and AUC-ROC) and variable (Cox regression and Harrell’s C) lengths of follow-up.
Method
Participants
This research used a subsample of offenders that was part of the original Post Conviction Risk Assessment (PCRA) construction and validation project (see Johnson, Lowenkamp, VanBenschoten, & Robinson, 2011; Lowenkamp, Johnson, Holsinger, VanBenschoten, & Robinson, 2013). Individual offenders placed on community supervision between October 1, 2004 and December 1, 2005, were selected for inclusion in this study (N = 53,077). This number was further reduced to the final sample size of 27,156 once all records missing the necessary data to create a PCRA score were eliminated from the dataset.
Sixty percent of the sample was categorized as White while 33% of the sample was categorized as Black. Much smaller percentages of the sample were identified as Asian (3%), Native American (4%), Other (< 1%), or unknown (< 1%). Approximately 78% of the sample was male, and just more than 22% of the sample was female. The average offender age at the time supervision was initiated was 37 years (SD = 11.4). On average, the offenders were followed for about 9.6 years (SD = 0.34; range: 9-10.2 years).
Measure of Risk
The PCRA (Johnson et al., 2011; Lowenkamp et al., 2013) was used as the measure of risk in this study. PCRA scores were developed based on administrative data and include 15 items related to criminal history, education and employment, alcohol and drug use, social networks, and cognitions. The items are summed together generating a score that ranges in value from 0 to 18, representing four categories: low (0-5), low/moderate (6-9), moderate (10-11), and high risk (12-18). The PCRA has shown strong or excellent validity in predicting any rearrest (Lowenkamp, Holsinger, & Cohen, 2015; Lowenkamp et al., 2013) and rearrest for violent offending (Harris, Lowenkamp, & Hilton, 2015; Lowenkamp et al., 2015; Skeem & Lowenkamp, 2015). It has also been demonstrated that the PCRA is a valid measure over time and that changes in PCRA scores predict changes in the likelihood of recidivism (Cohen, Lowenkamp, & VanBenschoten, 2016; Cohen & VanBenschoten, 2014), has good interrater agreement 1 (Lowenkamp et al., 2013), and that the functional form of the relationship between the PCRA and rearrest is the same across categories of race (Skeem & Lowenkamp, 2015) and gender (Skeem, Monahan, & Lowenkamp, 2016).
Outcome Measures
Record checks contained criminal history data from the National Crime Information Center (NCIC) and Access to Law Enforcement System. The dates of rearrests that occurred after the date of PCRA (intake to supervision) were coded from these data. Overall, 50% of the sample was rearrested for a new criminal offense after intake to supervision (see Table 1). This 50% recidivism rate used the variable follow-up time. Measures of rearrest from fixed periods of 1 month through 12 months (not included in Table 1) were created, as were rearrest rates for each of the 9 follow-up years. We also randomly selected a cutoff time for each offender. This process involved using a random sampling approach to place the study population into 10 groups whose recidivism activity could be tracked for periods ranging from 1 to 10 years. For example, we randomly sampled 10% of the study population for the 1-year follow-up group, and then randomly sampled another 10% of the study population for the 2-year follow-up group and so on. Through this approach, we were able to generate a random follow-up sample consisting of 10 groups whose recidivism activity was limited to the varying follow-up times. Rearrests were then developed based on the randomly selected follow-up time. The failure rate based on this last approach was 36%.
Descriptive Statistics (N = 27,156)
Note. PCRA = Post Conviction Risk Assessment.
Analyses
Multivariate Cox and logistic regression models, and the associated Harrell’s C and AUC-ROC values, were estimated using each of the 9-year follow-up periods and the outcome measure based on randomly selected follow-up times. AUC-ROC values, associated standard errors, and 95% confidence intervals were estimated for 21 different time periods (0-11 months, 1-9 years, and then using maximum follow-up times). This examination of the relationship between failure and time also involved comparing the patterns in the hazard and odds ratios generated from both models over the specified time periods. It should be noted that an exact comparison of odds and hazard ratios (HRs) should be avoided because the formulas and interpretation for both effect sizes differ. Specifically, the denominator used in calculating odds ratios is the number of nonrecidivists, while HRs were developed by dividing the number of recidivists by the total sample size (Kleinbaum & Klein, 2012). Given these differences in calculating odds and HRs, odds ratios will generally be larger than HRs because the denominator in the odds ratio is smaller. Although not directly comparable, we examined the patterns in the odds and HRs to ascertain whether both models evidenced similar patterns in terms of the relationship between the raw PCRA score and recidivism over time. Moreover, we converted the odds ratios from the logistic regression models into an estimated HR allowing for direct comparisons between the two effect size metrics. 2
An important assumption when employing Cox regression models, also known as proportional hazard models, is the assumption that the effects of each covariate are the same at each time point. In other words, the effects are independent of time. If this assumption is violated, then the returned coefficient is akin to an average effect over time. There are a number of ways to test for and address a violation of this assumption. Given the key interest in this paper (that the HRs might change over time), the assumption of proportionality was tested for and addressed by adding an interaction term between the total PCRA score and time. Aside from testing the assumption of proportionality, this interaction term also tests the notion that the effect of the total PCRA score varies over time (Allison, 2010).
To better understand how failure relates to time at risk, a life table by year and risk category was calculated. The hazard rates, by year, for the entire sample and each category of risk was calculated and graphed. Finally, the cumulative failure rate (CFR) and hazard rate by year was calculated for each risk category (see Table 4).
Results
Table 1 presented the descriptive statistics for the sample regarding demographics (N = 27,156). Offenders were mostly White (60%), males (78%), and low to low/moderate risk on the PCRA (87%). On average, these offenders received raw PCRA scores of 6.2 (SD = 2.8) and, thus, were in the low/moderate range of the PCRA. The average time to failure was about 76 months (SD = 44.6). The percentage of offenders rearrested by follow-up periods ranged from 13% for the 1-year follow-up to 51% for the 9-year follow-up.
Table 2 presented the results from a series of both Cox regression and logistic regression models with several iterations of the dependent variable (i.e., new arrest). Specifically, the dependent variable was limited by each year of a 9-year follow-up period. Models were also presented using all years together for the dependent variable, which included those followed for 9 and 10 years (“All Years”), as well as one where cases were randomly assigned a variable follow-up time. This random assignment of follow-up time allowed for the estimation of models with variable rather than constant follow-up time periods. We chose to control for race and sex in the multivariate models given the relationship between race, sex, and rearrest. 3 The decision to control for an offender’s race and gender by implication means that the effect sizes presented throughout this study were not solely representative of the PCRA’s predictive accuracy. Rather these effect sizes were representative of a combination of the PCRA’s risk scales and an offender’s demographic characteristics.
Results of Logistic Regression and Cox Regression Models Predicting Rearrest and Time to Failure With Total PCRA Score
Note. All hazard ratios and odds ratios are significant at p < .001. Harrell’s C based on Cox Regression Models without interaction between time and PCRA total score. Harrell’s C and AUC-ROC are based on models with race, sex, and PCRA total score. PCRA = Post Conviction Risk Assessment; CI = confidence interval; AUC-ROC = area under the curve–receiver operating characteristic.
Outcome was developed based on randomly assigning 1 of 10 follow-up time periods.
p < .001.
We attempted to address two issues via the results presented in Table 2. First was an examination of the two different procedures. Both of these multivariate procedures (Cox regression and logistic regression) have been commonly used in recidivism studies generally, and risk assessment validation studies in particular, with some purporting one method may hold more value than the other. In actuality, according to these data, the results were quite similar for each model when comparing the relative statistics from one procedure (Cox regression) to the next (logistic regression). An examination of the hazard and odds ratios showed both sets of models (Cox regression and logistic regression) revealing a moderate amount of stability in the likelihood of failure occurring for each length of follow-up period, regardless of which procedure was used. For example, the odds ratios for the fixed follow-up periods ranged from 1.40 to 1.44; only for the randomized follow-up period did the odds ratio dip below the 1.40 mark to 1.38. Conversely, the HRs produced from the Cox regressions ranged from 1.33 when the study period included all years to 1.39 for the study periods covering the 2- and 5-year follow-ups. Moreover, the results remained relatively consistent for the randomized follow-up period relative to each varying length of fixed follow-up time (including “All Years”) showing relative stability regardless of whether a fixed or variable follow-up was used.
In addition, there was not a great deal of difference when comparing the HRs for the Cox regression models to the HRs that were generated via converting the logistic regression odds ratios into estimated HRs, a pattern that holds true for each length of follow-up period used. With the exception of the 1-year follow-up, the actual HRs were consistently higher than the estimated HRs but not markedly so. Even when the difference was most stark, for example, the model that utilized the random follow-up period (Actual HR = 1.34; Estimated HR = 1.25), this difference would not likely translate into a difference of great magnitude, if the ultimate objective was to determine real impact on actual recidivism rates.
The interaction term between the total PCRA score and time indicated that as time goes on the PCRA’s predictive accuracy declined slightly. We also found that the accuracy of the total PCRA score diminished at a greater rate for higher risk offenders than for lower risk offenders. The Cox Regression model reintegrated what was observed in Table 4 and Figure 1—that the effect of the total PCRA score was not constant across time and it seemed to vary to a greater extent for higher risk offenders.

Hazard rates (as percentage) by year and risk
Similar results were revealed when comparing the Harrell’s C values to the AUC-ROC values for each iteration of the follow-up period. First, there was very little difference between the respective methods, for each length of follow-up period. As with the comparison of odds ratios to HRs above, AUC-ROC values were consistently higher than their Harrell’s C counterparts, but not markedly so. In all cases, these respective values indicated an acceptable level of predictive accuracy (.70 or higher). Moreover, there was very little difference when examining these measures of predictive accuracy within the statistical techniques applied for this study (i.e., Cox and logistic regression). In short, the Harrell’s C values indicated adequate predictive power regardless of the length of time utilized for a follow-up period, as do the AUC-ROC values. It is important to note that the maximum values of both Harrell’s C and AUC-ROC were not affected by varying sample size, allowing for consistent comparison regardless of the number of cases that may have been utilized in each set of models.
At least two preliminary conclusions can be drawn from the results presented immediately above. First, both Cox regression and logistic regression revealed such similar results that they may be of equal value regarding recidivism and risk-prediction research. Second, and perhaps of more utility, was the consistency for each method, when comparing like models across varying lengths of follow-up periods. Although as noted above the predictive accuracy did not appear to vary much across varying length of follow-up time, the value was at or near the highest level for the shortest amount of time (1 year). Generally, when it comes to the amount of follow-up time, “more” is regarded as “better” when designing and executing any study involving recidivism or the predictive validity of a risk scale, for example. However, (depending on other research conditions and/or data limitations), perhaps the raw amount of time that is necessary is actually shorter than what conventional wisdom to date has dictated, at least for high base rate outcomes such as general recidivism (Jones, 1996). Jones (1996), for example, recommended an average follow-up period of 2 years for recidivism prediction research. In the current instance, it did not really matter how long a case was followed, as the same statistics indicating the same predictive accuracy resulted. One notable exception to this would be if there was a need to establish base rates of failure (i.e., an estimate of the actual failure rate for a particular group of offenders), and/or there was interest in attempting to calibrate an instrument via the definition of specific cutoff scores.
A more granular examination of follow-up time was presented in Table 3. In an effort to determine the shortest amount of follow-up time necessary to observe stable and valuable results, the bivariate AUC-ROC values were calculated for each month from 1 through 11 months, as well as for each year (1 year through 9 years), and using all years together. Perhaps unsurprisingly the AUC-ROC value was the lowest (.65) for the 1-month follow-up period, indicating the weakest capacity in the scale’s ability to distinguish successes from failures (defined as any recidivism). However, AUC-ROC values reached a level above .70 at 3 months (.71), and topped out at the 7-month mark (AUC-ROC = .74; 95% CI = .73 to .75). Although AUC-ROC values appeared to increase, slightly, at the 1-year mark, they topped out and remained stable after that point, as observed in Table 2. Such small statistical increases were unlikely to be worth the additional necessary follow-up time. Furthermore, these results may indicate that researchers seeking to develop and test the validity of risk assessment instruments predicting general recidivism need not engage in follow-up assessments covering multiple years. Stated differently, it might be feasible to construct and validate the predictive accuracy of risk instruments with large samples and high reoffending base rates using relatively short coverage periods (e.g., 7-12 months). Of course, research attempting to examine the amount of time an offender remained offense free and, hence, no longer represented a danger to the community would continue to require extensive follow-up periods.
Bivariate AUC-ROC Values Predicting Any Rearrest by Follow-Up Times (N = 27,156)
Note. AUC-ROC = area under the curve–receiver operating characteristic; CI = confidence interval.
Further analyses were undertaken to determine the extent to which hazard rates may change over time, and/or if they change over time by risk category. Table 4 presented a life table, for all cases together, and for the cases disaggregated by risk level according to the PCRA (Low, Low/Moderate, Moderate, and High). Furthermore, the results were presented by follow-up period, including those cases where the follow-up period was less than a year (0) up to 10 years. The hazard rates were also expressed graphically by year and risk (see Figure 1).
Life Table for Any Rearrest by Risk
Note. CFR = cumulative failure rate; HR = hazard rate, presented as a proportion.
According to Table 4 (and as expected), the CFRs increased for each year that passed. Likewise, rates of failure appear to be lowest overall for “Low Risk” offenders, and increased for each increase in risk level. Furthermore, the patterns of the hazard rates (meaning of the people that were still successes, the percent that fail during the time period under consideration) declining over time were fairly consistent with some anomalies. This was especially the case for low risk offenders in which the HRs were relatively stable across the years. For example, among low risk cases, the hazard rate begins at 2.20% (Year 0) then increases to 4.29% (Year 1), then decreases to 4.22% (Year 2) and continues to decrease for the duration (to Year 10). Although the hazard rates overall were expectedly higher for low/moderate risk cases (compared with low risk cases), they followed the same pattern as the low risk cases. Although moderate and high risk offenders had more variation in their hazard rates than the lower risk offenders, they manifested similar patterns of declining hazard rates over the study period. Moderate risk cases, for example, approximated the same pattern (with overall higher failure rates still, relative to low and low/moderate cases), but have an increase between Year 7 and Year 8 before continuing to decrease. The hazard rates for high risk cases appeared to display a bounce of sorts, increasing from Year 0 to 1, decreasing from Year 1 to 2, increasing from Year 2 to 3, then decreasing consistently until Year 6 where the rate increases slightly (Year 7), then increases markedly (Year 8) before decreasing again. Although the number of high risk cases involved in the analysis was low at that point in time, the high risk cases appeared at least somewhat similar to moderate cases regarding the overall trajectory of their hazard rates over time. Likewise (and perhaps expectedly), the moderate and high risk cases together were markedly different overall from the low and low/moderate cases, which adds further support to the PCRA’s utility in meaningfully distinguishing between risk levels.
Another crucial aspect with this examination of hazard rates over time involved findings in support of the issue of offender redemption. Specifically, although the hazard rates differ substantially across the risk categories at the onset of the follow-up period, they become similar, though never quite converging, once the follow-up time reached 7 years. For example, high risk offenders had hazard rates (.54) that were nearly 4 times higher than their low/moderate risk counterparts (.15) within the first follow-up year. By the seventh follow-up year, the hazard rates for high risk offenders who remained offense free during this time period (.06) was essentially the same as low/moderate risk offenders with no reoffending activity during this period (.07) indicating that the time to redemption was approximately 7 years. Moderate risk offenders manifested similar patterns with their hazard rates ranging from .34 at the first follow-up year and then declining to .08 by the seventh follow-up year. Although the hazard rates slightly increased for moderate and high risk offenders by the eighth follow-up year, they then declined to rates nearly similar to that of low risk offenders. In sum, the pattern of converging hazard rates across the PCRA risk categories provided evidence supporting the concept of redemption in the recidivism literature.
Discussion
Actuarial risk assessments are generally used to create groups or classifications that are associated with a specific probability of an event occurring. The predicted event is typically some form of “failure” such as revocation from supervision, new arrest, new conviction, or commitment to a secure facility of some sort, within what should be a standardized fixed period of time. The creation or allocation of an adequate fixed follow-up time years (e.g., 2-year follow-up) represents distinct challenges for research aimed at developing actuarial risk-prediction scales and/or testing their predictive validity. Conventional wisdom would dictate that when generating recidivism estimates, fixed follow-up periods are superior to variable follow-up approaches, that longer follow-up periods are better than shorter ones, and that the longest follow-up periods are indeed “the best” regardless of what outcome is being predicted (Jones, 1996). This conventional wisdom rests on the notion that time needs to pass (or be allocated/accounted for retrospectively) to “allow” the outcome to be revealed.
The results presented above demonstrated that the coverage period utilized in risk-prediction research might not need to be standardized through the use of fixed follow-up periods (e.g., 2-year fixed follow-up time). For example, in Table 2, the randomized follow-up period revealed effective results in terms of distinguishing between successful and failing cases. These findings likewise hold implications for future research, particularly in instances where the development of a retrospective follow-up period is not possible. We are in no way advocating the abandonment of the standardization of a follow-up period whenever it is feasible, however, in instances where potentially beneficial data are otherwise unavailable, informative results for the generation of recidivism estimates may still be gleaned with variable follow-up times within the same sample.
When it comes to which procedure is best when testing the predictive validity of an actuarially derived scale, the logistic regression models with the observation of AUC-ROC produced results that approximated the Cox regression models with the observation of Harrell’s C. It is important to keep in mind that base rates of failure will change over time, so specific issues related to instrument calibration will remain and are not addressed in the current article. However, as the question of predictive validity (particularly with newly developed scales) is typically of most concern, logistic regression-derived AUC-ROC values based on a 7-month (or longer) follow-up period may suffice in giving researchers and practitioners the information they need (especially when working with large samples containing adequate base rates of failure for outcomes that can be reliability measured such as general rearrest rates).
Importantly, the analyses presented above may also serve to challenge the idea that a long follow-up period is really necessary when working with actuarial prediction scales and models for high base rate events (e.g., predicting general recidivism). First, as presented in Table 3, we essentially asked the question “how low could we go” while still observing reliable results when it came to a scale differentiating between successes and failures. As noted above, according to our analyses, once the follow-up period reached 7 months, we had observed a level of predictive power that was not enhanced much at all by adding several more months and even years. Although one could argue that perhaps the power of the scale itself simply “maxed out,” the AUC-ROC values and their confidence intervals indicate an effectively predictive scale, relative to most if not all other actuarial risk-prediction scales. In short, while it was feasible to entertain the possibility that a different risk-prediction scale (including one that perhaps is not yet developed) might show increased predictive power with a follow-up period that exceeds 7 months, we have essentially demonstrated that, at least in this sample and with this base rate of recidivism, the PCRA was effective with a very short follow-up time. These results have implications for both research (testing) as well as development. With the right conditions and data availability, scholars may be able to produce and/or test different risk-prediction scales and models with greatly increased frequency. Increasing the rate at which scales and models can be developed, validated, implemented, and revalidated has implications for supervision policies and practices (e.g., risk-related decisions such as the level of supervision, case plan development, the measurement of progress via programming over time).
In addition, the results of the test of nonproportionality in the Cox regression models yield three important findings. First, the significant interaction terms between the PCRA and time indicated that the relationship between the PCRA and recidivism varied over time. This would certainly call into question the notion that risk was best estimated using instruments that were completely static in nature. Second, and related to the first point, the results of a varying relationship between the PCRA and recidivism dependent on time indicated that a correctional system relying on the results of risk assessment might need to reassess offenders at regular intervals. Third, the results of the test and solution for nonproportionality in the Cox regression models might suggest, as others have (see Harris & Rice, 2007), that crime free time or the concept of offender redemption needs to be accounted for in scoring algorithms for risk assessment scales.
Finally, this study provided crucial support for the contention that offenders who remain crime free over time become similar in terms of their likelihood of recidivism regardless of their initial risk levels. To reiterate, redemption studies show that the likelihood of recidivism between offenders and nonoffenders begins to converge within a period ranging from 6 to 10 years (Blumstein & Nakamura, 2009; Kurlychek et al., 2007; Soothill & Francis, 2009). In other words, if an offender remains offense free for several years, their probability of reoffending will begin to approximate that of nonoffenders (Bushway et al., 2011). The current study showed the hazard rates for offenders in different risk groups converged after a period of about 7 years. Although it is important to note that these hazard rates never exactly converged and that the high and moderate risk offenders saw slight increases in their hazard rates during the eighth year of follow-up, this study provided support for the contention that even high risk offenders who abstain from criminal activity over a period of about seven years manifest recidivism levels similar to their lower risk counterparts.
These findings have important implications for risk assessment development and implementation. Risk scoring algorithms might take into account the concept of redemption by adjusting an offender’s risk level downward if they remain recidivism free in the community for a specific time period. Offenders initially designated high risk, for example, could be moved downward on the risk scale to a substantially lower risk level if they evidenced no reoffending behavior within a certain time frame. Although risk instruments such as the PCRA allow for the reclassification of offenders at subsequent assessments, it is often difficult to reclassify high risk offenders to the lowest risk category because of the importance of criminal history in the scoring algorithm (Cohen et al., 2016). Risk instruments taking into account the redemption findings could allow for the movement of even the highest risk offenders into a relatively low risk category or even be considered for termination from supervision.
The current research utilized a large dataset which included cases that had been followed for a long time (10 years at a maximum). Regardless of the large sample size, however, it is possible that the high risk cases were underrepresented because this sample was of federal probationers, though the high risk cases did evidence the highest failure rates. Future research will take into account the need for additional analyses utilizing more high risk cases to support the current results.
Although the multivariate models controlled for two important demographic factors (specifically sex and race), it was possible that other unmeasured factors may contribute to the likelihood of failure (new arrest). Any number of other unmeasured responsivity and/or criminogenic need factors may influence the likelihood of failure. Future research should make efforts to take these additional factors into account provided the data were available to do so.
We found that the length of follow-up time that may be necessary to reveal adequate predictive validity may be as short at 7 months. These findings, however, should be tempered in part due to the fact that they were based on just one actuarial risk scale. It is possible that despite the adequate results observed via the current research, other risk assessments (developed or not) may reveal increasing predictive validity with longer follow-up periods. We also had a reliable source of recidivism information. Some recidivism sources may take longer to note recidivism events. Additional research testing varying amounts of follow-up time needs to be conducted using other risk assessments to determine whether the 7-month follow-up period or longer time frames are needed when assessing an instrument’s predictive validity.
Moreover, it is important to note that the stability of the AUC-ROC metrics shown within the 7-month time frame can partially be attributed to the large sample size and high recidivism base rates used for this study. Researchers working with smaller sample sizes and/or recidivism base rates might lack the capacity to replicate this study’s results. For example, we analyzed the odds ratios, with accompanying confidence intervals, generated from logistic regression models examining the relationship between PCRA risk scores and rearrest outcomes using study populations with decreasing sample sizes from about 27,000 offenders to 40 offenders (see Appendix Table A1). Although the odds ratios produced from the logistic regression models were fairly stable across the different sample sizes, the confidence intervals widened considerably once the study populations decreased to fewer than 200 offenders. For example, the confidence interval using the full sample of nearly 27,000 offenders varied from 1.42 to 1.46. However, the confidence interval when the study population is limited to 100 offenders varies from 1.15 to 1.62 and becomes even wider for smaller populations. These results indicate that the stability in the odds ratios, Harrell’s C, and AUC-ROC curves produced by these models will probably not hold for researchers having to work with smaller study populations or model recidivism outcomes involving low base rates, such as sexual recidivism. For these researchers, a 7-month follow-up time frame may not provide sufficient time to construct, validate, and test a risk assessment instrument.
Footnotes
Appendix
An anonymous reviewer indicated that another interesting aspect of prediction research would be the effect of the number of recidivists on the odds ratios and references Vergouwe, Steyerberg, Eijkemans, and Habbema (2005) who, after conducting Monte Carlo simulation, recommend that there be a minimum of 100 events and 100 nonevents to detect model differences in external validations. To test the effect of the number of events and nonevents on the odds ratios, a series of logistic regression models were run with decreasing sample size. The results of those analyses are contained in the Appendix Table A1 and indicate fairly stable odds ratios regardless of sample size. However, this does not mean that the number of events and nonevents is irrelevant to validation studies, only that there are no model differences detected in these data regardless of the number of events and nonevents. If model differences were present, Vergouwe et al. (2005) would argue that the minimum number of events and nonevents needed to detect that model difference is at least 100.
