Better Practices in the Development and Validation of Recidivism Risk Assessments: The Minnesota Sex Offender Screening Tool

Abstract

This study examines the development and validation of the Minnesota Sex Offender Screening Tool–4 (MnSOST-4) on a dataset consisting of 5,745 sex offenders released from Minnesota prisons between 2003 and 2012. Bootstrap resampling was used to select predictors, and k-fold and split-sample methods were used to internally validate the MnSOST-4. Using sex offense reconviction within 4 years of release from prison as the failure criterion, the data showed that 130 (2.3%) offenders in the overall sample were recidivists. Multiple classification methods and performance metrics were used to develop the MnSOST-4 and evaluate its predictive performance on the test set. The results from the regularized logistic regression algorithm showed that the MnSOST-4 performed well in predicting sexual recidivism in the test set, achieving an area under the curve (AUC) of 0.835. Additional analyses on the test set revealed that the MnSOST-4 outperformed the Minnesota Sex Offender Screening Tool–3 (MnSOST-3), Minnesota Sex Offender Screening Tool–Revised (MnSOST-R), and Static-99 in predicting sexual reoffending.

Keywords

risk assessment sexual recidivism sex offender MnSOST-4

Introduction

As the size of the “what works” literature has continued to grow in the wake of the 1970s claim that “nothing works,” the corrections field has increasingly embraced the idea of using policies and practices that have been found to reduce the likelihood of recidivism. Drawing on lessons learned from this body of research, the principles of effective correctional intervention provide a guide for the implementation of evidence-based practices (EBPs). Included among the eight principles are (a) assess risk and needs, (b) enhance intrinsic motivation, (c) target interventions, (d) use of cognitive-behavioral treatment methods, (e) increase positive reinforcement, (f) engage ongoing support in communities, (g) measure relevant processes and practices, and (h) provide measurement feedback (Crime and Justice Institute at Community Resources for Justice, 2009). The use of validated risk and needs assessments is not only central to EBPs, but the effective delivery of programming is contingent, at least to some extent, on assessments that perform well in predicting recidivism.

Research has consistently demonstrated that actuarial assessments outperform professional judgment in making accurate classification decisions (Dawes, Faust, & Meehl, 1989; Hanson & Morton-Bourgon, 2009). Still, the evidence generally suggests that risk assessment tools have been, at best, only modestly successful in distinguishing recidivists from nonrecidivists (Singh & Fazel, 2010). While there are numerous examples in which risk assessment instruments have performed well in predicting recidivism (Hamilton et al., 2016; Olver, Stockdale, & Wormith, 2014; P. Smith, Cullen, & Latessa, 2009), there are also instances in which predictive performance has been relatively poor (Boccaccini, Murrie, Caperton, & Hawes, 2009; Duwe & Rocque, 2016; Giguere & Lussier, 2016). This uneven but relatively pedestrian level of overall performance applies not only to “broad-band” assessments that predict general recidivism but also to “narrow-band” instruments that assess risk of certain types of offending such as sexual recidivism. For example, in their evaluation of sexual recidivism assessment tools, Knight and Thornton (2007) reported that these tools had, on the whole, weak predictive validity (an average area under the curve [AUC] value of 0.64), although there were a few that achieved adequate predictive discrimination.

The Minnesota Sex Offender Screening Tool (MnSOST) series of instruments has been one of the most widely used sexual recidivism assessment tools (Langton, Barbaree, Harkins, Peacock, & Arenovich, 2008). Shortly after the MnSOST made its debut in 1996, Epperson and colleagues (2003) began working on a revised version of the instrument. In 1999, they introduced the Minnesota Sex Offender Screening Tool–Revised (MnSOST-R), an instrument that has been used to assess sexual recidivism risk for sex offenders both within and outside Minnesota. More recently, Duwe and Freske (2012) significantly revised the MnSOST-R with their development of the Minnesota Sex Offender Screening Tool–3 (MnSOST-3), which the Minnesota Department of Corrections (MnDOC) began using in January 2012.

Given that the MnSOST-3 estimates the probability of sexual recidivism over a 4-year follow-up period and sex offenders scored on this instrument first began getting released from prison in 2012, it is now possible to evaluate how the tool has performed on a cohort of sex offenders released from Minnesota prisons. Perhaps more importantly, these MnSOST-3 assessment data also provide a baseline that can be used to help determine whether a better performing sexual recidivism assessment can be developed. The present study details the efforts to update the MnSOST-3 by examining the development and validation of the Minnesota Sex Offender Screening Tool–4 (MnSOST-4).

In the following section, this study describes the use of risk assessment instruments, especially the MnSOST-3, for sex offenders in Minnesota prisons. Next, just as the principles of effective correctional intervention have been drawn from the “what works” literature, this study reviews existing research so as to identify proven or promising practices in the development and validation of risk assessment instruments. As discussed in the Data and Method section, these principles are then applied to the development of the MnSOST-4. Following a presentation of the results, which include predictive performance comparisons between the MnSOST-4 and three other assessments (MnSOST-3, MnSOST-R, and Static-99), this study concludes by discussing the implications of the findings for recidivism risk assessment.

Risk Assessment for Minnesota Sex Offenders

Since their debut more than 20 years ago, the MnSOST series of instruments has been used to prioritize sex offenders for prison-based programming, especially sex offender treatment. In addition to helping determine the appropriate levels of postrelease community supervision, the MnSOST tools have been used to screen sex offenders for involuntary civil commitment. Perhaps most notably, the MnSOST assessments have been used to help assign risk levels for sex offenders leaving prison, which are then used to determine the extent to which the community will be notified. Over the last 20 years, approximately 55% of sex offenders released from Minnesota prisons have received a Level 1 assignment (lower risk), 30% a Level 2 assignment (moderate risk), and 15% a Level 3 assignment (higher risk).

Sex offenders are assessed on the MnSOST prior to receiving a risk-level assignment from an End of Confinement Review Committee (ECRC), which is composed of the prison warden or treatment facility head where the offender is confined, a law enforcement officer, a sex offender treatment professional, a caseworker experienced in supervising sex offenders, and a victim services professional. While the MnSOST score provides a presumptive risk level, ECRCs may override it as they consider the MnSOST score as well as additional factors that presumably increase or decrease the risk of reoffense (e.g., an offender’s stated intention to reoffend following release, or a debilitating illness or physical condition that mitigates the risk of reoffense).

When the MnDOC transitioned from the MnSOST-R to the MnSOST-3 in January 2012, the presumptive risk-level cut points changed due to the marked differences between the two instruments. The sample used to develop the MnSOST-3 consisted of 2,535 male sex offenders, most of whom had been released from Minnesota prisons between 2003 and 2006. In contrast to the first two versions of the MnSOST, which relied on Burgess methodology as the classification method, Duwe and Freske (2012) used multiple logistic regression to create the MnSOST-3. Relying on sex offense reconviction within 4 years as the outcome measure, they used bootstrap resampling for item selection and model validation. The MnSOST-3 had an AUC value of 0.796, which suggests that the instrument had good predictive discrimination.

The MnSOST-3 contained 11 predictors—nine main effects and two interaction effects. Of the nine main effects, only three were items derived from the MnSOST-R (public place, completion of chemical dependency and sex offender treatment, and age at release). The other items are as follows: male victims, predatory offenses, felony offenses, violations of orders for protection (VOFP)/stalking/harassment, disorderly conduct convictions (last 3 years), and unsupervised release.

The MnSOST-3, which is scored in a Microsoft Excel application, provides several measures of sexual recidivism risk. The MnSOST-3 value an offender receives represents his predicted probability of sexual recidivism within 4 years, which varies from a low of 0% to a high of 100%. To provide a range in which the true risk of sexual recidivism likely falls, 95% confidence intervals (CIs) were calculated around MnSOST-3 estimates. While the MnSOST-3 value and the accompanying 95% CIs offer measures of absolute sexual recidivism risk, percentile ranking was also included to provide a measure of relative risk (Duwe & Freske, 2012).

Shortly after implementing the MnSOST-3 in January 2012, the MnDOC began using the MnSOST-3.1 the following month. The MnSOST-3.1 differs from the MnSOST-3, in that it excludes the two interaction effects. Although the AUC for the MnSOST-3.1 is the same (0.796) as it is for the MnSOST-3, use of the 3.1 resulted in a slight change to the cut points for presumptive risk levels. Under the MnSOST-3, the cut scores were 5.50% and higher for Presumptive Level 3, between 2.30% and 5.49% for Presumptive Level 2, and below 2.30% for Presumptive Level 1. Under the 3.1, the cut scores were 5.80% and higher for Presumptive Level 3, and between 2.30% and 5.79% for Presumptive Level 2 (Duwe & Freske, 2016).

More recently, in 2013 the MnDOC began using the MnSOST-3.1.2. The main difference between the MnSOST-3.1.2 and its predecessor, the MnSOST-3.1, is that the probability distribution used to determine an offender’s percentile rank (i.e., his risk relative to other Minnesota sex offenders) is based on 1,517 cases scored on the MnSOST-3.1 by the MnDOC during 2012. Previously, the probability distribution used to determine percentile rank for both the MnSOST-3 and MnSOST-3.1 had been based on the 2,535 cases in the development sample.

The Minnesota Screening Tool Assessing Recidivism Risk (MnSTARR)

The MnDOC’s transition to the MnSOST-3.1.2 in 2013 coincided with the debut of MnSTARR, a “multiple-band” instrument that assesses risk separately for male and female prisoners for five different types of recidivism—nonviolent, felony, nonsexual violent, first-time sexual offending, and repeat sexual offending—over a 4-year follow-up period (Duwe, 2014a). Because a sex offender’s likelihood of sexual recidivism is derived directly from his MnSOST-3.1.2 score, the MnSOST-3.1.2 is fully integrated within the broader MnSTARR assessment. Given that the MnSTARR is strictly a risk assessment tool, its noncriminal history items, which are mostly dynamic factors, do not identify which needs areas should be targeted for programming. Yet, because the noncriminal history/dynamic items measure observable behavior in prison such as misconduct or completion of programming, the MnSTARR indicates which needs areas have improved or grown worse while an offender is incarcerated.

In November 2016, the MnDOC transitioned from the MnSTARR—an assessment manually scored by correctional staff—to the MnSTARR 2.0—a fully automated assessment (Duwe & Rocque, 2017). The MnSTARR 2.0 extracts data from the state’s criminal history repository to populate the criminal history items on the instrument, while data from the Correctional Operations Management System (COMS)—the MnDOC’s centralized database—are pulled to populate items pertaining to demographic characteristics (e.g., gender, age, and marital status), institutional behavior (e.g., discipline convictions and gang affiliation), and participation in programming (e.g., earning a postsecondary degree in prison, completing chemical dependency treatment, and completing cognitive-behavioral therapy). The only MnSTARR 2.0 items that are not autopopulated are those for the MnSOST-3.1.2, which continue to be scored manually by correctional staff. Still, after a MnSOST-3.1.2 assessment has been completed on an offender, the MnSOST-3.1.2 score is extracted from COMS and uploaded within the MnSTARR 2.0 assessment.

In addition to using an automated scoring process, there are several other notable differences between the original MnSTARR and the MnSTARR 2.0: First, the dataset was expanded from 2003 to 2006 releases to include all offenders released between 2003 and 2010. The training set for the MnSTARR 2.0 included offenders released during the 2003-2008 period, whereas the test set comprised of releases from 2009 to 2010. Second, due to the automated scoring process, the MnSTARR 2.0 includes nearly 50 items (for both male and female offenders), which is more than double the number of items that were on the MnSTARR. Finally, whereas logistic regression with interaction terms was used to develop the original MnSTARR, regularized logistic regression (RLR) was the classification method used to create the MnSTARR 2.0.

Promising and Proven Practices in Risk Assessment

The promising and proven practices in the development of risk assessment instruments cover the following seven areas: (a) scoring method, (b) classification method, (c) item selection and measurement, (d) internal validation method, (e) evaluating predictive performance, (f) bias testing, and (g) feedback loop. The practices within each of these areas are not mutually exclusive but are, to a large extent, interdependent. For example, the choice of which scoring method to use can affect the classification method, which can, in turn, affect how items are selected. Table 1 displays prior practices used in the development of the MnSOST and MnSTARR assessments.¹

Table 1.

MnSOST and MnSTARR Risk Assessment Practices and Design Features.

	MnSOST			MnSTARR
	MnSOST-R	MnSOST-3	MnSOST-4	MnSTARR	MnSTARR 2.0
Risk assessment practices
Scoring method	Manual	Manual	Manual	Mostly manual	Automated
Classification method	Burgess	Logistic regression	Regularized logistic regression	Logistic regression	Regularized logistic regression
Item selection	Hierarchical log. regression	Stepwise/bootstrap	Stepwise/bootstrap	Stepwise/bootstrap	Stepwise/bootstrap
Internal validation	Split sample	Bootstrap resampling	Split sample	Bootstrap resampling	Split sample
Predictive performance	AUC	AUC	Multiple metrics	AUC	Multiple metrics
Bias testing	No	No	Yes	No	Yes
Feedback loop	None	Other assessment comparisons	Other assessment Comparisons	Other assessment Comparisons	Other assessment Comparisons
Additional design features
Release window	1990-1992	1990-1992 and 2003-2006	2003-2012	2003-2006	2006-2011
Recidivism assessed	Sexual recidivism	Sexual recidivism	Sexual recidivism	Felony, nonviolent, violent, first-time and repeat sexual offending (MnSOST-3)	Felony, nonviolent, violent, first-time and repeat sexual offending (MnSOST-3)
Recidivism follow-up period (years)	6	4	4	4	3 and 4
Risk factors	Static/dynamic	Static/dynamic	Static/dynamic	Static/dynamic	Static/dynamic
Overall sample size	476	2,535	5,745	12,475	39,360

Note. MnSOST = Minnesota Sex Offender Screening Tool; MnSTARR = The Minnesota Screening Tool Assessing Recidivism Risk; MnSOST-3 = Minnesota Sex Offender Screening Tool–3; MnSOST-4 = Minnesota Sex Offender Screening Tool–4; AUC = area under the curve.

Scoring method

The scoring method refers to the process by which the items on a risk assessment instrument are populated, with the total score being used to determine an individual’s risk level (e.g., low, moderate, high). Items on an instrument can be scored manually, usually by correctional staff, or they can be populated through an automated process. Despite the extensive training and quality assurance that often accompanies the use of manually scored tools, differences among staff in how they score manual assessments are inevitable due to factors such as the subjectivity of the items, the extent to which staff have been trained, staff workloads, the amount of time it takes to complete an assessment, and data entry errors.

Interrater reliability (IRR) examines the consistency between raters in scoring an instrument, and the differences among staff, or raters, in how they score manual assessments comprise what is known as interrater disagreement. Although it has been assumed that assessments must have reliability to have predictive validity (Austin, 2006), relatively few risk assessment studies have measured IRR, and only one has examined the impact that interrater disagreement has on predictive performance. In their study comparing manual and automated scoring methods, Duwe and Rocque (2017) found that interrater disagreement can affect the performance of an instrument in predicting recidivism. As interrater disagreement increased in the manually scored assessments, predictive performance decreased. Conversely, the automated scoring process achieved better predictive validity because it removed interrater disagreement. By eliminating the time it takes for correctional staff to score manual assessments, an automated scoring method is also more efficient and cost-effective. Duwe and Rocque estimated, for example, that for every dollar spent on automating the MnSTARR 2.0, there will be an estimated return of nearly US$22 within 5 years, totaling US$2.8 million.

The vast majority of recidivism risk assessment instruments, including those designed to predict sexual recidivism, are scored by hand. Although automated scoring is a promising practice, it is still a relatively new idea within corrections. Whether automated scoring can, or should, be used more widely remains to be seen. For example, there may be circumstances in which manual scoring is preferable or an automated process is simply not feasible.

But if an instrument must be scored manually, then it is necessary to demonstrate that it can be scored consistently. To achieve adequate reliability, Duwe and Rocque (2017) argued that using objective items (as opposed to more subjective items that require more interpretation) is critical. For example, in assessing a domain such as antisocial peers, it may be better to measure whether an individual has received any visits in prison than asking whether she or he is a “social isolate” in a one-on-one interview. Moreover, to demonstrate reliability, manually scored instruments should include an IRR assessment.

Classification method

The classification method refers to the process by which the values for items on an instrument are calculated into a risk score The main types of classification methods that have been used to develop recidivism risk assessment tools are Burgess (i.e., a simple summation of item values), statistical techniques (e.g., logistic regression), and machine learning algorithms (e.g., Random Forests [RF], artificial neural networks, support vector machines, etc.). Except for the MnSOST-3 (Duwe & Freske, 2012) and the Vermont Assessment of Sex Offender Risk–2 (VASOR-2; McGrath, Lasher, Cumming, Langton, & Hoke, 2014), which both used multiple logistic regression, sexual recidivism assessment instruments have relied on Burgess methodology.

A number of recent studies have examined the impact of the classification method on performance in predicting recidivism. Much of this research has focused on comparing statistical methods such as logistic regression with machine learning algorithms. The results from these studies have been mixed. Some have found that machine learning algorithms outperformed statistical methods (Berk & Bleich, 2013; Breitenbach, Dieterich, Brennan, & Fan, 2009; Duwe & Kim, 2015; Hess & Turner, 2013), whereas others have found little or no difference (Hamilton, Neuilly, Lee, & Barnoski, 2015; Liu, Yang, Ramsey, Li, & Cold, 2011; Tollenaar & van der Heijden, 2013). In the only study that has directly compared predictive performance between Burgess methods and statistical and machine learning algorithms, Duwe and Kim (2016) found that statistical and machine learning methods significantly outperformed scales developed with Burgess methodology.

In addition to predictive performance, classification methods differ in how transparent and interpretable they are. At one end of the spectrum, Burgess methodology is very transparent and interpretable. At the other end, machine learning algorithms are much more opaque, having been described as black boxes or, at best, gray boxes. Linear regression occupies the middle ground between both ends. But the extent to which linear regression is transparent and interpretable depends on how complicated, or dimensional, the model is. Models with interaction effects and nonlinear terms will operate more like machine learning algorithms and will, thus, be less decipherable.

There are a few broad conclusions that can be drawn from research on the classification method: First, the evidence suggests that there is a trade-off between transparency and performance. Although Burgess methodology offers greater simplicity and transparency than statistical and machine learning algorithms, there is a price to pay in the form of reduced predictive performance (Duwe & Kim, 2016). Second, consistent with the “no free lunch” theorem (Caruana & Niculescu-Mizil, 2006; Wolpert, 1996), there is no single best method or algorithm that yields the best performance in every situation. To be sure, the evidence suggests that machine learning algorithms such as RF or boosted trees will, on the whole, achieve the best performance (Caruana & Niculescu-Mizil, 2006). Still, there are instances in which relatively poor classifiers outperform higher performing algorithms (Caruana & Niculescu-Mizil, 2006; Duwe & Kim, 2015).

As a result, recent research has advocated evaluating multiple classification methods in the development of risk assessment instruments (Duwe & Kim, 2016; Ridgeway, 2013). Testing multiple classification methods improves the chances of finding the best predictive model. Moreover, it makes it possible to determine whether the loss of predictive performance sufficiently offsets increased transparency and vice versa (Duwe & Kim, 2016).

Item measurement and selection

Due largely to the popularity of Burgess methodology in developing recidivism risk assessment tools, items have generally been nominal- and ordinal-level measures that have been treated as main effects. One exception to this approach, at least among instruments designed to predict sexual recidivism, is the MnSOST-3, which contains continuous measures (e.g., age at release) as well as interaction effects (Duwe & Freske, 2012). Interaction effects can be challenging to interpret in practice, however, which is why Duwe and Freske (2016) revised the MnSOST-3 by removing the two interaction effects. Rather than modeling interaction effects within a statistical model, this study explores a more simplistic approach by creating combined measures of multiple predictors. For example, instead of entering multiple prior sex offenses (one prior sex offense = 0; multiple prior sex offenses = 1) and multiple victims age groups (one age group = 0; multiple age groups = 1) as an interaction effect in the model, this study combines these two measures into a single item (multiple prior sex offenses and multiple age groups = 1; any other combination = 0).

Item selection is guided, to a large extent, by theory. Items should be considered for inclusion on an assessment if they are theoretically relevant and if the direction of their impact on the outcome being predicted (e.g., recidivism) is consistent with existing theory. Moreover, prior research suggests that rather than including a large number of items related to both risk and needs, assessments designed to predict recidivism should include only items that enhance predictive performance (Baird, 2009; Barnoski & Aos, 2003; Duwe, 2014a).

Among theoretically relevant predictors, however, the literature has not delivered unequivocal evidence on the best method to use for item selection. Item selection methods include procedures such as the gain ratio, information gain, chi-square, principal components analysis, and stepwise selection. Stepwise routines have been criticized on a number of grounds, especially for producing biased regression coefficients (Tibshirani, 1996) and for capitalizing on chance features of the data (Judd, McClelland, & Ryan, 2008). In an effort to mitigate these concerns, several studies have used bootstrap variable selection procedures in the development of recidivism risk assessment instruments (Duwe & Freske, 2012; Hamilton et al., 2016).

Internal validation method

Internal validity examines the extent to which the tool’s performance can be reproduced on the population underlying the sample. Several internal validation methods have been developed to determine the reproducibility of a prediction model. The split-sample method involves using a portion (e.g., one half or two thirds) of the sample to develop the prediction tool (i.e., the development sample or training set), which is then applied to the remaining portion of the sample to test the internal validity of the model (i.e., the validation sample or test set). As Harrell, Lee, and Mark (1996) pointed out, this approach wastes data. Yet, by having a test set, the split-sample method provides a means for making a direct comparison between the prediction instrument being validated and other existing assessments that have been administered to individuals in the test set.

Although cross-validation, or k-fold validation, is more efficient than the split-sample method because it involves repeated data splitting, research has shown that bootstrap resampling is the most efficient internal validation technique (Steyerberg, Bleeker, Moll, Grobbee, & Moons, 2003; Steyerberg et al., 2001). Bootstrap resampling involves pulling many smaller samples from the overall sample to generate estimates of error (Efron, 1979). In doing so, it makes full use of the dataset while also providing error estimates that have relatively low variability and minimal bias (Steyerberg et al., 2001). With the exception of the MnSOST-3 (Duwe & Freske, 2012), which used bootstrap resampling, and the VASOR-2 (McGrath et al., 2014), which used k-fold validation, sexual recidivism assessments have preferred the split-sample method.

Evaluating predictive performance

The sex offender risk assessment literature has relied almost exclusively on the AUC as the metric by which to evaluate predictive performance. While the AUC has its strengths and weaknesses, which are noted below, it measures only one dimension of predictive validity: discrimination. Predictive validity, however, also consists of accuracy and calibration.

Predictive accuracy assesses how well a model makes correct classification decisions. One of the most commonly used metrics is accuracy (ACC), a threshold-based measure that looks at the extent to which an assessment correctly classifies offenders as recidivists or nonrecidivists. For example, if a recidivist had a predicted recidivism probability less than 50%, then this offender would be incorrectly classified as a nonrecidivist (i.e., false negative). If this offender had not recidivated, then she or he would be accurately classified (i.e., true negative). The main limitation with using threshold-based predictive accuracy metrics is that they can be less informative and meaningful for highly imbalanced datasets in which few, if any, of the individuals have predicted probabilities of 50% or higher.

The second dimension of predictive validity, discrimination, measures the degree to which an assessment separates the recidivists from the nonrecidivists. The AUC is relatively robust across different recidivism base rates and selection ratios (W. Smith, 1996). With values that range from 0 to 1, the AUC statistic is interpreted as the probability that a randomly selected recidivist has a higher score on a risk assessment instrument than a randomly selected nonrecidivist. Values at either end of the spectrum (0 or 1) reflect perfect prediction, whereas a value of 0.50 indicates that the prediction tool does no better than chance. According to the literature, an AUC between 0.90 and 1.00 is considered excellent, between 0.80 and 0.89 is good, between 0.70 and 0.79 is fair, between 0.60 and 0.69 is poor, and between 0.50 and 0.59 represents a failure to achieve predictive discrimination (Baird et al., 2013; Thornton & Laws, 2009).

As Davis and Goadrich (2006) pointed out, the AUC can provide an overly optimistic estimate of predictive discrimination for imbalanced datasets. In addition, as Hand (2009) demonstrated, the AUC uses different misclassification cost distributions for dissimilar classifiers and can provide misleading results if receiver operating characteristic (ROC) curves cross. Given that the AUC evaluates different classifiers using different metrics, Hand (2009) developed a predictive discrimination metric, the H-measure, that uses a common cost distribution for all classifiers, with higher values indicating better performance (Hand, 2009). The H-measure has seldom been used in existing research on recidivism prediction, although previous studies have reported H values that ranged from 0.07 to 0.40 (Duwe & Rocque, 2017; Hamilton et al., 2015). Another alternative to the AUC is the precision-recall curve (PRC), which uses the precision and recall values to assess predictive discrimination. The PRC has been found to be a better metric than the AUC for highly imbalanced datasets (Davis & Goadrich, 2006), although it has yet to be used much, if at all, in prior validation research. Like the H-measure, PRC values range from 0 to 1, with higher values denoting better performance.

Calibration measures how well the predicted probabilities from a model correspond with the observed outcome being predicted. Whereas predictive discrimination assesses relative risk, calibration taps into absolute risk. For a prediction instrument to make accurate absolute assessments of risk, the model’s predicted probabilities must be calibrated with the observed recidivism outcomes. Two of the more commonly used calibration metrics are the Brier score and root mean square error (RMSE). For both metrics, the closer the value to 0, the better the calibration.

Amid the growing recognition that comprehensive evaluations of predictive performance involve more than calculating the AUC, recent studies have begun reporting multiple metrics that address each of these dimensions of predictive validity (Duwe & Kim, 2016; Hamilton et al., 2016; Tollenaar & van der Heijden, 2013). In particular, these studies included the squared error, accuracy, and ROC area (SAR), a statistic developed by Caruana, Niculescu-Mizil, Crew, and Ksikes (2004) that combines discrimination, accuracy, and calibration. The formula for SAR is [ACC + AUC + (1 − RMSE)] / 3. The SAR value in these studies ranged from 0.63 to 0.83, with higher values indicating better predictive performance (Duwe & Kim, 2016; Hamilton et al., 2016; Tollenaar & van der Heijden, 2013).

Testing for bias

As the use of risk and needs assessments has become increasingly embedded within correctional policy and practice, there have been recent concerns over the disparate and adverse effects that use of these tools may have on racial and ethnic minorities (Skeem & Lowenkamp, 2016). Whether and to what extent the effects may be adverse, however, depends on the various ways in which correctional authorities use risk assessments. In “low stakes” risk assessment, instruments are used to help determine institutional custody levels, prioritization for programming, and the type of community supervision. In “high stakes” risk assessment, where an individual’s liberty hangs in the balance, tools are used to inform decisions related to pretrial release, sentencing, whether individuals should be paroled from prison, and, more narrowly, whether persons convicted of a sex offense should be civilly committed after serving their sentence. While risk assessments should be routinely tested for racial and ethnic bias, as Skeem and Lowenkamp (2016) recommended, bias testing should be imperative for instruments used in high-stakes decisions.

Feedback loop: Updating a risk assessment instrument

The literature offers little, if any, explicit guidance when it comes to revising a risk assessment instrument. The only criterion, albeit an implicit one, seems to be that the challenger—the new, updated assessment—should be able to outperform the incumbent—the older, existing version of the tool. For example, studies examining revisions to the Static-99 (Helmus, Thornton, Hanson, & Babchishin, 2012), MnSOST (Duwe & Freske, 2012), and VASOR (McGrath et al., 2014) each touted the fact that the challenger outperformed the incumbent. Meanwhile, even less has been said about the frequency with which instruments should be updated, although the period of time between revisions can be lengthy. Indeed, more than a decade passed between revisions for the Static-99 (Helmus, Hanson, Thornton, Babchishin, & Harris, 2012), MnSOST (Duwe & Freske, 2012), and VASOR (McGrath et al., 2014).

Consistent with the final two principles (measuring relevant processes and providing measurement feedback) from the principles of effective correctional interventions, this study explores whether the use of a measurement feedback loop can help improve predictive performance. Complete predictive performance data for the MnSOST-3 first became available at the end of 2016 since the assessment was first implemented in 2012, and it uses a 4-year follow-up period for recidivism. These data are used to help determine whether the MnSOST-3 can be updated with a better performing assessment. This study not only compares the performance of the challenger (the MnSOST-4) with the incumbent (the MnSOST-3), but it also compares predictive performance for two other widely used sexual recidivism assessments: the MnSOST-R and the Static-99.

Data and Method

The overall sample used to develop and validate the MnSOST-4 consists of 5,745 sex offenders released from Minnesota prisons between 2003 and 2012. The predicted outcome is sexual recidivism, which was operationalized as a reconviction for a new sex offense within 4 years of release from prison. Similar to the MnSOST-3 (Duwe & Freske, 2012), sex crimes were measured as hands-on sex offenses. Accordingly, noncontact, sex-related offenses such as possession of child pornography or indecent exposure were excluded.

Reconviction was the recidivism measure because it reduced the likelihood of including instances of “pseudo recidivism” in which the offender was rearrested and/or reconvicted for a sex offense that was committed prior to his release from prison. Information on the date(s) when the reoffense occurred was seldom available in the arrest data but was consistently present in the conviction data. Reconviction data were obtained from the Minnesota Bureau of Criminal Apprehension (BCA), which include only convictions that occur in Minnesota. Like any study that measures recidivism, official criminal history data will underreport the actual extent to which the sex offenders examined here recidivated.

Internal Validation Methods

As discussed below, k-fold validation was used to optimize the parameters for the classification methods evaluated, while the split-sample method was used for internal validation. The training set was made up of 3,973 sex offenders released from Minnesota prisons between 2003 and 2009, whereas the test set contained 1,772 sex offenders released from prison between 2010 and 2012. Of the 3,973 offenders in the training set, 97 (2.4%) were recidivists. Of the 1,772 offenders in the test set, 33 (1.9%) were recidivists. For the overall sample, the base rate was 2.3%.

As shown later, the performance of the MnSOST-4 will be compared with the MnSOST-R, Static-99, and MnSOST-3. The advantage with the split-sample approach is that the performance of the MnSOST-4 can be directly compared with the other assessments on the same group of cases. For example, each of the 1,772 offenders in the test set had been administered a MnSOST-R prior to their release from prison. As a result, the MnSOST-4 and MnSOST-R will be compared on the full test set. Given that 168 offenders in the test set had not been assessed on the Static-99, the comparison between the MnSOST-4 and Static-99 focuses on the 1,604 offenders in the test set who had been scored on the Static-99. Finally, because the MnSOST-3 was not implemented until 2012, the MnSOST-3 and MnSOST-4 will be compared only for the 541 offenders in the test set who were released in 2012.

Selecting Predictors

The overall dataset for this sample contained more than 60 theoretically relevant predictors, including detailed criminal history measures, each item from the MnSOST-R and MnSOST-3, demographic characteristics, behavior in prison, and participation in prison programming. Some of these items, particularly those relating to prior sexual offending, must be scored manually through a detailed file review. Because the MnSOST-4 must be scored manually by correctional staff, the length of time it takes to administer and the number of items that must be scored are important considerations. In an effort to pare down the number of items so that only the significant and robust predictors of sexual recidivism were included, a bootstrap variable selection method developed by Efron and Gong (1983) was applied to the dataset.

Multiple logistic regression models were estimated in which predictors were added one at a time until no further single addition achieved significance level a = .10. Among the 18 predictors that had a significant effect (p < .10) on sexual recidivism, bootstrap resampling was used to refine the selection of predictors included in the MnSOST-4. Consistent with the MnSOST-3 (Duwe & Freske, 2012), predictors were retained as long as they were statistically significant at the .05 level in at least 70% of the 1,000 bootstrap samples. After removing five predictors that did not achieve statistical significance in at least 70% of the samples, another 1,000 bootstrap samples were estimated. As shown in Table 2, there were 13 predictors that were statistically significant at the .05 level in at least 70% of the bootstrap samples.²

Table 2.

Descriptive Statistics for MnSOST-4 Items.

Predictors	Description	M	SD	B	OR
Total violent offenses	Total number of violent offense convictions	2.524	1.931	0.085	1.089
Felony/violent offending ratio	Ratio of felony offenses to violent offenses	1.799	1.766	0.027	1.027
Violations of orders for protection	Number of violations of orders for protections	0.245	0.741	0.151	1.164
Stranger victims	Has committed sex offense against a stranger victim	0.175	0.380	0.254	1.290
Public location	Has committed sex offense in a public location	0.128	0.334	0.510	1.665
Male victim specialization	Specialization in sex offenses involving male victims	0.964	0.177	−0.731	0.482
Multiple predatory offenses/age groups	Has multiple predatory offenses and victim age groups	0.102	0.302	0.219	1.245
Suicidal tendencies	History of suicidal thoughts, gestures, and attempts	0.185	0.388	0.273	1.314
Complete SO/CD treatment	Completed SO/CD treatment in prison	0.072	0.258	−0.384	0.681
Stake in conformity index	Marriage, employment, and postsecondary degree	1.844	2.649	−0.031	0.970
Age at release	Age at release (years)	35.875	11.086	−0.012	0.989
Release violator	Current prison admission is release/parole violator	0.314	0.464	0.371	1.450
Unsupervised release	Discharged or released to no supervision	0.115	0.319	0.428	1.534
n		5,745	5,745	3,973	3,973

Note. MnSOST-4 = Minnesota Sex Offender Screening Tool–4; OR = odds ratio; SO = sex offender; CD = chemical dependency.

Description of MnSOST-4 Items

Among the 13 items on the MnSOST-4, seven are criminal history measures that each increase sexual recidivism risk. Of these, four pertain specifically to prior sexual offending. As with the MnSOST-R and MnSOST-3, having committed a sex offense in a public location was a significant predictor of sexual recidivism. The number of predatory offenses with male victims was an item on the MnSOST-3, and a similar item can be found on the Static-99 (Hanson & Thornton, 1999) and VASOR-2 (McGrath et al., 2014). For the MnSOST-4, a better predictor was the extent to which sex offenders specialized in committing predatory offenses involving male victims. The formula used to measure male victimization specialization was 1 − [(Number of predatory offenses with male victims / Total number of predatory offenses) × (Number of predatory offenses with male victims / Total number of predatory offenses)]. A value of “0” for this item indicates that the offender has complete specialization in sexual offending against males, whereas a value of “1” indicates complete specialization in sexual offending against females. As the coefficient indicates in Table 1, sexual recidivism risk increases as specialization in male victimization increases (i.e., value moves from 1-0).

Sex offenses committed against a stranger were included on the MnSOST-R but not on the MnSOST-3. Consistent with the MnSOST-R and Static-99, the results show that a prior sex offense committed against a stranger increases the risk of sexual reoffending. Predatory offenses and victims’ age groups, which were items on the MnSOST-3 and MnSOST-R, respectively, were not significant predictors of sexual recidivism. When these two items were combined into a single item, however, it became a significant predictor. More specifically, sexual recidivism risk was significantly greater when sex offenders had multiple prior sex offenses and had victims from more than one age group (per the MnSOST-R).

Although predatory offenses were a significant predictor for the MnSOST-3, it was not for the MnSOST-4. Instead, the total number of violent offense convictions, which is a broader measure that includes both sexual and nonsexual violent offenses, significantly predicted sexual recidivism. Likewise, felony offense conviction was an item on the MnSOST-3, but it was not a significant predictor for the MnSOST-4. However, when the number of felony offense convictions was considered in relation to the number of violent offense convictions, the ratio of felony to violent offense convictions (i.e., felony offense convictions/violent offense convictions) was a significant predictor. That is, as the value for the felony/violent offending ratio increased, so did sexual recidivism risk. The item that measures the number of convictions for VOFP, stalking, and harassment significantly predicted sexual recidivism. This item, which was also included on the MnSOST-3, likely taps into intimacy deficits, which has been found to be predictive of future sexual offending (Hanson & Morton-Bourgon, 2005). Although recent disorderly conduct conviction was an item on the MnSOST-3, it did not significantly predict sexual recidivism in the MnSOST-4 sample.

The MnSOST-4 contains three “dynamic” items—suicidal tendencies, successful participation in sex offender and chemical dependency (SO/CD) treatment, and the stake in conformity index—whose values can change while an offender is in prison. Instruments assessing the risk of first-time sexual offending (Duwe, 2012), repeat sexual offending (Boer, Hart, Kropp, & Webster, 1997), and general recidivism (Duwe, 2014a) have identified suicidal thoughts or tendencies as a risk factor. Consistent with this research, the results showed that a history of suicidal tendencies significantly raised the likelihood of a sex reoffense.

Participation in prison-based SO and CD treatment was measured as two separate items on the MnSOST-R. These items were combined to form a single item on the MnSOST-3 in which offenders received a value of “1” only if they completed SO treatment and CD treatment during their current prison term. As with the MnSOST-3, completion of SO and CD treatment significantly reduced the risk.

The stake in conformity index is based on three dichotomous measures: marital status (married = 1; unmarried = 0), employment (employment in previous year = 1; unemployed in previous year = 0), and educational achievement (postsecondary degree or certificate = 1; less than postsecondary degree/certificate = 0). The employment measure is the same item that appears on the MnSOST-R. Individually, none of the three measures were significant, robust predictors of sexual recidivism. Combined, however, this index emerged as a significant predictor. The score for the stake in conformity index ranges from 0 to 10. Reflecting their association with sexual recidivism, marital status is given 5 points, employment is assigned 4 points, and educational achievement is given 1 point.³ Thus, an offender with at least a postsecondary degree who is married would receive 6 points on the stake in conformity index. The higher the stake in conformity index score, the lower the sexual recidivism risk.

As with the MnSOST-3 and other instruments such as the MnSOST-R and Static-99, age at release was a significant predictor. In general, the younger the offender at the time of release, the greater the sexual recidivism risk. Unsupervised release is an item on the MnSOST-3, and it remained a significant predictor for the MnSOST-4 sample. The risk was also significantly greater for offenders who entered prison as parole violators; that is, for offenders released for correctional supervision, those who returned to prison for violating their conditions of parole (either for a new offense or for technical violation) had a higher sexual recidivism risk.

Classification Methods

The six different classification methods that were trained and tested on the dataset containing the outcome measure (sex offense reconviction within 4 years) and the 13 predictors were RLR, artificial neural networks, support vector machines, bagged trees, RF, and LogitBoost. The main advantages with RLR, which is the classification method used for the MnSTARR 2.0, are that it reduces the chances of overfitting by shrinking overly large coefficients, and the interpretation is much the same as it is for an unpenalized logistic regression model. Because penalization estimation procedures such as RLR introduce bias by reducing the variance of coefficients, its main drawback is that estimating CIs around the predicted probabilities is problematic (Goeman, Meijer, & Chatuverdi, 2016). Still, by evaluating predictive performance for RLR and the other five algorithms, this study compares a relatively transparent and interpretable method—RLR—with several newer, promising machine learning algorithms that are generally more opaque.⁴

Models were optimized on the training set by varying parameter values, and then using a 10-fold cross-validation procedure to determine how each model would likely perform on the test set.⁵ After identifying the parameters that yielded the best performance on the training set, the optimized models for each algorithm were then applied to the 1,772 sex offenders in the test set. To estimate the potential trade-off between transparency and predictive performance, the results for the most transparent algorithm (RLR) are presented below alongside those for the algorithm with the best overall performance (RF).

Predictive Performance Metrics

Six different metrics were used to evaluate predictive performance, including one for accuracy (ACC), one for calibration (RMSE), and three for discrimination (AUC, H-measure, and PRC). In addition to these metrics, this study introduces a consolidated statistic, SHARP (squared error, H-measure, ACC, ROC, and PRC), that is designed specifically for assessing overall predictive performance within highly imbalanced datasets. The SHARP metric is similar to the SAR statistic developed by Caruana et al. (2004), except that it weights predictive discrimination more heavily by including the H and PRC statistics. The formula for SHARP is as follows: [H-measure + AUC + PRC + ACC + (1 − RMSE)] / 5. As with SAR, the SHARP value ranges from 0 to 1, with higher values signifying better predictive performance.

Bias Testing

The MnSOST assessments have been used in both “low stakes” and “high stakes” risk assessments, which underscore the need to test for bias. In Minnesota, the vast majority of prisoners fall into one of the following five racial and ethnic categories: White, African American, American Indian, Hispanic, and Asian. Because the number of offenders in some of these categories was relatively small, the 1,772 offenders in the test set were separated into two groups—White and non-White. Hispanic offenders were included in the non-White group. To test for racial/ethnic bias, the MnSOST-4’s predictive performance was evaluated separately for these two groups.

IRR Assessment

An IRR assessment was performed by MnDOC staff in the Risk Assessment/Community Notification (RACN) unit, which is responsible for scoring MnSOST assessments. Of the 541 sex offenders in the test set who were released from prison in 2012, 10 were randomly selected for scoring by eight raters from the RACN unit. Each of the eight raters in this study had prior training and experience in scoring sex offender risk assessment instruments.

The intraclass correlation coefficient (ICC) was used to estimate IRR. Because the eight raters represent the population of raters for the MnDOC, a two-way mixed-effects model is used (Shrout & Fleiss, 1979). In their recent study, Duwe and Rocque (2017) proposed the following ICC thresholds for assessing IRR within the context of manually scored recidivism risk assessment tools: 0.95 and above is excellent; 0.85 to 0.94 is good; 0.75 to 0.84 is adequate; and below 0.75 is poor.

Results

As shown in Table 3, the ACC value for the MnSOST-4 developed with RLR was 0.981. The high ACC value is somewhat misleading, however, as none of the 1,772 offenders in the test set had a predicted probability that exceeded 50%. As a result, all of the offenders were classified as “negatives.” The 1.9% error rate reflects the base rate for the test set. The AUC was 0.835, which indicates that the MnSOST-4 had good predictive discrimination on the test set. The other discrimination metrics indicate that the instrument had a H value of 0.180 and a PRC of 0.224. The RMSE value was 0.131, whereas the SHARP value was 0.618.

Table 3.

Predictive Performance Results.

	Predictive performance metrics
	ACC	PRC	H	AUC	RMSE	SHARP
2010-2012 test set (N = 1,772)
MnSOST-4
RLR	0.981	0.224	0.180	0.835	0.131	0.618
Random Forest	0.981	0.216	0.163	0.851	0.132	0.616
MnSOST-R	0.981	0.033	0.012	0.626	0.133	0.504
2010-2012 test set (N = 1,604)
MnSOST-4
RLR	0.981	0.224	0.174	0.847	0.134	0.618
Random Forest	0.981	0.202	0.158	0.858	0.135	0.613
Static-99	0.981	0.047	0.022	0.654	0.137	0.513
2012 test set (N = 541)
MnSOST-4
RLR	0.976	0.154	0.117	0.765	0.149	0.573
Random Forest	0.976	0.150	0.117	0.786	0.151	0.576
MnSOST-3	0.975	0.118	0.099	0.716	0.138	0.554
Bias testing
MnSOST-4: RLR
White	0.983	0.241	0.206	0.838	0.126	0.628
Non-White	0.983	0.225	0.192	0.833	0.137	0.619

Note. ACC = accuracy; PRC = precision-recall curve; H = Hand’s H-measure; AUC = area under the curve; RMSE = root mean square error; SHARP = squared error, H-measure, accuracy, receiver operating characteristic, and PRC; MnSOST-4 = Minnesota Sex Offender Screening Tool–4; MnSOST-R = Minnesota Sex Offender Screening Tool–Revised; MnSOST-3 = Minnesota Sex Offender Screening Tool–3; RLR = regularized logistic regression.

When we compare the MnSOST-4 with the MnSOST-R, we see that the two instruments had the same ACC value, while the MnSOST-4 had slightly better calibration. We see large differences, however, for the three predictive discrimination metrics. The AUC for the MnSOST-4 was 0.209 higher, which was a statistically significant difference at the .05 level using DeLong, DeLong, and Clarke-Pearson’s (1988) method. The H was 0.168 higher, and the PRC was 0.191 higher. As a result, the SHARP value was 0.114 higher.

To a large extent, we see the same pattern of results when comparing the Static-99 and MnSOST-4 among 1,604 offenders in the test set. Both instruments had the same ACC (0.981), and the MnSOST-4 was better calibrated. The AUC for the MnSOST-4 was 0.193 higher, a statistically significant difference using the DeLong et al. (1988) approach. The H-measure was 0.152 higher, and the PRC was 0.177 higher. Furthermore, the SHARP value was 0.105 higher.

The differences are not quite as pronounced when we compare the MnSOST-3 and MnSOST-4 among 541 offenders released in 2012. The results reveal that the MnSOST-3 had better calibration, whereas the MnSOST-4 performed slightly better for accuracy. When we focus on the predictive discrimination metrics, we see that the MnSOST-4 was 0.049 higher for the AUC, 0.018 higher for the H, and 0.036 higher for the PRC. In addition, the MnSOST-4 was 0.019 higher for the SHARP.

It is worth noting that while the MnSOTST-3’s AUC value (0.716) is adequate, it is quite a bit lower than that reported (0.796) during its development (Duwe & Freske, 2012). Because these results are based on only 1 year’s worth of releases, they are somewhat preliminary. Even though Duwe and Freske (2012) corrected for optimism, it appears that the AUC they reported for the MnSOST-3 was still too optimistic.

The overall results also reveal that the MnSOST-4 based on the RLR algorithm performed approximately the same as the assessment developed with the RF algorithm. For all three test sets, both algorithms had the same ACC values and the RLR model had slightly better RMSE values. Among the three discrimination metrics, the RLR model had a higher H-measure value for two of the test sets and a higher PRC value in all three test sets. Conversely, the RF model had a higher AUC in all three test sets. Overall, the RLR algorithm had a slightly higher SHARP value for the two 2010-2012 test sets, whereas RF had a slightly better SHARP value for the 2012 test set.

As noted above, the 1,772 offenders in the test set were divided into two groups—White and non-White—to test for bias. The MnSOST-4 based on the RLR algorithm performed better among White offenders but not by much. Although both groups had the same ACC value, the White offender test set had slightly better performance for the other four metrics. The DeLong et al. (1988) test revealed that the difference in AUC values (0.838 vs. 0.833) was not statistically significant at the .05 level. Overall, the SHARP value for the White offender test set (0.628) was .009 higher than that for the non-White offender test set (0.619).

The results presented in Table 4 assess how well the classification methods performed overall among the different test sets, but it is also important to determine how well they would perform in practice. Risk assessment instruments, in particular the MnSOST, are used to help guide risk-level assignments for sex offenders in Minnesota.Level 3 is the highest risk level, and historical data indicate that approximately 15% are given this risk-level designation.⁶ If we used the various instruments examined here to identify the riskiest 15%, how well would the assessments do?

Table 4.

Predictive Performance Among Highest Risk Offenders.

	N	Recidivists	Rate (%)	Recidivists captured (%)
2010-2012 test set (N = 266)
MnSOST-4
Regularized logistic regression	266	19	7.1	57.6
Random Forest	266	18	6.8	54.5
MnSOST-R	266	10	3.8	30.3
2010-2012 test set (N = 241)
MnSOST-4
Regularized logistic regression	241	18	7.5	58.1
Random Forest	241	18	7.5	58.1
Static-99	241	13	5.4	41.9
2012 test set (N = 81)
MnSOST-4
Regularized logistic regression	81	6	7.4	46.2
Random Forest	81	6	7.4	46.2
MnSOST-3	81	5	6.2	38.5

Note. MnSOST-4 = Minnesota Sex Offender Screening Tool–4; MnSOST-R = Minnesota Sex Offender Screening Tool–Revised; MnSOST-3 = Minnesota Sex Offender Screening Tool–3.

For the full test set, there were 1,772 offenders, and MnSOST-R and MnSOST-4 scores were available for each offender. If we focus on the 266 (i.e., 15% of 1,772) with the highest MnSOST-R scores versus the 266 with the highest MnSOST-4 scores, we see that the MnSOST-R would capture 10 recidivists compared with 19 for RLR and 19 for RF. Put another way, the MnSOST-R would capture 30% of the recidivists versus 58% for both RLR and RF. If we focus on the 241 (i.e., 15% of 1,604) with the highest Static-99 scores, the instrument would capture 13 recidivists (42% of 31). By comparison, the RLR model would capture 19 recidivists or 61% of the total, whereas RF would capture 18 or 58% of the total. If we focus on the 81 (15% of 541) in the 2012 test set with the highest MnSOST-3 scores, five were recidivists (or 38% of 13). Likewise, when we focus on the 541 released in 2012 with the highest MnSOST-4 scores, there would be six recidivists (46% of the total) apiece for both the RLR and RF models.

IRR Assessment

The results showed that the ICC for the eight raters was .901 for consistency and .896 for absolute agreement of ratings for the 10 cases (see Table 5). The overall IRR of the MnSOST-4 is, therefore, in the “good” range according to the criteria proposed by Duwe and Rocque (2017).

Table 5.

MnSOST-4 Interrater Reliability Assessment.

	Consistency			Absolute agreement
MnSOST-4 items	ICC	Lower bound	Upper bound	ICC	Lower bound	Upper bound
MnSOST-4 score	.901	.793	.969	.896	.786	.968
Total violent offenses	.828	.666	.944	.807	.635	.936
Felony offenses	.880	.756	.962	.881	.758	.962
VOFP/stalking/harassment	.909	.809	.972	.909	.810	.972
Predatory offenses	.953	.896	.986	.953	.898	.986
Stranger victims	1.000	1.000	1.000	1.000	1.000	1.000
Male victims	1.000	1.000	1.000	1.000	1.000	1.000
Public place	.857	.715	.954	.857	.717	.954
Multiple age groups	.847	.518	.993	.856	.463	.993
Suicidal tendencies	1.000	1.000	1.000	1.000	1.000	1.000
Employment	.758	.560	.917	.769	.577	.922
Married	.857	.715	.954	.857	.717	.954
Postsecondary degree	.815	.479	.974	.811	.473	.973
SO/CD treatment	1.000	1.000	1.000	1.000	1.000	1.000
Release violator	.897	.787	.968	.886	.767	.964
Release age (years)	1.000	1.000	1.000	1.000	1.000	1.000
Unsupervised release	.857	.715	.954	.857	.717	.954

Note. All coefficients were statistically significant at the .05 level. MnSOST-4 = Minnesota Sex Offender Screening Tool–4; ICC = intraclass correlation coefficient; VOFP = violations of orders for protection; SO = sex offender; CD = chemical dependency.

The item-level data indicate an absence of any interrater disagreement for the following five items: predatory offenses, stranger victims, suicidal tendencies, SO/CD treatment, and age at release. With ICC values of .758 for consistency and .769 for absolute agreement, the employment item had the highest level of interrater disagreement.

Conclusion

Based on a review of the existing literature, this study identified promising and proven practices in the development and validation of recidivism risk assessments. While it may be debatable whether these are “best practices,” the results suggest that they are, at a minimum, “better practices.” Indeed, the findings indicate that the MnSOST-4 achieved a relatively high level of predictive performance across multiple metrics. The MnSOST-4 outperformed both the MnSOST-R and Static-99 by relatively wide margins, although its improvement over the MnSOST-3 was more modest. Still, the differences in predictive performance are sufficient to warrant the MnSOST-4’s replacement of the MnSOST-3 as the instrument the MnDOC uses to assess sexual recidivism risk. The analyses did not reveal significant racial/ethnic bias, and the IRR assessment indicated that the MnSOST-4 can be scored consistently by raters.

These findings are encouraging, although a few caveats are worth mentioning: First, as this study revealed, the MnSOST-3’s AUC in this study (0.716) is quite a bit lower than what Duwe and Freske (2012) reported in the development and validation study (0.796). Accordingly, when the MnDOC begins using the MnSOST-4 in practice, its predictive performance may fall short of the results presented here. Second, the MnSOST-4 was developed and validated on a population consisting of male sex offenders released from Minnesota prisons. It cannot be assumed that this instrument would deliver a similar level of predictive performance if it were applied to another sex offender population. This is true, however, even for global, widely used assessments such as the Static-99. For example, the AUC for the Static-99 reported in this study (0.654) is well below the AUC reported (0.705) by Helmus et al. (2012) in their meta-analysis. In short, prior to using an assessment on an offender population, the assessment’s predictive performance must be evaluated on that population.

These limitations notwithstanding, there are several broad lessons that can be drawn from the development of the MnSOST-4: First, using a feedback loop with an accelerated revision schedule may be helpful in further improving the performance of risk assessment instruments. More specifically, efforts to revise an assessment should begin as soon as complete performance data become available, which suggests that risk assessment instruments should have shorter shelf lives than what they have had in the past. In addition to measuring the performance of the “incumbent,” these data can be used to help determine whether a “challenger” can achieve greater predictive validity. If this feedback loop in which challengers are compared against incumbents is performed iteratively, it will presumably become progressively more difficult for a challenger to unseat an incumbent. In doing so, however, it is anticipated this process will help facilitate greater improvement in predictive performance.

Second, the development of the MnSOST-4 suggests that there may be advantages to using alternative measures for existing data. As evidenced by the stake in conformity index and multiple predatory offenses and victim age groups items, this study created single predictors out of multiple items that were not, on their own, significantly predictive of sexual recidivism. Creating consolidated measures not only helps identify significant predictors but it also provides what is arguably a simpler, more transparent approach for modeling interaction effects. In addition, male victim specialization was more predictive than an item that simply measured the number of sex offenses with male victims. Items that measure the extent to which offenders have specialized in sex offenses involving stranger victims or public places may also be more predictive than those that just record whether the offender has ever committed a sex offense against a stranger victim or in a public location. Although data were not available on the number of sex offenses committed against stranger victims or in public locations, future research should examine whether the use of more refined measures of prior sexual offending further enhances the prediction of sexual recidivism.

Third, determining how well an instrument performs should entail more than just reporting the AUC. As this study showed, manually scored tools should be accompanied by an IRR assessment, research should begin to routinely test for bias, and multiple performance metrics in addition to the AUC should be used. The SHARP metric, in particular, may be especially relevant for validations of tools designed to predict sexual recidivism as it taps into all three dimensions of predictive validity—accuracy, discrimination, and calibration—and is designed for highly imbalanced datasets.

Finally, testing multiple classifiers is critical to the development of a risk assessment instrument that is an optimal fit for the offender population and context to which it will be applied. As evidenced by the removal of the two interaction terms from the MnSOST-3 to the 3.1, the ability of practitioners to understand and explain, at least on a basic level, how an instrument assesses risk for individual offenders is paramount. Of the algorithms tested in this study, RLR is arguably the most transparent and interpretable. Given that its predictive performance was on par with the best performer (RF) among the other classifiers evaluated, the decision to base the MnSOST-4 on the RLR algorithm is relatively clear-cut. If, however, there had been a greater gap in performance between RF and RLR (e.g., if RF had, say, an AUC of 0.900 vs. 0.835 for RLR), then sacrificing this much predictive discrimination would be difficult to ignore and a decision to use the RLR algorithm more difficult to defend.

Despite RLR’s performance in this study, the “no free lunch” theorem bears repeating. There is no single best algorithm, for even the most promising ones such as boosted trees or RF can perform poorly sometimes. Just because RLR had a strong performance on the MnSOST-4 dataset does not necessarily mean that it would be a top performer in predicting sexual recidivism in other contexts. Indeed, even efforts to revise future iterations of the MnSOST may show that other algorithms outperform RLR in predicting sexual recidivism for Minnesota sex offenders. Because it is not possible to know beforehand which classifier will perform the best on a dataset, testing multiple classification methods is necessary. Evaluating multiple classifiers not only improves the odds of identifying the optimal algorithm for a dataset but it also enables a determination of the size of the potential trade-off between transparency and performance.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

Author Biography

Grant Duwe is the director of Research and Evaluation for the Minnesota Department of Corrections, where he evaluates correctional programs, develops risk assessment instruments, and forecasts the state’s prison population. His recent work has been published in Criminology & Public Policy, Criminal Justice Policy Review, The Prison Journal, The Journal of Offender Rehabilitation, and International Journal of Offender Therapy and Comparative Criminology. He is a nonresident senior fellow with Baylor University’s Institute for Studies of Religion and, along with Michael Hallett, Joshua Hays, Byron Johnson, and Sung Joon Jang, a coauthor of the forthcoming book, The Angola Prison Seminary: Effects of Faith-Based Ministry on Identity Transformation, Desistance and Rehabilitation (Routledge).

References

Austin

(2006). How much risk can we take-The misuse of risk assessment in corrections. Federal Probation, 70, 58-63.

Baird

(2009). A question of evidence: A critique of risk assessment models used in the justice system. Madison, WI: National Council on Crime & Delinquency.

Baird

Healy

Johnson

Bogie

Dankert

E. W.

Scharenbroch

(2013). A comparison of risk assessment instruments in juvenile justice. National Council on Crime & Delinquency. Retrieved from https://www.nccdglobal.org/sites/default/files/publication_pdf/nccd_fire_report.pdf

Barnoski

Aos

(2003). Washington’s Offender Accountability Act: An analysis of the department of corrections’ Risk assessment. Olympia: Washington State Institute for Public Policy.

Berk

R. A.

Bleich

(2013). Statistical procedures for forecasting criminal behavior: A comparative assessment. Criminology and Public Policy, 12, 513-544.

Boccaccini

M. T.

Murrie

D. C.

Caperton

J. D.

Hawes

S. W.

(2009). Field validity of the STATIC-99 and MnSOST-R among sex offenders evaluated for civil commitment as sexually violent predators. Psychology, Public Policy, and Law, 15, 278-314.

Boer

D. P.

Hart

S. D.

Kropp

P. R.

Webster

C. D.

(1997). Manual for the sexual violence risk-20: Professional guidelines for assessing risk of sexual violence. Burnaby, British Columbia, Canada: Mental Health, Law, and Policy Institute, Simon Fraser University.

Breitenbach

Dieterich

Brennan

Fan

(2009). Creating risk-scores in very imbalanced datasets: Predicting extremely low violent crime among criminal offenders following release from prison. In Koh

Y. S.

Rountree

(Eds.), Rare association rule mining and knowledge discovery: Technologies for infrequent and critical event detection (pp. 231-254). Hershey, PA: Information Science Reference.

Caruana

Niculescu-Mizil

(2006). An empirical comparison of supervised learning algorithms using different performance metrics. In Proceedings of the23rd International Conference on Machine Learning (pp. 161-168). New York, NY: Association for Computing Machinery.

10.

Caruana

Niculescu-Mizil

Crew

Ksikes

(2004). Ensemble selection from libraries of models. In Proceedings of the21st International Conference on Machine Learning (pp. 1-12). Canada, Alberta.

11.

Crime and Justice Institute at Community Resources for Justice. (2009). Implementing evidence-based policy and practice in community corrections (2nd ed.). Washington, DC: National Institute of Corrections.

12.

Davis

Goadrich

(2006). The relationship between precision-recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (pp. 1-12). Canada, Alberta.

13.

Dawes

R. M.

Faust

Meehl

P. E.

(1989). Clinical versus actuarial judgment. Science, 243, 1668-1674.

14.

DeLong

E. R.

DeLong

D. M.

Clarke-Pearson

D. L.

(1988). Comparing the areas under two or more correlated receive operating characteristic curves: A nonparametric approach. Biometrics, 44, 837-845.

15.

Duwe

(2012). Predicting first-time sexual offending among prisoners without a prior sex offense history: The Minnesota Sexual Criminal Offending Risk Estimate (MnSCORE). Criminal Justice and Behavior, 39, 1,434-1,454.

16.

Duwe

(2014a). The development, validity, and reliability of the Minnesota Screening Tool Assessing Recidivism Risk (MnSTARR). Criminal Justice Policy Review, 25, 579-613.

17.

Duwe

(2014b). To what extent does civil commitment reduce sexual recidivism? Estimating the selective incapacitation effects in Minnesota. Journal of Criminal Justice, 42, 193-202.

18.

Duwe

Freske

(2012). Using logistic regression modeling to predict sex offense recidivism: The Minnesota Sex Offender Screening Tool-3 (MnSOST-3). Sexual Abuse: A Journal of Research and Treatment, 24, 350-377.

19.

Duwe

Freske

(2016). The Minnesota Sex Offender Screening Tool-3.1 (MnSOST-3.1): An update to the MnSOST-3. In Phenix

Hoberman

H. M.

(Eds.), Sexual offenders: Diagnosis, risk assessment and management (pp. 489-502). New York, NY: Springer.

20.

Duwe

Kim

(2017). Out with the old and in with the new? An empirical comparison of supervised learning algorithms to predict recidivism. Criminal Justice Policy Review, 28, 570-600.

21.

Duwe

Kim

(2016). Sacrificing accuracy for transparency in recidivism risk assessment: The impact of classification method on predictive performance. Corrections: Policy, Practice and Research, 1, 155-176.

22.

Duwe

Rocque

(2016). A jack of all trades but a master of none? Evaluating the LSI-R’s performance in assessing risk and need. Corrections: Policy, Practice and Research, 1, 81-106.

23.

Duwe

Rocque

(2017). The effects of automating recidivism risk assessment on reliability, predictive validity, and return on investment (ROI). Criminology & Public Policy, 16, 235-269.

24.

Efron

(1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7, 1-26.

25.

Efron

Gong

(1983). A leisurely look at the bootstrap, the jackknife, and cross-validation. American Statistician, 37, 36-48.

26.

Epperson

D. L.

Kaul

J. D.

Huot

Goldman

Alexander

(2003). Minnesota Sex Offender Screening Tool-Revised (MnSOST-R) technical paper: Development, validation, and recommended risk level cut scores. Retrieved from http://www.psychology.iastate.edu/~dle/TechUpdatePaper12-03.pdf

27.

Friedman

Hastie

Tibshirani

(2000). Additive logistic regression: A statistical view of boosting. The Annals of Statistics, 28, 337-407.

28.

Giguere

Lussier

(2016). Debunking the psychometric properties of the LS/CMI: An application of item response theory with a risk assessment instrument. Journal of Criminal Justice, 46, 207-218.

29.

Goeman

Meijer

Chatuverdi

(2016). L1 and L2 penalized regression models. Retrieved from https://cran.r-project.org/web/packages/penalized/vignettes/penalized.pdf

30.

Hamilton

Kigerl

Campagna

Barnoski

Lee

Van Wormer

Block

(2016). Designed to fit: The development and validation of the STRONG-R recidivism risk assessment. Criminal Justice and Behavior, 43, 230-263.

31.

Hamilton

Neuilly

M.-A.

Lee

Barnoski

(2015). Isolating modeling effects in offender risk assessment. Journal of Experimental Criminology, 11, 299-318.

32.

Hand

D. J.

(2009). Measuring classifier performance: A coherent alternative to the area under the ROC curve. Machine Learning, 77, 103-123.

33.

Hanson

R. K.

Morton-Bourgon

K. E.

(2005). The characteristics of persistent sexual offenders: A meta-analysis of recidivism studies. Journal of Consulting and Clinical Psychology, 73, 1154-1163.

34.

Hanson

R. K.

Thornton

(1999). Static 99: Improving actuarial risk assessments for sex offenders (Vol. 2). Ottawa, Ontario: Solicitor General Canada.

35.

Harrell

F. E.

Lee

K. L.

Mark

D. B.

(1996). Multivariable prognostic models: Issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Statistics in Medicine, 15, 361-387.

36.

Helmus

Hanson

R. K.

Thornton

Babchishin

K. M.

Harris

A. J.

(2012). Absolute recidivism rates predicted by Static-99R and Static-2002R sex offender risk assessment tools vary across samples a meta-analysis. Criminal Justice and Behavior, 39, 1148-1171.

37.

Helmus

Thornton

Hanson

R. K.

Babchishin

K. M.

(2012b). Improving the predictive accuracy of Static-99 and Static-2002 with older sex offenders: Revised age weights. Sexual Abuse: A Journal of Research and Treatment, 24, 64-101.

38.

Hess

Turner

(2013). Risk assessment accuracy in corrections population management: Testing the promise of tree based ensemble predictions. Irvine: Center for Evidence-Based Corrections, The University of California, Irvine.

39.

Hsieh

M. L.

Hamilton

Zgoba

K. M.

(2016). Prison experience and reoffending: Exploring the relationship between prison terms, institutional treatment, infractions, and recidivism for sex offenders. Sexual Abuse: A Journal of Research and Treatment. Advance online publication. doi:10.1177/1079063216681562

40.

Judd

C. M.

McClelland

G. H.

Ryan

C. S.

(2008). Data analysis: A model comparison approach. New York, NY: Routledge.

41.

Knight

R.A.

Thornton

(2007). Evaluating and Improving Risk Assessment Schemes for Sexual Recidivism: A Long-Term Follow-Up of Convicted Sexual Offenders. Washington, DC: U.S. Department of Justice.

42.

Langton

C. M.

Barbaree

H. E.

Harkins

Peacock

E. J.

Arenovich

(2008). Further investigation of findings reported for the Minnesota Sex Offender Screening Tool Revised. Journal of Interpersonal Violence, 23, 1363-1379.

43.

Liu

Y. Y.

Yang

Ramsey

X. S.

Cold

J. W.

(2011). A comparison of logistic regression, classification and regression tree, and neural network models in predicting violent re-offending. Journal of Quantitative Criminology, 27, 547-573.

44.

McGrath

R. J.

Lasher

M. P.

Cumming

G. F.

Langton

C. M.

Hoke

S. E.

(2014). Development of Vermont Assessment of Sex Offender Risk02 (VASOR-2) reoffense risk scale. Sexual Abuse: A Journal of Research and Treatment, 26, 271-290.

45.

Olver

M. E.

Stockdale

K. C.

Wormith

S. J.

(2014). Thirty years of research on the Level of Service scales: A meta-analytic examination of predictive accuracy and sources of variability. Psychological Assessment, 26, 156-176.

46.

Ridgeway

(2013). The pitfalls of prediction. National Institute of Justice Journal, (271), 34-40.

47.

Shrout

P. E.

Fleiss

J. L.

(1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86, 420-428.

48.

Singh

J. P.

Fazel

(2010). Forensic risk assessment: A meta-review. Criminal Justice and Behavior, 37, 965-988.

49.

Skeem

J. L.

Lowenkamp

C. T.

(2016). Risk, race, and recidivism: Predictive bias and disparate impact. Criminology, 54, 680-712.

50.

Smith

Cullen

F. T.

Latessa

E. J.

(2009). Can 14,737 women be wrong? A meta-analysis of the LSI-R and recidivism for female offenders. Criminology and Public Policy, 8, 183-208.

51.

Smith

(1996). The effects of base rate and cutoff point choice on commonly used measures of association and accuracy in recidivism research. Journal of Quantitative Criminology, 12, 83-111.

52.

Steyerberg

E. W.

Bleeker

S. W.

Moll

H. A.

Grobbee

D. E.

Moons

K. G.

(2003). Internal and external validation of predictive models: A simulation study of bias and precision in small samples. Journal of Clinical Epidemiology, 56, 441-447.

53.

Steyerberg

E. W.

Harrell

F. E.

Borsboom

G. J. J. M.

Eijkemans

M. J. C.

Vergouwe

Habbema

J. D. F.

(2001). Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis. Journal of Clinical Epidemiology, 54, 774-781.

54.

Thornton

Laws

D. R.

(2009). Cognitive approaches to the assessment of sexual interest in sexual offenders. Hoboken, NJ: John Wiley.

55.

Tibshirani

(1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58, 267-288.

56.

Tollenaar

van der Heijden

P. G. M.

(2013). Which method predicts recidivism best? A comparison of statistical, machine learning and data mining predictive methods. Journal of the Royal Statistical Society, Series A, 176(Pt. 2), 565-584.

57.

Wolpert

D. H.

(1996). The lack of a priori distinctions between learning algorithms. Neural Computation, 8, 1341-1390.

Better Practices in the Development and Validation of Recidivism Risk Assessments: The Minnesota Sex Offender Screening Tool–4

Abstract

Keywords

Introduction

Risk Assessment for Minnesota Sex Offenders

The Minnesota Screening Tool Assessing Recidivism Risk (MnSTARR)

Promising and Proven Practices in Risk Assessment

Scoring method

Classification method

Item measurement and selection

Internal validation method

Evaluating predictive performance

Testing for bias

Feedback loop: Updating a risk assessment instrument

Data and Method

Internal Validation Methods

Selecting Predictors

Description of MnSOST-4 Items

Classification Methods

Predictive Performance Metrics

Bias Testing

IRR Assessment

Results

IRR Assessment

Conclusion

Footnotes

Declaration of Conflicting Interests

Funding

Notes

Author Biography

References