Abstract
Recidivism risk assessment tools have been utilized for decades. Although their implementation and use have the potential to touch nearly every aspect of the correctional system, the creation and examination of optimal development methods have been restricted to a small group of instrument developers. Furthermore, the methodological variation among common instruments used nationally is substantial. The current study examines this variation by reviewing methodologies used to develop several existing assessments and then tests a variety of design variations in an attempt to isolate and select those which provide improved content and predictive performance using a large sample (N = 44,010) of reentering offenders in Washington State. Study efforts were completed in an attempt to isolate and identify potential incremental performance achievements. Findings identify a methodology for improved prediction model performance and, in turn, describe the development and introduction of the Washington State Department of Correction’s recidivism prediction instrument—the Static Risk Offender Need Guide for Recidivism (STRONG-R).
Over the past 30 years, the criminal justice system has witnessed an increase in the use of actuarial risk assessments to predict recidivism and to allow for structured organizational decision making. Offender risk assessments now assist in determining custody levels, guide contact standards, and determine intervention priority/eligibility. Actuarial risk tools are now a standard part of how criminal justice professionals make decisions, occurring in both the adult and juvenile system (Bushway, 2013; National Center for Juvenile Justice, 2006). In the United States and Canada, it is becoming improbable that an offender would evade an assessment of risk following conviction. Despite their influence, due to the relatively small group of researchers involved in creating recidivism assessments (e.g., Andrews & Bonta, 1995; Baird, 1981; Barnoski & Drake, 2007; Brennan & Oliver, 2000; Duwe, 2014; Hare, 1991; Latessa, Smith, Lemke, Makarios, & Lowenkamp, 2009), examinations of development methods and procedures are limited. Kroner, Mills, and Reddon (2005) argued that this is likely due to the near decade of research and development needed to take an instrument from inception to validation. We contend that this has unfortunately led to restricted knowledge and a relatively limited critique. In addition, others have suggested (Bushway, 2013; Ridgeway, 2013) that there is a lack of comparative testing of important sources of variation, which limits practitioners’ knowledge base when adopting a prospective tool.
Generally speaking, risk assessments consist of algorithms of various complexities, using empirically predictive indicators of behavior (Falzer, 2013). From clinical judgment (1G), to static only (2G), to static and dynamic (3G), and finally the development of responsive (4G) instruments, offender assessment has been classified into four generations (see Andrews, Bonta, & Wormith, 2006). 1 Each generation adds a nuanced dimension that improves an instrument’s functionality (Baird, 2009). However, classifying assessments is not cut-and-dry, as a newer generation does not always provide increased performance. Furthermore, some instruments claim robust applicability to a variety of populations (e.g., Level of Service Inventory−Revised [LSI-R], Ohio Risk Assessment System [ORAS]) and others target specific jurisdictions or special populations (e.g., Minnesota Screening Tool Assessing Recidivism Risk [MnSTARR]). Despite decades of use and study, the evaluation of instrument performance is still misunderstood by many practitioners, policy makers, and researchers. Furthermore, when the criteria of validation can be achieved by demonstrating predictive performance that is slightly better than a coin flip (or more commonly known as betting the base rate), 2 possessing a validated instrument can be a relatively low bar for an agency to hang their hat on. The field is moving forward, and translation research is needed to guide practitioners away from underperforming models and toward those with greater predictive strength (Gottfredson & Moriarty, 2006).
The current study sought to examine offender risk assessment development variations. Our efforts attempted to addresses three objectives. First, we identify the concepts and issues that vary the core provisions of risk assessment modeling and prediction. Specifically, we discuss issues of contemporary recidivism risk assessments commonly used in North America. Although not an exhaustive list, the issues described are the subject of current discussion and testing. Next, we use data from the Washington State Department of Corrections (WADOC) to empirically examine how predictive performance is influenced by these issues. Finally, we isolate and compare instruments and development design decisions, extending the discussion of methodological variation. The empirical comparisons were made in an effort to develop an offender recidivism prediction instrument that takes advantage of the identified development methods, which resulted in improved predictive performance.
Methodological Design Issues of Current Risk Assessments
Often practitioners adopt an existing risk assessment off-the-shelf that may have been previously developed in another state or country. That is, one has to start somewhere and starting with an existing instrument often makes the most sense for an agency. Alternatively a state or agency may develop their own assessment, gathering items that are tailored to populations and outcomes in which they intend to predict. In this scenario, the agency must work with an experienced research team and cast a wide net of items to include in an assessment pool and to be gathered on its offenders. The implementing agency must then collect the detailed assessment data and allow research partners to craft an instrument that is tailored to suit the needs of its assessors and practitioners. Conceivably, if proper methods of development are adhered to at each stage, the instrument should perform better than an off-the-shelf instrument, which lacks item and outcome tailoring and localized context provided by the agency crafted prediction models (Wright, Clear, & Dickson, 1984). Currently, many state’s correctional systems utilize customized instruments, including: Minnesota (Duwe, 2014), Georgia (Meredith, Speir, & Johnson, 2007), and Texas (http://www.tdcj.state.tx.us/) to highlight a few.
Although a jurisdiction-specific or customized assessment has great appeal for an agency, there are many developmental design decisions that must be made during the development process, which have the potential to alter, or potentially improve, the prediction of the recidivistic outcomes. In 2008, Washington State began the data collection process to develop its own 4G model risk assessment. In 2012, we began the design and development process, attempting to utilize modern statistical techniques and agency input to maximize prediction strength and functionality. When developing a tool for Washington State, we aimed to achieve two primary goals: (a) select highly predictive items and (b) create predictive models that will be stable over time.
As customized assessments often do not get the attention and advocacy as more nationally renowned tools, there is a tendency to label them less than state of the art and possessing limited functionality. The current research describes our risk assessment development process, creating a state of the art offender assessment that is customized to meet the needs of Washington State. Using Washington State as an example, an intended purpose of this study was to describe methodological variations that may influence the performance of all offender assessments, in an effort to provide translational research or as Bushway (2013) suggested, give attention to the methodological issues that can affect criminal justice practice. Although more are likely to exist, we sought to describe and test five potential performance impacting issues that can be observed in a variety of instruments used today, including (a) static versus dynamic items, (b) item selection, (c) item weighting, (d) gender responsivity/specificity, and (e) specified outcomes.
Static Versus Dynamic Items
As indicated, the movement from 2G to 3G instruments added dynamic predictors of recidivism with an emphasis on offender needs. Conceptually, when static and dynamic instruments are utilized in the same instrument, these tools are often referred to as risk-needs assessments; where static risks (e.g., prior number of convictions) are combined with dynamic needs (e.g., employed in the prior 6 months). As previously discussed by Baird (2009), a distinction can be observed between the objectives of correctional practitioners and psychometricians. When assessing an objectively measurable concept like recidivism, a manifest outcome exists and prediction items of any and all types may be used in an effort to predict said outcome. Prediction items may be assessed for their ability to predict recidivism via simple bivariate methods, regression approaches, or more recently, the use of machine learning models have been attempted (Hamilton, Neuilly, Lee, & Barnoski, 2014; Oliver, Dieterich, & Brennan, 2014).
However, when attempting to predict a latent concept like substance abuse needs a variety of psychometric approaches are often used, such as factor analysis, structure equation modeling, and item response theory to name a few. These analyses are completed in an effort to assess intervention and treatment prioritization for a given needs area/domain. Instruments, such as the LSI-R, have made attempts to combine these efforts, using items to predict manifest recidivism outcomes, while also dividing items into latent domains that may be used for needs assessment. Although rarely discussed, it is imperative that a distinction be drawn between risk and needs assessments; where the risk assessments make use of any and all item types to predict an observed recidivism outcome and may do so without the need for latent variable approaches. Accordingly, static and dynamic risk and needs items can jointly assist in recidivism prediction. The current study focuses exclusively on risk prediction.
When included in a multivariate model, however, static criminal history items often reduce the impact of dynamic needs items, resulting from issues related to shared variance (see Barnoski & Aos, 2003). Nevertheless, prior findings have generally indicated that dynamic items provide a unique contribution, improve prediction strength, and allow for the possibility of offender risk to decrease over time (Cottle, Lee, & Heilbrun, 2001; Jung & Rawana, 1999; Loerber & Farrington, 1998). The inclusion of dynamic items is therefore necessary if agencies wish to identify reductions in risk.
Item Selection
Instruments are to be composed of items that predict an outcome of interest. The Risk–Need–Responsivity (RNR) model indicates that items are to possess an empirical relationship with recidivism and are thus criminogenic (Andrews & Bonta, 2010). While the importance of psychometrics and latent properties of domains and subscales can be debated, agencies are most concerned with the instrument’s strength of prediction. Therefore, when predicting risk of recidivism, all item types are fair game, including static, dynamic, and any other ethically and theoretically relevant measure. The field has witnessed the use of a mix of clinical experience, bivariate, and multivariate techniques for determining risk scale item inclusion. At a minimum, a significant bivariate association with recidivism is needed to identify an empirical relationship, or to make a determination that the item is a criminogenic predictor. Items lacking this distinction will not add to the prediction of recidivism and may reduce the instrument’s strength, creating prediction noise (Baird, 2009). Multivariate assessments provide a more stringent criterion for item inclusion. Accounting for issues related to shared variance and multicollinearity, regression and other multivariate techniques utilize model assumptions to establish items of importance, removing those which may have a bivariate relationship but fail to affect prediction after accounting for other included measures. A debate within the field has suggested that liberal selection criteria allow for the inclusion of tertiary items and domains (i.e., free time/leisure activities) that divert attention from the core criminogenic measures driving recidivism prediction (Baird, 2009; Wooditch, Tang, & Taxman, 2014).
Item Weighting
A debate among instrument developers is the utility of multivariate models to provide greater weight to important items and, in turn, improve predictive performance. Researchers using analytic weights seek to improve prediction performance by ranking variables by relative import. As studies have indicated (J. Austin, Coleman, Peyton, & Johnson, 2003; Barnoski & Aos, 2003), when predicting recidivism, measures such as criminal history and age are strong predictors, whereas, although still important, measures such as alcohol use and education attained are relatively weaker. Unweighted, or Burgess weighted, 3 models commonly utilize bivariate significance to identify variable importance and treat measures equally, resulting in a simple summation of predictor scoring, where Burgess weighted tools sum a series of dichotomous items (0/1) and the more generic unweighted tools provide single-unit increases for each increasing risk response (0, 1, 2, etc.). Although unweighted methods may assure that items predict in a theoretically consistent direction at a bivariate level, the direction of effects represents a black box on a multivariate level, a perceived disadvantage. Furthermore, these methods create redundancy, potentially over weighting items’ importance as a result of shared variance.
The importance of weights has been debated, with some suggesting that weights provide little performance improvement and are more susceptible to performance shrinkage (Dawes, 1979; Grann & Långström, 2007; Harris, Rice, & Quinsey, 1993; Wainer, 1976), whereas others suggest substantial improvement gained when samples are sufficiently large (Einhorn & Hogarth, 1975; Silver, Smith, & Banks, 2000). The use of bivariate selection procedures over multivariate methodologies increases the likelihood of including items that are either weakly, or even negatively, associated with the target outcome. Therefore, without an optimal weighting method it is possible that many items utilized in unweighted tools dilute accuracy, and create prediction noise (Baird, 2009). One of the best illustrations of this concept was completed by Kroner and colleagues’ (2005) use of randomly selected items (drawn from a coffee can) of four unweighted instruments, in which randomly formed models provided near equivalent performance as their more established counterparts. Additional examinations have demonstrated that the use of multivariate item selection and analytically weighted items in 2G assessments have been shown to outperform 4G Burgess scored instruments (J. Austin et al., 2003; Barnoski & Aos, 2003).
The practical argument against weighting is that it complicates scoring and face validity for practitioners, which often makes computer automation necessary. However, with the increased use of automation and agency data system integration, this once pragmatic argument is losing ground.
Gender Specificity
Van Voorhis, Wright, Salisbury, and Bauman (2010) were instrumental in identifying the theoretical need for assessments to be separated by gender. Aside from the fact that certain items have been shown to be more predictive for female than male offenders (Andrews et al., 2012; Smith, Cullen, & Latessa, 2009), there is a logical argument and empirical evidence that the two genders represent separate populations (Else-Quest, Higgins, Allison, & Morton, 2012). Creating gender-neutral assessments restricts considerations of gendered distinctions as they relate to system rehabilitative practices and institutional culture. Gender-neutral risk assessments may further limit a clinician’s ability to develop individual treatment plans (Hannah-Moffat, 2009). Indeed, practitioners prioritize management in a gendered manner (Britton, 2003; Freiburger & Hilinski, 2010; Frohmann, 1997; Kruttschnitt & McCarthy, 1985; Miller, 1999; Spohn, Beichner, & Davis-Frenzel, 2001) due to variant pathways men and women take toward criminality (Blanchette & Brown, 2006; Brennan, Breitenbach, Dieterich, Salisbury, & Van Voorhis, 2012; Daly, 1992, 1994; Salisbury & Van Voorhis, 2009; Sampson & Laub, 1993). Segregated by gender, incarceration is an obvious way the criminal justice system deals with females differently than males.
Three methods to make an instrument gender-specific are described here. First, an instrument can be created and scored as gender neutral but manually adjust risk category cut points so that fewer female offenders score as high risk. Second, an instrument may utilize gender as a predictor, or risk assessment item, encapsulating all gender variations in a single measure. A third method, discussed here as gender-specific, selects and weights prediction items for the separate gender subsamples. Beyond the potential improved predictive performance, gender-specific assessments provide item context and description that can assist in case management and, in turn, improve face validity and responsivity. Finally, gender-specific assessments start with women in mind, using items and scales that are formatted specifically to address the criminal pathways and needs of female offenders.
Specified Outcomes
One of the biggest practical problems in designing a risk assessment is measuring and defining the recidivistic outcome to be predicted, as there may be more than one, or it may differ by jurisdiction. First, one must identify a source of data to evaluate the recidivism outcome of interest. While some states provide recidivism outcomes at the state level, some states do not have an integrated system, making risk assessment development more difficult.
Beyond this basic data gathering issue, all types of recidivism are not created equal. Kroner and colleagues (2005) argued that greater performance will be realized when selected items are more directly linked to offense-specific outcomes, where identifying violent recidivists may be more of a concern than identifying those whose recidivism consists solely of drug or property offenses. Thus, improved predictive performance may potentially be achieved through offense-specific outcomes related to crime severity, creating separate models for violent and nonviolent offenses.
Methodological Variations among Six Contemporary Recidivism Risk Assessment Instruments
Although there are many more examples of methodological design variations in practice, the five described are the focus of the current discussion and are observed to vary among prominent tools used today. To provide noteworthy examples of these design variations, we reviewed six contemporary assessment instruments. Table 1 summarizes the descriptive variations among the selected instruments as well as the instrument developed for Washington State—The Static Risk Offender Need Guide for Recidivism (the STRONG-R)—followed by a brief description of each instrument to supplement the table summary.
Development Design Characteristics of Six Contemporary Recidivism Risk Assessment Instruments
Note. LSI-R = Level of Service Inventory–Revised; LS/CMI = Level of Service/Case Management Inventory; COMPAS = Correctional Offender Management Profiling for Alternative Sanctions; SRA = Static Risk Assessment; ORAS = Ohio Risk Assessment System; WRNA = Women’s Risk Needs Assessment; MnSTARR = Minnesota Screening Tool Assessing Recidivism Risk; STRONG-R = Static Risk Offender Need Guide for Recidivism; X = model contains characteristic.
Gender responsive scales for female offenders are available; however, the COMPAS does not select and weight all model items separately for male and female offenders.
The ORAS does not select and weight items separately for male and female offenders but instead alters the risk category cut points to include fewer females in higher risk categories.
The WRNA does select items specific for females but does not provide a model to predict recidivism for male offenders.
Lsi-R and the Level of Service/Case Management Inventory (LS/CMI)
The LSI was developed by Don Andrews and consists of a “single-sheet inventory with 62 ‘zero–one’ items which would fit in officers’ case-books” (Andrews, 1982, p. 3). Considered a 3G instrument, the revised LSI-R made theoretical improvements, classifying static and dynamic items into 10 psychometric domains. The 4G LS/CMI was developed to assist in case management, adding responsivity components (Andrews, Bonta, & Wormith, 2004). All three of these instrument variations were developed similarly, predicting general recidivism, using bivariate item selection, unweighted items, and designed to be gender neutral. Although widely used, the instrument is not weighted or customized by jurisdiction. Notably, Smith et al. (2009) meta-analyzed 27 effect sizes and found moderate effects of predictive validity for males (area under the curve [AUC] = .64) and females (AUC = .66). 4 Similarly moderate-to-weak effects (AUC = .615-.646) were found in a recent examination of the LSI-R in New Jersey, with generally weaker recidivism prediction effects identified for female parolees (Ostermann & Herrschaft, 2013).
Correctional Offender Management Profiling for Alternative Sanctions (COMPAS)
The COMPAS, a 4G assessment, was developed by Northpointe, Inc. (Brennan & Oliver, 2000). The COMPAS claims greater specificity by making use of gender-specific scales for female offenders 5 ; however, it does not select and weight items separately for male and female offenders (Brennan, Dieterich, & Ehret, 2009). Utilized in several states, the tool is calibrated and normed for each jurisdiction. The COMPAS addresses potential limitations of previous generations, including multivariate item selection, analytical weighting of predictors, bootstrap validation procedures, 6 general and offense-specific (e.g., violent) recidivism outcome models, and software integration to provide automation with agency information systems (Brennan, Dieterich, Breitenbach, & Mattson, 2009). Evaluations of predictive validity indicate moderate-to-strong effects. A noted concern of the tool is that, although subjects are assessed on 100 plus items, roughly 18 to 20 items are used to predict recidivism.
Women’s Risk Needs Assessment (WRNA)
Developed in 1999, the WRNA has undergone validation studies and subsequent revisions—currently Version 6. Recognizing an empirical gap in addressing the differing needs of men and women offenders, the WRNA was created with a focus on gender-specific need scales, measuring factors such as limited self-efficacy, adult victimization, and parental stress, which have been identified as uniquely predictive factors for women (Salisbury, Van Voorhis, & Spiropoulos, 2009; Van Voorhis & Presser, 2001). It also contains scales of items more prevalent and predictive among female offenders, including relationships, mental health issues, safety, poverty, abuse and victimization, and educational attainment (Salisbury et al., 2009). Developed as a 4G general recidivism assessment instrument, the WRNA made use of bivariate item selection, unweighted items and split sample validation. The instruments have been built as a system of tools, making predictions for pretrial, institutions, prerelease, and community supervision. Although considered gender-specific for female offenders, a male version/component of the WRNA has not been developed, restricting its utility as a general population assessment.
ORAS
The ORAS, developed by Latessa and colleagues (2009), is a 4G instrument designed to predict recidivism risk at four stages in the system—pretrial, prison, reentry, and community supervision. The ORAS was initially developed for Ohio’s correctional system and is currently being tested elsewhere (e.g., Indiana, Florida, and Texas); however, it is not weighted or normed by jurisdiction. Although more data and analyses have been collected since the initial implementation, there were several notable development concerns, including relatively small development samples, 7 less robust validation efforts, 8 nonfixed and relatively short follow-up durations, and the use of unweighted predictor items. Designed as gender neutral, the ORAS provides a gender-specific modeling component, making use of a manual adjustment of the female cut points. Based on initial findings, each model possesses small-to-moderate effect sizes for predicting rearrest. 9
MnSTARR
Duwe (2014) developed the 4G MnSTARR. The MnSTARR uses gender-specific models and classifies risk among four reconvictions types. Using multivariate selection and bootstrap validation techniques to model more than 100 identifiers, the MnSTARR reports strong predictive performance (AUCs range from 0.73 to 0.81). However, the MnSTARR makes use of a purely data-driven weighting scheme, potentially impacting the theoretical/face validity in which risk scores are calculated.
Washington State Static Risk Assessment (SRA)
Prior to 1999, assessing risk in the Washington State adult offender population was not an overarching system goal. The Offender Accountability Act of 1999 (OAA) was enacted by the state legislature, explicitly adding sentencing policy goals to (a) classify felony offenders according to their risk for future offending and (b) deploy a greater amount of resources to high-risk offenders. Initially, the WADOC adopted the LSI-R. Through a legislative directive to the Washington State Institute for Public Policy (WSIPP), Barnoski evaluated the LSI-R’s use in Washington and found several issues regarding its predictive performance (Barnoski & Drake, 2007).
Following these discoveries, Barnoski created a new assessment—The Washington State SRA. Using analytic weighting of age, gender, and 25 criminal history measures, the SRA created a four-category risk scale (High Violent, High Nonviolent, Moderate, and Low) using both general felony and offense-specific outcome models (Barnoski & Drake, 2007). Considered a 2G instrument, the SRA vastly improved predictive discrimination for its Washington State offender sample. Ultimately, the instrument was adopted based on its increased predictive accuracy, prediction of three types of high-risk offenders (drug, property, and violent), increased objectivity, and less time to completion. For 3 years, the initial and primarily data-driven, rather than interview-based, version of the SRA was used to determine levels of supervision in the community, intervention eligibility, and guide community contact standards.
During the SRA’s initial rollout in 2007, the WADOC had internally assembled a team to develop a needs assessment instrument—The Offender Needs Assessment (ONA). Disappointed with item construction of nationally recognized instruments, the WADOC sought to make the ONA more objective and reliable, focusing item content on an offender’s exhibited behaviors/characteristics. The items were designed to inhibit inaccurate reporting by offenders by constructing responses with the potential to be confirmed by record review (i.e., “The offender respects personal property but not public/business property”), rather than relying on an offender’s self-selection of philosophical-based responses (e.g., “A hungry person has the right to steal”). The instrument consists of eight domains and 56 primary questions, of which 40 are of the select-all-that-apply type, allowing for multiple responses to a given item. In all, the ONA allows for 358 unique responses. The ONA was implemented in August of 2008, shortly after the SRA and has been completed by WADOC staff for all supervised offenders. The initial goal of the ONA was to assist case managers’ program referrals after the offender’s risk level was established by the SRA. A secondary goal of the WADOC was to collect data for a sufficient duration of time with the intent of using items more strategically at a later date. Based on their collect-and-see approach, a large pool of cases and variables has been collected and were available for risk assessment development purposes.
During the first years of SRA, two key policy changes were implemented. First, in an effort to reduce correctional costs and better align supervision with RNR principles, low and moderate risk offenders were removed from community supervision. Second, the legislature enacted the policy termed Swift and Certain, which changed (i.e., increased) the frequency of recorded technical violations—a key SRA predictor. In addition, scoring modifications were needed to improve the SRA’s face validity, where negatively weighted items (e.g., homicide, sex offenses) were removed. With these ideals in mind, Barnoski (2010) created the Static Risk assessment, version 2 (SRA2).
In 2012, the WADOC restructured and expanded the scope of its evidenced-based programming. A noted impediment for prioritization was a lack of a validated risk-needs instrument to guide this process. In 2013, the WADOC partnered with Washington State University to initiate the development of the STRONG assessment system. A collection of instruments for the purposes of increasing prediction performance, measuring reductions in risk via the inclusion of dynamic items, and creating greater gender specificity and better prioritization of programming. The current study describes the development of one element of the STRONG assessment system, specifically, the models and instrument created to predict felony recidivism. The instrument was created to adjust for many of the methodological issues described previously; designing it to include: both static and dynamic items, use a multivariate selection of measures, analytically weighted items, gender-specific scoring, included methods to ensure face validity, use of a 2-year fixed follow-up, provide both general felony and offense-specific models of recidivism, and utilize state of the art validation procedures.
Method
The primary intent of the study is to examine offender risk assessment model development variations. This was completed in an attempt to isolate the potential incremental improvements in performance. To examine these variations, the current section describes the development of the STRONG-R and the prospective predictive modeling methods considered.
Sample
The study sample includes subjects who were (a) convicted of a felony, 10 (b) supervised by the WADOC, 11 (c) received a SRA and ONA assessment, 12 and (d) possessed a minimum 24-month follow-up in the community. Reconvictions were operationalized as an outcome following the event resulting in WADOC supervision. 13 Ample duration following an offenders’ placement in the community was needed to adequately measure recidivism. As the vast majority of offenders recidivate within 18 months in the community (Hamilton & Campbell, 2013; Taxman & Thanner, 2006) and dynamic needs have been demonstrated to have the greatest impact on recidivism at roughly 12 months (Wooditch et al., 2014), a 24-month fixed follow-up was used, as it provided sufficient outcome events, yet retained an adequate sample needed for modeling purposes. This definition also conforms to standards developed for the Washington State legislature by WSIPP (Barnoski, 1997). Given that the ONA was implemented in late 2008, the sample was restricted to releases between August 2008 and December 2010. The total sample size for the study was 44,010.
Predictors
The acronym STRONG-R references the combination of two data sources, the Static Risk (STR) and Offender Needs Guide (ONG)—for Recidivism (R). The STR references the inclusion of SRA instrument items, while the ONG represents items gathered from the ONA. SRA items are retrieved from an offender’s criminal history via software automation. WADOC correctional events were added to the automated item pool, including: infractions, violations, visitations, and interventions received. Second, needs items were collected from ONAs, which require an interview designed to be completed within 45 to 60 min. The ONA has been included as part of the WADOC training academy since its initial implementation. In prison, it is provided by case managers while community corrections officers provide the assessment for community supervised cases. 14
Item Selection Criteria
A series of multiple binary logistic regression models were used to select and weight predictors. However, as previously discussed, item selection procedures that are purely data-driven can be problematic. That is, items may predict in an unanticipated direction, causing an illogical scoring schematic (Wainer, 1976). To adjust for this potential result, modified item selection procedures were constructed to identify highly predictive items, while eliminating those predicting in an illogical direction. First, each predictor item was initially examined for theoretical/logical directionality. Items in which a consensus of prior findings indicated a likely reduction in predicted recidivism were reverse coded to enable all measures to weight in a consistent (positive) direction. All model predictors are described in Table 2, where reverse coded measures are indicated with an R following the item label.
STRONG-R Descriptive Statistics of Selected Items (N = 44,010)
Note. R indicates items reverse coded. STRONG-R = Static Risk Offender Need Guide for Recidivism.
We prevented the inclusion of illogically weighted items via a software solution. Using the R programming language, a selection procedure was created to prevent items possessing a negative logit value from being included. We feel this is a novel solution for a common instrument development need—the prevention of illogical weighting. In addition, based on Steyerberg, Eijkemans, and Habbema’s (1999) discussion of underfitting, items were selected based on model improvement identified via the Akaike Information Criterion (AIC), 15 as removing predictors based on a more arbitrary threshold (p < .05) can lead to a loss of predictive performance and create multicollinearity issues. Using these two criteria—positive logit and AIC value improvement—item selection procedures were completed using a forward stepwise method, and items which failed to reach the predefined criteria were removed. 16
Item Selection and Validation Procedures
Readers should note that we selected STRONG-R measures from a large pool of potential items. A list of selected items and responses used in the STRONG-R is provided in Appendix I (see http://wsicj.wsu.edu). For a full list of SRA items and responses, refer to Barnoski and Drake (2007, Appendix A), and similarly, for a complete list of ONA items and responses, see Appendix II (see http://wsicj.wsu.edu). Furthermore, due to the large sample size and extensive list of potential items, the models required substantial Random Access Memory (RAM) to compute. To improve computational speed, we uploaded and processed models using Amazon Web Services (AWS). Utilizing 10 servers for each model, the run times ranged from 3 to 13 days.
For item selection, all measures were included in a forward stepwise logistic regression model. To create a more stable prediction model, bootstrapping procedures were used to select items and assess predictive performance (P. C. Austin & Tu, 2004). Bootstrapping procedures select cases from the full data set with replacement until the number of cases selected for the bootstrap draw reaches the total number of subjects in the full sample. Because draws are completed with replacement, some subjects are selected once, others more than once, and others not at all, which makes created samples similar to the full sample but differences are substantial enough to prevent the influence of anomalies and outliers on the selection procedure. To complete the item selection procedure, 100 bootstrap samples were drawn. Again, items were selected in each stepwise computation if they possessed a positive logit and improved the model AIC. 17 Items that met the two selection criteria in more than half (51%) of the bootstrap samples were retained. 18
Finally, to ensure face validity, models and their associated items were reviewed by subject matter experts (SMEs) at the WADOC. The development team of SMEs were assembled from what was described as a cross-section of WADOC employees, including community corrections officers, case managers, supervisors, information technology, and upper management. The SMEs were tasked with identifying items that were absent in a given model that theoretical and practical experience indicated should be present in a given model. These additional items were manually added (forced) into construction samples and were no longer required to meet the AIC improvement criterion for selection. These additional items were included in the final models if they retained a positive logit, despite having a potentially low item weight. 19
Predictive validity of created models used a process similar to those identified in prior studies (see Duwe, 2014; Harrell, Lee, & Mark, 1996; Steyerberg, Harrell, Borsboom, Vergouwe, & Habbema, 2001). As described previously, subjects were selected with replacement to form the 100 development samples. These subjects are considered the in boot sample. These samples were used to select and weight items for each model. Those subjects not selected for in each sample were retained from out of boot samples and, on average, comprised roughly one third of the total sample. Utilizing these left over cases from each bootstrap draw, 100 samples of unused cases were created. For our sampling procedure, the in boot subjects represent the construction sample while the out of boot subjects represent the validation sample for each bootstrap iteration. Predictive performance was assessed using out of boot samples, summarizing performance of the created models across these 100 samples. Model performance criteria were computed on all four outcomes and across each gender, for a total of eight models. It should be noted that out of boot subjects were not sampled with replacement and only used once in a given validation sample iteration. This validation procedure is not unique to our study design and was developed previously to assess prediction model performance (see Steyerberg et al., 2001).
Comparison of Methodological Variations
The STRONG-R is designed to predict recidivism, using static and dynamic items to predict felony reconvictions. Although dynamic items that assess offender needs are included, we have not developed a needs assessment and thus do not use factor analysis or psychometric scaling to produce subscales or need domains. Instead, we sought to achieve greater predictive performance by selecting methods for optimal recidivism risk assessment development. The STRONG-R instrument was created after an assessment of potential optimal designs, examining each of the five methodological considerations: (a) gender specificity, (b) analytic item weighting, (c) multivariate item selection, (d) inclusion of dynamic items, and (e) offense-specific outcome modeling.
Gender-Neutral Versus Gender-Specific Models
First, gender-specific modeling was used in the development of the STRONG-R and was pursued to improve predictive performance via the development of models using separate male and female samples. To examine the incremental improvement gained, we created a Gender-Neutral model, selecting and weighting predictive items from the STRONG-R using the combined gendered sample (N = 44,010). This model was then used to predict felony conviction outcomes for the male and female samples.
Unweighted Versus Analytic Item Weighting
The STRONG-R makes use of analytic weights, converting model logits into response multipliers. To examine comparative performance by analytic weighting of items, the STRONG-R General Felony model was stripped of its weights and recreated through a simple summing of selected items’ raw values. This unweighted (or Burgess-style) model was used to predict felony outcomes for our male and female samples.
Bivariate Versus Multivariate Item Selection
Multivariate items selection was also used to develop the STRONG-R. As described, this design variation was anticipated to improve predictive performance over bivariate selection procedures through the removal of prediction noise. We compared multivariate item selection with bivariate selection by returning to the Assessment Item Pool 20 and creating a model by selecting items using bivariate significance (p < .05) as the only selection criterion.
Static Versus Dy\namic Items
Through the addition of dynamic needs items, the STRONG-R was built with the intention of improving recidivism prediction compared with the WADOC’s SRA2. The SRA2 makes use of analytic weighting; however, it was designed to be gender neutral and utilized only static criminal history items (plus age at assessment). We compared the STRONG-R with the SRA2 to identify incremental performance differences based on the inclusion of dynamic items.
Combined Incremental Effects
As described, two commonly used instruments (i.e., the LSI-R and ORAS) utilize both static and dynamic items but make use of models that are unweighted and gender neutral, with items selected based on bivariate (not multivariate) significance. We created an additional model using these specifications, which was intended to provide an external check as compared to these commonly utilized development methods.
In addition, we sought to examine the full range of incremental improvements in performance. Here, we created a model that used only static criminal history items (and age at assessment), bivariate selection, unweighted scoring, and was gender neutral. It was anticipated that this model would have the weakest comparative performance.
Felony Versus Offense-Specific Outcomes
A final design variation examined the use of outcome-specific modeling. As several of the reviewed instruments make use of only a general assessment of recidivism, improved predictive performance was assumed to be achieved via STRONG-R’s item selection and weighting of violent, property, and drug felony reconvictions. Using the STRONG-R General Felony model as a reference, we compared model AUC performance of the three STRONG-R specified Violent, Property, and Drug models.
Validation Statistics
To assess comparative model performance on discrimination, calibration, and accuracy, we provide validation statistics and, where appropriate, compare performance findings between the instruments developed by the various methods described. Discrimination is a model’s ability to separate recidivists from nonrecidivists. Calibration is the degree of agreement between estimated and observed events. Accuracy is the proportion of cases correctly classified. Initially, a model creates a risk score on a continuous scale, in which validity is assessed. As mentioned, we also provide a combined measure (squared error, accuracy, and Receiver Operating Characteristic (ROC) area [SAR]). For the current study, the AUC is used as a measure of the continuous risk score’s global discrimination. 21 Accuracy (ACC) is computed as the proportion of cases correctly classified. Calibration was examined via the overall error (CALerr), or slope, which is the difference between the expected probability and the proportion of the observed outcome (see Tollenaar & van der Heijden, 2013). Entropy is a measure of calibration and assesses the amount of disorder in the prediction, or how mixed the data set is with regard to the target variable values, where lower values indicate improved performance. A combined measure of discrimination, calibration, and accuracy was also computed, termed the SAR and consists of (AUC + ACC + 1 − RMSE) / 3 (RMSE stands for Root Mean Square Error; Caruana & Niculescu-Mizil, 2006). Although commonly AUC discrimination is given the lion’s share of consideration when evaluating a model’s predictive performance, each of these additional metrics is included to describe the broader array of performance elements considered.
Results
Descriptive statistics are presented in Table 2 for the instrument’s 140 items, separated by gender. Twenty-eight of the items are collected from agency records and are auto-populated via software integration with the WADOC’s information system. As findings demonstrate, not all items were selected in every model, but items included in the table were found to meet selection criteria in at least one of the eight models created. Cells without values identify items that were not selected and thus not scored for a given gender’s model.
Multivariate stepwise regression analyses results using the STRONG-R methodology are displayed in Table 3. Average AIC values are provided as an assessment of overall model effect. In addition, a calculation of outcome Events per Variable (EPV) is presented, as Steyerberg and colleagues (2001) identified that reliable estimates of bootstrap resampling occur when a model’s EPV is 10 or greater. Standardized coefficient values are presented to allow readers to compare variables’ relative prediction strength.
STRONG-R Standardized Coefficient Estimates for Multiple Logistic Regression Models
Note. R indicates items reverse coded. STRONG-R = Static Risk Offender Need Guide for Recidivism; STG = Security threat group; EPV = events per variable; AIC = Akaike Information Criterion.
Indicates the item was manually added without the mentioned AIC improvement criterion.
Table 4 presents the STRONG-R’s predictive validity measured by the model’s discrimination, calibration, and accuracy. Findings indicate strong effect sizes for validation AUC values (see Rice & Harris, 2005), ranging from 0.72 to 0.78. Accuracy (ACC) ranges from 72% to 94% of cases correctly classified. Entropy calibration measures range from 0.15 to 0.53, with comparatively better performance identified for the offense-specific outcome models compared to the General Felony models. The slope demonstrates greater calibration values of felony models. All STRONG-R models demonstrated a satisfactory level of calibration and indicate comparatively high values for the Male Property and Female Violent models. Finally, SAR values range from 0.68 to 0.82, which are in line with prior recidivism model findings (see Tollenaar & van der Heijden, 2013). When examining this combined measure, felony models demonstrate weaker performance compared with the offense-specific outcome models, where the Violent model showed the best performance for females and the Property model was indicated as the top model for males.
Measures of Predictive Validity for STRONG-R Models
Note. STRONG-R = Static Risk Offender Need Guide for Recidivism; AUC = area under the curve; ACC = accuracy; SAR = squared error, accuracy, and ROC area.
Comparing Predictive Validity of Different Development Methodologies
Table 5 presents validation model performance values comparing the STRONG-R to six model variations in development methodologies for both male and female samples. Because comparison findings were similar across specified model outcomes, for brevity we only compared instruments on General Felony model outcomes. The six models are as follows:
0. STRONG-R—Created using multivariate section, dynamic and static, analytically weighted items, and gender-specific models.
1. Gender Neutral—A version of STRONG-R, selecting and weighting items using the combined gendered sample.
2. Unweighted—The STRONG-R General Felony model stripped of weights and a simple sum of selected item values.
3. Bivariate Selection—Models selecting items from the entire item pool using only bivariate significance (p < .05).
4. Static Only and Gender Neutral—M represents the SRA2, created models consisted of only static criminal history items (and age) using a Gender-Neutral model.
5. Bivariate Selection, Unweighted, and Gender Neutral—Combining Design Variations 1, 2, and 3, these models were similar to the development methods of several national recognized risk assessments (i.e., LSI-R and ORAS).
6. Static Only, Bivariate Selection, Unweighted, and Gender Neutral—These models combine Design Variations 1, 2, 3, and 4.
Comparing Predictive Validity of Instrument Design Variations Using Felony Outcomes
Note. Bolded figures indicate the top performing model for a given metric; the STRONG-R utilized multivariate item section, dynamic and static items, gender-specific models, and analytically weighted items. H+ refers to the Hand statistic provided for the benefit of future researchers to make direct comparisons with different study samples. AUC = area under the curve; ACC = accuracy; SAR = squared error, accuracy, and ROC area; STRONG-R = Static Risk Offender Need Guide for Recidivism.
p < .05. **p < .01. ***p < .001.
Significance tests of AUC comparisons were computed for correlated ROCs (see DeLong, DeLong, & Clarke-Pearson, 1988). However, given that AUCs use different distributions to evaluate classifiers, Hand (2009) likened restrictions of making comparisons across samples to contrasting different metrics (e.g., meters to yards). Thus, Hand created a standardized metric deftly termed H-units. We provide our models’ H-units here for future researchers to make direct comparisons of differing study samples. We again provided estimates of accuracy, entropy, slope, and the combined SAR measure.
For males, the STRONG-R model provided significant predictive improvement (p < .05) over four of the six comparisons, including the following: 2. Unweighted; 4. Static Only and Gender Neutral (SRA2); 5. Bivariate Selection, Unweighted, and Gender Neutral; and 6. Static Only, Bivariate Selection, Unweighted, and Gender Neutral. The STRONG-R model did not provide significant AUC predictive improvement over 1. Gender Neutral, and 3. Bivariate Selection.
Similar results were obtained for the female sample, where the STRONG-R model provided significant predictive improvement (p < .05) over four of the six comparisons, including the following: 3. Bivariate; 4. Static Only and Gender Neutral (SRA2); 5. Bivariate Selection, Unweighted, and Gender Neutral; and 6. Static Only, Bivariate Selection, Unweighted, and Gender Neutral. The STRONG-R female model did not provide significant predictive improvement over 1. Gender Neutral, and the 3. Unweighted model. Thus, the STRONG-R consistently provided improved performance over Models 4, 5, and 6 and mixed performance with regard to Models 1, 2, and 3. However, as the last three models are closest in methodology to some of the more common instruments used today, our findings are noteworthy.
Furthermore, H values, Entropy, ACC, slope, and SAR all generally favored the STRONG-R, demonstrating comparatively equal or improved performance for all but one model (Female Gender Neutral, ACC). For these additional metrics, similar small variations were found between the STRONG-R and Design Variations 1, 2, and 3, while more substantial differences were found for Variations 4, 5, and 6.
General Felony Versus Offense-Specific Recidivism Outcomes
Next, comparisons were made between STRONG-R offense-specific prediction models and the General Felony model for males and females. For these comparisons, the item weights constructed for the General Felony models were tasked to predict each of the three offense-specific outcomes—violent, property, and drug felony reconviction. AUC comparisons were then examined between the specific outcome models (previously displayed in Table 3). The findings are provided in Table 6. For all three male comparisons, significant improvement was identified (p < .001) for the specific outcome models, with AUC differences ranging from 5% to 7%. Specific outcome female models also demonstrated improved AUC values; however, the drug model comparison did not reach significance. Given the greater prevalence of drug reconvictions for females, the items and weights of the drug models are most consistently aligned with the General Felony model and likely contributed to the nonsignificant finding. Generally, outcome model comparison findings provide substantial support for the use of specific outcome models to aid predictive discrimination.
AUC Comparisons of the STRONG-R Offense-Specific Versus General Felony Outcomes
Note. Bolded figures indicate the top performing model for a given metric. AUC = area under the curve; STRONG-R = Static Risk Offender Need Guide for Recidivism.
p < .05. **p < .01. ***p < .001.
Six-Point Risk Scale
An import design feature of the SRA2 was assembling ordinal risk categories incorporating multiband modeling, supported by prior findings of offender specialization (see Baker, Metcalfe, & Jennings, 2013; McGloin, Schreck, Stewart, & Ousey, 2011; McGloin & Stickly, 2011; Nieuwbeerta, Blokland, Piquero, & Sweeten, 2011). To combine the effects of the three specified and one general risk scales, a hierarchy of risk of reconviction severity was established for the STRONG-R to operationalize risk category assignment as follows: (a) Low, (b) Moderate, (c) High Drug, (d) High Property, (e) High Violent, and (f) Criminally Diverse. The rules governing risk assessment category placement are illustrated in Figure 1.

Hierarchical Risk Categories
Classification rules are described as follows. Offenders identified to be high risk in the Violent model are placed into Level 5—High Violent. Those not identified as High Violent but were identified as High Risk within the Property model was categorized as Level 4—High Property. Offenders not identified as High Violent or High Property, but were identified as high risk in the Drug model, are categorized as Level 3—High Drug. Those identified as high risk in all three offense-specific outcome models were categorized as Level 6—Criminally Diverse. Offenders not classified as high risk in an offense-specific outcome models were assessed in the Felony model, identifying offenders who are Low Risk (Level 1). Those not classified as either high or low risk in the previous models are identified as Level 2—Moderate. Because categories are assembled into a single categorical scale, an offender may fall into the high-risk category in an offense-specific outcome model (e.g., High Drug) but may also be eligible for Moderate Felony category placement. Based on the hierarchy of categories, offenders are placed in the higher prioritized category (i.e., High Drug).
We feel this classification system is quite novel as compared with the more traditional unidimensional assessment of risk. In this hierarchical design, probability of recidivism is appropriately less important than severity of risk to public safety. While a discussion of cut-point placement methods is beyond the scope of the current study, we thought it important to conclude our description of the STRONG-R’s utilization of offense-specific models in determining risk category placement.
Discussion
The variation in risk assessment methodology indicates a need for translational research that identifies best practice techniques to assist criminal justice researchers in developing, and practitioners in selecting, prospective risk tools (Bushway, 2013). Based on our review of the literature, we find that of the many commonly used risk assessment instruments, the development methods utilized to create said tools go largely unpublished, are claimed as proprietary information, or placed in an agency report that may be publicly available, but difficult to locate. Given the effort needed to identify and describe tool creation methods, our study sought to examine common methods and compare their predictive validity when predicting recidivism.
The STRONG-R was the product of this examination, developed using techniques identified to be best for the WADOC population and serves as a baseline for improvements in predictive accuracy over other techniques. In particular, specified and general outcome models are used to increase the detail of information gathered and, in turn, increase performance. Separate male and female models were created to increase specificity of contextual information gathered on instrument predictors and the overall gender specificity of the assessment. Bootstrap techniques to both select predictors and validate models were used to make modeling consistent with state of the art methods. The use of multivariate models and analytic weights further improved levels of predictive discrimination, accuracy, and calibration. However, in contrast to purely data-driven methods, we make a strategic departure by inhibiting illogical prediction patterns. We feel this procedure increases face validity and, in turn, will bolster stakeholder trust in the instrument. Cut-point placements were also provided in an attempt to optimize category discrimination, while accounting for WADOC policy and resource allocation.
In an effort to isolate and demonstrate the incremental improvement provided by the aforementioned methodological design decisions, we used WADOC data to compare five other potential instrument design variations, all of which were scored for both male and female samples. Results generally indicated improved performance of the STRONG-R methodology as an assessment of felony recidivism prediction for Washington State offenders across the dimensions examined. However, it is notable that smaller improvements were found over the Gender Neutral and Unweighted models. These points have been debated previously (Andrews et al., 2012; Dawes, 1979; Einhorn & Hogarth, 1975; Van Voorhis et al., 2010; Wainer, 1976) and the current findings lack a decisive conclusion. We find that when these design variations were combined together, the previously identified performance differences become additive and the STRONG-R methodology provides substantial and significant predictive performance improvement. Furthermore, we chose to retain gender specificity and analytic weighting in our model construction—not strictly based on performance—but conceptually these two instrument design considerations provide improved comprehension of model scoring for practitioners, where an understanding of variable importance is more likely to retain stakeholder buy-in. It will be interesting to monitor these comparisons to identify whether the observed minor performance differences remain stable or grow over time.
Two important comparisons were found with regard to predictive performance differences between the STRONG-R, the SRA2, and the alternate model which used bivariate item selection, unweighted (or Burgess-style) scoring, and/or a gender neutrality. The former was used to demonstrate improvement the Washington State risk assessment and provide confidence in the WADOC’s decision in implementing the STRONG-R. The latter comparison was important as it provided a proxy model, demonstrating the STRONG-R’s likely improvement over commonly used and competitive instruments in Washington State (i.e., the LSI-R, ORAS, and WRNA), which make use of similar methodologies. Both model comparisons were found to significantly and substantially favor the STRONG-R’s design and construction, evidenced by improved predictive performance.
Although comparative performance findings are decisively improved for the STRONG-R methodology, they were also anticipated due to aforementioned methodological variations. As the review of the literature described, the initial SRA demonstrated improved predictive discrimination over the LSI-R for Washington State offenders (Barnoski & Drake, 2007). It was established that improvements were a result of three instrument design modifications: an increase in static criminal history item specificity, multivariate item selection procedures, as well as weighting items to the WADOC population. The current study sought to isolate the impact of these and two other design variations—gender specificity and the inclusion of dynamic items. Our hope is that policy makers take note of the described methodological/design issues when examining, and potentially customizing, current instruments or considering adoption of a new/updated assessment and, as Bushway (2013) suggested, let this study serve to extend the list of credible techniques for practitioners looking to identify prospective tools.
Limitations
The current study is not without limitations. First, when comparing predictive validity of the STRONG-R versus the six design options, statistical significance of AUC values was identified. Due to the large size of our sample, it is possible that significance may appear inflated. Given that the range of AUCs is 10% for males and 9% for females, roughly one fifth of the scale’s range, and all additional performance metrics demonstrate improved strength of the STRONG-R’s design, we feel confident that the comparisons identified are both substantial and meaningful.
Although there is demonstrated performance strength for the STRONG-R methodology for recidivism prediction, the other 4G instruments discussed do not solely focus on risk to society (i.e., recidivism). Person-centered instruments are used to adapt supervision, treatments, and services to offender needs, and many assessments utilize a needs assessment. A less than common distinction among researchers and practitioners is found in the operationalization distinctions between risks versus needs assessments. Despite the use of dynamic risk and protective items, the STRONG-R is not a needs assessment. Strategically designed needs assessments should highlight the prevalence of offender issues to be used in identifying correctional system resourcing for available interventions. Currently, development efforts are underway to pair a needs assessment with the current STRONG-R prediction efforts for case management purposes.
A noted advantage of the STRONG-R for Washington State is the tailored nature of the instrument, where prediction performance is couched in the jurisdiction it serves. A drawback of this procedure is the potential restriction related to external validity both over time and with non-Washington State–based populations. As the STRONG-R is still in the early stages of implementation in Washington, we are currently not able to provide estimates of external validity. In the first years following implementation efforts, we intend to examine external validation performance. However, due to the extensive size of our sample, it is less likely that external validation of future Washington State–based samples will differ substantially from the internal estimates reported (see Bleeker et al., 2003).
Relatedly, although we would anticipate a comparable level of predictive performance with non-Washington State–based populations, due to the tailored nature of the instrument, there is a potential for performance shrinkage if weights obtained from Washington were applied to another state or jurisdiction. We contend that, although predictors measure general aspects of recidivism, the STRONG-R is designed for item weights and risk category cut points to be modified, or customized, when applied in a new population. This norming/reweighting procedure is a known technique and utilized by other analytically weighted instruments (e.g., the COMPAS). Again, our hope is that policy makers and practitioners will take note of this variation in risk assessment development when considering the strength of current assessments, where many off-the-shelf instruments should be adjusted when utilized for a population or jurisdiction that differs from the initial development sample (Wright et al., 1984).
Furthermore, although the STRONG Assessment Item Pool was large, it was not exhaustive and we do not rule out the possibility of extending its predictive performance. For instance, there are domains/items that are considered by other assessments that may further specify elements of risk to recidivate (i.e., risk to society). Another advantage of tailoring an instrument within a jurisdiction is the ability to collect, or beta-test, items for future inclusion. Using a development team of practitioners, efforts are currently underway to identify gaps to be filled with additional item content for future instrument versions.
One unexpected finding was identified in our Gender-Neutral model comparison. Based on prior gender-specific risk assessment findings (Brennan et al., 2012; Van Voorhis et al., 2010), a larger improvement in predictive performance was anticipated. This limited demarcation between the two instrument design strategies may be due to a lack of gender-specific scales and items (e.g., trauma, parental stress, social support) included in the current STRONG Assessment Item Pool. We anticipate selecting, adding, scoring, and evaluating additional gender-specific items, with the potential for inclusion in later versions of the STRONG-R.
As mentioned, it was a design decision to restrict the selection of items to those that predict in a theoretically consistent direction of effect. This decision was grounded in prior research (Wainer, 1976) and practical experience applying item weights for the SRA2. Developing an instrument with the best performance values does not guarantee that it will have the trust of those using it. We do not fault others who have designed their instrument weights to maximize validation findings. Using a purely data-driven selection and weighting procedure would have likely increased our model performance. Forgoing larger performance values was an acceptable consequence to retain face validity and one in which we feel bridges the gaps of advantages/disadvantages of Burgess-style weighing systems and instruments derived from other analytic weighting schematics.
Relatedly, others have noted the potential of machine learning techniques possessing the ability to outperform the regression based models utilized here (Berk & Bleich, 2013). Our prior findings with Washington State offender risk assessment data did not find machine learning models (e.g., random forests and neural networks) to provide improved performance (Hamilton et al., 2014). Generally, we found comparable, and often improved, performance from regression methods, which ultimately guided our decision to avoid machine learning techniques for STRONG-R model creation. However, this debate over the best methods for prediction model creation is not settled and we do not rule out the possibility that predictions may be improved with machine learning or other methods not mentioned.
Finally, we made efforts to compare the predictive performance of the STRONG-R to the current WADOC assessment (the SRA2). It would have been additionally insightful to make comparisons with all of the aforementioned instruments, including the COMPAS and MnSTARR. However, comparisons of these instruments were not feasible for previously mentioned reasons and the underlying intent was not to indicate the STRONG-R to be superior to these nationally recognized instruments but to highlight the issues surrounding incremental predictive performance using alternative methodological design decisions.
Conclusion
Following Andrews and colleagues’ (2006) description of offender assessment generations, there has been an assumption within the field that generational upgrades provide greater recidivism prediction performance. In a response to critique, Brennan, Dieterich, Breitenbach, and Mattson (2009) described how all 3G/4G instruments are not created equal. There are noted design issues in commonly used 4G instruments. These issues were so apparent that Washington State’s 2G instrument, the SRA, demonstrated greater predictive validity than the (3G) LSI-R (Barnoski & Drake, 2007). Through the creation of the STRONG-R, additional improvements have been demonstrated.
To create an optimal estimation of offender risk, one requires a systematic method of organizing items and their values, specified to the outcome of interest (Einhorn, 1986). Given the current reach of offender risk assessment in the criminal justice system, we concur with Kroner and colleagues (2005) who argued that developing and refining risk assessments is critical to improve the system’s effect on antisocial behavior and ultimately recidivism. The noted advancements of the STRONG-R can be conceived of as design decisions, where selected methods were developed based on our experimentation with methods used in a variety of contemporary instruments.
These observed advancements would not be possible without our ability to stand on the shoulders of prior offender risk assessment developers. In particular, we are cognizant that purely data-driven item selection procedures can be problematic. Although a regression model might identify a homicide conviction to be negatively related to reconvictions, structuring the linear weights of an instrument in this way is tantamount to giving someone credit (or reducing their risk score) for committing additional homicides. Instances of illogical prediction patterns such as these became the rationale for modifications from the SRA to the SRA2, and similar face validity issues appear in more contemporary instruments created with greater methodological rigor. 22 Based on noted critiques, we restricted our selection to only those items that presented a logical and theoretically consistent direction of effects.
Second, prior findings have noted the importance of offender specialization (see Baker et al., 2013; McGloin et al., 2011; Nieuwbeerta et al., 2011). Although prior instruments have parsed out violent from general recidivism, we further specified crime types known to impact supervision and intervention provisions. We identified these four primary crime types for which offenders may be provided priority for specialized interventions and created four models to identify those who are higher risk for violent, property, drug, and general felony recidivism. An additional model focused on sexual reoffending is currently in development to add to the STRONG assessment suite.
Finally, we created the STRONG-R to be customized to the risk and need items that are paramount to the WADOC system. It is imperative that agencies seek out assessment developers to provide instruments tailored to the characteristics of their jurisdiction. Grabbing an instrument off-the-shelf, such as the LSI-R or LS/CMI, ORAS, COMPAS, WRNA, MnSTARR, or even the STRONG-R, and asking it to predict with the same level of performance as the jurisdiction/population in which it was developed is a policy decision that may prove costly over time. We were fortunate in that it was possible to conduct the current study, as the WADOC had the foresight to implement a comprehensive needs interview with a future goal of determining if its items/responses could improve prediction performance. This two pronged approach, implementing a more comprehensive, but not yet validated needs interview, along with a more limited, yet validated risk assessment allowed the WADOC to categorize offenders into risk categories (i.e., High Violent, Moderate, Low Risk) as accurately as currently possible; all while ensuring the capacity to improve assessments following sufficient data collection. This stepwise process can be repeated for further improvements by adding potential items whose predictive capacity can be assessed in future validation efforts—a process referred to here as item-beta-testing.
Furthermore, an agency wanting to implement a predictive tool has to start somewhere, by implementing an assessment even though it has not yet been normed or validated for their jurisdiction. Implementing an assessment off-the-shelf that has been validated elsewhere runs an initial risk of bias and reduced performance. It is understandable that many jurisdictions may not have the resources or agency support to follow the same development path described here; however, it is very important that an instrument be tailored to the population supervised. At the very least, we recommend that if an instrument is used/adopted off-the-shelf, the risk level distribution be determined prior to implementation to assess how its use may affect resources. The assumption is that the risk levels will correspond to differing recidivism rates. Once sufficient data are collected, the predictive performance of the risk score should then be assessed, normed/reweighted, and the risk category cut points be constructed for the specific jurisdiction.
In conclusion, implementing an offender risk assessment is an arduous task, in which the consequences impact nearly every aspect of a supervision system. The development of a new assessment better informs correctional practice when its methods are clearly defined and item content is tailored to the needs of the agency. The instrument should incorporate predictors with a multitude of theoretical, supervisory, and rehabilitative considerations. Often overlooked in discussions of risk assessment creation and implementation is the necessary connection to the practitioner community when developing best practices (Bushway, 2013). There is a need for an agency development team consisting of SMEs with the requisite analytic skills and experience to inform data collection, interrater reliability, efficient model utility, and stakeholder buy-in. We hope that this study will add to the knowledge base of the field as we continue to marshal in new and ever-improving methods to increase prediction performance and assessment usability.
Footnotes
A technical report of a portion of this work was presented to the Washington State Department of Corrections (Hamilton, Neuilly, Lee, & Barnoski, 2014). The findings and discussion are those of the authors and may not represent the position of the Washington State Department of Corrections.
Notes
, to read the reports written by him while at the Institute.
