Abstract
This study presents the results from the development and validation of a fully automated, gender-specific risk assessment system designed to predict severe and frequent prison misconduct on a recurring, semiannual basis. K-fold and split-population methods were applied to train and test the predictive models. Regularized logistic regression was the classifier used on the training and test sets that contained 35,506 males and 3,849 females who were released from Minnesota prisons between 2006 and 2011. Using multiple metrics, the results showed the models achieved a relatively high level of predictive performance. For example, the average area under the curve (AUC) was 0.832 for the female prisoner models and 0.836 for the male prisoner models. The findings provide support for the notion that better predictive performance can be obtained by developing assessments that are customized to the population on which they will be used.
Introduction
Over time, correctional authorities have increasingly relied on risk assessment instruments in an effort to optimize limited resources and foster greater safety in both prison and the community. These instruments have been utilized to help determine institutional custody levels, the type of community supervision, and whether individuals should be paroled or released to the community prior to the adjudication of their criminal cases. Because institutional and community programming resources are often scarce, they have also been used to identify which offenders to prioritize for programming.
As the application of risk assessment instruments within corrections has grown, so has the body of research on the development and validation of these tools. Much of this research, however, has focused on the prediction of recidivism—the most widely used outcome measure for correctional populations. Conversely, the existing literature has paid relatively little attention to risk assessment instruments that predict institutional misconduct. Defined as the failure by inmates to follow institutional rules and regulations (Camp et al., 2003), prison misconduct encompasses behavior that ranges from disobeying orders and possession of “contraband” (i.e., alcohol, drugs, etc.) to assaults against staff and other inmates.
When individuals (re)enter a jail or prison, correctional systems typically make classification decisions regarding the security or custody levels at which inmates should be confined. To promote better safety and security for both inmates and staff, individuals thought to be at a higher risk of institutional misconduct are often placed at more restrictive custody levels such as close or maximum. In contrast, lower risk inmates are more likely to be placed at minimum or medium custody levels in which they have fewer restrictions and greater freedom of movement within the facility.
The determination of risk is often based on an assessment of whether inmates will engage in any misconduct. As shown later, however, more than one quarter of male prisoners and 43% of female inmates in Minnesota’s prison system had at least one discipline conviction during their confinement. If maintaining institutional safety through effective risk management is a key objective of prison classification systems, then simply predicting who will have any misconduct is not especially meaningful. Instead, what is more important from a risk management perspective is identifying who will have serious, violent misconduct and/or a lot of discipline convictions. In other words, which inmates are most likely to compromise facility safety and consume a great deal of staff time?
The career criminal literature has long documented that a relatively small proportion of offenders account for a disproportionate share of crime (Chaiken & Chaiken, 1984; Wolfgang et al., 1972; Wright & Rossi, 1986). Similarly, a relatively small segment of the inmate population is responsible for much of the misconduct in prison. As shown later, approximately 10% of the male prisoners in Minnesota accounted for 70% of all discipline convictions, 79% of all misconduct resulting in a segregation or restrictive housing penalty (i.e., more serious misconduct), and 100% of assaults against other inmates and staff (i.e., violent misconduct). Likewise, about 10% of the female inmate population in Minnesota was responsible for 62% of all discipline convictions, 71% of misconduct resulting in segregation, and 100% of all violent misconduct. If the highest risk inmates—the top 10%—can be accurately identified, then prison classifications systems can be used to further improve institutional safety. For example, in addition to custody level, prison systems can apply other measures, such as the delivery of programming, to reduce the likelihood that higher risk inmates will engage in misconduct.
Present Study
This study develops and validates a fully automated, gender-specific risk assessment for inmates in Minnesota that is designed to predict serious and/or frequent misconduct (SFM) in 6-month intervals. Recent research has shown that, compared with a manual scoring method, a fully automated assessment is more reliable, efficient, and cost-effective (Duwe & Rocque, 2017). In particular, a fully automated assessment eliminates interrater disagreement, which leads to better predictive performance. Moreover, Duwe and Rocque (2017) reported that automation of the Minnesota Screening Tool Assessing Recidivism Risk (MnSTARR) 2.0 would yield a return on investment (ROI) of more than US$20 after 5 years, generating close to US$5 million in staff time saved.
In addition to the fact that male and female prisoners are housed in separate facilities in Minnesota, the factors that increase and decrease the risk of SFM may vary by gender. It was important, therefore, to create separate assessments for male and female prisoners. It was also important to design an assessment that predicts SFM in 6-month intervals. As discussed later in more detail, the Minnesota Department of Corrections (MnDOC) reassesses inmates every 6 months they are in prison. Rather than developing an intake assessment that predicts misconduct over the entirety of an inmate’s confinement, it was necessary to design an instrument that predicts SFM for each individual offender every 6 months they are in prison.
Prior Research on the Predictors of Prison Misconduct
The two main perspectives that have been utilized to explain prison misconduct are importation and deprivation. Pitched largely at the individual level, importation argues that misconduct occurs as a result of the characteristics and experiences that offenders bring with them into prison. Deprivation, situated more at the institution level, holds that situational factors within the prison environment influence offender misconduct (Tewksbury et al., 2014).
Even though prison misconduct is not synonymous with criminal offending, both represent rule-violating behavior. Moreover, prison misconduct has been found to be a significant predictor of recidivism (Duwe, 2014; Gendreau et al., 1996). Furthermore, existing research suggests that prison misconduct and recidivism share many of the same risk and protective factors. Indeed, as with recidivism, the strongest predictors of misconduct tend to be static factors such as criminal history, age, and race (Caudy et al., 2013; Gendreau et al., 1997).
Reflecting the findings reported by Gendreau et al. (1997) that antisocial companions increase the likelihood of misconduct, several studies have indicated that gang membership (i.e., identification as a member of a security threat group [STG]) is positively associated with rule violations (Gaes et al., 2002; Griffin & Hepburn, 2006; Tewksbury et al., 2014). Gendreau et al. (1997) also noted that social achievement (e.g., education, employment, marital status) and early family factors had modest associations with disciplinary infractions.
Consistent with the deprivation perspective, Gendreau et al. (1997) also noted that institutional factors have an effect on misconduct. Existing research has demonstrated that prisons vary in their effect on individual prisoners’ likelihood of engaging in misconduct (Camp et al., 2003). Indeed, previous studies suggest misconduct is affected by institution-level factors such as size, location, and security level (Huebner, 2003; Steiner & Wooldredge, 2014). Other research indicates disciplinary infractions are influenced by the overall characteristics of the inmates as well as the staff (Camp et al., 2003). Although earlier work has established that custody levels have a minimal impact on misconduct (Berk et al., 2003; Camp & Gaes, 2005), a more recent study of Texas prisoners by Worrall and Morris (2011) found that increases in custody levels (i.e., higher levels of risk) were associated with a greater likelihood of rule violations.
In another meta-analysis focusing on what works to reduce prison misconduct, French and Gendreau (2006) concluded that the most effective intervention for curbing disciplinary infractions were cognitive–behavioral treatment programs, especially when they were implemented with fidelity and targeted multiple criminogenic needs. Cognitive–behavioral therapy has been shown to be one of the most effective interventions for curbing disciplinary infractions (French and Gendreau, 2006) and recidivism (Lipsey et al., 2007). Education and employment programming have also been found to reduce misconduct (Duwe et al., 2015; Gover et al., 2008; Steiner & Wooldredge, 2014), although their effectiveness has been more modest and sometimes inconsistent.
The MnDOC’s Current Classification System
In the late 1990s, the MnDOC implemented what was, at that time, a new classification system. Due to a lack of documentation, it is not clear how the MnDOC developed the classification assessment or whether it was ever validated. What is known, however, is that the MnDOC—like other states—had received assistance in developing and implementing the new assessments from the National Institute of Corrections.
The classification assessment the MnDOC has used contains a total of six items—current offense, history of assault, institutional adjustment, history of escapes, age, and custody level at most recent release. The assessment is based on a modified Burgess weighting scheme, which is a simple, summative approach wherein items on a risk assessment tool are assigned a value. The value for each item is then summed across all items to produce a total score.
For example, with the current offense item, an individual can receive between 0 and 18 points (no bodily harm = 0, threat of bodily harm = 3, attempted bodily harm = 6, weapon = 7, involved bodily harm = 18). MnDOC staff conduct a case file review to determine the most appropriate response for each of the six items. The values for each of the six items are added up to form a total score that ranges from 0 to 67. Inmates with a life sentence have a minimum score of 43, whereas those with a life without parole sentence have a minimum score of 50. Although there are some exceptions, inmates are assigned custody levels based on their classification score. Maximum custody is a score of 50 or higher, close is between 19 and 49, medium is between 6 and 18, and minimum is less than 5.
Inmates are assessed for classification as long as they are admitted to prison for a new felony sentence (either as a new court commitment or probation violator). Offenders are not assessed for classification if they reentered prison as a parole violator. Following an initial assessment, inmates are reassessed for classification every 6 months they are in prison. If inmates have not had any discipline convictions for 6 months, their classification score drops by 3 points. If they go a full year without any discipline convictions, their classification score drops by 4 points.
Data and Method
The data set to develop and validate this assessment is similar to what was used to create the MnSTARR 2.0. The overall data set contains 39,355 male and female inmates released from Minnesota prisons between 2006 and 2011. The male sample consists of 35,506 offenders, whereas the female sample contains 3,849 offenders. Because the data include all releases from prison and some prisoners were released multiple times during this 6-year period, the data set contains 24,322 individual offenders (21,648 males and 2,674 females).
Operationalizing the Outcome Measure: Severe and/or Frequent Misconduct
The assessment developed in this study is designed to predict which prisoners would engage in severe and frequent misconduct (SFM) over the ensuing 6 months. For the male prisoners, SFM was operationalized as (a) five or more discipline convictions, (b) three or more discipline convictions resulting in a restrictive housing sanction, or (c) a discipline conviction for assaulting either staff or other inmates within a 6-month period. For the female prisoners, SFM was defined as (a) 10 or more discipline convictions, (b) three or more discipline convictions resulting in a restrictive housing sanction, or (c) a discipline conviction for assaulting either staff or other inmates within a 6-month period. This misconduct measure thus captures the prisoners who are not only violent toward staff and other inmates but also engage in high rates of severe misconduct that disrupt the safety of the institution.
Rather than developing separate predictive models for all misconduct, segregation misconduct, and violent misconduct, using this consolidated SFM measure is a more streamlined, parsimonious approach toward identifying the highest risk inmates. As shown in Table 5, roughly 10% of male and female prisoners meet the criteria for SFM designation in every 6-month period that was examined. Moreover, as indicated earlier, even though only one tenth of the prison population meets the criteria for SFM designation, these prisoners are responsible for much of the misconduct in prison. If the inmates at greater risk of SFM can be prospectively identified with a relatively high degree of accuracy, it may be possible to improve the safety of correctional institutions by mitigating the risk of misconduct among these prisoners.
Predictors of Severe and Frequent Misconduct
In addition to relying on the same sample used to develop and validate the MnSTARR 2.0, this study leverages the work performed in implementing that instrument by applying many of the same predictors. The predictors used to develop the predictive models for females and males are displayed in Tables 1 and 2, respectively. Both tables also identify which predictors are drawn from the MnSTARR 2.0 and which ones are not. The MnSTARR 2.0 items cover predictors relating to criminal history, demographic characteristics such as age at release and marital status, type of prison admission, offense type, gang affiliation (i.e., STG criteria), a history of suicidal tendencies, educational achievement, and whether a prisoner will be released to supervision.
Descriptive Statistics for Female Prisoner Sample.
Note. MnSTARR = Minnesota Screening Tool Assessing Recidivism Risk; SFM = serious and/or frequent misconduct; GED = general educational development.
Descriptive Statistics for Male Prisoner Sample.
Note. MnSTARR = Minnesota Screening Tool Assessing Recidivism Risk; SFM = serious and/or frequent misconduct; GED = general educational development.
The MnSTARR 2.0 also contains items relating to participation in programs that have been found to reduce recidivism. However, given the need to assess for SFM every 6 months, including all the correctional programming items on the MnSTARR 2.0 was problematic. Although research indicates that participation in programming improves behavior both in prison and following release (L. M. Davis et al., 2013; French & Gendreau, 2006; Landenberger & Lipsey, 2005; Mitchell et al., 2007), it also shows that prisoners are at a greater risk of recidivism and postprison unemployment when they are “warehoused” (Duwe & Clark, 2017). Accordingly, rather than utilizing a number of items that measure participation in individual programs, this study includes items that measure whether inmates had been idle and, thus, not participating in any programming.
Because past behavior is one of the best predictors of future behavior, the female and male data sets include a number of items relating to prior prison misconduct for those who had previously been to prison in Minnesota. More specifically, items were included that measure total number of discipline convictions, total number of convictions resulting in segregation sentences, total number of violent discipline convictions, as well as more specific types of misconduct, such as threatening others, disobeying direct orders, and abuse/harassment.
To further improve predictive performance, the postintake predictive models (i.e., assessments every 6 months after intake) also include measures of recent behavior in prison. In particular, the postintake models include items for unauthorized idle (UI) status (i.e., “warehousing”), total misconduct, misconduct resulting in a segregation penalty, violent misconduct, and whether an offender met the criteria for SFM classification. For example, let us assume a male prisoner will be in prison for 11 months. This individual would receive an intake assessment, which would predict his likelihood of SFM over the next 6 months. At the 6-month mark, he would be reassessed for SFM over his final 5 months in prison. Therefore, the 6-month assessment would include items that measure behavior during his first 6 months in prison relating to the number of discipline convictions, the number of discipline convictions resulting in segregation, the number of discipline convictions for violent misconduct, whether he met the criteria for SFM, and whether he was in UI status.
Validation Method
To develop and validate the prediction models, both data splitting (i.e., split population) and K-fold validation methods were used. Initially, the male and female offender samples were split into training sets and test sets. The training set consisted of offenders released between 2006 and 2009, whereas the test set included offenders released in 2010 and 2011. For the male offenders, there were 23,838 in the training set and 11,668 in the test set. For the female offenders, there were 2,546 in the training set and 1,303 in the test set.
Consistent with the need to predict SFM in 6-month intervals, the data set was further split into 6-month increments for both males and females. For example, with the male prisoner sample, all 35,506 offenders (23,838 in the training set and 11,668 in the test set) were included for the initial classification. However, of these offenders, only 19,359 had a length of stay greater than 6 months. Thus, for the assessment to be performed at the 6-month mark (i.e., this one predicts SFM until the 12-month mark), the sample included the 19,359 (12,481 in the training set and 6,875 in the test set) with confinement periods greater than 6 months. Similarly, for the assessment to be performed at the 12-month mark (to predict SFM from the 1-year mark until the 18-month mark), the sample included 12,246 who were in prison for more than a year (7,778 in the training set and 4,468 in the test set).
Due to attrition in the size of the sample, models were developed up to the 4-year mark for males. For females, prediction models were developed up to the 2-year mark due to a smaller sample size and lengths of stay that tend to be shorter in comparison with males. For example, all 3,849 offenders (2,546 in the training set and 1,303 in the test set) were included for the initial classification. For the assessment performed at the 6-month mark, the sample included 1,668 offenders (1,076 in the training set and 592 in the test set) who had a length of stay longer than 6 months. Likewise, the 12-month assessment was based on a sample of 914 offenders (562 in the training set and 352 in the test set), whereas the 18-month assessment was based on 523 offenders (312 in the training set and 211 in the test set).
To further assess performance, this study uses additional test sets that contain male and female prisoners admitted to prison during 2017. The training and test sets contain offenders released between 2006 and 2011. Although most of these prisoners had relatively short lengths of stay in prison, some had been confined since the 1990s or even as early as the 1980s. What this means is that the training and test sets included some prisoners whose first 6 months in prison (or months 6–12, 12–18, etc.) were more than 20 years ago or even 30 years ago. The advantage to using a test set of 2017 admissions is that it provides an additional look at whether the models are predictive for a more recent cohort of prisoners. The 2017 test set is used for the intake assessment model as well as the 6-month assessment model.
Selecting Predictors
In an effort to identify the items that were significant and robust predictors of SFM for each of the 12 models, a bootstrap variable selection method developed by Efron and Gong (1983) was applied to the data set. Multiple logistic regression models were estimated in which predictors were added one at a time until no further single addition achieved significance level, a = .10. Among the predictors that had a significant effect (p < .10) on SFM, bootstrap resampling was used to refine the selection of predictors for each model.
Consistent with prior research (Duwe, 2019; Duwe & Freske, 2012), predictors were retained as long as they were statistically significant at the .05 level in at least 70% of the 1,000 bootstrap samples. After removing predictors that did not achieve statistical significance in at least 70% of the samples, another 1,000 bootstrap samples were estimated. This process was carried out for each of the 12 models.
Model Development
As with the MnSTARR 2.0 (Duwe & Rocque, 2017) and the Minnesota Sex Offender Screening Tool–4 (MnSOST-4; Duwe, 2019), regularized logistic regression (RLR) was the classification algorithm used for all 12 models that were trained and tested—eight for males and four for females. RLR can effectively handle data sets with a large number of predictors—and the accompanying problems with collinearity—by shrinking overly large parameter estimates. In doing so, it also helps mitigate overfitting.
In using RLR as the classifier to develop the 12 prediction models, 10-fold cross-validation was used. More specifically, for each of the 12 models, the ridge estimator value (1.0E−8 to 500) was varied on the training set in an effort to optimize predictive performance. The 10-fold cross-validation procedure was used to determine how each model would perform on the test set. To identify the best RLR algorithm for each of the 12 prediction models, the area under the curve (AUC) was used. After identifying the parameters that yielded the highest AUC for each algorithm, predictive performance was then evaluated on the test sets.
Predictive Performance Metrics
Consistent with recent research that has applied multiple metrics to evaluate the predictive performance of risk assessment instruments (Duwe, 2019; Duwe & Kim, 2016; Duwe & Rocque, 2017; Hamilton et al., 2015; Tollenaar & van der Heijden, 2013), this study used seven different statistics to assess predictive validity. There are three main dimensions of predictive validity: (a) accuracy, (b) discrimination, and (c) calibration. Predictive accuracy assesses how well a model makes correct classification decisions. One of the more commonly used metrics is accuracy (ACC), a threshold-based measure that examines the extent to which an assessment correctly classifies offenders as recidivists or nonrecidivists. For example, if a recidivist had a predicted recidivism probability less than 50%, then this offender would be incorrectly classified as a nonrecidivist (i.e., false negative). Conversely, if this offender had not recidivated, then she or he would be accurately classified (i.e., true negative). The ACC value ranges from 0% to 100%, and higher ACC values reflect greater accuracy in making correct classification decisions.
The second dimension of predictive validity, discrimination, measures the degree to which an assessment separates—in this instance—the recidivists from the nonrecidivists. Three metrics—the AUC, the H measure developed by Hand (2009), and the precision–recall curve (PRC)—were used to assess predictive discrimination. The AUC is relatively robust across different recidivism base rates and selection ratios (Smith, 1996), and it has arguably been the most widely used metric to assess recidivism prediction performance. With values that range from 0 to 1, the AUC statistic is interpreted as the probability that a randomly selected recidivist has a higher score on a risk assessment instrument than a randomly selected nonrecidivist. According to the literature, an AUC between 0.90 and 1.00 is considered excellent, between 0.80 and 0.89 is good, between 0.70 and 0.79 is fair, between 0.60 and 0.69 is poor, and between 0.50 and 0.59 represents a failure to achieve predictive discrimination (Baird et al., 2013; Thornton & Laws, 2009).
The AUC can provide an overly optimistic estimate of predictive discrimination for imbalanced data sets (J. Davis & Goadrich, 2006), and it can provide misleading results if receiver operating characteristic (ROC) curves cross because it uses different misclassification cost distributions for dissimilar classifiers (Hand, 2009). Accordingly, this study employs the metric developed by Hand (2009), the H measure, which uses a common cost distribution for all classifiers. With higher values indicating better performance, previous studies have reported H values that ranged from 0.02 to 0.40 (Duwe, 2017; Duwe & Rocque, 2017; Hamilton et al., 2015). The PRC, which uses the precision and recall values to assess predictive discrimination, is another alternative to the AUC. Like the H measure, PRC values range from 0 to 1, with higher values denoting better performance. In the only recidivism study that has applied the PRC, the values ranged from 0.05 to 0.24 (Duwe, 2019).
Calibration measures how well the predicted probabilities from a model correspond with the observed outcome being predicted. Whereas predictive discrimination assesses relative risk, calibration taps into absolute risk. For a prediction instrument to make accurate absolute assessments of risk, the model’s predicted probabilities must be calibrated with the observed recidivism outcomes. With values that range from 0 to 1, root mean square error (RMSE) measures the squared root of the average squared difference between observed recidivism and predicted probabilities. The closer the RMSE value is to zero, the better the calibration.
In addition to these metrics, two consolidated statistics assessed overall predictive performance. The SAR (squared error, accuracy, ROC) is a combined measure of discrimination, accuracy, and calibration, and its formula is (ACC + AUC + (1 − RMSE)) / 3 (Caruana et al., 2004). In previous recidivism prediction research using the SAR, values have ranged from 0.63 to 0.83 (Hamilton et al., 2016; Tollenaar & van der Heijden, 2013). SHARP (squared error, H measure, ACC, ROC, and PRC) statistic is similar to SAR except that it weights predictive discrimination more heavily by including the H and PRC statistics. Designed specifically for assessing overall predictive performance within imbalanced data sets, SHARP’s formula is (H measure + AUC + PRC + ACC + (1 − RMSE)) / 5 (Duwe, 2019). The values for SHARP range from 0 to 1, with higher values signifying better predictive performance. In the only study that has used the SHARP metric, the values ranged from 0.50 to 0.63 (Duwe, 2019).
Results
The predictive performance results of the models developed for female and male prisoners are presented in Tables 3 and 4, respectively. Both tables show how each of the 12 models (four for females and eight for males) performed on the test sets. As noted earlier, female and male data sets consisting of admissions to prison during 2017 are used as additional test sets for the initial assessment and 6-month assessment models. Tables 3 and 4 thus present predictive performance results for 16 test sets (six for females and 10 for males). These tables also report the AUCs for the MnDOC’s current classification system for all 16 test sets to provide a means of comparison with the models developed for this study.
Predictive Performance Results: Female Prisoners.
Note. AUC = area under the curve; H = Hand’s H measure; PRC = precision–recall curve; ACC = Accuracy; RMSE = root mean square error; SAR = squared error, accuracy, receiver operating characteristic; SHARP = squared error, H measure, accuracy, receiver operating characteristic, and PRC.
Predictive Performance Results: Male Prisoners.
Note. AUC = area under the curve; H = Hand’s H measure; PRC = precision–recall curve; ACC = accuracy; RMSE = root mean square error; SAR = squared error, accuracy, receiver operating characteristic; SHARP = squared error, H measure, accuracy, receiver operating characteristic, and PRC.
As shown in Table 3, the average AUC for the MnDOC’s current classification assessment across all six test sets was 0.653. With an average AUC of 0.832 across the six test sets, the four female prisoner models developed in this study had much better predictive discrimination. For the other two predictive discrimination metrics, the average H value was 0.379 and the average PRC value was 0.405. The average ACC was 0.904, indicating the models made correct classification decisions 90% of the time. The average value for RMSE, the calibration metric, was 0.279. Meanwhile, the averages for SAR and SHARP, the two consolidated metrics, were 0.819 and 0.648, respectively.
The results suggest that predictive performance generally improved after the initial classification assessment. For example, if we focus on the AUC, we see that values were 0.759 and 0.731 in the initial classification assessment. In comparison, the AUC values were 0.854 and 0.922 for the 6-month assessment, 0.909 for the 12-month assessment, and 0.819 for the 18-month assessment. The improved performance among the later assessments reflects the fact these models were able to draw on more recent behavioral indicators such as misconduct and UI status.
In Table 4, the results show the average AUC for the MnDOC’s current classification assessment across all 10 sets for males was 0.665. In comparison, the average AUC was 0.836 across the 10 test sets for the eight predictive models that were developed. The average H value was 0.377, whereas the average PRC value was 0.432. The average ACC was 0.898 among the 10 test sets, and the average RMSE was 0.294. For the two consolidated metrics, the averages were 0.813 for SAR and 0.650 for SHARP.
As with the results for female prisoners, predictive performance generally improved after the intake assessment. Whereas the intake assessment AUCs were 0.768 and 0.747 for the two test sets, they were 0.828 and 0.800 in the two test sets for the 6-month assessment. After the 6-month assessment, the AUC values ranged between 0.840 and 0.888 for the remaining assessments, which reflects the ability to capitalize on measures of recent behavior in prison.
The results for males also indicate predictive performance was slightly worse for the two test sets based on 2017 admissions. This is not unexpected, given that the other test sets were more similar to the training set (2006–2009 releases) insofar as they consisted of inmates released from prison in 2010 and 2011. Still, the results for the 2017 admission test sets suggest that the models would have good predictive performance if they were applied to the Minnesota male prisoner population in the future.
In Table 5, we take a more detailed look at the predictive performance results for females and males. For each of the 16 test sets, this table provides the SFM base rate (number of prisoners in the test set divided by the number who met the SFM criteria), the number of prisoners making up the top 10%, the number of “true positives” (i.e., the number of prisoners in the top 10% who met the criteria for SFM), the true positive rate, the SFM capture rate, and the amount of base rate improvement.
Female and Male Predictive Performance Results for High-Risk Prisoners.
Note. SFM = serious and/or frequent misconduct.
To illustrate, let us consider the 12-month assessment results for male prisoners. Table 5 shows there were 4,468 male prisoners in the test set, of whom 570 met the SFM criteria (five or more discipline convictions, three or more segregation sentences, and/or violent misconduct). As a result, the SFM base rate for the male 12-month assessment was 12.8%. The top 10% consisted of the 447 prisoners with the highest predicted probabilities for SFM at the 12-month assessment. Of the 447, there were 262 (58.6%) who were “true positives” insofar as they eventually met the SFM criteria after (a) the 12-month assessment and (b) prior to their release from prison or the 18-month assessment. Therefore, the true positive rate for the top 10% is 59%. Given that there were a total of 570 SFM prisoners in the test set, the 262 true positives yield an overall SFM capture rate of 46%. The true positive rate of 58.6% among the top 10% is also 4.6 times higher than the base rate, which signifies the amount of improvement over the base rate.
When we examine the results for females and males, the true positive rate ranged from 20% (a low of 31% for males) to 59% (a high of 54% for females) across the 16 test sets. Isolating the top 10% with the highest risk of SFM yielded an overall SFM capture rate that ranged from 26% (a low of 31% for males) to 61% (a high of 59% for males). The base rate improvement ranged from a low of 2.5 (3.1 for males) to a high of 6.2 (6.0 for males). The averages among the 16 test sets were 45% for the true positive rate, 44% for the SFM capture rate, and 4.4 for base rate improvement.
These results imply that if the MnDOC began using these predictive models for classification purposes, then about 45% of the highest risk prisoners (the top 10%) would eventually meet the criteria for SFM, which is 4.4 times higher than the base rate. The true positives among the top 10% would also account for roughly 44% of all prisoners who engaged in SFM. Given these findings, if the risk of SFM could be effectively mitigated among the top 10%, it could lead to a notable drop in misconduct overall.
Conclusion
The results confirm the assessment for SFM had a relatively high level of predictive performance. There are likely several reasons. There are performance advantages to having a fully automated assessment that is customized to the population on which it will be utilized. There are clear benefits to developing and validating a fully automated assessment system. It is more efficient and cost-effective; moreover, automation implies customization. All things being equal, assessments that have been customized to the populations on which they are used have a “home-field advantage” relative to global, off-the-shelf assessments that are applied to populations on which they have not been validated.
The results also suggest that recent behavioral indicators are, not surprisingly, influential in improving predictive performance. A fully automated system is also more efficient and cost-effective. Follow-up research will examine how cost-effective it is, along with assessing performance. Rather than developing separate predictive models for all misconduct, segregation misconduct, and violent misconduct, using a consolidated SFM measure is a more streamlined, parsimonious approach.
This is not being presented as the only solution that exists for developing and validating a prison misconduct prediction assessment. Other prison systems may have different needs. But this represents one approach that could be helpful. Plus, there has not been a lot of research published on prison misconduct prediction instruments.
If a prison system has an accurate classification assessment, the main point here is that it should not admire how accurate it is. Instead, steps should be taken to mitigate risk of the highest risk so as to make facilities safer for both inmates and staff. Moreover, the literature shows that prison misconduct is predictive of recidivism. As noted earlier, there is overlap in what predicts misconduct and recidivism. The items identified as predictive of SFM may also help predict recidivism. Reducing misconduct among those at highest risk of SFM may also be helpful in reducing recidivism risk.
An accurate classification system can be employed to improve institutional safety by not only identifying those at greatest risk of SFM but also taking steps to mitigate that risk. For example, although many acknowledge that reentry begins at the start of confinement, programming tends to be back-loaded toward the end of confinement because it has a greater impact on recidivism. However, for those with higher risk of SFM, perhaps they should be prioritized for programming regardless of their length of stay. For example, maybe an intervention such as cognitive–behavioral therapy (CBT) would be helpful in reducing the risk of SFM.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
