Abstract
We estimated the predictive power of the dynamic items in the Finnish Risk and Needs Assessment Form (Riski- ja tarvearvio [RITA]), assessed by caseworkers, for predicting recidivism. These 52 items were compared to static predictors including crime(s) committed, prison history, and age. We used two machine learning methods (elastic net and random forest) for this purpose and compared them with logistic regression. Participants were 746 men who had and 746 who had not reoffended during matched follow-up periods from 0.5 to 5.8 years. Both RITA items and static predictors predicted general and violent recidivism well (area under the curve [AUC] = .74-.78), but to combine them increased discrimination only slightly over static predictors alone (ΔAUC = .01-.03). Calibration was good for all models. We argue that the results show strong potential for the RITA items, but that development is best focused on improving usability for identifying treatment targets and for updating risk assessments.
Risk assessment in correctional settings has two important and, in part, competing roles: to assess the risk of new crimes and thereby assign prisoners to the appropriate level of supervision and to identify needs of the prisoner to provide treatment that can facilitate successful reintegration into society (Monahan & Skeem, 2016). For an instrument to be informative for decisions regarding supervision level, it has to show predictive validity for the unwanted events that assigning higher security levels aims to avoid. If needs are identified as treatment targets, they not only have to predict the success of reintegration but also explain this success. A textbook case of instrument development would start with creating theory-driven measurements, then establish the construct validity for these measurements, and finally estimate their predictive validity. However, data with measurements of differing levels of theoretical justification and construct validity are often readily available for analysis. In these cases, including the present study, we argue that an explicit examination of predictive power is a fruitful place to start. We focused on so-called dynamic risk factors and aimed to show that two machine learning methods (elastic net logistic regression and random forest) are well suited for this task.
In the risk assessment literature, there is a distinction between dynamic risk factors that can change and be changed during the prison sentence (e.g., alcohol use or employment problems) and static risk factors that cannot be changed through interventions (e.g., number of previous sentences or age; see Douglas & Skeem, 2005). Instruments that assess dynamic risk factors have potential benefits over only assessing static predictors. They hold the promise of identifying treatment targets at the same time as the current risk level is assessed. Variables that can change can also be used to update risk assessments. The Finnish Risk and Needs Assessment Form (Riski- ja tarvearvio [RITA]) is an assessment tool that puts focus on variables that are at least potentially dynamic. In the present study, we compared the predictive power of the dynamic items in RITA to available static predictors.
The Value of Quantifying Predictive Power
Focusing on predictive power, instead of construct validity, risks result in multidimensional composite constructs rather than construct valid and causal constructs (Cording, Beggs Christofferson, & Grace, 2016; Ward, 2016). However, to examine predictive power has many uses, among them is the aim to quantify how well an outcome can be predicted by the information at hand (Shmueli, 2010). Variables that do not predict a relevant outcome should instead be redesigned, reoperationalized, or abandoned. Doing so can in the long run help meeting criteria for construct validity and to develop causal constructs.
Predicting and explaining a phenomenon, such as recidivism, are tasks that put different demands on the statistical methods used (Shmueli, 2010). The goals of accurate prediction and interpretable explanation may be incompatible as maximizing one often entails compromises on the other (Kuhn & Johnson, 2013). When we try to explain what predicts recidivism, we focus on the function that links the predictors to the outcome. This function may be complicated, and we often need to simplify the model to be able to describe it. An alternative is to ease our requirement to understand the model in favor of maximizing how well it duplicates the observed outcomes (Breiman, 2001b). Arguably, to treat the relationship between predictors and outcome as a black box is unsatisfactory (Zeng, Ustun, & Rudin, 2017), but we can ultimately choose a more transparent scoring technique. In that case, we are interested in the loss of predictive power (if any). A focus on prediction rather than explanation allows a more explicit test of the predictive power of dynamic risk factors compared to static predictors before introducing constraints that will improve interpretability but reduce predictive power.
Risk Assessment in Finnish Prisons
In Finland, per the Finnish Imprisonment Act (Vankeuslaki, 767, 2005), a plan is drawn up in the beginning of the sentence for how the person should serve the sentence. The plan includes placement and planned activities during the sentence. In this sentence plan, considerations include available information about previous sentences, working and functional ability, criminality, and other circumstances. In some cases (typically for persons with sentences longer than 1 year), the assessment unit uses a structured assessment form, here termed the Finnish Risk and Needs Assessment Form (Riski- ja tarvearvio [RITA]). It is based on the Offender Assessment System (OASys) used in England and Wales (Debidin, 2009). It was translated and adapted for the Finnish circumstances in 2004 (Lilja, 2014) and taken into common use by the Criminal Sanctions Agency in Finland in 2006. The OASys was developed in the late 1990s building on existing evidence and a strategy termed “What works” (McGuire, 1995). The risk factors in OASys, and consequently in RITA, have considerable overlap with the “central eight” cataloged by Andrews, Bonta, and Wormith (2006).
After two opening sections on the committed crimes for the current sentence, RITA contains thematic sections covering dynamic variables. Using confirmatory factor analysis, Salo, Laaksonen, and Santtila (2016) suggested that the domains underlying the responses on the RITA items are best described as problems managing one’s economy, alcohol problems, resistance to change, drug abuse and associated behavior, aggressiveness, and employment problems. The form has no formal status in any process of correctional decision making. For example, there are no regulations as to what kind of RITA profile an inmate should have to be placed in an open institution (Criminal Sanctions Agency in Finland, 2004).
Value of Dynamic Risk Factors
An important reason for considering dynamic risk factors is that as they can change, they can be used to update assessments of risk (Clarke, Peterson-Badali, & Skilling, 2017; De Vries Robbé, de Vogel, Douglas, & Nijman, 2015; Howard & Dixon, 2013; Lewis, Olver, & Wong, 2013; Olver, Beggs Christofferson, Grace, & Wong, 2014). If there is change due to an intervention, and if this change can be linked to changes in recidivism rates, then assessing dynamic risk factors can even help us establish causal explanations for recidivism (Cording et al., 2016; Ward & Beech, 2015). Repeated measurements of the same variables are needed for this potential to come into play, and for now, this is not a feature of the Finnish risk assessment system. The RITA could be characterized as a “third-generation risk assessment instrument” (Andrews et al., 2006; Campbell, French, & Gendreau, 2009) that includes dynamic variables, but is not integrated into the assessment of change or of how successful the interventions are.
Without evidence that the variables change or that this change has any effect on the probability of recidivism, they may cautiously be called variable markers (Kraemer, 2003) or potentially dynamic risk factors (Hanson & Harris, 2000). In the present study, we aim to establish whether previously identified RITA domains meets this first step of showing a statistical relationship to recidivism. Arguably, as long as static and dynamic predictors show the same level of predictive power, dynamic variables should be preferred as they can serve the additional purpose of identifying risk-reducing interventions. However, regarding the narrower task of aiding early decisions about security level, dynamic variables should preferably show incremental predictive power over easily assessed static predictors.
Predictive Validity of Risk Assessment Instruments
There is ample evidence that actuarial assessment that uses standardized questions and scoring schemes generally has higher predictive power than clinical judgment that uses neither (Ægisdóttir et al., 2006; Andrews et al., 1990; Dawes, Faust, & Meehl, 1989; Grove, Zald, Lebow, Snitz, & Nelson, 2000; Quinsey, 2009). However, among these structured assessments, previous research does not show a clear advantage for a particular set of predictors. In meta-analyses, risk assessment instruments with dynamic risk factors have shown similar predictive performance to those without (Campbell et al., 2009; Yang, Wong, & Coid, 2010). Furthermore, there seems to be considerable overlap in the predictive information that items within an instrument provide. Kroner, Mills, and Reddon (2005) randomly picked 13 items, to construct new measures, from four well-established risk assessment instruments. The randomly assembled instruments predicted recidivism at the same level as the original instruments. Coid et al. (2011) showed that a subset of items within three prominent risk assessment instruments accounts for the predictive power of the full instruments.
There is more explicit evidence that the methodological aspects of a validation study play a role in the resulting predictive power. These factors include reliability of predictor measures, quality of the outcome measure, base rate of recidivism, and the length of follow-up (Andrews et al., 2011; Olver, Stockdale, & Wormith, 2014; Yang et al., 2010). The fact that results are dependent on these controlled parameters raises the question of whether risk assessment instruments, especially those with dynamic predictors, perform as well in applied settings as in research settings (Cording et al., 2016). Flores, Lowenkamp, Holsinger, and Latessa (2006) showed that assessments had stronger predictive validity when they were done by a person trained specially in the use of the instrument or by a person with more assessment experience. In contrast, Jones, Brown, and Zamble (2010) found that ratings by parole officer and by researcher had similar predictive validity. However, the models in their study, derived from stepwise Cox regression, were allowed to be different for different types of raters. Thus, it is possible that any differences in rating quality were offset by a customized weighting of the model predictors. Weighting that is customized to the setting where the instrument is used may have positive effects on predictive validity even if flexible scoring rules might hurt construct validity.
Methodological Considerations for Examining Predictive Power
There has been considerable development of so-called machine learning methods over the last few decades (Hastie, Tibshirani, & Friedman, 2009). For the present study, the subclass of methods for supervised learning is pertinent. There has been some, but not widespread, use of advanced supervised learning methods for risk assessment (Zeng et al., 2017) with some researchers finding limited benefit from the methods (Hamilton, Neuilly, Lee, & Barnoski, 2015; Tollenaar & van der Heijden, 2013). However, it is not given that the relative performance of different supervised learning methods in a particular domain can be established once and for all. The predictive performance of various methods depends on the nature of the predictors, the outcome, and the relationship between these variables—the actual performance is unknowable before model training (Berk & Bleich, 2013; Kuhn & Johnson, 2013). To reveal predictive performance, it is thus worth considering more than one single statistical prediction method.
Two popular methods are elastic net logistic regression (Zou & Hastie, 2005) and random forest (Breiman, 2001a). They represent a minimal subset of available supervised learning methods, but have different strengths regarding what types of predictor–outcome relationships they can detect. Trying these two methods is thus likely to capture much of the prediction potential while limiting the search for the optimal model. Both methods have the strength that noninformative predictors do not diminish their predictive power. This strength is a very suitable feature for exploring the predictive power of a set of predictors collectively as it does not require variable selection. Every method is, however, susceptible to overfitting, and the use of multiple methods increases the requirement of thorough cross-validation.
We compare these methods to a more traditional method: logistic regression. Some readers will see our use of logistic regression as a straw man and argue that logistic regression would never be used in this manner. We do not expect logistic regression to perform well when using the numerous variables in our data but include it as a baseline measure.
Controlling for Change in Risk Via Prison Term Variables
Assessments at the beginning of the sentence give a snapshot at a given time point. Over time, both static and dynamic predictors should lose some of their predictive power due to spontaneous or induced changes in risk (Andrews et al., 2011). It is, however, possible that dynamic risk factors lose their predictive power faster than static risk factors and that dynamic risk factors would show higher predictive power if updated closer to release. In the absence of repeated measurements, we can only implicitly assess the potential of updated risk assessment. We can do this via indicators of changes in risk that becomes available after the initial assessment. Some of this information is captured in prison term variables such as placement decisions or crime during the sentence. We do not consider these variables as risk assessment items, but we are interested in the added predictive information that they provide. Including all variables gives us an estimate of the maximum predictive power of all information available in our data. To the extent that the initial assessment fails to capture risk at release, it should show up as incremental predictive power of the prison term variables.
Evaluating Predictive Performance: Discrimination and Calibration
There are two parts to the predictive performance of a model: discrimination and calibration (Helmus & Babchishin, 2017). Discrimination is the ability of the test to assign higher estimated probabilities of recidivism for individuals who do reoffend, compared to individuals who do not. However, a test is well calibrated if it assigns a level of estimated probabilities to a group of individuals who correspond to the actual recidivism rate in that group. A test can have good discrimination without having proper calibration. We evaluate both forms of predictive performance in the present study.
Research Questions
The present study examined the predictive power of dynamic items in the RITA and aimed to contribute to the discussion of the role of dynamic predictors of recidivism in two ways. For one, we performed an explicit comparison of static and dynamic predictors through methods that are designed to optimize predictive performance rather than interpretability. In addition, we examined the predictive potential of dynamic variables when they are assessed in an applied setting rather than under the strict control of scale developers. This latter element is in contrast to much of the research on dynamic risk factors (Cording et al., 2016). We considered predictions of both general and of violent recidivism. Specifically, we asked the following questions:
Do the prediction methods’ elastic net and random forest result in better predictions than a more traditional method: logistic regression?
Do the RITA items as assessed in an applied setting in the beginning of a sentence predict recidivism?
Do RITA domains show a statistical relationship with recidivism?
Do the RITA items have incremental predictive validity over available static predictors?
Does considering term variables as proxies for updated risk have incremental predictive validity over the assessment in the beginning of the sentence?
Are the models good enough, in terms of discrimination and calibration, to be applied?
Method
Sample
Information used in the present study was retrieved from the National Prisoner Database (in Finnish: Vankitietojärjestelmä), upheld by the Criminal Sanctions Agency in Finland. The database contains records of current and former inmates, including information about the offenses they have been convicted for, temporary releases or parole, possible disciplinary reports, as well as assessment reports (if completed). This study used a case-control sample consisting of individuals who had been sentenced to a new prison term matched by an individual released around the same date but with no new sentence. Subjects had to have the Finnish Risk and Needs Assessment Form (RITA) fully completed and to be released from a Finnish prison between 2007 and 2011. The resulting sample consisted of 748 male cases and 748 male controls. Women were excluded because of their low number (n = 34 + 34). During the same period, on average 4,400 prisoners were released from Finnish prisons each year (Blomster, Linderborg, Muiluvuori, Salo, & Tyni, 2012).
Outcomes
In this study, recidivism was operationalized as a new prison sentence. There had to be a new sentence recorded in the prisoner database for an individual to be coded as having reoffended. Therefore, the outcome could be called reimprisonment as the extraction from the database does not capture convictions with more lenient sentences. We coded a new sentence of any kind (including violent offenses) as general recidivism and considered a crime as violent recidivism if it included homicide or assault.
The base rate for general recidivism was 50%, as a result of the sampling procedure. Violent recidivism, a subset of new crimes, had a base rate in this study of 18.6%. The median time to the new sentence was 102 days (max = 1,321 days). Data were collected during 2012, and the median follow-up time was 3.7 years (range = 0.5-5.8 years). By design, follow-up time for general recidivism is close to identical between reimprisoned individuals and individuals not reimprisoned. The distributions of follow-up time for individuals reimprisoned for a violent crime (M = 3.65 years, SD = 1.04) and other included individuals (M = 3.70 years, SD = 1.05) were also very similar as displayed by a quantile–quantile plot available in the online repository.
Predictors
Static Predictors
The static predictors included categories for the current crime committed, prison history, and age. The offenses related to the current sentence were grouped into 15 categories all listed in Table 1. These categories were not mutually exclusive. For example, violent resistance toward a police officer was coded both as an offense against official authorities and as assault.
Mean (and Standard Deviation) or Frequency (and Percentage) for Select Predictors by Reoffense Category
Note. RITA = Finnish Risk and Needs Assessment Form.
Missing values for 426 individuals excluded. bMissing values for 45 individuals excluded. cCategories are not mutually exclusive. See https://github.com/bennysalo/predict-recidivism for all predictors.
We included four variables regarding the number of previous sentences: number of prison terms, community service terms, remand terms, and terms as a substitute for unpaid fines. For 45 men, this information was missing. For these, we coded the variable indicating the number of previous sentences as 0 and added a variable that indicated missing information regarding previous sentences. Other variables about prison history were a variable indicating any escape, unlawful absence or attempt thereof during any of the noted terms, and the person’s age when the first term was noted in the prison database. Current age was defined as the age at release.
Age at first term was missing for 426 individuals. In these cases, we treated the current sentence as a first sentence. As sentence length was not available in the data, we used age at release rounded down to the closest integer age at first term. The same way we coded the number of previous sentences, we used a variable that indicated missingness. This method for replacing missing values captures the predictive information without introducing inferred values. It is also a method that is easy to implement in an applied setting. If the goal was to explain rather than predict recidivism, other methods for handling missing data (e.g., imputation) would be more appropriate. For prediction purposes, the information about missingness may in itself contain implicit information that is worth retaining (Shmueli, 2010).
Dynamic Predictors
The 52 dynamic RITA items were all coded on a 3-point scale with 0 = risk factor not present, 1 = the risk factor is somewhat present, or 2 = the risk factor is evidently present. To study the predictive power of RITA domains, we calculated sum scores based on the domains suggested in Salo et al. (2016). These domains were based on a subset of the original items, and some items belonged to two domains. They are Alcohol problems (seven items, Cronbach’s α = .804), Resistance to change (12 items, α = .793), Employment problems (six items, α = .774), Problems managing one’s economy (six items, α = .731), Aggressiveness (seven items, α = .685), and Drug abuse and associated behavior (six items, α = .590). Because of low reliability, the last domain was not used and instead replaced by the domain Current drug use and its effects (seven items, α = .926) that is a subpart of that domain. The original items and the domains can be found in Salo et al. (2016).
Variables Related to the Prison Term
The variables we used to estimate the predictive power of information available after the initial assessment were placement in open prison, granting conditional release, supervision of parole, and crime during the sentence. After considering the risk of absconding or crime, an individual can be placed to serve part of the sentence in an open institution with minimum security. In addition to practical considerations, the individual needs to agree to abstain from drug use and, if necessary, take recurring drug tests. A person placed in an open institution can be returned to a closed institution, and the placement at release was what determined the value on the predictor. This variable is thus a combination of the prison personnel’s view of risk and the willingness and ability of the imprisoned individual to meet requirements for placement in an open institution. The same is true for granting conditional release, where after similar considerations, the last 6 months of a sentence can be served in the community while under technical surveillance. For this variable, we had information available on whether a conditional release was granted and whether it was eventually revoked. This information was dummy coded as two separate predictors.
In the majority of cases, prisoners are released on parole. This parole is generally supervised if the remaining sentence is longer than 18 months and the supervision may take several forms. Our variable, supervision of parole, includes minimal judgment from caseworkers and is primarily an indicator of sentence length. It is nevertheless related to the unfolding of the prison term and adds information potentially not captured in the static and dynamic predictors. Finally, the variable crime during sentence indicates whether the individual committed or was a suspect of a crime during any lawful or unlawful absence from the prison. Descriptive statistics for select predictors per reoffense category are presented in Table 1.
Machine Learning Methods
We employed two often used methods for supervised learning: elastic net logistic regression (Zou & Hastie, 2005) and random forest (Breiman, 2001a). These methods have different strengths. Elastic net adds a penalty parameter to the estimation of a logistic regression model to reduce overfitting and thereby improve predictive power. It performs well when there are relationships between predictors and the outcome that are parsimoniously captured by logistic regression coefficients. The elastic net, however, is less susceptible than regular logistic regression to multicollinearity, thanks to the penalty imposed on the coefficients (Dormann et al., 2013).
Random forest is an aggregation of multiple classification trees built on bootstrapped samples and with random selections of the predictors to consider. Its strength lies in its flexibility. By growing trees in multiple layers, it can accommodate for complex interactions between predictors. However, while it splits predictors at optimal points, it does not capture linear relationships as well as logistic regression. From an interpretability standpoint, it is also inferior to logistic regression methods (Hastie et al., 2009; Kuhn & Johnson, 2013). We compared both models to logistic regressions without penalty. For models with a single RITA domain as a predictor, we only use logistic regression as the two other methods are superfluous when there is no need to weigh multiple predictors against each other.
For both methods, the exact model depends on so-called tuning parameters that are essential parts of the algorithms and affect predictive power. For elastic net, there are the nature and strength of the penalty parameter, and for random forest, there is the number of variables to select randomly for consideration in growing the classification trees. The best tuning parameter values were chosen based on cross-validation in the training set. This procedure is documented on https://github.com/bennysalo/predict-recidivism.
Assessment of Predictive Performance and Incremental Validity
In all analyses, we evaluated the performance of the prediction models in a validation set that was separate from the data used to train the models. This approach has several advantages. First, it gives an estimate of the predictive performance that is not inflated by overfitting. Second, regarding tests of incremental validity, we do not need to rely on assumptions of distribution of the performance measure or its degrees of freedom. This flexibility is especially helpful when comparing models trained with different prediction methods. Third, we aggregate the predictors and estimate the predictive validity of this aggregate prediction in the same step. For each separate case, the predictors are combined in a way that renders the best out-of-sample predictive performance. We argue that this reveals the predictive potential of predictor sets better than relying on an intermediate aggregation step such as factor analysis.
Resampling Methods
We validated all models through repeated cross-validation within a training set and later select models through a single validation in a separate test set. This is a recommended approach when developing an applied prediction tool and avoids capitalizing on chance (Kuhn & Johnson, 2013). All validation sets had either 299 or 300 observations (equivalent to 20% of the original sample) with the same base rate of the outcome in every sample. We first sampled 300 observations to form the test set. The remaining 1,196 observations formed the training set. Cross-validation within the training set consisted of 250 repeats of fourfold cross-validation, resulting in 1,000 resamples. In each resample, 299 observations were set aside for validation, and the remaining 897 served as training data. The variance of model performance in the test set was estimated through bootstrapping the test set 2,000 times. For both validation methods, the confidence intervals and p values reported in the “Results” section are based on percentiles in the respective resample sets.
The two validation methods produce measures that have slightly different interpretations. The cross-validation in the training set measures the expected prediction performance as the observations in both the training set and the validation set are allowed to vary. This method allowed us to compare models without the results being overly dependent on the random split. After having settled on the prediction method and tuning parameters, we estimated the performance of the final models trained on the full training set. We could do this by validating the models in the 300 observations that were first set aside. The performance in the single test set is highly dependent on the particular random split, which makes model comparisons less clear, but is a more straightforward test of the models we propose. Statistical power can be increased by allocating more individuals to validation sets. However, as this automatically decreases the size of the training set, it diminishes the accuracy of the trained model. In prediction modeling, the balance is often struck by allocating the majority of observations to the training set (Hastie et al., 2009; Kuhn & Johnson, 2013).
Measures of Predictive Performance
We evaluated the prediction models on discrimination, calibration, and overall performance. We used overall performance as the criteria for selecting the best prediction methods and for making inferences of incremental validity of predictor sets. Here, we used McFadden’s pseudo-R2 (McFadden, 1974). This measure has a direct relationship to the log likelihood, commonly used as a measure of model fit in logistic regression, but is an attempt to express the predictive performance as an effect size measure. Models that have predictions closer to the observed values have higher log likelihood and consequently higher pseudo-R2.
A common metric for discrimination is the area under the curve (AUC). In the present case, the AUC is a measure of to what extent individuals who commit new crimes are ranked as having a higher risk than individuals who do not reoffend. Contrary to pseudo-R2, the AUC is not explicitly affected by the absolute value of estimated probabilities (Helmus & Babchishin, 2017). We find Cohen’s d easier to interpret when comparing models. Cohen’s d is a linear metric, while for pseudo-R2 and AUC, the same numerical difference represents larger differences in performance. For differences between models, we first convert pseudo-R2 and AUC values to equivalent values on Cohen’s d using formulae in Table 1 of Ruscio (2008) assuming equal group sizes. (Pseudo-R2 was treated as the square of r.) The assumption of a base rate at 50% does not hold for violent recidivism but makes the comparison between types of recidivism easier. As a rule of thumb, d = 0.2, 0.5, and 0.8 are characterized as small, medium, and large effects (Cohen, 1992). Converted to AUC and r by the above-mentioned formula, this corresponds to AUC = .56, .64, and .71 and pseudo-R2 = .01, .06, and .14.
Calibration statistics in the risk assessment literature and the methods for examining calibration vary (Hanson, 2017; Helmus & Babchishin, 2017). Commonly, calibration is evaluated in some form by comparing the expected rate of recidivism to the observed rate. One illustrative form of doing this is a calibration plot. In our calibration plot, we divided individuals into quintiles based on the estimated probability of recidivism, calculated the average estimated probability in that group, and plotted that against the de facto recidivism rate of the group. A well-calibrated model follows a diagonal line where the expected and the observed recidivism rates are equal. We examined the calibration of the test set predictions of all multiple predictor models.
Statistical Software
We used the software environment R (Version 3.5.2; R Core Team, 2018) for analyses. These analyses, including the R packages that we used, can be found on https://github.com/bennysalo/predict-recidivism. The material in the repository includes comprehensive descriptive statistics and results on a per resample basis.
Results
Training Set Analyses
The overall performance (pseudo-R2) and discriminative power (AUC) of the predictive models are presented in Figure 1. The error bars show that there was considerable variance in the performance of a model depending on the random split between training and validation set. However, as all models were tested on the same resamples, we could compare models per resample and thereby control for variance in performance that is due to the variation in splits.

Error Bar Plots of Model Performance for Predictive Models in 1,000 Resamples of 299 Observations
Comparison of Prediction Methods
As expected, elastic net and random forest consistently performed better than logistic regression. Three of the logistic regression models for predicting violent recidivism even resulted in a negative median pseudo-R2. Importantly, the relative performance between sets of predictors was different when using logistic regression than when using the two other methods, which makes comparisons between predictor sets using logistic regression unsuitable. A likely explanation is that, as logistic regression has no mechanism for handling noninformative items, it easily overfits as the number of predictors grows. Logistic regression handled the set of 24 static predictors relatively well but performed worse with the 52 RITA items or when the sets were combined.
Elastic net models tended to result in the best overall predictive performance. The model type had the highest pseudo-R2 in six of the eight cases. However, the differences between elastic net and random forest were small and not statistically significant with the median differences, in d equivalents, ranging from d = 0.11 in favor of elastic net to d = 0.03 in favor of random forest.
Regarding discrimination, the pattern was similar to that seen for overall performance, but not identical. Notably, the random forest model predicting violent recidivism from static predictors had narrowly better discrimination than the elastic net model (equivalent to d = 0.07) even if overall performance was better for the elastic net model. Notably, logistic regression models had acceptable discrimination even with inferior pseudo-R2 figures. Good AUC figures combined with low pseudo-R2 figures are an indication that the models rank individuals according to risk effectively but that the estimated probabilities do not reflect the observed rates of recidivism, thus suggesting poor calibration. Although the better performance of elastic net models over random forest models was not decisive, they had a lower variance in performance. We chose the elastic net model for all comparisons between sets of predictors. We did this to facilitate comparisons between predictor sets.
Incremental Power of Predictor Sets
Overall predictive performance for elastic net models and their incremental predictive performance are presented in Table 2. Models using static predictors performed better than models using RITA items. Combined predictor sets showed some incremental improvements over using static predictors alone, and adding term variables improved the predictions even further. Improvements in pseudo-R2 ranged from small to inconsequential in size. The incremental improvement of term variables over predictors available at the start of the sentence for predicting general recidivism was statistically significant even after Bonferroni correction for six tests. The 95% confidence interval for the difference overlapped zero in all other tests.
Overall Predictive Performance in Training Set Per Outcome and Predictor Set
Note. AUC = area under the curve; LL = lower limit; UL = upper limit; RITA = Finnish Risk and Needs Assessment Form.
Incremental difference over the following model on the line below the line of the corresponding model. Cohen’s d calculated from AUC assuming equal group sizes using formula from Table 1 in Ruscio (2008). 95% CI = 95% confidence intervals, uncorrected for multiple tests. These are based on performance in 1,000 resamples of 299 observations and with the model trained on the remaining 897 observations in the training set. Models with multiple predictors are all elastic net models. Models with an RITA domain as a single model are logistic regression models.
Discrimination values in cross-validated samples are presented in the left columns of Table 3. Models using RITA items, static predictors, and combined predictor sets all showed acceptable levels of discrimination for both general and violent recidivism. The differences between the models were small or minimal. For general recidivism, the difference between models using static predictors and models using RITA items was equivalent to d = 0.18. The incremental improvement of combining RITA items and static predictors over static predictors alone was d = 0.06, and the additional incremental improvement of adding term variables was d = 0.18. For violent recidivism, the same improvements were d = 0.12, 0.13, and 0.07.
Discrimination of the Elastic Net Models in Training Set Per Outcome and Predictor Set
Note. For the training set, these are based on performance in 1,000 resamples of 299 observations and with the model trained on the remaining 897 observations in the training set. For the test set, the CI is based on 2,000 bootstrap samples of the performance in the test set of 300 observations when the model is trained on the full training set of 1,196 observations. d = Cohen’s d calculated from AUC assuming equal group sizes using formula from Table 1 in Ruscio (2008). CI = confidence interval; AUC = area under the curve; LL = lower limit; UL = upper limit.
RITA and Its Domains
Overall predictive performances of the models using single RITA domains as a predictor are also presented in Table 2. RITA domains predicted both general and violent recidivism at levels equivalent to small- and medium-sized effects. Problems managing one’s economy stood out as a domain with relatively strong predictive power for general recidivism, and aggressiveness was the domain that predicted violent recidivism best. These two dimensions were statistically significant predictors in one-tailed tests with Bonferroni correction for six tests. The two domains stood out regarding discrimination as well (see Table 3).
Test Set Analyses
Discrimination values in the test set are presented in the right columns of Table 3. The 95% confidence intervals for discrimination in the independent test set overlap the estimated discrimination in the training set, adding credence to the results. The two confidence intervals are of similar breadth, which is unsurprising as the size of the validation samples were the same in both cases. However, it shows that the validation sets of 300 observations had limited statistical power and that performance in such validation sets depended considerably on random sampling.
Figure 2 shows the calibration, in the test set, of the predictive models that could be considered for applied use (i.e., excluding models using term variables), including all prediction methods. The relative performance regarding calibration corresponded roughly to the relative performance in discrimination. Elastic net and random forest models showed good calibration with observed recidivism rates closely corresponding to the average expected recidivism rate in the respective quintile of estimated probabilities. The differences between models were mainly in respect to the range of the estimated probabilities. The range of estimated probabilities for violent recidivism was markedly narrower with few estimated probabilities over 50%. This observation can partly be attributed to the lower base rate of violent recidivism. For both general and violent recidivism, the range of predictions was a little narrower when using only RITA items. In both cases, a more conservative range of predictions seemed to be motivated. There was a lower average estimation of recidivism risk for the top quintile, but this corresponded to the fact that the top quintile did include more individuals that did not reoffend in these cases.

Calibration Plot for Prediction Models
Discussion
Performance of Elastic Net and Random Forest
Elastic net and random forest both produced considerably better predictions than logistic regression. As a traditional logistic regression is susceptible to overfitting (Hastie et al., 2009), this was expected. When comparing the predictive power of predictor sets, the most significant drawback is that predictor sets are affected differently by the shortcomings of logistic regression. Differences in the predictive power of the models do not only depend on the information contained in the predictors, but also on the predictors’ suitability for multiple logistic regression. Using a large number of individual items makes logistic regression especially sensitive to multicollinearity, and our use of the method can be considered something of a straw man. There are other ways to use logistic regression that would be better. For example, we could have selected a subset of uncorrelated items or used a dimension reduction technique to create fewer predictors more suitable for use in logistic regression (Dormann et al., 2013; Hastie et al., 2009). However, doing so would have added an extra step in our analyses that might have introduced subjective decisions, loss of information, or both.
Random forest is expected to outperform the elastic net when the relationship between the predictors involves interactions. In this case, elastic net logistic regression performed as well or better. An interpretation of this is that multiple items make small univariate contributions and are best used merely added to each other (which an elastic net without interaction terms does effectively). In larger training samples, subtle interactions might become more evident, favoring random forest.
Predictive Power of RITA Items and RITA Domains
RITA items showed a robust relationship with both general and violent recidivism. The discrimination is on par with instruments with high discrimination in their validation studies (see Table 1 in Cording et al., 2016) and above the average reported in meta-analyses (Campbell et al., 2009; Gendreau, Little, & Goggin, 1996; Katsiyannis, Whitford, Zhang, & Gage, 2018; Olver, Stockdale, & Wormith, 2014; Walters, 2003; Yang et al., 2010). This result is notable as the RITA is administered with considerable freedom of interpretation of the caseworker and has not gone through the process of thorough construct validation that many other instruments have. These results suggest that dynamic risk factors assessed in an applied setting can reach the level of discrimination previously seen in research settings.
The relatively high discrimination can be attributed to two factors working in tandem: customized weighing of items and thorough cross-validation. No predefined scoring scheme was used to sum up the items scores. Instead, the prediction methods were used to find the best solution, separately for violent and general recidivism. This approach means that the items are used in a way that best links the items as interpreted by the caseworkers to recidivism in this Finnish sample. The elastic net and random forest methods are both well suited for this task. However, the weighting should be customized to this particular setting but not to this particular sample. Cross-validation helps avoid this overfitting. The models were tuned to favor the parameters that maximized discrimination in subsamples that were not used to train the model. Using this criterion resulted in that when the models were tested on entirely new data (the test set), they performed at the level expected by the results from the training data.
RITA seems to draw on several domains for its predictive power. None can be concluded to lack meaning in the prediction of recidivism. Results provide strong evidence for two domains: problems managing one’s economy for general and aggressiveness for violent recidivism.
Incremental Predictive Power of RITA Items Over Static Predictors
The same factors that benefit the predictive power of RITA items also benefit the static predictors, and they performed on the same level and better than RITA items. Adding RITA items to the static predictors gave very small improvements in discrimination and overall performance. One interpretation of this is that using static predictors alone comes close to the potential predictive power of the information available at the beginning of the sentence, and dynamic items, therefore, have little to add even if they also are reliable predictors. This interpretation fits with the idea of a “glass ceiling” suggested by Coid et al. (2011). The glass ceiling in the present study, however, was around AUC = .80 in comparison to the ceiling of .72 for prospective studies of violent recidivism that Coid et al. identified.
Incremental Predictive Power of Term Variables
Adding variables related to the unfolding of the prison term resulted in a small improvement in the prediction of general recidivism, but a negligible improvement for predicting violent recidivism. This lack of increase in discrimination does not mean that what happens during the sentence does not affect or inform the risk of violence. No change in discrimination means no change in the ranking according to estimated recidivism risk. Thus, better than a claim of no treatment effect is the interpretation that extra care is put in the supervision decisions regarding violent offenders and that high-risk individuals are treated in line with readily available information about risk. For example, even if placement in open prison would be a protective factor against recidivism, adding placement as a predictor in the prediction model does not increase the discriminative power if placement in open prison is primarily granted to individuals with correctly assessed low risk of recidivism. Let us be clear; the present study cannot be used to assess the efficacy of any specific intervention or prison treatment in general. How the prison term unfolds may have important consequences. The high discriminative power of static items merely implies that those who start with the highest risk remain the ones with the highest risk.
Generalizable Discrimination and Calibration of the Models
Validation of the chosen models in an independent test set shows that discrimination is strong enough for the models to be applied, especially any model that includes static predictors. However, effect sizes should be interpreted in context. The interpretation of, for example, AUC = .78 is that, if we randomly pick one reimprisoned individual and one individual not reimprisoned, the estimated risk will be higher for the reimprisoned individual in 78% of cases. This level of discrimination is helpful, but not close to perfect.
While AUC is a metric robust to differences in base rate (Ruscio, 2008), the prediction of violent recidivism is harder just because there is less information about violent reoffenders. In this study, the 278 individuals with new violent crimes provided enough information to reach a level of discrimination for violent recidivism on par with discrimination for general recidivism. Base rate differences had a more significant influence on calibration. Calibration was good for all the models, but the range of estimated probabilities was more limited for violent recidivism. The models were not able to identify any individual with a probability of committing a new violent crime much higher than 50%. This level of certainty may also correspond to the actual predictability of violent offenses. Violent crimes are rarer than general crimes, and the models avoided big errors by not designating high probabilities of new violent crimes to any individual. Using only RITA items also offered a narrower range of estimated probabilities for both general and violent recidivism making models that include the static predictors, alone or in combination with dynamic items, a little more useful.
The Role of RITA in Prediction
The static predictors in the present study performed a little better relative to the dynamic predictors than in meta-analyses on the prediction of general recidivism (Gendreau et al., 1996) and violent recidivism (Campbell et al., 2009). The results are, however, in line with previous conclusions that show that dynamic and static predictors are nearly interchangeable for the limited task of predicting reoffending (Campbell et al., 2009; Gendreau et al., 1996; Yang et al., 2010). In the introduction, we claimed that to justify the cost of assessing dynamic risk variables, they should preferably display incremental predictive validity over static predictors. Arguably, as the easy gains in prediction have already been made, and considering the impact risk assessment has on the individual, one could argue that even small gains should be satisfactory. For example, adding RITA items to static predictors to predict violent recidivism results in an increase in AUC of .027. Leveraged in impactful decisions, that may potentially be considered meaningful. However, the results in the present study suggest that one should look for the incremental gains in other places than in the dynamic items that are found in RITA.
One place to look for incremental gains is in dynamic risk variables with stronger evidence of construct validity. There have been several calls for a stronger focus on developing risk assessment tools that would allow causal inferences about recidivism (Cording et al., 2016; Monahan & Skeem, 2016; Ward, 2016). For this, the measured constructs need to be truly dynamic, and that is an assumption that is not verified for the RITA items.
Another place to look is in changes in risk. We could not examine this as the RITA items were only administered at a single time point. The limited incremental predictive power of our term variables suggests that there is limited change in risk over the prison term. Repeated measures may show different results. The changes in risk can be measured and they have incremental predictive power has been demonstrated using the Violence Risk Scale (Lewis et al., 2013; Olver, Beggs Christofferson, & Wong, 2015), Historical Clinical Risk Mangement-20, and Structured Assessment of Protective Factors for Violence Risk (De Vries Robbé et al., 2015).
The predictive power of RITA items suggests a potential for further development of RITA. At the same time, the relatively strong predictive power of the static predictors suggests that predictive power does not have to be a primary concern in this development. Development of the RITA can focus on establishing strong construct validity. Development in this direction would strengthen its usability for its current use of identifying treatment targets. Repeated measures open up for measurements of changes in risk that might provide stronger incremental predictive power. However, the initial assessment can rely primarily on static predictors. In other words, the development of RITA should focus on ensuring that the measures are fair, informative, and reliable.
Predictions using the current predictors can be made using a computer algorithm or converted into an analog tool. Customization of weights might be worthwhile allowing different scoring rules when identifying treatment targets and when assessing risk. Because elastic net performed very well and produced coefficients that are interpretable in the same way as those in a traditional logistic regression, an analog tool can quickly be developed. This conversion might entail simplifying coefficients, deleting low information items, and selecting items to create a parsimonious tool with minimal loss in predictive power.
The Role of Dynamic Predictors in Risk Assessment in General
Prediction is not the be-all and end-all of risk assessments in correctional settings. It can help make decisions about security level and early release more accurate and fair, but correctional systems should have higher ambitions than that. For understanding the causes of recidivism and mitigating risk, assessment of static variables is a dead end. Here, only truly dynamic and valid risk factors can help us. This fact is an argument for making dynamic items a part of prediction tools. We, however, claim that it reiterates the point we make specifically for RITA to pertain to dynamic variables in general: prediction is not the domain where dynamic variables have most to offer. They should be developed to serve purposes that static predictors cannot serve. If gains in prediction follow from that work, it would be helpful, but not required.
Other Limitations and Further Research
It is worth reminding ourselves that predictions work better on a group level than on an individual level. The calibration plots show that predictions are accurately averaged over the members of a particular risk group. However, the models can allocate an individual to the wrong risk category without severely damaging the performance on a group level.
This study examines only the static and dynamic variables available to us. Other static items, and especially other dynamic items, might have given added predictive power. Even more so, the items that aim to capture what happens during the sentence were limited to only four variables. A strength of this study, however, is in its applied setting. With the information already at hand, we could go beyond discriminative power and also investigate the effect of the practices of placement in open prison and conditional release on the risk of recidivism. By matching groups on their propensity for being placed in open prison or granted conditionals release, we could examine whether the level of supervision changes the level of risk within a risk group.
Footnotes
Authors’ Note:
The authors would like to thank reviewers for helpful comments that have significantly helped improve this article. Analyses are documented, and detailed results are provided on
. This work was supported by the Criminal Sanctions Agency in Finland. It was also supported by personal funding for the first author by the National Doctoral Program of Psychology in Finland, the Finnish Cultural Foundation, the Åbo Akademi University Foundation, Svensk-Österbottniska Samfundet, and Waldemar von Frenckells stiftelse and for the third author by the Academy of Finland Project 287800.
