Abstract
Most violence risk assessment tools have been validated predominantly in males. In this multicenter study, the Historical, Clinical, Risk Management–20 (HCR-20), Historical, Clinical, Risk Management–20 Version 3 (HCR-20V3), Female Additional Manual (FAM), Short-Term Assessment of Risk and Treatability (START), Structured Assessment of Protective Factors for violence risk (SAPROF), and Psychopathy Checklist–Revised (PCL-R) were coded on file information of 78 female forensic psychiatric patients discharged between 1993 and 2012 with a mean follow-up period of 11.8 years from one of four Dutch forensic psychiatric hospitals. Notable was the high rate of mortality (17.9%) and readmission to psychiatric settings (11.5%) after discharge. Official reconviction data could be retrieved from the Ministry of Justice and Security for 71 women. Twenty-four women (33.8%) were reconvicted after discharge, including 13 for violent offenses (18.3%). Overall, predictive validity was moderate for all types of recidivism, but low for violence. The START Vulnerability scores, HCR-20V3, and FAM showed the highest predictive accuracy for all recidivism. With respect to violent recidivism, only the START Vulnerability scores and the Clinical scale of the HCR-20V3 demonstrated significant predictive accuracy.
Females represent only a minority in the criminal justice system. However, there is widespread agreement that female offending is on the rise, especially violent offending and particularly among young women (Miller, Malone, & Dodge, 2010; Moretti, Catchpole, & Odgers, 2005). Hence, a disproportionate increase in the number of females entering the penitentiary system and forensic mental health care has been observed in many countries (for reviews, see Nicholls, Cruise, Greig, & Hinz, 2015; Odgers, Moretti, & Reppucci, 2005; Walmsley, 2015). In addition, it should be noted that the official prevalence rates of female offending might constitute an underestimation, as women usually commit less reported and less visible offenses, such as intimate partner violence (Desmarais, Reeves, Nicholls, Telford, & Fiebert, 2012) and child abuse (May-Chahal & Cawson, 2005). Considering the rising numbers of females entering forensic mental health care, it is important to know whether commonly applied tools and treatment methods are applicable to females, as the majority were developed for male forensic populations.
The use of violence risk assessment tools has become common practice in forensic mental health care. Structured violence risk assessment provides insight into risk and protective factors and offers guidelines for risk management and treatment and is thus of great importance for both society and the individual patient/offender. Prevention of (violent) offending by women is of utmost importance, also for the next generations. Children of violent or antisocial mothers have elevated risks of problems relating to mental and physical health, substance use, school, and offending behavior (Kim, Capaldi, Pears, Kerr, & Owen, 2009). In the past three decades, major progress has been made in the area of structured violence risk assessment, and multiple instruments have become available to guide mental health professionals in this process. However, the majority of risk assessment instruments are based on violence risk research conducted primarily in male samples. Moreover, research into the psychometric properties of these tools has been carried out mostly with men.
There is an ongoing debate on whether risk assessment procedures and tools are gender-neutral or gender-responsive (Salisbury, Boppre, & Kelly, 2016; Yesberg, Scanlan, Hanby, Serin, & Polaschek, 2015). Some scholars have taken the position that there is no reason to assume that male-based instruments do not apply to women because most risk factors are considered valid for both genders (see Rettinger & Andrews, 2010). Previous violent behavior, young age at first violent offense, and substance abuse have all been found to be valid risk factors for both females and males (Andrews et al., 2012). Other scholars have advocated the acknowledgment of unique risk factors and pathways to offending for females, for instance, relating to traumatic experiences and mental health issues (Blanchette & Brown, 2006; Brennan, Breitenbach, Dieterich, Salisbury, & van Voorhis, 2012; Davidson & Chesney-Lind, 2009; Salisbury et al., 2016). Some risk factors that have been found specifically in female populations are pregnancy at a young age (Messer, Maughan, Quinton, & Taylor, 2004), prostitution (Morgan & Patton, 2002), and self-harm (Völlm & Dolan, 2009), although the predictive accuracy of these factors needs to be demonstrated more firmly. Mental health professionals who work with women on a daily basis recognize gender differences in violent offending and have called for more gender-sensitive assessment of violence risk factors (Adams & Freeman, 2002; Odgers et al., 2005). Despite the ongoing debate and the need for more research, it appears that the majority of risk factors are valid for both genders, but some risk factors seem to have a stronger or different impact on females compared with males. A distinction can be made between factors to which women are exposed more often (e.g., sexual victimization) and factors to which women are more sensitive, that is, factors that have a stronger effect on later violent or criminal behavior on women than men (e.g., disruptions in social relationships; de Vogel & Nicholls, 2016).
Furthermore, it has been suggested that girls and women respond differently to protective factors compared with boys and men. For example, close family ties, positive social relationships, sound finances, and being religious appeared to have a stronger protective effect on females than males (Hart, O’Toole, Price-Sharps, & Shaffer, 2007; Rodermond, Kruttschnitt, Slotboom, & Bijleveld, 2016). Another factor that is considered as having potentially strong protective influences in females is self-efficacy (van Voorhis, Wright, Salisbury, & Bauman, 2010).
Summarizing, although there are many similarities for female and male offenders, several important gender differences in risk and protective factors for (violent) offending have been found, and it can therefore be questioned whether commonly used risk assessment tools are sufficiently valid and useful in female populations.
The Value of Risk Assessment Tools in Females
Research has revealed that unstructured clinical judgment of violence risk is sensitive to sex-based biases. Mental health professionals of both genders tend to underestimate the risk of violence in female psychiatric patients (Skeem et al., 2005), and the use of structured tools is recommended. Most commonly used structured risk assessment tools can be divided into actuarial tools and tools according to the Structured Professional Judgment (SPJ) method. An important distinction between actuarial and SPJ tools is how the final judgment is established. In actuarial tools, the risk conclusion is obtained via an algorithm, whereas in SPJ tools information should be integrated, combined, and weighed by the assessor to come to a structured, individualized final judgment. An example of a widely used actuarial tool for general offending is the Level of Service Inventory (LSI, or revised versions; see Andrews & Bonta, 2000). A large body of research has demonstrated good predictive accuracy for the LSI in both male and female populations (Olver, Stockdale, & Wormith, 2014; though for a critical discussion about the use of the LSI in females, see Salisbury et al., 2016). One of the most widely used SPJ risk assessment instruments for violence in forensic psychiatry is the Historical, Clinical, Risk Management–20 (HCR-20; Webster, Douglas, Eaves, & Hart, 1997). Several studies found lower predictive validity for females than males (see the reviews of Garcia-Mansilla, Rosenfeld, & Nicholls, 2009; McKeown, 2010). Yet, a meta-analysis showed that the HCR-20 had the best predictive efficacy among samples containing higher proportions of women, patients with schizophrenia, and Caucasians (O’Shea, Mitchell, Picchioni, & Dickens, 2013). The Short-Term Assessment of Risk and Treatability (START; Webster, Martin, Brink, Nicholls, & Desmarais, 2009) is another commonly used SPJ tool, which is fully dynamic and assesses different forms of risk. Not many studies have yet been conducted on the value of the START in female populations, but the results so far have shown good predictive accuracy in women (O’Shea & Dickens, 2015; Viljoen, Nicholls, Greaves, de Ruiter, & Brink, 2011). Predictive accuracy of SPJ tools assessing protective factors was found to be lower in females compared with males (Viljoen et al., 2016). In the Structured Assessment of Protective Factors for violence risk (SAPROF; de Vogel, de Ruiter, Bouman, & de Vries Robbé, 2012), the most accurate predictors for abstention from violence differed. For men, the protective factors self-control, work, and attitudes toward authority were the strongest predictors for not committing violent incidents during treatment, whereas for women, leisure activities, coping, and intelligence were the strongest (de Vries Robbé, de Vogel, Wever, Douglas, & Nijman, 2016).
Psychopathy is considered to be a strong predictor of (violent) recidivism and is therefore often incorporated in risk assessment tools, such as the HCR-20 (Leistico, Salekin, DeCoster, & Rogers, 2008). The Psychopathy Checklist–Revised (PCL-R; Hare, 2003) is widely used for the assessment of psychopathy, and although it is not a risk assessment instrument, PCL-R total scores have proven to be predictive for violence (Leistico et al., 2008). Most studies into the value of the PCL-R have been conducted in male samples. Studies in female samples have yielded mixed results, with some finding similar results as for males, whereas others found lower predictive accuracy in females compared with males (for discussions, see Logan, Weizmann-Henelius, 2012; Nicholls, Ogloff, Brink, & Spidel, 2005). Overall, the PCL-R is assumed to have relevance in violence risk assessment in female offenders (Nicholls et al., 2005), yet concerns have been expressed about whether the PCL-R satisfactorily captures the construct of psychopathy in women (Forouzan & Cooke, 2005).
Garcia-Mansilla et al. (2009) reviewed the literature on different methods of violence risk assessment in a range of female populations. The conclusion of this review was that although structured methods of risk assessment are more accurate in predicting violent behavior than unstructured ones, the research supporting the applicability of violence risk assessment tools in female populations remains equivocal. McKeown (2010) also conducted a literature review on violence risk assessment and psychopathy in women and concluded that more research is needed with a particular focus on additional risk factors for women. More recently, a systematic review of nine risk assessment tools used in 15 studies including adult female offenders was carried out by Geraghty and Woodhams (2015). The findings from this review indicate that none of the measures demonstrated strong predictive validity in female populations. The LSI was found to be the most effective tool for assessing both violent and general recidivism in women.
Gender-Sensitive Risk Assessment Tools
To date, only a few risk assessment instruments are available that take gender into account. Some actuarial risk assessment tools have been developed or adapted for use with female offenders. For example, the Women’s Risk Needs Assessments (WRNA; van Voorhis et al., 2010) are actuarial tools that assess both gender-neutral (e.g., substance abuse, criminal history, financial problems) and gender-responsive (e.g., sexual trauma, mental health issues, relationship conflict) factors in female prisoners. Research found support for the predictive accuracy of the WRNA in female prison populations in the United States (van Voorhis, Bauman, & Brushett, 2013). To our knowledge, these tools have not yet been validated for forensic psychiatric patients. Furthermore, two SPJ tools specifically for females are available. The Early Assessment Risk List for Girls (EARL-21G; Levene et al., 2001) is an SPJ instrument for antisocial behavior in girls aged between 6 and 12 years. The Female Additional Manual (FAM; de Vogel, de Vries Robbé, van Kalmthout, & Place, 2014) is a gender-sensitive risk assessment guideline for adult female (forensic) psychiatric patients that has been developed as an addition to the HCR-20, or its revision, the Historical, Clinical, Risk management–20 Version 3 (HCR-20V3; Douglas, Hart, Webster, & Belfrage, 2013). The FAM contains eight additional risk factors (e.g., prostitution, low self-esteem) and three additional final risk judgments (self-destructive behavior, victimization, and nonviolent criminal behavior). There is still relatively little empirical evidence for these SPJ gender-sensitive tools. It should be noted that research on female offenders, specifically forensic psychiatric patients, and predicting violent behavior is challenging because of small sample sizes and relatively high rates of chronic psychiatric admission. Consequently, such studies require collaboration between settings and a longer research period (see also Burman, Batchelor, & Brown, 2001).
The Current Study
To improve current risk assessment and risk management practices, our knowledge and understanding of female offenders and their risk of recidivism need to be enlarged. The current study aims to evaluate the predictive accuracy of multiple tools for recidivism, most importantly violent recidivism, in a female forensic population. The predictive validity of five SPJ risk assessment instruments—the HCR-20, HCR-20V3, FAM, START, and SAPROF—as well as the PCL-R for violent and general recidivism was examined in a sample of 78 women discharged from four Dutch forensic psychiatric settings. In addition, we examined discriminant validity by comparing risk scores between recidivists and nonrecidivists and convergent validity by analyzing Pearson correlations coefficients between the different instruments and their subscales. This study was mainly explorative, as the literature is still inconclusive on risk assessment in female forensic psychiatric patients. As the FAM has been developed as an addition to the HCR-20/HCR-20V3 specifically for female offenders, it is hypothesized that the FAM will show higher predictive accuracy (area under the curve [AUC] values) than these two tools for violence.
Method
Participants
The majority of the 78 women included in the present study were of Dutch descent (n = 68, 87.2%). The mean age at the time of admission to the forensic psychiatric hospitals was 33.7 years (SD = 9.6, range = 20-65) and of discharge, 38.6 years (SD = 9.7, range = 22-69). Most of the women had been involuntarily admitted by a court order (n = 71, 91.0%), called TBS-order (terbeschikkingstelling: translated as “detained under a treatment order”). The TBS-order is imposed by court on offenders who have committed a serious violent offense, are considered to be at high risk of re-offending, and who have diminished responsibility for the offense because of severe psychopathology. Offenses for which the women were admitted to forensic psychiatry were homicide offenses (n = 42, 53.8%), arson (n = 18, 23.1%), violent offenses (n = 11, 14.1%), property offenses (n = 5, 6.4%), and sexual offenses (n = 2, 2.6%). The mean treatment duration of the women was 62.4 months (SD = 41.9, range = 1-208). The majority had been physically, sexually, and/or emotionally abused in their childhood (n = 57, 73.1%) and/or adult life (n = 50, 64.1%). Furthermore, substance abuse (n = 51, 65.4%), borderline personality disorder (BPD; n = 40, 51.1%), and traits of BPD (n = 19, 24.4%) were highly prevalent in this sample (for more, see de Vogel, Stam, Bouman, Ter Horst, & Lancel, 2016).
Measures
HCR-20/HCR-20V3
The HCR-20 contains 10 risk factors from the past (Historical scale), five from the present (Clinical scale), and five relating to the future (Risk Management scale). After coding the presence of the 20 items on a 3-point scale (0, 1, 2), the risk factors should not simply be summed up, but interpreted, integrated, and weighed by the assessor to make a final risk judgment (Low, Moderate, High; in the present study coded on a 5-point level: 1 = low; 2 = low to moderate; 3 = moderate; 4 = moderate to high; 5 = high). Most studies with the HCR-20 found acceptable interrater reliability and validity (Douglas et al., 2017). However, the majority of these studies were conducted in male populations, and research in female populations revealed mixed results (McKeown, 2010). The HCR-20V3 is the revised version of the HCR-20 and still contains 20 items divided into the Historical, Clinical, and Risk Management scales, but the content of items has changed and several items include subitems. The items should be rated on a three-level response format (No, Partially, Yes; in the present study transposed to numerical scores 0, 1, 2). To date, little is known about the predictive validity of the HCR-20V3 for women. A small-scale study with insanity acquittals showed that the relationship between scale scores and violence was stronger among men than women, although gender was not a significant moderator in logistic regression analyses predicting likelihood of violence (Green et al., 2016).
FAM
The FAM was developed as an additional guideline to the HCR-20 for assessing risk of violence among women. In 2014, it was adapted for use with the HCR-20V3. The FAM contains additional guidelines to two historical items of the HCR-20V3 (personality disorder and traumatic experiences) and eight new items with specific relevance to women (prostitution, parenting difficulties, pregnancy at young age, suicide attempt/self-harm, covert/manipulative behavior, low self-esteem, problematic child care responsibility, and problematic future intimate relationship). Furthermore, three additional final risk judgments can be coded with the FAM (the risk for self-destructive behavior, victimization, and nonviolent criminal behavior). With respect to the psychometric properties of the FAM, only preliminary results from small samples are available. They show good interrater reliability and moderate to good predictive validity for inpatient violence, threatening behavior, and self-harm during treatment (Campbell & Beech, 2018; de Vogel et al., 2014; Greig, 2014; Griswold et al., 2016).
START
The START is an SPJ tool for short-term assessment of violence risk and treatability. It consists of 20 dynamic factors that can be coded both as vulnerability and as strength on a 3-point scale (0, 1, 2). In the START, multiple risks can be assessed next to the risk of violence to others: risk of suicide, self-harm, self-neglect, unauthorized absence, substance use, and risk of being victimized. O’Shea and Dickens (2015) examined the predictive validity of the START in a sample of secure psychiatric patients using START codings by multidisciplinary teams of mental health professionals. They found the START to be a stronger predictor of aggression and self-harm in women than men. In a Canadian study, significant predictive accuracy was found for both the Vulnerability and Strength scores of the START in a group of 48 female forensic patients (Viljoen et al., 2011).
SAPROF/Protective Factors
The SAPROF is an instrument to assess protective factors for violence risk in adults that should always be used in addition to risk-focused risk assessment tools, such as the HCR-20/HCR-20V3. The tool includes three scales: Internal (five items), Motivational (seven items), and External (five items). Research so far has demonstrated good psychometric properties for the SAPROF (see de Vries Robbé et al., 2016), although mainly examined in male patients. Results for female patients were less strong. Viljoen et al. (2016) examined protective factors measured with different tools (SAPROF, START) in civil psychiatric patients and concluded that SPJ tools utilizing both risk and protective factors performed better. Gender was found to be a moderator in predicting severe aggression when using risk assessment tools. Both the SAPROF and the START Strength scores demonstrated superior predictive validity for men compared with women.
PCL-R
The PCL-R was developed to assess psychopathy and comprises 20 factors coded on a 3-point scale (0, 1, 2) that are divided into four facets: Interpersonal (e.g., conning, manipulative, glibness), Affective (e.g., shallow affect, callous), Impulsive (e.g., lack of goals, irresponsibility), and Antisocial (e.g., criminal versatility, poor behavioral control). Several studies have been conducted on predictive accuracy in female samples (e.g., Richards, Casey, & Lucente, 2003), and the results have been mixed (for discussions, see Logan & Weizmann-Henelius, 2012; Nicholls et al., 2005). In a previous study within the current research project, the predictive validity of the PCL-R for physical violence during treatment was found to be good for men and moderate for women. When verbal violence was included in the definition of violence during treatment, the predictive validity of the PCL-R was good for both genders (de Vogel & Lancel, 2016).
Procedure
In 2012, a multicenter research project was started on gender differences in forensic psychiatry (for more, see de Vogel et al., 2016). The historical items of the HCR-20, HCR-20V3, and FAM were coded on file information for all women who have been admitted to one of four different forensic psychiatric facilities in the Netherlands between 1984 and 2013 (N = 297). This sample represents nearly all female forensic psychiatric patients in the Netherlands in that period. In addition, an extensive list of criminal, demographic, psychiatric, and treatment characteristics was coded. Overall, file information was extensive and contained abundant collateral information (e.g., police reports, psychological reports, treatment plans, and evaluations). Each researcher rated the quality of the file information on a 0 (insufficient) to 100 (excellent) scale, based on availability of reliable information about the entire lifespan and from multiple sources. When two or more items could not be coded for the historical items of the risk assessment instruments because of lack of information, the quality of the file was coded as insufficient (i.e., below 50). All files with a score below 50 (n = 17) were excluded from the analysis, resulting in a sample of 280 women.
All female patients who had been discharged with a follow-up period of at least 3 years in society were included in the present study (N = 78, 27.9%, discharged between 1993 and 2012). 1 The quality of these 78 files was generally judged as good, with a mean score of 75.6 (SD = 12.3, range = 50-100). The remaining cases (n = 202, 72.1%) were still in forensic or general psychiatric care or were only recently discharged. We chose the period of 3 years because it usually takes quite some time before an offender is officially convicted and before these convictions can be retrieved from the official judicial documentation register. For these 78 cases, we also coded the dynamic items of the HCR-20, HCR-20V3, and FAM, including the final risk judgments, as well as the SAPROF and START based on the most recent information available in the files. The six tools were chosen because these are the standardly used and validated instruments in the participating settings. Since 2005, the HCR-20, including the PCL-R score, is one of two instruments mandated in the Netherlands to use with TBS patients. In 2013, the HCR-20 was replaced by the HCR-20V3. The majority of the PCL-R ratings were retrieved from the hospital files, meaning that they were usually coded based on interviews plus file information by diagnosticians in the participating hospitals (see de Vogel & Lancel, 2016). The other tools were coded by a group of 10 trained and experienced researchers (psychologists and criminologists) who were all blind to outcome data. When four or more items could be not be coded, the quality was considered insufficient. Two cases had more than four missing items on the START and were left out of the predictive validity analyses for the START. For the other cases, the maximum number of missing items in total of the SPJ tools was three. The missing items were prorated by applying the mean score on the item of the total group.
Interrater reliability of the HCR-20/FAM Historical items was previously established for 25 of the 275 women in this research project and found to be good (intraclass correlation [ICC] = .93, p < .001, see de Vogel et al., 2016). For the other tools, it was not possible to test interrater reliability due to practical reasons, but previous studies in Dutch forensic psychiatric samples have found good interrater reliability for the PCL-R (Hildebrand, de Ruiter, de Vogel, & van der Wolf, 2002), SAPROF (de Vries Robbé et al., 2016), HCR-20 (de Vogel, 2005), and HCR-20V3 (de Vogel, van den Broek, & de Vries Robbé, 2014).
Official reconviction data were retrieved in December 2016 from the Judicial Documentation register of the Ministry of Justice and Security. For seven women, recidivism data could not be obtained, and these cases were excluded from the predictive validity analyses. Of these seven women, four were deceased and three could not be retrieved in the system, possibly due to the fact that they initially had conditional convictions because of minor index offenses. In total, 14 of the 78 women (17.9%) were deceased after discharge. As 10 of them had had a follow-up period of at least 3 years in society, these cases were not excluded from predictive analyses. The total sample for the analyses of predictive validity for any and violent recidivism consisted of 71 women. The mean follow-up period of these women was 11.8 years (SD = 4.9, range = 5.8-23.0). Predictive validity was analyzed for both a preset follow-up period of 3 years and for the total mean follow-up period of 11.8 years.
Data Analysis
Statistical analyses were carried out using SPSS Version 25. Receiver operating characteristic (ROC) analyses were conducted to assess predictive validity of the six tools for violent or general recidivism after discharge. ROC analyses result in AUC values. ROC analyses are reasonably unaffected by base rates, and although there are limitations, most research into risk assessment apply ROC analyses, facilitating comparison (Flores, Holsinger, Lowenkamp, & Cohen, 2017; Mossman, 2013). Rice and Harris (2005) provided guidelines for interpreting AUC values and facilitated comparison across studies by applying different effect sizes. AUC values between .56 and .64 can be compared with Cohen’s d of .20 and are interpreted as low effect; AUC values between .64 and .71 with Cohen’s d of .50, medium effect; and AUC values of .71 and above with Cohen’s d of .80, high effect.
Results
Outcome
Nine of the 78 women (11.5%) were readmitted to civil psychiatry after having lived in society for a period of time. Fourteen women (17.9%) were deceased after discharge at a mean age of 44.6 years (SD = 9.2, range = 29-59). We were not able to retrieve the causes of death for all of these women, but at least three women died by suicide. Twenty-four of the 71 women whose recidivism data could be retrieved (33.8%) were officially reconvicted for one or more offenses. The mean number of reconvictions for these 24 women was 5.5 (SD = 4.7, range = 1-18). Thirteen (18.3%) women were reconvicted for violent offense(s), including attempted homicide (n = 2), arson (n = 5), threat (n = 6), assault (n = 6), and stalking (n = 1). Eleven women (15.5%) were convicted because of nonviolent offenses, such as property offenses and fraud. Nine women (12.7%) were convicted for both violent and nonviolent offenses.
Discriminant Validity
To examine discriminant validity, we compared the mean scores on the instruments for the recidivists and nonrecidivists for all recidivism and for violent recidivism. With respect to all recidivism, Table 1 shows that the mean scores on the FAM, HCR-20V3 (and their combination), and START differed significantly between recidivists and nonrecidivists. Table 1 also displays that with respect to violent recidivism, only the mean START Vulnerability scores differed significantly between women with a violent recidivism and women who had not recidivated with a violent offense.
Mean Total Scores (SD) for Those With and Without Recidivism
Note. The follow-up period was 11.8 years. FAM = Female Additional Manual; HCR = Historical Clinical Risk; PCL-R = Psychopathy Checklist–Revised; SAPROF = Structured Assessment of Protective Factors; START = Short-Term Assessment of Risk and Treatability.
Convergent Validity
Table 2 presents the Pearson correlation coefficients between the different instruments, including the subscales and facets. All instruments’ total scores significantly correlated with each other, suggesting that they measure a common construct of risk and have shared variance. In addition, almost all subscales and facet scores correlated significantly with each other and the total scores. An exception was the correlation between PCL-R Facet 2 and HCR-20V3 Risk Management scale and most of the correlations with PCL-R Facet 1 Interpersonal and the SAPROF External scale.
Pearson Correlation Coefficients for Total Scores and Subscale Scores of the Six Instruments
Note. FAM = Female Additional Manual; HCR = Historical Clinical Risk; H-scale = Historical scale; C-scale = Clinical scale; R-scale = Risk Management scale; PCL-R = Psychopathy Checklist–Revised; F1 = Interpersonal; F2 = Affective; F3 = Impulsive; F4 = Antisocial; SAPROF = Structured Assessment of Protective Factors; START = Short-Term Assessment of Risk and Treatability; V = Vulnerability; S = Strength.
p < .05. **p < .01.
Predictive Validity Recidivism
Table 3 shows that the total scores of the FAM, HCR-20V3, PCL-R, and START Vulnerability demonstrated significant predictive accuracy for all recidivism (including violence) for the preset follow-up period of 3 years. In addition, the combination FAM and HCR-20V3, the HCR-20 Clinical scale, the HCR-20V3 Historical and Clinical scales, the PCL-R Facet 1 Interpersonal, and the FAM final risk judgment nonviolent criminal behavior were significant predictors for all recidivism. With respect to the longer follow-up period of 11.8 years, the FAM, the HCR-20V3, the combination FAM and HCR-20V3, the HCR-20 and HCR-20V3 Clinical scales, the FAM final risk judgment nonviolent criminal behavior, and START Vulnerability remained significant predictors. Predictive validity was also analyzed specifically for nonviolent recidivism (n = 11) for the FAM final risk judgment nonviolent criminal behavior: AUC = .86, SE = .061, 95% confidence interval [CI] = [.743, .981], p = .001 for the 3-year follow-up period and AUC = .64, SE = .109, 95% CI = [.420, .849], p = .157 for the 11.8-year follow-up period.
Predictive Validity for All Recidivism After Discharge (N = 71)
Note. The AUC values of the SAPROF and START Strength are reversed as these contain protective factors that aim to predict nonoffending. AUC = area under the curve; CI = confidence interval; FAM = Female Additional Manual; HCR = Historical Clinical Risk; FRJ = Final Risk Judgment; PCL-R = Psychopathy Checklist–Revised; SAPROF = Structured Assessment of Protective factors; START = Short-Term Assessment of Risk and Treatability.
Two cases were left out of the analyses for the START Vulnerability and Strength because they had too many missing items (n = 69).
p < .05. **p < .01.
Table 4 shows that none of the instruments were significant predictors for violent recidivism for the preset follow-up period of 3 years. The START Vulnerability and the HCR-20V3 Clinical scale demonstrated significant predictive validity for violent recidivism for the follow-up period of 11.8 years.
Predictive Validity for Violent Recidivism After Discharge (N = 71)
Note. The AUC values of the SAPROF and START Strength are reversed as these contain protective factors that aim to predict nonoffending. AUC = area under the curve; CI = confidence interval; FAM = Female Additional Manual; HCR = Historical Clinical Risk; FRJ = Final Risk Judgment; PCL-R = Psychopathy Checklist–Revised; SAPROF = Structured Assessment of Protective Factors; START = Short-Term Assessment of Risk and Treatability.
Two cases were left out of the analyses for the START Vulnerability and Strength because they had too many missing items (n = 69).
p < .05.
Predictive Validity Mortality
Considering the unexpectedly high mortality rate at a relatively young mean age, we decided to conduct post hoc analyses to explore predictive validity of the six tools for mortality for the total follow-up period (mean = 10.9 years, N = 78). It was not possible to retrieve data on the causes of death, but early death may be related to self-destructive or suicidal behavior and psychopathology, especially BPD and substance use disorders (see Chesney, Goodwin, & Fazel, 2014). Although the instruments were not developed to assess risk of self-destructive or suicidal behavior, there is overlap in risk factors for violence to others and to the self (see de Vogel et al., 2014; Völlm & Dolan, 2009; Webster et al., 2009). The post hoc analyses showed that none of the instruments and subscales were significant predictors of mortality, with one exception, that is, PCL-R Facet 1 Interpersonal (see Table 5). This facet was a significant negative predictor of mortality, or in other words, a protective factor.
Predictive Validity for Mortality After Discharge (N = 78)
Note. AUC = area under the curve; CI = confidence interval; FAM = Female Additional Manual; HCR = Historical Clinical Risk; FRJ = Final Risk Judgment; PCL-R = Psychopathy Checklist–Revised; SAPROF = Structured Assessment of Protective Factors; START = Short-Term Assessment of Risk and Treatability.
Two cases were left out of the analyses for the START Vulnerability and Strength because they had too many missing items (n = 76).
p < .05.
Discussion
The present multicenter study is one of the first studies comparing the predictive validity of different SPJ risk assessment tools and the PCL-R for recidivism in a sample of discharged female forensic psychiatric patients. Overall, it can be concluded that the predictive validity for all types of recidivism was moderate, but low for violence. This is in accordance with previous studies (Coid et al., 2009), the conclusion of the systematic review of Geraghty and Woodhams (2015), and the reviews of McKeown (2010) and Garcia-Mansilla et al. (2009). In our female population, dynamic risk factors were the strongest predictors. The START Vulnerability, HCR-20V3, especially the Clinical scale, and FAM total score and final risk judgment nonviolent criminal behavior demonstrated the strongest predictive accuracy for all recidivism for both medium- and long-term follow-up. For violent recidivism, only the START Vulnerability scores and the HCR-20V3 Clinical scale were significant predictors. The most important results and implications of this study will be discussed below.
First, notable in this sample was the high rate of mortality and readmission to psychiatry after having lived in society for a period of time. Of the 280 women included in the multicenter research project, only 78 were discharged into society with a follow-up period of at least 3 years. Of those 78, 14 had died and nine were readmitted to civil psychiatry, leaving a total of 55 discharged women currently living in society. Moreover, recidivism data could not be retrieved for seven of these women. These findings indicate that female forensic psychiatric patients form a difficult sample to follow-up, which is in line with previous studies conducted in female forensic samples (Davies, Clarke, Hollin, & Duggan, 2007; Sahota et al., 2010). Still, in our study, four of the 14 (28.6%) women who had deceased within the follow-up period had committed a new offense before their death. These 14 women had died at a relatively young mean age of 44.6 years. We were not able to retrieve the causes of death for all of these women, but at least three women died by suicide. This is not surprising considering the fact that BPD and substance use disorders, which were highly prevalent in the present sample, have been found as disorders with the highest suicide risk (Chesney et al., 2014). Possible explanations for early deaths besides suicide are the long-term physical consequences of severe self-harm and substance abuse (see, for example, Binswanger et al., 2007) and adverse childhood experiences that have been demonstrated to increase the risk of premature mortality (e.g., Brown et al., 2009). In the post hoc analyses, we found that PCL-R Facet 1 Interpersonal was a significant protective factor for mortality. An explanation could be that interpersonal psychopathic features such as conning and manipulative behavior, glibness, and grandiose sense of self-worth make these women less vulnerable for early death by severe self-destructive behavior including suicide. It should be emphasized that the results should be interpreted very carefully, as we did not have data on the causes of death. Not much research has been published on the relation between psychopathy and mortality, but in a recent study in a Finnish male sample a significant positive relationship was found between high PCL-R scores and premature death (Vaurio, Repo-Tiihonen, Kautiainen, & Tiihonen, 2018). The causes of death of the group with high PCL-R scores were mainly of violent nature or relating to a dangerous, antisocial lifestyle, for example, homicides or accidents. The authors did not specify the results on the four facets. Our results are in contrast with these findings in men. An explanation may be the different manifestation of psychopathy in women—more subtle with less physical violence and a less antisocial, dangerous lifestyle compared with men. Our results are in line, though, with findings from a previous Dutch study, in which we found lower rates of self-destructive behavior during treatment for women scoring high on psychopathy versus women scoring low on psychopathy (de Vogel & Lancel, 2016). This study also found a lower incidence of physical violence during treatment for women scoring high on psychopathy versus women scoring low and men scoring high on psychopathy, which may confirm the suggestion above that psychopathy in females is manifested in a more subtle and less physically violent way.
Second, the recidivism rates of 36% for all offenses and 18% for violent offenses that were found in the present study are comparable to international findings in female prisoners (Coid et al., 2009; Rettinger & Andrews, 2010). It is more difficult to compare the recidivism rates with other forensic psychiatric samples, as not much has been published about recidivism of female forensic psychiatric samples. The scores of the recidivists and nonrecidivists differed significantly only for START Vulnerability, HCR-20V3, and FAM.
Third, predictive validity was not very strong for most tools, with moderate accuracy for all recidivism, but low accuracy for violence. The observed AUC values for violent recidivism are lower than what is internationally found in studies with male samples (see, for example, Otto & Douglas, 2010). In previous studies in the Dutch forensic psychiatric population, predictive accuracy for male patients was considerably better for the PCL-R and HCR-20 (de Vogel & de Ruiter, 2005), the SAPROF (de Vries Robbé et al., 2016), and the HCR-20V3 (de Vogel et al., 2014) compared with the AUC values found for females in the current study. The START total risk score was a good predictor for all recidivism, especially for the 3-year follow-up period, but was also significant for long-term general and violent recidivism. This is notable because the START is fully dynamic and developed to assess short-term risks. Yet, these results are in line with previous studies that found good predictive accuracy of the START in female populations (O’Shea & Dickens, 2015; Viljoen et al., 2011). Strikingly, none of the other instruments showed significant predictive accuracy for violent recidivism. Only the Clinical scale of the HCR-20V3, also consisting of only dynamic factors, was a significant predictor of violence. This significant predictive accuracy for the dynamic START and Clinical scale for the follow-up period of almost 12 years is remarkable, as typically predictive validities decrease over long follow-up periods. A possible explanation for the relatively strong predictive accuracy of dynamic factors in these female forensic psychiatric patients is their psychiatric background, particularly the high prevalence of BPD, often characterized by emotional instability and impulsivity. With respect to the PCL-R, the findings are in accordance with previous studies (see Logan, Weizmann-Henelius, 2012; McKeown, 2010). Good predictive validity was found for the PCL-R total score for all recidivism, but only for the 3-year follow-up period, and predictive accuracy for violence was low. A possible explanation is that female psychopaths do not rely on violent, intimidating behavior, but instead on more subtle, manipulative behavior (see Logan, Weizmann-Henelius, 2012). This is also in line with our finding that PCL-R Facet 1 Interpersonal was the only facet with significant predictive accuracy for all recidivism at 3-year follow-up, whereas in mainly male samples, PCL-R Facet Antisocial is usually the most predictive of recidivism (see Walters, Knight, Grann, & Dahle, 2008). Our results are also in line with Richards et al. (2003) who found PCL-R Factor 1 to be most predictive in a sample of incarcerated female substance abusers.
The HCR-20V3 yielded more significant AUC values than the HCR-20. Several aspects of the HCR-20V3 are deemed better for violence risk assessment in women compared with the HCR-20 Version 2. For example, victimization, which is an important risk factor for females in both childhood and adulthood (Benda, 2005; van Voorhis et al., 2010), should now be considered for the entire lifespan instead of only during childhood. Furthermore, in HCR-20V3, the psychiatric variables, Items H6 Major Mental Disorder and C3 Symptoms of Major Mental Disorder, are formulated broader and are better specified than in the HCR-20. Finally, the distinction in the HCR-20V3 between problematic behavior during different developmental stages and traumatic experiences is an important distinction for females.
The SAPROF and the START Strength scores did not demonstrate significant predictive validity in this female sample. In a previous study, the SAPROF showed better results for women (de Vries Robbé et al., 2016). However, in the referred study, the SAPROF was examined for inpatient violence and not for recidivism after discharge. It should also be noted that both tools were designed to predict short- to medium-term relapse, whereas in the present study, the overall follow-up period was substantially longer. Still, the START Vulnerability scores did predict significantly for both the medium- and long-term follow-up. Hence, the value of protective factors for recidivism after discharge for female forensic patients could not be confirmed in the current study.
Contrary to our hypothesis, the FAM total scores did not yield higher AUC values compared with the HCR-20V3 total scores for violent recidivism and not substantially higher compared with the HCR-20 total scores. In fact, the FAM performed slightly worse than the HCR-20V3. Still, both the FAM total score and the final risk judgment nonviolent criminal behavior were significant predictors for all recidivism. The good predictive validity for the final risk judgment suggests that the process of coding risk assessment tools may help the assessor to deliberate more strongly about risks and thus provides support for the SPJ method. This is in line with a previous Dutch study (de Vogel & de Ruiter, 2005), although it should be noted that there was an overlap of 15 women in the samples. Nevertheless, the predictive accuracy of the FAM total score and the final risk judgment Violence for violent recidivism—for which it was primarily designed—was low. The FAM was originally developed in addition to HCR-20 Version 2. The findings from the current study suggest that the HCR-20V3 is improved with respect to the use with women and the FAM is less useful as an addition to HCR-20V3 compared with HCR-20. Possible explanations for the low predictive accuracy of the FAM for violent recidivism are that for some of the factors there is clear empirical support for the relation with general criminal offending, but not specifically for violence. Furthermore, for a number of items a correlation was found with previous violent behavior, but this does not necessarily mean that the factor is also related to future violent behavior. Still, mental health professionals have reported to consider the FAM a valuable addition to the HCR-20 and HCR-20V3. Most importantly, they state that using the FAM—which does not take much extra time to code in addition to the HCR-20/HCR-20V3—is valuable because it raises awareness of possible important gender issues in treatment and risk management (de Vogel & Louppen, 2017; Griswold et al., 2016). Furthermore, the FAM may be a valuable tool to predict risk outcomes other than convictions for violence after discharge, such as inpatient violence, self-destructive behavior, and victimization, which are also crucial to prevent.
Limitations
Several limitations to the present study should be acknowledged. First, the sample in this study was small and selective, as it only included forensic psychiatric patients who are characterized by severe offenses and complex psychopathology. Only discharged patients were included in the study and this might have caused a restricted range of scores on the instruments. It should also be noted that from the sample, nine women were readmitted to civil psychiatry, where reconviction is probably less likely to occur because of the institutionalization. Base rate of offending was rather low, especially for violent recidivism at the short follow-up period of 3 years (n = 6). Although ROC analyses are assumed to be insensitive to base rates, this low base rate and small sample could be lowering the results on predictive validity (see Flores et al., 2017; Mossman, 2013). Thus, the small and selective sample may have affected the results and may have resulted in an underestimation of reconvictions and predictive accuracy of the instruments. Still, the sample is representative for Dutch forensic psychiatry, as it included almost all of the discharged female forensic psychiatric patients in the Netherlands between 1993 and 2012. Considering the fact that females are a minority in forensic psychiatry, we believe that the current study adds to the scarce literature on (violence) risk assessment in women. However, the generalizability for more general gender-sensitive violence risk assessment is limited. Ideally, this study will be replicated in other samples, for instance, from the penitentiary system or general psychiatry.
Second, this study was retrospective, and although the files were extensive, not all items of the tools could be coded, especially not on the dynamic tool START, for which two cases had to be deleted for predictive validity analyses. For all other tools, the number of missing items was negligible. Prospective studies into the predictive validity of risk assessment tools in women would be valuable. However, considering females form a minority and have relatively low discharge rates and high mortality and readmission rates, this will take time. Another limitation is that we did not examine interrater reliability for all tools in the present study, but only for the HCR-20/FAM Historical items for a small sample from the total sample in this Dutch multicenter project into gender differences. Finally, an important limitation is that the outcome measure used in this study was only restricted to official reconviction data. The reconviction rate may be an underestimation of the actual recidivism because not all offenders are reported, apprehended, and arrested. Moreover, the outcome was only related to recidivism, while the FAM and START are also used to assess other types of risks, and both tools have previously been found to predict self-harm and inpatient violence (Campbell & Beech, 2018; O’Shea & Dickens, 2015). For future studies, it would be important to include other outcome measures, for instance, unofficial data on violence, data on victimization, and data on self-harm.
Future Research
More research is definitely needed with multiple risk assessment tools—both SPJ tools and actuarial tools—into correctional and psychiatric samples of female offenders with several outcome measures, such as incidents during treatment, incidents of self-harm, victimization, or other negative outcomes. More specified studies, for example, into individual risk factors or more specific groups of female offenders, like sexual offenders, could yield valuable new insights. A direct comparison with a matched male sample is also worthwhile to study. The present study comprised adult women. As it has been reported that the group of adolescent female offenders is growing, there is an urgent need to conduct more studies into the value of risk assessment and management in girls and young women (see also Emeka & Sorensen, 2009; Moretti et al., 2005). A recent meta-analysis found good results for an actuarial tool, the Youth Level of Service (YLS) for both male and female juvenile offenders (Pusch & Holtfreter, 2017). This corresponds with the review by Geraghty and Woodhams (2015) who found the strongest results for the adult version, the LSI (see also the results of the meta-analysis of Olver et al., 2014). More research applying both SPJ instruments and actuarial tools, especially those that include dynamic risk factors, would be valuable. In the present study, we only examined SPJ risk assessment tools—next to the PCL-R—as these are mandated to use in Dutch forensic psychiatry. Finally, considering the results on particularly PCL-R Facet 1 Interpersonal, we strongly recommend to conduct more studies into the different psychopathic features in women.
Clinical Implications
Overall, this study shows that SPJ tools such as the START, the HCR-20V3, and the FAM as well as the PCL-R have some, but not very convincing, predictive value in adult female forensic psychiatric populations. The assessor should thus exert caution in the interpretation of the results of risk assessments, especially when important decisions need to be made, for example, regarding mandatory admittance to a hospital. We strongly recommend more research, but based on the results of the present study, we cautiously recommend to use the HCR-20V3 for violence risk assessment in forensic mental health care and the START for more general risk assessment, especially when short- to medium-term assessments are needed or when only recent information to code dynamic risk factors is available. The FAM may be a useful addition to the HCR-20V3 for more gender-sensitive risk assessment and management, mainly for clinical purposes but not for improving predictive power. Furthermore, for clinical practice, it is important to not only consider risk of violence or general recidivism in society but also pay attention to the other risks, such as self-destructive behavior, victimization, or substance use.
Footnotes
Authors’ Note:
The authors wish to thank Gerjonne Akkerman-Bouwsema, Anouk Bohle, Yvonne Bouman, Nienke Epskamp, Susanne de Haas, Loes Hagenauw, Paul ter Horst, Stéphanie Klein Tuente, Jeantine Stam, Eva de Spa, Nienke Verstegen, Michiel de Vries Robbé, and all other colleagues from FPC Oldenkotte, FPC Woenselse Poort, FPK Assen, and the Van der Hoeven Kliniek who contributed to this study. This study was conducted with official permission from the directors of the four settings.
