Abstract
The Static-99 is the most commonly used risk assessment instrument for sexual violence in North America and its results can affect highly consequential decisions made in the criminal and civil justice systems. Despite its influence, few studies have systematically examined how the Static-99 is used by clinicians in practice. The current study compares the Static-99 ratings of clinicians to those of researchers for 100 adult males who completed an outpatient sex offender treatment program and were followed up over an average of about 4 years. Results showed good agreement between the ratings of clinicians and researchers for total scores on the Static-99, as well as for most individual items. Ratings by clinicians tended to be slightly lower than those made by researchers. The predictive validity of ratings made by clinicians and researchers was very similar and moderate in terms of effect size. In 30 cases, clinicians used discretion to “override” or adjust the Static-99 ratings when making final risk judgments, but the predictive validity of the clinical adjusted ratings was worse than that of the original Static-99 ratings made by clinicians. The need for quality assurance and training are discussed along with the need for clear empirically supported guidelines regarding overrides.
The most commonly used assessment instrument for evaluating the risk for sexual violence in both Canada and the United States is an actuarial risk assessment instrument called the Static-99 (Archer, Buffington-Vollum, Stredny, & Handel, 2006). According to the Interstate Commission for Adult Offender Supervision (2007), the majority of states in the United States use actuarial risk assessment instruments at some point during the time in which a sex offender is supervised by the criminal justice system and at least 30 states use the Static-99.
The Static-99 was developed by Hanson and Thornton (1999) and is designed to predict the risk that a convicted sexual offender will commit an act of sexual or general violence in the future. As such, the Static-99 items were selected based on their observed relationship with those outcomes and explicit rules are provided in the manual regarding how items should be combined to produce an overall level of risk. The Static-99 includes 10 items (see Table 1) that can be coded from file information alone. Each item is scored and summed to produce a total score out of 12. Total scores can be categorized into one of four risk groups (0-1 = low risk, 2-3 = moderate-low risk, 4-5 = moderate-high risk, and 6+ = high risk) and correspond to predicted probabilities of future recidivism. For example, a total score of 3 indicates moderate-low risk and a 12% likelihood of sexual recidivism over 5 years.
Research and Clinical Intermethod Agreement
Consistent with the authors of other actuarial instruments, Anderson and Hanson (2010) “recommend against adjusting the Static-99 findings” (p. 263). For example, they state that scores should not be overridden (i.e., changed) even if the offender receives a low risk score but has committed very violent offences. This recommendation follows directly from research evaluating the use of overrides used by supervising officers, which found poor agreement regarding when overrides should be used and a slight decrease in the predictive accuracy of the instrument (Hanson, Harris, Scott, & Helmus, 2007). However, when officers considered risk factors from another risk assessment instrument to override their score (Hanson & Harris, 2000) predictive accuracy increased. Thus, in contrast to other actuarial instruments and the original rules of the Static-99 Anderson and Hanson (2010) state that overrides can be made to the Static-99 when the user incorporates empirically based risk factors such as those in the Stable 2007 and Acute 2007 (Hanson et al., 2007).
Numerous studies have examined the reliability and validity of the Static-99. Excellent levels of rater agreement have been found in both research and in applied settings (Hanson & Morton-Bourgon, 2009). However, atypically low levels of rater agreement have also been found (e.g., ICC1 = .63; Ducro & Pham, 2006). Researchers have suggested that low levels of rater agreement may be due to inadequate records, lack of training, complexity of the items, and the absence of quality assurance procedures (Anderson & Hanson, 2010). The consequence of rater disagreement is that the instrument will not provide the maximum potential level of predictive accuracy within the sample it is being applied to (Anderson & Hanson, 2010).
On average, the predictive accuracy of the Static-99 has been somewhat lower (d = .70) than that found in the development samples (d = .78, for a review see Hanson & Morton-Bourgon, 2009). The differences in predictive validity found across samples are more than would be expected by chance, and Anderson and Hanson (2010) state that it is not clear what the cause of the differences is. Possible causes include differences in the populations under examination, problems with the instrument itself, or problems with the way in which the instrument is being used by raters. For example, raters may be careless in their use of the instrument or may not be following the rules as set out in the manual.
Despite the large volume of research conducted on the reliability and validity of risk assessment instruments, including the Static-99, only a small proportion of that research has focused on the use of these instruments by clinicians. Examining the clinical use of risk assessment instruments in practice is important due to the significant implications of the decisions made on the basis of these instruments. The Static-99 has implications for many decisions related to sex offenders such as community registration, treatment determinations, community notification determinations, civil commitment, sentencing recommendations, and community supervision (Doren, 2004). Perhaps the most influential of those decisions is in sexually violent predator (SVP) proceedings. These proceedings are significant because they can result in the indefinite loss of freedom for the individual under review. Actuarial risk assessment instruments are commonly used by clinicians in their evaluations for SVP proceedings and have been found to be strongly associated with clinician’s opinions about whether to civilly commit offenders (Jackson & Hess, 2007; Levenson, 2004).
Boccaccini, Murrie, and colleagues examined the use of actuarial instruments, including the Static-99, in SVP proceedings in the United States. They found lower rates of reliability (ICC1 = .58-.64; Murrie et al., 2009) and predictive validity (AUC = .55 for sexual violence; Boccaccini, Murrie, Caperton, & Hawes, 2009) than had previously been found in research settings.
It is evident that the Static-99 has been widely accepted for use in decision making and predicting sexual recidivism. The popularity of this actuarial risk assessment instrument is likely due to the structured nature of the scoring, the standardization and transparency of the process, and the improvement on unstructured professional judgment (Anderson & Hanson, 2010; Hanson & Morton-Bourgon, 2007; Janus & Prentky, 2003). The majority of research on the Static-99 focuses on its use by researchers in a wide variety of settings. Relatively little research has examined how the Static-99 is used by clinicians in practice, where it can have substantial implications for the individuals being assessed. Furthermore, as described above the research that does exist tends to find lower rater agreement and predictive validity for clinicians compared with researchers. This trend suggests that the differences may be the result of systematic flaws in the way in which the Static-99 is implemented in practice as opposed to flaws in the Static-99 itself. For example, systematic differences in scoring in adversarial contexts may indicate evaluator partisanship with the party retaining their services. Risk scores calculated by treatment providers posttreatment might also show systematic biases, where evaluators lean toward reduced risk.
The current study examines the way in which the Static-99 is used in practice by comparing clinicians using the Static-99 to evaluate offenders attending an outpatient forensic treatment facility to researchers evaluating the same cases. Our research questions relate to two aspects of risk assessment, first, the utilization of a risk assessment instrument and second, the consequences of the way in which the instrument was implemented. With respect to how the Static-99 was used in practice the following research questions will be examined. First, what is the intermethod agreement between clinicians and researchers? Second, what could account for differences in scores? The consequences of clinicians’ use of the Static-99 will be explored by examining two additional research questions. Third, what consequences do the differences in scores between clinicians and researchers have on the risk determination made in the case? And fourth, what consequences do the differences in scores between clinicians and researchers have for the prediction of sexual violence?
Method
Offenders
The present sample included the file review of 100 male offenders who were required, as part of their sentence for a sexual offence conviction, to participate in an outpatient treatment program. The majority of the offenders were serving a provincial sentence (2 years less a day) whereas some were serving a federal sentence (≥ 2 years). At the time of treatment, the average age of the offenders was 41 years (SD = 12.41, range = 19-77 years). The majority of the offenders were White (66%), 15% were native North Americans, and 19% were classified as other. During their lifetime, more than half of the offenders (78%) had committed a sexual offence against a child (45% of offences were intrafamilial and 48% were extrafamilial) and 53% had offended against an adult (28% against a stranger and 29% against an acquaintance).
Procedure
All 100 files were coded by two graduate students, both of whom were trained on the Static-99 and blind to the recidivism data. At intervals of 10 files, the two researchers made consensus ratings for each file. Consensus ratings were made through discussion and a reexamination of the file where disagreement persisted. Interrater agreement was indexed using intraclass correlation coefficient for single ratings, calculated using a two-way mixed effects model with absolute agreement. The intraclass correlation coefficient for single ratings, or ICC1, was .88; the estimated interrater reliability of consensus ratings, indexed using ICC2, was .94. Consensus ratings were used for subsequent analyses. After making their own ratings, the researchers coded the Static-99 ratings made by the clinicians who evaluated the offenders at the clinic. Researchers also took note of whether overrides were made to the Static-99 scores and what other instruments were used in the clinicians’ evaluations. (As noted previously, an override is when clinicians report risk estimates for Static-99 score that differ from the standard rates, due to the presence of case-specific moderating factors.) All of the clinicians had received extensive training in use the Static-99 and the Sex Offender Need Assessment Rating (SONAR) prior to making their ratings.
Police records indicate that only a minority of reported sexual offences result in arrest or conviction (Anderson & Hanson, 2010). For this reason we chose to define recidivism as any police contact related to a sexual offence, following the offender’s completion of the outpatient sex offender treatment program. Police contact included a police investigation, a new charge, or a new conviction. Recidivism data were gathered by a police constable from federal, provincial, and municipal databases, blind to all the Static-99 ratings. The follow-up time period began after the completion of treatment (between November 2002 and April 2004) and ended on June 1, 2007. As offenders did not complete treatment at the same time, the follow-up period varied between offenders, but was on average 3.69 years (SD = .47, range = 3-4 years). The rate of sexual recidivism was 18% over the follow-up period.
Analysis
Agreement between Static-99 ratings made by researchers and clinicians also was indexed using ICC1, calculated for absolute agreement using a two-way mixed effects model. In general, ICCs > .75 are considered excellent, ICCs between .60 and .75 are considered good, ICCs between .40 and .60 are considered moderate, and ICCs < .40 are considered poor (Fleiss, 1986).
A Cox regression survival analysis was conducted to compare the predictive validity of research and clinical Static-99 total scores. This method is appropriate for the data because the follow-up time (or time at risk) differed across offenders and some of the offenders did not recidivate. For the regression analysis, the total scores were grouped into the four risk categories indicated in the Static-99 manual. Scores of 0 or 1 were low risk and coded as 1, scores of 2 or 3 were moderate-low risk and coded as 2, scores of 4 or 5 were moderate-high risk and coded as 3, and scores of 6 or more were high risk and coded as 4.
Results
Intermethod Agreement Between Clinicians and Researchers
Intermethod item agreement for the Static-99 ranged from moderate to good (see Table 1), with a median ICC1 of .77. The lowest agreement was for Item 3, “index nonsexual violence,” ICC1 =.56, and the highest agreement was for “prior sex offences,” ICC1 =.89. Intermethod agreement for total scores was excellent, ICC1 = .92 (see Table 1).
The mean clinical Static-99 score was 2.89 (SD = 2.21, range 0-9), and the mean research score was 3.20 (SD = 2.45, range 0-10). 1 Total scores were identical in 57 cases and different in 43 cases. The majority of the differences in total scores involved researchers providing higher total scores (n = 31, 72%) as a consequence of endorsing more or higher levels of risk factors. Research scores were between one and four points higher than clinical scores. Clinical total scores were higher in 12 cases (n = 12, 28%) but were only ever higher by one point.
Possible Reasons for Differences in Scores
Sufficient information was available in 75 cases to examine whether clinicians made calculation errors when summing their Static-99 scores. Calculation errors were made in 6 (8%) of the 75 cases.
The necessary file information to determine if a clinical override was made was available in 91 of the 100 cases. An override is when a clinician chooses to assign a risk to the case that does not match the numerical score the person received on the Static-99. For example, an individual might score 1 on the Static-99, which corresponds to a determination of low risk but the clinician would report them as being moderate-low risk. Clinicians chose to override Static-99 scores in 30 cases (33%). In 17 (57%) of those cases, the override increased the risk rating and in 13 (43%) cases, it decreased the rating. Clinicians may have been using the SONAR to override Static-99 scores in 10 (33%) cases, as they indicated that they completed the SONAR in their report. Conversely, 17 (57%) of the clinicians who chose to override the Static-99 did not use the SONAR in their risk assessment; information was missing for 3 individuals.
It should be noted that the Static-99 manual allows overrides to be made when an offender has been in the community for an extended period of time (e.g., 5-10 years) without committing a sexual offence. This rule does not account for the overrides made in this study as all of the individuals had recently (within the past year) been convicted of a sexual offence.
Consequences of the Differences in Scores Between Clinicians and Researchers on the Risk Determinations Made in the Cases
Of the 43 cases in which there was an intermethod discrepancy regarding the total score, 18 (42%) of those differences also resulted in disagreements regarding the risk level (i.e., low, moderate-low, moderate-high, and high) in the case. In other words, the differences in the numerical scores corresponded to differences in the risk level associated with that score. Discrepancies between scores also resulted in differences in the percent likelihood of sexual recidivism to which the scores corresponded in 36 (84%) of the 43 cases.
Consequences of the Differences in Scores Between Clinicians and Researchers on the Prediction of Sexual Violence
Survival analysis of time to reconviction revealed moderate predictive validity for the consensus research Static-99 scores, χ2 (1) = 7.87, p = .005, Hazard Ratio = 1.91, 95% CI for Hazard Ratio = [1.20, 3.05]. 2 Very similar results were found for the clinical ratings of the Static-99, χ2 (1) = 7.39, p = .007, Hazard Ratio = 1.87, 95% CI for Hazard Ratio = [1.18, 2.95]. When both the research and clinical Static-99 scores were entered into the regression analysis together as covariates, they did not differ significantly in their ability to predict sexual recidivism.
Although both research and clinical scores were moderate predictors of sexual recidivism, the observed recidivism rates for the first three risk categories (low, low-moderate, and moderate-high) were very similar. This is apparent in Table 2 (first six columns), which presents the number and proportion of recidivists in each risk category, along with the 95% CI for each proportion. The 95% CIs for the lowest three risk categories overlapped substantially. 3 This indicates that the moderate predictive accuracy of the Static-99 was primarily due to the fact that the highest risk category had a recidivism rate different from that of the lowest three risk categories.
Proportion of Recidivists in Static-99 Categories According to Research, Clinical, and Clinical Override Scoring
In addition to examining the clinical Static-99 total scores, the final risk ratings as reported by the clinicians were also examined. These final ratings were made by clinicians on the nominal 4-point scale (low, low-moderate, moderate-high, and high) and as previously described were subject to clinical overrides in 30 cases. The final clinical override risk ratings were predictive of sexual recidivism, χ2 (1) = 4.39, p = .036, Hazard Ratio = 1.62. However, the clinical override scores were less predictive of sexual recidivism than the scores without overrides.
To determine the value of the clinical overrides, we modeled time to recidivism using both the regular Static-99 scores obtained by clinicians and the scores they made with overrides. This model had significant predictive validity, χ2 (2) = 8.00, p = .018. This model was not significantly better than that of the regular clinical Static-99 ratings on their own,
Discussion
Intermethod Agreement
Agreement was excellent for Static-99 total scores. However, clinical total scores only matched research scores in just more than half (57) of the cases. The differences were most often due to higher researcher ratings, meaning that researchers were endorsing more items or higher item values. Many of the Static-99 items also showed excellent intermethod agreement, though others had surprisingly lower rates of agreement given how straightforward they appear. For example, Item 3, “index nonsexual violence” had the lowest agreement, being just above chance. Given that their index offence was what brought the offenders into the justice system, it seems reasonable to assume that a lack of information about the index offence was not the cause of the low agreement. Other items with low agreement included “Any unrelated victims” and “Prior sentencing dates.” Low levels of agreement for these items were surprising given that the scoring rules are not complex and raters should be able to score prior sentencing dates based on an official criminal record alone.
Although the results indicate that these discrepancies exist, it is unclear why they exist. Possibilities include that the Static-99 item definitions are unclear, that raters are not being sufficiently careful when making ratings, or that raters forget the coding rules and do not refer back to the manual when making ratings. This latter possibility coincides with Anderson and Hanson’s (2010) emphasis on the importance of consulting an instrument’s manual when making ratings. Researchers were more likely to endorse higher scores, which may indicate that clinicians failed to pick up on file information that was detected by researchers. Again, this could be due to several reasons including carelessness, time constraints, or because clinicians failed to refer to the manual. Discrepancies cannot be attributed to missing or incomplete information as both researchers and clinicians had access to the same file information. We also found no evidence of systematic bias among the clinicians. Ratings were made prior to treatment and the clinicians had no vested interest in the value of the ratings.
It should be noted that we cannot say with certainty who erred in their scoring when discrepancies were present between the scores of clinicians and researchers. However, research ratings were made by each researcher and then consensus ratings were made which involved a discussion of the items and a reexamination of the file information where necessary. Clinical ratings were made by lone clinicians on their own case files. As such, it seems reasonable to assume that where discrepancies exist, research ratings are likely to be more accurate than clinical ratings.
Although it is possible that a combination of unconscious errors, forgetfulness, and misinterpretations caused some of the discrepancies, others were due to conscious choices made by clinicians to override Static-99 total scores. Some of the overrides may have been made using the SONAR. However, more than half were not, indicating a deviation by clinicians from the recommended use of the Static-99.
Implications of Intermethod Differences
The discrepancies found were not numerically large; however, results showed that discrepancies due to both unconscious error and conscious choice were substantial enough to alter the risk level and percent likelihood of recidivism to which the scores corresponded in a majority of the cases where discrepancies were present. These alterations to the risk determinations are important because risk level and percent likelihood of recidivism are the means by which clinicians communicate their findings to the criminal justice system. As such these differences could potentially translate to real differences in how offenders are perceived and dealt with by the criminal justice system. For example, determinations where risk is increased could result in longer sentences or make someone a potential candidate for sexually violent predator proceedings which can lead to indefinite incarceration.
Implications for Prediction
In addition to altering risk determinations, clinical overrides also decreased the predictive validity of Static-99 scores. This decrease was to be expected based on the recommendations of most proponents of actuarial assessment instruments not to override scores, as well as the premise on which actuarial instruments are based, namely that optimal empirically based algorithms as opposed to unstructured clinical judgment are the best predictors of future violence. In addition, other studies have also found that clinical overrides made to actuarial scores decrease predictive validity (Gore, 2007; Hanson, 2007; Vrana, Sroga, & Guzzo, 2008). Other proponents of actuarial methods permit clinical overrides based on the use of empirically based risk factors (Anderson & Hanson, 2010). In this sample, less than half of the overrides were made in accordance with those requirements (i.e., by using the SONAR). The results suggest that despite what authors and user manuals advise pure actuarial or judgment free risk assessment is not always what occurs in practice.
New recidivism rates as well as new methods of determining and reporting recidivism risk using the Static-99 were recently proposed by Helmus, Hanson, and Thornton (2009). On the basis of updated samples, new recidivism rates were calculated for two different types of offender samples, routine Canadian correctional samples and preselected high risk samples, thereby resulting in a range of recidivism risk estimates. For example, for a score of 5 on the Static-99, the estimated sexual violence recidivism risk is 10.2% over 5 years for Canadian correctional samples and 23.1% over 5 years for high risk samples. When calculating final risk scores, evaluators are now encouraged to use the new recidivism estimates and indicate where they think the individual they evaluated falls within the estimated range. Things to consider when making this determination are “the risk-relevant characteristics of the population from which the offender is selected” and “risk-relevant characteristics of individual offenders” (Helmus et al., pp. 3-4). As the revised Static-99 recidivism rates were published after our data were collected, they were not available for use by the clinicians or the researchers in our study. It is clear, however, that interpretation of Static-99 scores now requires considerably more judgment than it did previously. This may be problematic, given that clinical judgment reduced the predictive accuracy of the Static-99 in our study. Direct evaluation of clinical use of the revised recidivism rates would seem to be a high priority for future research.
Another finding related to predictive validity was that the difference between high risk ratings and the lower three ratings accounted for most of the predictive accuracy of the Static-99 in this sample. In other words offenders in the low, moderate-low, and moderate-high risk categories did not differ from one another in their recidivism rates. These results call into question the utility of the graded risk categories. Furthermore, they suggest that users should be wary of making distinctions between offenders in lower three risk groups as they may not differ substantially in their risk for reoffending in a sexually violent manner. Having said this it should be noted that larger samples have found distinctions between the four risk categories, which may mean that our findings were the result of sampling error. In addition, our findings may reflect greater treatment efficacy for in the intermediate risk groups, making them indistinguishable from the lowest risk group. As such, the incremental validity of the risk categories should be examined in future studies.
Our study was limited by the fact that we had to rely on incidents of sexual violence being reported to police. It is generally acknowledged that recidivism rates based on official reports are likely to underestimate the true recidivism rate. To minimize underestimation, we used police contacts as a measure of recidivism. This avoids potential “false negative” errors that occur when an offender is investigated for alleged sexual violence but police were unable to obtain sufficient evidence to proceed with arrest, charge, or conviction.
Finally, our study was limited in that we were unable to confer directly with clinicians concerning why they chose to override scores.
Implications for Practice
Few studies have examined how actuarial models can be best incorporated into clinical practice (Elbogen, 2002). Our results suggest that even with training we cannot assume that actuarial risk assessment instruments will always provide simple and error free assessments of risk. Although they offer a more objective and unbiased approach to risk assessment, they are, in fact, not free of clinical judgment when they are used in practice. Furthermore, when such judgment is imposed it is detrimental to the predictive validity of the instrument.
Borum (1996) suggests that proper implementation of risk assessment in the field requires more than an assessment instrument; clinical practice guidelines as well as training programs and curricula must also be present. Guidelines have been provided regarding when clinical overrides can be made (Anderson & Hanson, 2010; Hanson & Thornton, 1999; Helmus et al., 2009). However, the guidance provided is minimal. For example, the circumstances under which overrides are permitted are not always clear. Overrides can be made to the Static-99 risk category by incorporating empirically based risk factors (Anderson & Hanson, 2010). However, the pool of empirically tested risk factors is large, and conflicting evidence exists for some risk factors. Furthermore, there is no indication of how big of a change can be made to risk ratings, how the change influences confidence ratings or the predicted likelihood of recidivism, and how this type of override should be communicated in a clinical report. The lack of clarity surrounding overrides may account for the decrease in predictive validity when overrides were used in this study. On the basis of our findings, additional and more detailed guidelines regarding the appropriate use of overrides should be tested empirically and provided to clinicians. Alternatively, clinicians should be discouraged from overriding Static-99 scores under any circumstances.
To further improve the performance of the Static-99, we concur with the recommendations of Anderson and Hanson (2010) that raters should routinely refer back to the instrument’s manual when making ratings. Occasional checks on rater agreement could also be done to ensure that shifts in scoring practices do not occur. Monahan (1993) suggests that large institutions should engage in continuing education with respect to risk assessment. For instance, a risk educator could be appointed to remain abreast of new research in the area and inform others. This would be of value for those using the Static-99 and the Stable and Acute 2000 as both have undergone revisions since they were first published.
Implementing any risk assessment instrument in practice comes with challenges and our study indicates that we cannot assume that raters will comply with all of the rules and recommendations set forth by the authors of risk assessment instruments. Even though the Static-99 manual is available online, reading it is not equivalent to training. Training should be strongly recommended and it should be made clear to users that despite how simple the instrument may seem to code there are in fact many rules and subtleties that they must be aware of. Both training and the Static-99 manual should emphasize the need to be cautious when scoring and adding items. Automated scoring procedures could also be used to eliminate addition errors. Whether overrides are permitted or not they should be discussed at length in training, that is, by either delineating clear guidelines as to when they are permitted or by describing how they should not be used given the negative consequences they can have on predictive validity.
More broadly, the authors of all actuarial instruments may need to clarify whether overrides are appropriate for their instruments and if so, under what specific circumstances. Initially, there appeared to be a clear rule against clinical overrides, however, that rule seems to have changed for some actuarial instruments and not others.
Future studies should examine the use of other risk assessment instruments by clinicians to determine if similar issues arise when those instruments are used in practice. Researchers should also investigate whether there are other factors that affect the reliability and validity of instruments when they are used in practice, such as the amount of training the user has or the type of the file information they have available. Clinical reports regarding risk assessment can have a major impact on the determinations made regarding an offender and the safety of current and potential future victims. As such, it is important to ensure that instruments are used properly and in the way they were designed to be used.
Footnotes
Acknowledgements
The authors gratefully acknowledge the assistance of Jay Healey in data collection and helpful feedback from R. Karl Hanson on an earlier draft of the article.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Jennifer Storey was supported by a doctoral studentship from the Social Sciences and Humanities Research Council of Canada.
