Abstract
Actuarial scales provide a relatively objective and reliable assessment of individuals’ risk of recidivism. Recent research has explored how graphs can improve quantitative risk communication. We tested whether graphs can improve understanding and perception of sexual violence risk when matched with risk metric. Participants (N = 676) were recruited from Amazon’s MTurk platform and read a brief description of a man convicted of a sexual offense, including results of a fictional sexual recidivism risk scale. In Study 1, absolute risk of recidivism enabled participants to distinguish between individuals with relatively high and low risk of sexual recidivism. In Study 2, this distinction was enhanced by adding a graph, especially when percentiles were communicated. Risk ratios increased perceived risk. Objective numeracy increased understanding and reduced perceived risk. We recommend that risk communication assumes limited statistical numeracy, and further research with practitioners to test the effect of graphs and risk metrics on forensic/judicial decisions.
More than 20 years have passed since actuarial scales have been widely available for assessing the risk of interpersonal violence (e.g., Harris et al., 1993) and sexual recidivism (e.g., Hanson & Thornton, 2000; Olver et al., 2007). Actuarial scales are now widely used (e.g., Kelley et al., 2018; Neal & Grisso, 2014). They are at least as accurate as structured clinical judgment methods (e.g., Campbell et al., 2009; Hanson & Morton-Bourgon, 2009; Singh et al., 2011; Yang et al., 2010). Unlike other methods, actuarial data are useful for evaluating the fairness and cost-effectiveness of forensic decision-making and related policies (e.g., Blasko et al., 2010; Hanson, Bourgon, et al., 2017; Harris & Rice, 2013; Lindsay et al., 2010). At the level of the individual, risk is most accurately conceptualized as a prognostic dimension and communicated statistically (e.g., Helmus & Babchishin, 2017). Actuarial scales provide an objective and reliable basis for communicating individuals’ risk of recidivism (e.g., Harris et al., 2015), especially given concerns with more imprecise and therefore problematic risk communication metrics such as categories (e.g., “high risk”; Hanson, Babchishin, et al., 2017; Scurich, 2018) or dichotomous predictions (e.g., “dangerous”; Helmus & Babchishin, 2017).
However, an international survey of more than 2,000 risk assessors revealed a predominant preference for communicating risk using categories (e.g., “high risk,” “low risk”) followed by a dichotomy (e.g., “is” or “is not” likely to engage in violence; Heilbrun et al., 2016). Only 14% of assessors reported using quantitative metrics such as percentage likelihood of violent recidivism, although the use of these metrics is notably higher in surveys of assessors conducting court evaluations for indeterminate incarceration (Blais & Forth, 2014; Chevalier et al., 2015). This preference seems unchanged since the 1990s, when researchers first asked forensic clinicians to explain their choice of risk communication methods, revealing discomfort with numbers and statistical reasoning (e.g., “I don’t know how to go from base rates to single cases”; Heilbrun et al., 1999, p. 403). Recent research has explored how visual aids can improve understanding of risk and optimize forensic decision-making. The present study tests the benefit of both different risk communication metrics and accompanying graphs on perceived risk and prediction of a subsequent reoffense.
Risk Communication Metrics
Actuarial scales are unique in being able to provide at least three quantitative risk communication metrics, in addition to nominal risk categories (e.g., “low,” “high”) that can be based on these metrics. Each metric has strengths and weaknesses. Absolute probabilities of recidivism (e.g., “individuals with this score have been found to reoffend at a rate of 22% over 5 years”) have become the most frequently provided metric for actuarial scales and most relied on by assessors (e.g., Blais & Forth, 2014; Chevalier et al., 2015), but they are the least stable metric, demonstrating considerable variability across samples (for review, see Helmus, 2018). Given the difficulty of specifying reliable and generalizable recidivism estimates, there have been calls to reduce reliance on this metric in favor of relative risk indices such as percentiles and risk ratios (Harris et al., 2015; Helmus, 2018).
Percentiles (e.g., “this individual is among the top 10% of highest risk men with a history of sexual offending”) are frequently used in psychology and any field involving norm-referenced assessment, and as an index of relative risk they provide particularly helpful information for resource allocation (Harris et al., 2015). However, there is no direct link between differences in percentile ranks and different likelihoods of recidivism, which can complicate differentiations among individuals at different risk levels (Hanson et al., 2012).
Risk ratios (e.g., “individuals with this score are twice as likely to reoffend as the typical man with a history of sexual offending, defined by the median score”) provide relative risk information in a way that focuses on the magnitude of increasing risk with ascending risk scores. Although they have the benefit of considerable stability across diverse samples and settings (Helmus et al., 2012), they can potentially contribute to overestimation of risk if not contextualized with base rate information (for discussion, see Hanson et al., 2013). For example, a risk ratio showing that an individual is roughly 5 times more likely to reoffend than the typical person convicted of a sex offense sounds like recidivism is nearly certain, but when the base rate is 6%, this still means they are more likely to not reoffend than to reoffend (e.g., 30%). 1
One risk communication study compared nominal risk levels, risk ratios, and absolute recidivism estimates among prospective jurors in a case study of a man serving a prison sentence for a sex offense (Varela et al., 2014). Although there was not a significant main effect of risk communication metric, there was an interaction between communication metric and risk level of the vignette (i.e., a high or low score on Static-99R). Participants perceived the low scoring case as significantly lower risk than the high scoring case only when the nominal risk category was presented, not for risk ratios or absolute probability estimates. The pattern of findings also suggested higher perceptions of risk overall for the high-risk condition when a nominal risk category was used compared with risk ratios or absolute probability estimates. The evidence also suggested that participants either did not understand or ignored the risk ratio information. When asked whether the individual was more likely or less likely to reoffend than other offenders, participants who were given risk ratio information showed no difference in their response depending on whether the risk ratio communicated above or below average risk (overall, 80% of participants reported the individual was more likely than others to reoffend).
Krauss et al. (2018) attempted to replicate some of Varela and colleagues’ (2014) findings. They compared nominal labels and absolute recidivism estimates in a sample obtained from Amazon’s Mechanical Turk (MTurk), examining four risk levels instead of two. Similar to Varela and colleagues, they obtained risk categories and recidivism estimates from Static-99R; however, instead of measuring perceived risk on a Likert-type scale, they used a dichotomous measure of whether a participant believed beyond a reasonable doubt that the offender was more likely than not to commit a future sexually violent offense (i.e., their verdict for a Sexually Violent Predator commitment decision). They did not conduct statistical tests for interactions, and unfortunately analyses of the effect of risk level were particularly low-powered as they did not treat the four risk levels as ordinal, but chi-square test results appeared generally consistent with the comparable interaction found by Varela et al (2014): risk level was significantly related to verdicts only when risk was communicated categorically. This study additionally found that vengeance attitudes predicted verdicts, and participants given categorical risk levels estimated that they were associated with recidivism probabilities considerably higher than what is empirically established for the Static-99R.
Numeracy Effects
Statistical illiteracy or, more specifically, “risk illiteracy” (Garcia-Retamero & Cokely, 2013) is a common limitation of human reasoning. Even among well-educated adults, performance on tests of numeracy and statistical reasoning can be poor (e.g., Hilton et al., 2017; Lipkus et al., 2001; Taylor & Byrne-Davis, 2017). Shuster (2011) complained that a medical screening test that requires Bayesian reasoning to interpret is “bad,” and a test with a “straight yes or no answer” is needed (p. 342). A clear dichotomous answer cannot be provided from a risk assessment, which offers a prognosis requiring probabilistic information, rather than a diagnosis (Helmus & Babchishin, 2017). Furthermore, professional practice guidelines preclude making dichotomous predictions (Association for the Treatment of Sexual Abusers, 2014). Yet to be effective, risk communicators must be sensitive to the needs of those receiving the message, a key aspect of which is numeracy.
People with relatively good numeric comprehension deliberate and evaluate decisions more than others (Ghazal et al., 2014); apply base rates, relative risks, and percentage information more effectively (e.g., Bodemer et al., 2014); and process numbers in graphs rather than overall impressions (e.g., Kreuzmair et al., 2016). They are less influenced by inessential information such as framing (e.g., being told that outcomes are 75% positive or 25% negative; e.g., Peters et al., 2006), anecdotal details about the assessment procedure (e.g., Scurich, 2015), or the listing out of individual items from a risk assessment (e.g., Scurich et al., 2012). Effective risk communication, therefore, should provide the minimal information necessary, in consistent terms, using statistical information, and aid the receiver’s understanding of the statistics.
Visual Aids in Violence Risk Communication
Forensic researchers have drawn on studies of medical and other personal risks to guide recommendations for violence risk communication (e.g., Hilton et al., 2015). Icon arrays that illustrate the n-in-x proportion of individuals with disease relative to the referent population have been found to aid risk perception, especially in more challenging tasks (e.g., Leonhardt & Robin Keller, 2018). Bar graphs have been used to compare probabilities across risk categories (e.g., Wilhelms & Reyna, 2013), and simple bar graphs with brief explanatory labels can aid risk communication among less numerate people (e.g., Okan et al., 2017). Both types of graphs have been used to aid comprehension and communication for some actuarial scales for general violence (e.g., Harris et al., 2015) and domestic violence recidivism (e.g., Hilton et al., 2010). However, there has been little uptake of lessons learned from the medical risk communication literature, and only two previous studies testing the use of graphs in forensic risk communication.
Hilton and colleagues (2017) compared the effect of four different visual aids on college students’ perceptions of violent recidivism risk and hypothetical decisions about security level. They used two kinds of bar graphs, showing either the absolute probability of violent recidivism or the percentile rank associated with each of nine categories of risk on an actuarial risk assessment tool. They also tested thermometer-style graphs showing an individual’s percentile and a series of pie charts that illustrated the absolute probability of recidivism across the nine categories. Participants read case vignettes of two individuals with different levels of assessed risk, twice each. The first time, no statistical risk information was presented. The second time, statistical risk information was provided. Some participants received statistics only (i.e., risk category, percentile rank, percentage likelihood of violent recidivism in 10 years), and others were given one of the four graphs in addition to the statistical information. Having statistical information improved participants’ ability to distinguish the higher and lower risk case, and the probability bar graph was significantly more effective than statistics only. However, when this bar graph was tested with a sample of forensic clinicians, it did not confer a benefit over communicating statistics only (Hilton et al., 2017).
A similar study was conducted with college students, faculty, and staff by Batastini and colleagues (2019). Participants read a case vignette of one individual that included actuarial statistics (i.e., risk assessment score, absolute probability of recidivism in 7 and 10 years, and a risk ratio stated as “one half of the overall expected base rate”). One group of participants also received an explanation of base rate (defined as “how often a particular event occurs” given as a proportion and percentage, along with an example from medical risks). Another group was given this explanation plus a pie chart of the proportion of smokers with and without cancer, and a probability bar graph similar to that used by Hilton and colleagues (2017) but displaying both 7- and 10-year probabilities of violent recidivism. Having the graph had no effect overall on ratings of risk or danger.
Purpose of Current Study
Batastini and colleagues (2019) did not include a measure of numeracy. Hilton and colleagues (2017) used only three items modified from Lipkus and colleagues’ (2001) numeracy scale and found no numeracy effects on forensic clinicians’ use of risk information. In addition, although these previous studies used graphs designed to illustrate absolute probability and/or percentiles, the visual aids were not selected for their actual effectiveness as aids to communication of these statistics. Furthermore, no visual aids for risk ratios were included. The present research sought to address these gaps in three ways. First, we used alternative illustrations of absolute probability, testing their effects on adults’ perceived risk of sexual recidivism in a hypothetical case. Second, we then presented a new sample with the hypothetical case and provided absolute probability, percentile, or risk ratio statistics, and tested the additional effect of a graph suited to illustrating the specific statistical information. Third, we included both an eight-item subjective numeracy scale and an eight-item objective numeracy test, and examined the effect of numeracy overall and in interaction with the experimental statistics and graph conditions.
In this study, our hypothetical case examined a sexual offense as opposed to a violent offense. This was selected because the three quantitative risk communication metrics (percentiles, risk ratios, absolute recidivism estimates) are more frequently available for sexual offending risk assessment scales (e.g., Lehmann et al., 2016; Phenix et al., 2016) and are more frequently used by evaluators (e.g., Blais & Forth, 2014; Chevalier et al., 2015). In addition, our materials (e.g., phrasing of outcome variables) were based on those used by Varela and colleagues (2014), who also used a sexual offense. Like Krauss et al. (2018), this study is a replication and extension of Varela et al. (2014), but rather than extending it to different risk levels in a Sexually Violent Predator verdict decision, we extended it to different risk communication metrics (by adding percentiles) and additionally examined the value of graphs and measured numeracy. We also did not restrict the context to Sexually Violent Predator verdicts.
Both of our studies were reviewed and approved by the institutional research ethics board of the second author’s institution. Both studies recruited samples through MTurk, which has been used with success in previous studies of violence risk communication (e.g., Scurich et al., 2012). Despite some concerns about crowdsourcing for research participants, including their participation in multiple studies, potential inattentive responding, and researchers’ post hoc exclusion of cases, it has been found to increase sample diversity compared with traditional recruitment methods (e.g., Chandler et al., 2014). Participants in MTurk studies have been found to reflect sociodemographic and political characteristics of respondents in population-based research (e.g., Levay et al., 2016). No systematic differences between patterns of findings using commercial online samples compared with more conventional samples (Walter et al., 2019). We included predetermined attention checks.
Study 1
The purpose of Study 1 was to collect preliminary data to ensure study feasibility and to determine whether we should use a bar graph or an icon array to communicate the absolute recidivism risk information. We expected that participants with higher numeracy scores would find the actuarial risk assessment results easier to understand (Hypothesis 1), that actuarially higher risk cases would be perceived as higher risk than lower risk cases (Hypothesis 2), and that adding a graph would increase this difference in perceived risk (Hypothesis 3). We did not have a priori expectations about the relative benefits of the bar graph or icon array.
Method
Participants were recruited from Amazon’s MTurk platform. We aimed to meet the guideline of at least 30 participants per cell for adequate power (Van Voorhis & Morgan, 2007). Our total sample for Study 1 included 153 participants completing surveys that passed all checks (55 cases were deleted after manipulation checks). 2 The survey took an average of 10.6 min to complete and participants were paid US$1 for their time.
Participants were equally male and female (each n = 76, 50%, <1% “other”) with a mean age of 36 years (SD = 11.8, range from 20 to 72). Most were college graduates (n = 72, 47%), or had some college education (n = 45, 29%), high school diploma or equivalent (n = 21, 14%), or a graduate degree (n = 15, 10%). Most identified as White non-Hispanic (n = 127, 83%), followed by Black (n = 13, 9%), Asian (n = 11, 7%), Hispanic or Latino/Latina (n = 6, 4%), or other (n = 2, 1%; percentages add up to more than 100 because participants could select multiple options).
Measures
We used the Numeracy Understanding in Medicine Instrument Short Form (Schapira et al., 2014) as an objective test of participant numeracy. It was selected based on its higher participant satisfaction ratings compared with other measures of objective numeracy (Dolan et al., 2016). Although developed in the medical field using health information, it consists of questions examining understanding of numerical information such as comparisons (e.g., determining which value represents a higher score or falls within a particular range) and data presented in percentages, frequencies, fractions, or multiplication. Total scores range from 0 to 8, reflecting the number of questions answered correctly (in the current study, M = 6.27, SD = 1.74). Proposed normative data for the scale suggest scores of 0 to 3 could be considered low numeracy, 4 to 6 as average numeracy, and 7 to 8 as high numeracy.
We used the Subjective Numeracy Scale (Fagerlin et al., 2007) to measure subjective self-assessments of numeracy skills. In addition to participants finding the scale faster and easier to complete than objective numeracy scales (Fagerlin et al., 2007), it is possible that self-assessments of numeracy may reflect both numeracy and anxiety about numeracy. Four items assess perceived skill with math (e.g., “How good are you at calculating a 15% tip?”), and four items assess preference for numerical data (e.g., preference for probabilities vs. words in weather predictions). Questions are answered on a Likert-type scale from 1 to 6 (scale anchors vary based on question wording). Total scores range from 1 through 6 (averaged across items), with higher scores reflecting higher subjective numeracy (in the current study, M = 4.60, SD = 0.99).
Procedure
Vignette materials (including primary outcome measures) were developed based on the study materials used by Varela and colleagues (2014) and are available upon request. Some modifications were made, including using a fictional risk scale instead of Static-99R. All participants read some introductory paragraphs explaining that criminal justice systems have limited resources and that measures to treat, manage, or monitor people who commit criminal offenses should be targeted toward the highest risk individuals. Basic information was provided to explain how risk scales are developed based on research. Participants read about “Mr. Donaldson,” who was convicted of sexual assault for raping an acquaintance. They were told that he was scored on a fictional risk scale called the Offender Risk Assessment Guide (ORAG), which research has shown predicts reoffending among men similar to Mr. Donaldson.
Participants were told that Mr. Donaldson’s score was either low or high on the ORAG (scores of 3 and 10 were used). Scores were explained with absolute recidivism probability estimates using language similar to the Static-99R sample report templates (Phenix et al., 2016), which included contextual information about overall 5-year base rates of sexual reoffending (indicating that they usually average between 5% and 15% across routine samples). Low ORAG scores were associated with a 4% rearrest rate in 5 years, and high scores were associated with a 20% rearrest rate. Participants read this information with no graph, an icon array graph, or a bar graph (see Figure 1 for samples).

Sample icon array graph and bar graphs used for Study 1.
After reading the vignettes, participants were asked a series of follow-up questions, including those from the above measures. Primary outcome variables included the following questions: “How likely is it that Mr. Donaldson will commit a sexual offense in the next 5 years?” “How dangerous is Mr. Donaldson to members of the community?” and “How much of a threat does Mr. Donaldson pose to members of the community?” (questions were answered on a Likert-type scale from 1 to 6, with higher scores reflecting higher risk). Participants were asked whether Mr. Donaldson was more likely or less likely to commit a new sexual offense compared with the “typical sex offender.” They were also asked (on a Likert-type scale from 1 to 6) how much the ORAG influenced their rating and how difficult it was to understand the ORAG results.
Results
All analyses were conducted independently by both authors to verify accuracy. The mean subjective numeracy score was M = 4.60 (SD = 0.99, Cronbach’s α = .86), objective numeracy was M = 6.27 (SD = 1.74, Cronbach’s α = .71), and the two were positively correlated, r (152) = .376, p < .001. The mean rating for ease of understanding the ORAG test results was 2.03 (SD = 1.09), indicating that participants found the results easy to understand. In a two (risk level) × two (graph vs. no graph) analysis of variance with objective and subjective numeracy as covariates, having a graph was a significant factor in the ease of understanding the ORAG test results, and objective numeracy was also significant (total model adjusted R2 = .201; Table 1). Higher objective numeracy was associated with less difficulty in understanding the results, r = −.451, p < .001, consistent with Hypothesis 1. There was no effect of subjective numeracy or risk level and no significant interaction of risk level and having a graph (Table 1). Participants with a graph rated the results easier to understand than those without a graph, M = 1.90 (SD = 1.02) vs. M = 2.31 (SD = 1.19), Cohen’s d = 0.38, 95% confidence interval (CI) = [0.04, 0.72]. Among only participants who received a graph, the type of graph was not a significant contributor to ease of understanding test results, nor was subjective numeracy, risk level, graph type, or the interaction of risk level and graph. Objective numeracy was the only significant factor (total model adjusted R2 = .207; Table 1).
Analysis of Covariance Results for Ease of Understanding Actuarial Risk Assessment Results (Study 1).
Note. Model 1: R2 = .228 (adjusted R2 = .201); Model 2: R2 = .245 (adjusted R2 = .207).
Ratings of likelihood to reoffend, danger to the community, and threat to the community showed strong evidence of consistency in measuring perceived risk in this sample (Cronbach’s α = .94). Therefore, we combined these ratings into an average measure of perceived risk, M = 3.64 (SD = 1.40). In a two (risk level) × two (graph vs. no graph) analysis of variance, risk level was a significant factor in the combined measure of perceived risk, and objective numeracy was also significant (total model adjusted R2 = .192; Table 2). Higher objective numeracy was associated with lower perceived risk (r = −.400, p < .001). Perceived risk was greater in the actuarially higher risk case than the lower risk case, M = 4.05 (SD = 1.25) vs. M = 3.28 (SD = 1.43), Cohen’s d = 0.57, 95% CI = [0.25, 0.89], consistent with Hypothesis 2. Subjective numeracy, graph, and the interaction of risk level and graph had no significant effect (Table 2; detailed results in Table 3). The lack of interaction effect fails to support Hypothesis 3. The same pattern of results was obtained among only participants who received a graph (total model adjusted R2 = .132; Table 2). Graph type (bar graph or icon array) did not have a significant effect. Post hoc examinations of means and 95% CIs across the graph conditions indicated that, in the absence of a graph, participants did not significantly discriminate between the higher and lower risk cases, but those with either graph did (Figure 2).
Analysis of Covariance Results for Averaged Perceived Risk (Study 1).
Note. Model 1: R2 = .219 (adjusted R2 = .192); Model 2: R2 = .175 (adjusted R2 = .132).
Perceived Risk as a Function of Risk Level and Graph (Study 1).
Note. Cohen’s d is comparing high versus low risk in each graph condition. 95% confidence interval in brackets.

Perceived risk of relatively low-risk (shaded squares) and high-risk (open squares) cases as a function of type of graph used in Study 1.
Discussion
Participants with higher objective numeracy scores rated the actuarial results easier to understand, consistent with Hypothesis 1. Quantitative risk metrics enabled participants overall to perceive relatively high- and low-risk cases differently, consistent with Hypothesis 2. However, graphs did not significantly increase this perceived difference, contradicting Hypothesis 3. Furthermore, our attempt to detect a difference between the bar and icon illustration of absolute probability of recidivism in perceived risk was not successful overall. Nevertheless, Study 1 was conducted to determine whether a bar graph or an icon array was a better choice for communicating the absolute recidivism risk information. Therefore, in the absence of any statistical finding favoring one graph over another, we made a decision to use the bar graph for this purpose in Study 2, based on its slightly higher effect size (see Table 3).
Study 2
The purpose of Study 2 was to examine the effect of communicating the actuarial risk assessment results as percentile, absolute probability, or risk ratio statistics, and to test the additional effect of a graph suited to illustrating the specific risk metric. We again expected that participants with higher numeracy scores would find the actuarial risk assessment results easier to understand (Hypothesis 4) and that actuarially higher risk cases would be perceived as higher risk than lower risk cases (Hypothesis 5). In line with Varela et al.’s (2014) findings, we expected this perceived difference to differ between risk metrics, with the smallest difference for absolute recidivism rates (Hypothesis 6). We continued to expect an effect of including a graph, although our results with the absolute probability metric in Study 1 led us to expect that any graph effect would differ by risk metric (Hypothesis 7).
Method
Participants were recruited from Amazon’s MTurk platform and were nonoverlapping with participants from Study 1 (completion of Study 1 was used as an automatic exclusion criterion in MTurk). The survey took an average of 13.5 min to complete, and participants were paid US$1 for their time. Our total sample for Study 2 included 523 participants completing surveys that passed all checks (84 cases were deleted based on the same manipulation checks as Study 1).
Participants were roughly equally female (n = 283, 54%) and male (n = 238, 46%; <1% “other”) with a mean age of 36 years (SD = 11). Most were college graduates (n = 244, 47%) or had some college education (n = 151, 29%), a graduate degree (n = 74, 14%), or a high school diploma or equivalent (n = 53, 10%); one (<1%) did not complete high school. Participants identified as White non-Hispanic (n = 394, 75%), followed by Black (n = 59, 11%), Hispanic or Latino/Latina (n = 46, 9%), Asian (n = 40, 8%), Native American, Indigenous, or Māori (n = 3, < 1%), or other (n = 9, 2%; percentages add up to more than 100 because participants could select multiple options).
Procedure
We used the same measures, vignettes, and procedure as Study 1, but with expanded independent variables. Risk level (two conditions: low and high) was the same as Study 1. Risk communication metric included absolute recidivism probability estimates, risk ratios, or percentiles (again, largely following reporting templates for Static-99R but modified for the fictional ORAG scale). In addition, participants either received no graph or a graph to assist them in interpreting the information. For those in the graph condition, we used a type of graph designed to communicate that risk communication metric. Those in the absolute probability condition received the same bar graph as Study 1, those in the risk ratio condition received a thermometer graph displaying relative risk, and those in the percentile condition received a type of icon array graph displaying how many out of 10 people would score the same or lower risk, versus higher risk (see Figure 3 for examples).

Icon array graph representing percentile rank, thermometer-style graph representing risk ratio, and bar graph of absolute probability of recidivism, for the relatively high-risk case, used in Study 2.
Results
All analyses were conducted independently by both authors to verify accuracy. The mean subjective numeracy score was M = 4.57 (SD = 0.93, Cronbach’s α = .85), objective numeracy was M = 6.42 (SD = 1.58, Cronbach’s α = .67), and the two were positively correlated, r (520) = .256, p < .001. The mean rating for ease of understanding the ORAG test results was 2.16 (SD = 1.13), indicating that participants found the results easy to understand. Ease of understanding the ORAG test results was examined in a two (risk level) × two (graph vs. no graph) × three (risk metric) analysis of variance with objective and subjective numeracy as covariates (total model adjusted R2 = .083; Table 4). There was a significant interaction of risk level with risk metric. Mean ratings show that, for the lower risk case, the ORAG was rated easiest to understand when presented as percentiles (M = 1.94, SD = 1.14, 95% CI = [1.69, 2.19]) and hardest as risk ratios (M = 2.40, SD = 1.23, 95% CI = [2.14, 2.67]), but for the higher risk case, the ORAG was rated equally easy to understand regardless of risk metric (Ms = 2.11–2.18, SDs = 0.99–1.15). Objective numeracy was the only other significant factor (r = −.292, p < .001, n = 523), indicating that more numerate participants found results less difficult to understand (consistent with Hypothesis 4). We continued to treat numeracy as a covariate in remaining analyses.
Ease of Understanding Actuarial Risk Assessment Results (Study 2).
Note. Model R2 = .106 (adjusted R2 = .083).
Ratings of likelihood to reoffend, danger to the community, and threat to the community again showed strong evidence of consistency in measuring perceived risk in this sample (Cronbach’s α = .94). Therefore, we combined them into an average rating for analysis. In a two (risk level) × two (graph vs. no graph) × three (percentile, risk ratio, or absolute risk) analysis of variance, objective numeracy was a significant factor (Table 5). There was a three-way interaction of risk level, risk metric, and having a graph, as well as two-way interactions of risk level with graph and with risk metric, and significant main effects of all three independent variables (total model adjusted R2 = .222; Table 5). Figure 4 illustrates the three-way interaction. There was a perceived difference in risk between the relatively low- and high-risk cases. This difference was larger when a graph was provided (except for the absolute risk metric), and the effect of the graph (in terms of enhancing the distinction between the low-risk and high-risk cases) was largest for the percentile risk metric. The perceived difference between the relatively low- and high-risk cases was significant and in the expected direction (Hypothesis 5). The interaction of risk level and risk metric is consistent with Hypothesis 6. The interaction expected between graph and risk metric was not supported (Hypothesis 7); however, the three-way interaction is consistent with the more nuanced interpretation that the effect of graph on the perceived difference between risk levels differed across risk metrics. Perceived risk of the actuarially lower risk case was generally lower when there was a graph (except when it accompanied absolute risk statistics), whereas perceived risk of the higher risk case varied as a function of both the graph and the risk metric. For ease of interpretation, Figure 4 does not include CIs; for closer scrutiny of results, we also provide an omnibus table of perceived risk across risk level, graph, and risk metrics (Table 6).
Analysis of Variance Results for Averaged Perceived Risk Ratings (Study 2).
Note. Model R2 = .241 (adjusted R2 = .222).

Perceived risk ratings for relatively low-risk and high-risk cases, by participants presented with percentile, risk ratio, and absolute risk statistics, as a function of whether a graph was presented (shaded markers) or was not presented (open markers) in Study 2.
Perceived Risk as a Function of Risk Level, Graph, and Risk Metric (Study 2).
Overall, more participants endorsed the forced-choice option that Mr. Donaldson “is more likely than the typical sex offender to commit a new sexual offense” (290, 55%) than the alternative option that he is less likely (233, 45%). However, the rate of endorsement was significantly different between risk metric conditions, χ2(2, N = 523) = 17.19, p < .001. Participants were evenly split when presented with percentiles (47% chose “more likely”) or absolute risk (51% chose “more likely”), whereas when presented with risk ratios, 68% chose “more likely.” Given the similarity of these findings to Varela et al. (2014), we explored this further in a binary logistic regression of this dichotomous outcome measure (more likely to reoffend than the typical offender) with risk communication metric (categorical), risk level, presence or absence of a graph, and objective and subjective numeracy as predictors (see Table 7). After controlling for objective and subjective numeracy, risk level, and presence of graphs, participants were less likely to label the individual as riskier than the typical case when risk was presented as percentiles and absolute recidivism estimates compared with risk ratios (odds ratios [ORs] < 0.50). Risk level was the only other significant incremental contributor (OR = 8.5) to the model (Nagelkerke R2 = .319, 74% correct classification; Table 7). Because Varela and colleagues (2014) reported no effect of actuarial risk level in their participants’ responses about whether the individual was more likely or less likely to reoffend than typical cases when given risk ratio information, we repeated this analysis with only participants in the risk ratio condition. Risk level was the only significant incremental contributor (OR = 9.8) to the model (Nagelkerke R2 = .320, 78% correct classification; Table 8).
Logistic Regression Results for Forced Choice More or Less Likely Than “Typical Sex Offender” to Commit a New Sexual Offense (Study 2, n = 521).
Note. CI = confidence interval.
Logistic Regression Results for Forced Choice More or Less Likely Than Typical Person Convicted of a Sex Offense to Commit a New Sexual Offense Among Participants Presented With Risk Ratios (Study 2, n = 173).
Note. CI = confidence interval.
General Discussion
The present study, with a total of 676 participants, investigated how laypersons perceived risk of sexual recidivism, and how different risk metrics and graphs affected risk communication using a fictional actuarial risk scale similar to Static-99R. We demonstrated differences in participants’ ratings of risk between hypothetical cases with actuarial risk of 4% and 20% likelihood of sexual recidivism. Of the risk communication characteristics that we manipulated (risk level, graphs, and risk metric), the largest and sometimes only significant effect size was obtained for risk level, which shows that, at a minimum, risk assessment results did influence perceived risk to a fairly strong extent (main effect in Study 2, d = 0.78, 95% CI = [0.60, 0.96]). Previous studies that have demonstrated lay and clinician participants’ ability to distinguish between cases with different actuarial risks have used within-participants designs (e.g., Hilton et al., 2008, 2017), which may be subject to experimental demand characteristics and desirable responding. The present research replicated this general finding using a between-participants design, demonstrating that participants’ risk judgments can be affected by actuarial risk communication in the absence of obvious comparison stimuli.
Although participants rated the assessment results easy to understand overall, averaging a rating of 2 on a scale from 1 (very easy) to 5 (very difficult), their scores on an objective numeracy test (but not their subjective numeracy) had a strong and consistent effect on ease of understanding. Specifically, more numerate people found the results easier to understand, regardless of graph inclusion. In addition, participants with higher objective numeracy reported lower perceived risk, suggesting they are perhaps less likely to overestimate risk than less numerate people. In Study 1, providing a graph improved self-reported understanding of absolute risk of recidivism, whereas in Study 2, which compared different risk metrics, adding a graph did not improve self-reported understanding. This discrepancy could be due to sampling error or could reflect the different risk metrics used in Study 2. Specifically, percentiles were considered easier to understand, and risk ratios harder, in the actuarially lower risk cases, whereas all metrics were rated equally understandable in the higher risk case. Varela and colleagues (2014) found that their lay sample appeared more receptive to risk statistics that showed a high risk of recidivism. Our study replicates this finding for risk ratios but suggests this effect may not occur for percentiles. These findings support the further investigation of ways to help communicate quantitative risk metrics, especially when the communication recipients are not expected to be highly numerate.
A key concern in offense risk communication is that the public (and clinicians) tend to overestimate risk (Kahneman, 2011; Mills et al., 2011), and particularly to overestimate risk of sexual recidivism (Helmus, 2016; Krauss et al., 2018; Levenson et al., 2007). In Study 2, we found that graphs helped reduce perceived risk overall, especially for relatively low-risk cases. Thus, the use of graphs in sexual offense risk communication could help bring public perceptions and policy decisions into closer alignment with observed sexual recidivism rates. There was no strong evidence supporting one type of graph over another, and the choice may depend on the most suitable visual aid to illustrate the risk metric used. When graphs were used, risk metrics attenuated or exaggerated their effect. For example, perceived risk of the actuarially lower risk case was generally lower when there was a graph, except when absolute risk was presented. The combination of a graph with percentile information produced the largest distinction between the actuarially higher and lower risk cases.
Risk metrics also had a main effect on perceived risk. When presented with risk ratios, a substantial majority of participants perceived the individual to be more likely than the “typical sex offender” to commit a new sexual offense. This pattern of higher perceived risk when risk ratios are presented mirrors findings from Varela et al. (2014) and suggests that, consistent with the recommendations of Hanson et al. (2013), it is preferable to provide clear base rate information alongside risk ratios to mitigate a possible tendency for risk ratios to inflate perceived risk.
Studies 1 and 2 differed in that only Study 2 found a significant main effect of graph on perceived risk level. This difference could be sampling error, but this is also not necessarily an inconsistency given that Study 1 only examined graphs when communicating absolute recidivism information. In Study 2, the significant interaction suggested that graphs were least effective for absolute risk conditions compared with percentile or risk ratio conditions. This is also consistent with previous research, in which Hilton and colleagues (2017) found a significant effect in a study examining different types of graphs, whereas Batastini and colleagues (2019) found no effect for including a graph in absolute probability risk communication. On the basis of the current and previous studies, graphs may be helpful for some risk communication metrics but not for absolute probabilities. Although a possibility, such a conclusion is premature for two reasons. First, Study 1 was underpowered. Although we aimed to have at least 30 participants per condition, after data cleaning, there were between 18 and 33 participants per condition. In contrast, all conditions in Study 2 had a minimum of 40 participants. In addition, although the interaction plot for Study 2 suggested that graphs were less effective in distinguishing high- and low-risk cases in the absolute probability condition, this effect approached statistical significance (d = 0.28 [−0.02, 0.58]). Future research, and particularly aggregation of findings, may demonstrate a significant benefit of graphs when communicating absolute risk information (although the current study does suggest the effects are greater for other risk metrics).
Limitations
Our study had several limitations, especially those associated with online sampling such as unrepresentativeness of the target population. Despite some reported concerns with MTurk versus other platforms (e.g., Heen et al., 2014), we obtained virtually equal proportions of male and female respondents, and participants’ ethnicity approximated that of the United States’ racial distribution (Heen et al., 2014) which suggests good gender and racial representativeness. Our samples were considerably highly educated, although this may approximate the education of most criminal justice practitioners and decision makers. We removed 139 (17%) participants across the two studies due to failure to respond correctly to embedded quality check questions. In future research, prescreening for suitable demographics, as well as motivation, may improve the generalizability of results (e.g., Chandler et al., 2014). Minimally however, we believe that MTurk data are more representative of the general population compared with university student populations, which is another common data source.
Another important limitation is our reliance on samples drawn from the general population in an experimental paradigm. Replication of our study with forensic clinicians as participants, and research in real-life settings rather than under experimental survey conditions, is essential for verifying the utility of using graphs and alternative risk metrics in offense risk communication and forensic decision-making. Such research should investigate the effect of multiple pieces of statistical information, which is the recommended practice for actuarial risk assessment reports, rather than comparing metrics used one at a time, as in the present study. Perhaps there is an optimal level of information for eliciting a perceived distinction between relatively high- and low-risk cases, beyond which the high/low-risk distinction plateaus and the communication of additional details adds no benefit. Meanwhile, our participants may reflect the characteristics and responses of potential jurors, as well as the general public, whose perceptions of men who have committed a sexual offense indirectly influence public policy regarding the detainment of persons convicted of sex offenses and reactions to their rehabilitation in the community.
Our case descriptions and questions to participants used the term “sex offender” rather than person-first terms such as “a person who has been convicted of a sexual offense.” We chose our wording to be consistent with materials used in previous research and to increase readability for our lay participants. However, we acknowledge that our choice of wording may have perpetuated biases, including inflated perceptions of risk of sexual recidivism.
Our bar graph y-axis representing percent who reoffend was truncated at 30% rather than presenting the whole possible range of recidivism up to 100% (or truncating at other levels, such as 50%). Truncating the axis resulted in larger visual differences between the heights of the bars and could have artificially magnified the perceived differences between groups. This may have given the bar graph an advantage in our study, compared with a bar graph that portrayed the full range; it is conceivable that stronger differences between graphs could have been observed had we used a full-scale bar graph. Alternatively, 100% sexual recidivism rates are not a realistic outcome (Hanson & Morton-Bourgon, 2009; Helmus et al., 2012). The maximum of 30% is more in line with the higher range of actuarial risk identified by existing sexual recidivism risk assessment scales. Ultimately, identifying the most appropriate y-axis range to communicate results is not obvious and arguments can be found for different approaches, each introducing their own sets of possible biases. Actuarial tools serve a useful function in discriminating among individuals with respect to risk, and a useful graph needs to illustrate those differences effectively. Further research could explore additional refinements and manipulations of graphs or test how graphs may enhance the communication of standardized risk levels (e.g., Hanson, Bourgon, et al., 2017).
Implications for Risk Communication Practice and Research
Using risk information to differentiate among individuals posing different levels of risk is a key outcome of risk communication and has been used in other research as an index of good forensic decision-making (e.g., Hilton et al., 2008, 2017). Given the impact on ease of understanding and distinction of perceived risk, this study supports the value of providing percentile information in risk communication. However, it is hard to know what outcomes are desirable without using a measure of decision-making. Future research should extend the present study to include other outcome measures such as whether participants would sentence the individual to imprisonment, impose registration/notification, commit the person to indeterminate custody, impose higher security, or release on parole.
Objective numeracy appears to affect people’s understanding and perception of risk information more than subjective numeracy does. Our test of objective numeracy was more similar to the risk communication task than the subjective numeracy rating was, perhaps contributing to its stronger association. Furthermore, subjective numeracy may be more influenced by anxiety with numbers in general rather than by the ability to interpret probability and other risk metrics per se. Our findings suggest that attempts to improve risk communication by increasing participants’ confidence in their numeracy would have limited benefit. Instead, efforts should focus on building numeracy skills and developing risk communication methods that assume limited statistical numeracy. Further research will be needed to test such methods and evaluate their utility in field studies with forensic clinicians.
We also found a strong relationship between objective numeracy and self-reported ease of understanding actuarial risk information. We explored alternative analyses examining either objective numeracy or ease of understanding as either covariates or moderators of the analyses, but a full report of these findings is beyond the scope of the current article and is subject to meaningful reductions in statistical power. These additional exploratory analyses did not change the core findings of the present article, but did suggest that objective numeracy and self-reported ease of understanding were related but not interchangeable constructs in understanding risk communication. We plan to aggregate this data set with related ongoing projects using the same moderators and outcome variables, for more detailed research on the role of numeracy and ease of understanding in risk communication.
We measured participants’ perception of risk using their ratings of likelihood to reoffend, danger to the community, and threat to the community, to capture different aspects of offending risk. Responses on these items were closely related, according to internal reliability analyses, and we treated them as a single scale. From a practical perspective, future experimental studies could simplify their assessment of perceived risk by using a single item or a single score on a multi-item measure. Conceptually, offending risk has been thought to be a multidimensional construct, and researchers have begun to explore separate dimensions and their theoretical and practical implications (e.g., Brouillette-Alarie et al., 2017). Future research could examine these and other risk measures used in the experimental literature and in clinical practice of risk communication, using factor analysis to explore dimensionality in the concept of offending risk.
In addition, our outcome of risk perception as a dimension is similar to Varela and colleagues (2014), although other studies have examined dichotomous outcome decisions (Krauss et al., 2018). Risk itself exists along a continuum (Hanson et al., 2013; Helmus & Babchishin, 2017) so a rich understanding of the construct and how it is perceived should similarly be dimensional. Overall, most treatment and supervision decisions should be proportional to risk (Bonta & Andrews, 2017) and the Justice Center’s standardized risk/needs framework groups risk into five different levels of service dosage and intensity (Hanson, Bourgon, et al., 2017). In the real world, however, dimensional understandings of risk sometimes must be converted to dichotomous decisions (e.g., parole, civil commitment) so there is also value in examining how risk perception is ultimately linked to dichotomous decisions. Future research should continue to explore both dichotomous and continuous measures.
Conclusion
The present study adds to the emerging literature on using graphs to aid violence risk communication by matching graphical aids to the risk metrics currently used in sexual recidivism risk assessment. Graphs helped to reduce lay participants’ perceptions of risk in hypothetical cases, especially for relatively low-risk cases, bringing perceptions into closer alignment with sexual recidivism rates observed in existing recidivism research. The effects of graphs also interacted with the risk metric, such that the largest perceived difference between the actuarially higher and lower risk cases was achieved by presenting percentiles illustrated by an icon array graph. Risk ratios were rated the hardest metric to understand, and they were associated with the highest perceived risk, suggesting that caution is required when communicating sexual recidivism risk assessment results using risk ratios. We recommend further research with practitioners as participants, testing additional case examples and graphs, and evaluating the benefits of graphs and risk metrics on forensic and judicial decisions.
Footnotes
Acknowledgements
We thank Annalisa Hughes, Kirsten Brinck, and Kelsey Gushue for assistance with Qualtrics and the data.
Authors’ Note
The authors take responsibility for the integrity of the data, the accuracy of the data analyses, and have made every effort to avoid inflating statistically significant results.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
