Abstract
Do people think about offending risk in verbal or numerical terms? Does the elicitation method affect reported subjective probabilities? Rational choice models require potential outcomes (e.g., benefits/costs) to be weighted by their probability of occurrence. Indeed, the subjective likelihood of being apprehended is the central construct in criminological deterrence theory—the so-called certainty principle. Yet, extant literature has measured the construct inconsistently and with little attention to potential consequences. Using a series of randomized experiments conducted with nationwide samples of American adults (aged 18 and over), this study examines the degree of correspondence between verbal and numeric measures of apprehension risk, assesses the durability of numeric estimates specifically, and attempts to elicit how respondents naturally think about apprehension risk. The findings suggest that laypeople are somewhat inconsistent in their use of both verbal and numeric descriptors of probability, their numeric estimates of probability are unlikely to be precise or durable, and many seem to prefer thinking of risk in verbal terms (compared to numeric terms). Researchers should consider including both verbal and numeric measures of probability and explore alternative measurement strategies, including anchoring vignettes, which have been valuable in standardizing verbal responses in other disciplines.
The perceived likelihood of apprehension, or less frequently, of punishment, is a leading variable in the literature on offender decision making and is the key construct in deterrence theory (Apel, 2013; Pratt et al., 2006). However, a review of the recent literature suggests there is little consensus on the measurement of this construct. There are two primary ways to elicit perceptions of probability in self-report surveys: verbal questions (e.g., “How likely is it that you will be caught if you commit [CRIME]?”) with a Likert-type response scale such as 1 = very unlikely to 7 = very likely and numeric questions (e.g., “What is the percent chance that you will be caught if you commit [CRIME]?”) with a numerical scale (e.g., 0%–100%) and responses on the ratio level.
The measures employed in the extant literature vary extensively and often appear without justification. Some studies used verbal measures, others used numeric scales ranging from 0 to 100, and still others have used a mixed 11-point scale whereby each category signals a 10% increase in probability from 0 to 100, despite verbal labels. Some of literature’s diversity of measurement may stem from its variety of data sources. Many studies of apprehension risk have examined American college students (Kamerdze et al., 2014; Loughran et al., 2014; McGloin & Thomas, 2016; Paternoster et al., 2017; Thomas et al., 2018), while another substantial portion have used more general samples from European nations, particularly Russia and Ukraine (Averdijk et al., 2016; Kroneberg et al., 2010; Tittle et al., 2011). Nevertheless, the single most influential source of data in this literature is the Pathways to Desistance study, a longitudinal study of serious adolescent offenders transitioning from adolescence into early adulthood in Maricopa County, Arizona and Philadelphia County, Pennsylvania. Table 1 provides a summary of recent measures of apprehension risk in the criminological literature. Nearly one third of the studies use the Pathways data.
Operationalizations of Apprehension Risk in Top Criminology Journals, 2010–2018.
Note. Articles come from a search of Criminology, Justice Quarterly, the Journal of Research in Crime and Delinquency, and the Journal of Quantitative Criminology.
a Data from the Pathways to Desistance study. bItems concerned informal revenge rather than formal arrest.
Studies that have employed numeric measures (e.g., Kamerdze et al., 2014; Paternoster et al., 2017; Pogarsky et al., 2017) typically assume respondents’ reported perceived probabilities of apprehension are precise, literal, and durable numeric estimates (see Thomas et al., 2018, p. 60). If this assumption holds true, these estimates would have several beneficial properties. They would be situated on a well-defined absolute scale (0–100), allow a respondent’s answer on Event A to be compared to their answer to Event B (i.e., internal consistency), and also allow comparison to other respondents’ answers on the probability of Events A and B (i.e., interpersonal comparability; Manski, 2004, p. 1339; see also Thomas et al., 2018). Since most perceptual deterrence research is fundamentally concerned with comparisons within and between individuals, it is difficult to understate the importance of this assumption.
If numeric responses are not the natural decision-making metric for many individuals—that is, the metric that people naturally use when thinking about apprehension risk—then respondents may be forced into a more complex response processes like intensity matching (Kahneman, 2011). This may in turn increase measurement error (see Holbrook et al., 2000; Sweitzer & Shulman, 2018). Such a process might vary across respondents in accordance with individual differences in intelligence or numeracy, creating systematic biases, so that measurement errors are larger for some groups of respondents than others. It may also vary within respondents over time, if situational factors influence how respondents think about and estimate sanction risk. If systematic measurement errors occur, their existence may help explain why research suggests perceived risk of apprehension appears to be a weak predictor of intentions to offend overall (Nagin, 1998; Paternoster, 1987, 2010; Pratt et al., 2006) and why perceived sanction risk is weakly correlated with objective risk (Kleck et al., 2005).
Several recent experiments in our field have examined the substantive qualities of subjective beliefs about apprehension risk, finding these beliefs are intuitive but still coherent within individuals (Pickett et al., 2018; Pogarsky et al., 2017; Thomas et al., 2018). A separate, and heretofore overlooked, research question is how best to elicit those beliefs in order to minimize measurement error and artificiality. This line of inquiry is important because research in other fields suggests that laypeople may have only a tenuous grasp of probability (Alberini et al., 2004; Hacking, 1990, 2006) and may struggle to express their perceptions of likelihood in surveys (Bruine de Bruin et al., 2000; Fischhoff & Bruine De Bruin, 1999). Unfortunately, there is little research in our field that assesses verbal and numerical risk estimates simultaneously, nor is there work exploring whether respondents naturally think about risk in verbal or numerical terms.
Such a study is needed, however, since scholars have often neglected a key potential trade-off of using subjective perceptions of certainty and severity of punishment rather than objective properties of sanction regimes. Objective sanction regime properties (e.g., police clearance rates, sentencing guidelines) may be unknown to or misunderstood by potential offenders, making subjective perceptions of risk more salient to decision making. Nevertheless, what perceived risk measures gain in salience, they may lose in durability over time, precision within respondents, and comparability across respondents. There is evidence to suggest this is the case. Decision makers have limited information, time, and cognitive abilities (e.g., “bounded rationality”; see Simon, 1955), are more sensitive to changes rather than absolute levels of the properties of their environment (e.g., “status quo bias” and “hedonic adaptation”; see Kahneman et al., 2006), and employ both analytical, cognitively intensive decision making and intuitive heuristic-based decision making (e.g., “dual-process decision making”; see Mamayek et al., 2015; van Gelder & de Vries, 2014). All of these factors are likely to affect subjective probabilities as well as self-reports of them in surveys.
Findings from behavioral economics indicate a broad need to relax or reconsider many traditional tenets of the deterrence and rational choice perspectives and show that laypeople have difficultly judging the probability of uncertain events (Pickett et al., 2018; Pogarsky et al., 2017, 2018). Several distinct avenues in the criminological rational choice literature suggest individuals may have difficulty providing person-specific information related to arrest risk that can be compared across people. Because respondents rely on intuitive reasoning, their probability judgments can be biased by extraneous and irrelevant information via priming or anchoring (Ariely et al., 2003; Pickett et al., 2018; Pogarsky et al., 2017; Tversky & Kahneman, 1974). As well, there is the role of personality traits in judging risk (Paternoster & Pogarsky, 2009; Pickett & Bushway, 2015) and an interplay between emotions like fear and anger and cognitive assessments of risk (Barnum & Solomon, 2019; Jacobs & Cherbonneau, 2017; Pickett et al., 2018).
Still, in a recent study, Thomas and colleagues (2018) found that although framing and anchoring can bias subjective probabilities, respondents are still “locally coherent” in their rank ordering of arrest risk by crime type. This suggests that within-person changes over time in risk perceptions may be meaningful, even if those perceptions have arbitrary initial levels. Nevertheless, Thomas et al. (2018) used one elicitation method (numerical) with one response scale (0%–100%) and samples of college students. They also did not directly evaluate the process by which respondents think about risk. Additionally, even if, as Thomas and colleagues (2018) recommend, rational choice research emphasizes longitudinal analyses examining within-individual change in risk perceptions, measurement error will still be consequential. Indeed, it may attenuate any relationship between perceived risk and other variables, such as offending or arrest experiences.
Our current research seeks to improve our understanding of the measurement of subjective probabilities of apprehension. It has three principal aims: (1) to assess the correspondence between respondents’ reported verbal and numeric assessments of their likelihood of being apprehended after committing a crime, (2) to examine the internal coherence of respondents’ numeric assessments of apprehension risk, and (3) to assess the manner in which respondents privately think about apprehension risk, either verbally or numerically. If verbal and numeric assessments correspond well with one another, and numeric assessments are coherent, then the current literature’s scattershot approach to measurement is not of great consequence. If, on the other hand, verbal and numeric responses do not correspond well, and numeric responses are not as durable or precise as we would expect, then it behooves us to better understand what metrics individuals prefer and to seek out alternative measurement methods that might incorporate both approaches.
Methods Overview
Data
The current article utilizes experimental data from an online survey of adult (aged 18 and older) U.S. residents recruited from Amazon’s Mechanical Turk (MTurk) in the summer of 2016. 1 MTurk is a crowdsourcing internet marketplace, where “workers” sign up for common information manipulation jobs (i.e., “human intelligence tasks” or “HITs”), such as reviewing advertisements or transcribing interviews. Following accepted best practices for recruiting respondents from MTurk (see Peer et al., 2014), we limited participation to U.S. residents, who had completed at least 50 prior HITs and had an approval rating of at least 95% in their prior HITs.
The survey was advertised as the “2016 National Survey on Legal Decision Making,” and 99% of respondents who began the survey successfully completed it. Once respondents began the survey, they were randomly assigned to certain subsets of questions (Studies 1–3), and then some details of the vignettes and the questions in those subsets were also randomized. Specifically, Studies 2 and 3 are split-ballot experiments, which are used in a variety of fields to evaluate measurement, such as testing the effect of the use of neutral categories, “don’t know” options, question wording, and question ordering (see Applegate & Sanborn, 2011; Smith, 1987; Van De Walle & Van Ryzin, 2011). Each sample is mutually exclusive—that is, each respondent was randomly assigned to participate in only one study. Respondents were provided a small monetary reward for their participation. Table 2 provides descriptive statistics for each of the three study samples.
Descriptive Statistics for Samples.
Note. SD = standard deviation.
Researchers increasingly use online convenience samples (e.g., Pogarsky et al., 2017; van Gelder & de Vries, 2012), which permit larger and more diverse groups of respondents, than more traditional convenience samples, such as college students (Berinsky et al., 2012). Research suggests MTurk respondents are less likely to fail comprehension checks, speed through questionnaires, or engage in item nonresponse (Weinberg et al., 2014). Research has also found evidence supporting the external validity of experimental findings from MTurk samples (Mullinix et al., 2015; Weinberg et al., 2014), and MTurk data have been used in articles published in leading journals across several disciplines (e.g., Hahl et al., 2017; Orvell et al., 2017; Pickett et al., 2018). That said, MTurk is not representative, and effect heterogeneity could be a threat to the generalizability of our findings (see Thompson & Pickett, 2019).
Measurement
Our research design follows recent studies of offender decision making (e.g., McGloin & Thomas, 2016; Paternoster et al., 2017; Pogarsky et al., 2017; van Gelder & de Vries, 2012, 2014) by using hypothetical survey vignettes. The use of such scenarios assumes that respondents know how they would likely act or think in a given situation (Kahneman & Tversky, 1979, p. 265). While respondents were not asked to report how they would personally act in the future, they were asked about their perceptions of risk in particular situations. If responses to such hypothetical situations bear little resemblance to actual behavior, they would provide little useful information. Fortunately, strong evidence exists that choices and perceptions reported in hypothetical scenarios, even scenarios describing dangerous or risky situations, closely correspond to those made in real life (Pogarsky, 2004; Thaler, 2015). In the next section, we describe each study in detail and discuss the associated findings.
Study 1
This first study addresses the degree of correspondence between verbal and numeric measures of apprehension risk (n = 301). In Study 1, respondents provided both verbal and numerical representations of their perceived likelihood of getting caught for a single offense. We first asked respondents to imagine a scenario where they spend a significant amount of time (30 min) driving on the highway to work each morning. Respondents were told they were running late one particular day and would have to drive over the speed limit to reach work on time (see Online Appendix A for exact wording). Respondents were then asked to describe their risk of apprehension (e.g., being pulled over by the police). One measure was a Likert-type question with five categories (1 = very unlikely, 5 = very likely). The other question asked the respondent to type their response in numeric terms using a fill in the blank text box (percent chance, 0%–100%). In order to prevent question ordering effects (Hart, 1998; McFarland, 1981), we randomized the order of the two questions. Table 3 shows respondents’ numeric estimates nested within their Likert-type responses—that is, descriptive statistics for all the numeric estimates for each Likert “group” (e.g., all respondents who stated their risk was “very unlikely”).
Descriptive Results of Numeric Probabilities by Verbal Category in Study 1.
Note. n = 301.
Unsurprisingly, the mean values of the numeric estimates within each Likert-type category are logically coherent—the mean for “very unlikely” is the smallest (5.79), and the mean increases for each group. Overall, the correlation coefficient for the numeric and verbal measures is .713. Several findings are noteworthy, however. First, the correspondence between verbal and numeric responses appears asymmetric—there is a much larger numeric difference between “very unlikely” and “unlikely” (11 percentage points) than between “likely” and “very likely” (3 percentage points). Second, the range of numeric probabilities that are mapped onto the Likert-type categories is wide. This is not limited to the “neither likely nor unlikely” group, whose respondents could perhaps be particularly unsure about numeric estimates. Instead, each of the five Likert-type categories includes a broad range of numerical responses and correspondingly large standard deviations. To one respondent, “likely” corresponded to a “60%” probability, but to another, it was “85%.”
Even if we recognize that different people might vary substantially in their interpretations of, for instance, “likely” versus “very likely,” we would like to assume that valence would be consistent across the Likert and numeric measures for nearly all respondents. That is, respondents who report “unlikely” should not report numeric values above 50%, and those that report “likely” should not report numeric values below 50%. In our sample, 38 respondents (13%) violated valence. For example, five respondents said “unlikely” but gave numeric responses at or above 50%, with the highest at 85%, and 17 respondents said “likely” but gave numeric responses below 50%. Four respondents who answered that the risk was “very likely” gave numeric responses at or below 50%. As well, while not strictly a violation of valence, nearly half of the respondents who indicated “neither likely nor unlikely” reported numeric values of 60% or above, or 40% or below, with a distinct skew toward lower values.
It should also be noted that a small proportion of respondents (n = 16 or 5% of the sample) gave answers of less than 1, using decimal points (e.g., .05, .30). While respondents were instructed to give their answer as “chances out of 100,” they were also told a decimal could be included (see Online Appendix A, Table A1). It is possible that some responses were misinterpretations of the instructions—for instance, a response of .4 was intended to denote 40% and not .4%. Nevertheless, the majority of the respondents who used decimals also answered with verbal categories of “very unlikely” and “unlikely” (10 or 62.5%), which would seem to indicate substantive responses. Two respondents who reported “neither likely nor unlikely” and four respondents who reported “likely” gave responses of less than 1 using decimal points. We may be most inclined to consider these responses as errors. Nevertheless, neither dropping these cases, nor recoding them to values, we can only speculate the respondents intended, substantively alters the results of Table 3.2
There are three potential explanations for this pattern of findings, which include a seeming overall correspondence between numeric and verbal measures, but substantial variation in numeric estimates within verbal categories, and 13% of the sample violating valence. First, some respondents may be inconsistent in their verbal responses but consistent numerically (Manski, 2004). Here, the numeric responses would represent truthful and precise estimates of the respondents’ anticipated apprehension risk of speeding, while the verbal responses would demonstrate vagueness. Second, respondents may be inconsistent in their numeric responses—owing perhaps to unfamiliarity with numbers or difficulty translating intuitions into a mathematical expression of probability—but may be consistent verbally. This would suggest that verbal estimates are more accurate measures of likelihood and that numeric estimates lend false precision, constituting mainly measurement error. The third potential explanation is that some or all respondents may be doubly inconsistent. They may have difficulty forming abstract numeric estimates, and they may also vary considerably in their verbal descriptions of the same level of apprehension risk. In order to adjudicate between these explanations, Study 2 uses a separate sample to examine the degree to which respondents’ numeric assessments are internally coherent.
Study 2
Study 2 assess the mathematical properties of numeric estimates of apprehension risk (n = 687). We test this by presenting the respondents with a speeding vignette (see Online Appendix A, Table A2) and then asking them about the numeric likelihood of being apprehended while speeding. All respondents received the same vignette, but we varied the denominator of the question, such that respondents were randomly assigned to one of the five possible versions. The respondents were asked to answer the questions as (1) “percent chance,” (2) “chances out of 10,” (3) “chances out of 20,” (4) “chances out of 100,” or (5) “chances out of 1,000.” 3 Respondents were also specifically instructed that they had the ability to use decimals, so respondents who received the “chance out of 10” and “chances of out 20” versions had the ability to be as precise as respondents in the other groups.
Three of the denominator choices (“percent chance,” “chances out of 10,” and “chances out of 100”) were adapted from the extant literature. Several of the studies featured in Table 1 use “percent chance” or “chances out of 100” (e.g., Kamerdze et al., 2014; Loughran et al., 2013; Pogarsky et al., 2017; Schulz, 2014). Likewise, although phrased somewhat differently, “chances of 10” is similar to the 11-category item featured in the Pathways to Desistance study, which has had an outsized influence on the recent offender decision-making literature (e.g., Anwar & Loughran, 2011; Loughran et al., 2012, 2016; Wilson et al., 2017). The other denominators (“out of 20” and “out of 1,000”) are uncommon in our field but have been used elsewhere (e.g., Loomes, 1998; Slovic & Monahan, 1995; Slovic et al., 2000). Lastly, we selected all of the denominators because they were mathematically transferrable or generalizable to one another.
If these numeric estimates of arrest risk have a precise mathematical meaning, as numbers typically do, then there should be no more than chance differences in perceived risk across the five groups. Results for this experiment are presented in Table 4.
Regression Predicting Denominator Effects on Numeric Probabilities in Study 2.
Note. n = 687. b = unstandardized coefficient; p = probability value; DV = dependent variable.
*p < .05. **p < .01. ***p < .001 (two-tailed).
Results from Study 2 suggest that respondents’ numeric responses of subjective risk are indeed sensitive to the denominator. Using the “out of 100” group as the reference category, Table 4 shows that respondents in lower denominator groups (e.g., “out of 10” and “out of 20”) were significantly more likely to report higher values. Respondents in the higher denominator group, “out of 1,000,” were more likely to report lower values, although this difference is slightly above the p < .05 standard for null hypothesis significance testing (p = .053). Surprisingly, the “percent chance” group exhibited the largest difference from the “out of 100” reference group, with respondents in the “percent chance” group being significantly more likely to report higher values. The effect size is large—13 percentage points.
These findings imply that numerical responses are perhaps “pseudo-mathematical”—respondents are answering using numbers, but their responses lack the precise mathematical meaning of real numbers. In math, 10% = 10/100 = 100/1,000. In surveys, these responses mean different things. This insight helps explain Thomas and colleagues’ (2018) finding that framing and anchoring can bias numerical subjective probabilities (though respondents’ rank ordering of arrest risk by crime type is maintained). Together, these findings pose concerns for Bayesian updating models of deterrence (Anwar & Loughran, 2011), and the expected utility paradigm of rational choice research more broadly (Becker, 1968), because these models assume individuals rigidly adhering to a set of mathematical axioms—completeness, transitivity, continuity, and independence (von Neumann & Morgenstern, 1944).
Study 3
As we have seen, there are apparent inconsistencies in respondents’ use of both verbal and numeric scales. Individuals attach different verbal labels to similar numeric probabilities and vice versa (see Study 1). When people assign numeric probabilities, they seem to lack a precise mathematical interpretation (see Study 2). Study 3 assesses how respondents actually prefer to think about likelihood when making decisions. While some scholars have argued that respondents’ preferences are irrelevant as long as they are willing and able to use the scale provided (Manski, 2004), we contend that understanding how they think about risk is important for designing elicitation procedures and analyzing decision making (Windschitl & Wells, 1996). There is substantial evidence that survey questions asking respondents to use language or metrics that they do not normally use, or that impose higher cognitive burden, affect metacognitive experiences and reduce data quality (Sweitzer & Shulman, 2018).
From a purely measurement standpoint, presenting information in ways the respondents prefer can reduce cognitive burden and thus decrease the likelihood of item nonresponse and breakoffs in self-report surveys (Dillman et al., 2014; Holbrook et al., 2000; Sweitzer & Shulman, 2018). The vast majority of perceptual deterrence research has relied on data from such self-report surveys and will likely continue to do so in the future. Moreover, from a substantive standpoint, forcing respondents to think about risk using a different mode than they prefer may substantively change their decision-making process, perhaps making it less like real-world decisions. For instance, a respondent asked to think about risk in broader verbal terms may make more intuitive decisions than they would in a real situation, while a respondent asked to frame their response in a numeric format might engage in more deliberative, analytical decision making than they otherwise would (“dual-process decision making”; see van Gelder & de Vries, 2014). To our knowledge, ours is the first study to question individuals on whether they think in verbal or numeric terms when considering risk, specifically sanction risk.
In Study 3, we first asked all respondents (n = 597) the same general prompt—to think about the risk of drunk driving in a city. We then presented them with sentences that had one or more blank spaces (e.g., “If someone drove drunk in this city, the ______ that he or she would get arrested is _____”) and asked how they would fill in those blanks if they were thinking about the risk of apprehension. In order to avoid priming respondents, within each group, the ordering of the response options was randomized where applicable. However, because any single “direct ask” question may be biasing, respondents were randomly assigned to one of the four different sets of response options in order to ensure that our findings were robust to the elicitation procedure (see Online Appendix A, Table A3). We designate these as Groups 1–4.
Group 1 (n = 147) was provided specific examples of numbers versus words as response options. Two response options provided specific terms associated with numeric probability (e.g., “percent chance”) and specific numbers. Two other response options provided terms associated with verbal risk (e.g., “likelihood”) and specific words. Finally, respondents were given an option to fill in their own answer. The goal here was to ask respondents which mode they preferred using, while including the terminology most commonly used in criminological research to measure apprehension risk—“likelihood” and “percent chance.”
It is still possible that we were unduly prejudicing Group 1’s respondents by using specific terminology because some terms might seem more scientific (i.e., “percent chance”) than others, and thus, respondents might select them in order to appear sophisticated. On the other hand, perhaps not providing specific examples would bias respondents. To overcome both concerns, respondents in Group 2 (n = 161) were provided general terms of numbers and words for response options. Specifically, respondents in this group were given a question with three response options: “There would be a number (e.g., ‘a 33 percent chance,’ or ‘a .33 probability’),” “There would just be words (e.g., ‘a low likelihood’ or ‘little chance’),” or a final response category (“other”), where they could fill in their own answer.
In Group 3 (n = 156), we provided respondents with the same general prompt of thinking about drunk driving but asked them a fully open-ended question (i.e., “If you had a thought like this, what would normally be in the blank spaces?”). We coded all open-ended responses for Group 3 into “numbers,” “words,” or “other.” For example, respondents who provided answers such as “likelihood; 1 in 3,” “chance; 50%,” and “probability; 100%” were coded as preferring “numbers.” On the other hand, respondents who answered with responses like “chance; a roll of the dice,” “chances; slim,” and “chances; high” were coded as preferring “words.” Finally, some respondents answered the prompt in ways that did not pertain to risk and likelihood in either numeric or word format (e.g., “place; going home,” “fact; justifiable,” and “thought; comforting”). These responses were coded as “other.”
Finally, respondents in Group 4 (n = 133) were provided the general prompt and were also given a completely open-ended question. However, respondents in Group 4 were also primed with specific numeric information about the risk of apprehension for drunk driving in the city. Our aim here was to see whether priming respondents with specific numeric information influenced their preference for thinking in numbers versus words. Results for all groups are provided in Figure 1.

Language preference across four vignettes in Study 3 (n = 597). Note. See Online Appendix A, Table A3 for exact question wording for each category.
In Group 1, where the response options were specifically aimed at common exemplars of verbal and numeric terms, 70% of respondents said they prefer to think about risk using verbal terms. A similar preference for thinking in words (64%) emerged in Group 2, where more generalized options were provided (e.g., “there would just be words [e.g., ‘a low likelihood’ or ‘little chance’]”). In Group 3, when an open-ended question was used, this preference persisted, with 65% of respondents reporting they would prefer to think in verbal terms. Only in Group 4, when explicit numeric information was presented in the form of a base rate of apprehension risk of drunk driving in the city (i.e., 400 of 1,200 reported drunk drivers were arrested), did the majority of respondents indicate they would prefer to think in numeric terms. Yet even when primed with numeric data about the risk of apprehension, more than one third of respondents (36%) stated that they would still prefer to think in words, not numbers.
Discussion
Taken together, our results suggest that individuals prefer to think about risk in verbal terms but also differ in their interpretation of verbal labels. When asked to respond numerically, respondents appear to map their assessments of likelihood—felt and described diffusely and verbally (Kahneman, 2011)—onto whatever numeric scale is provided, producing answers that appear to be numbers but that may lack any precise mathematical meaning. A 10% chance reported in a survey does not necessarily equal 1/10, or 2/20, or 10/100, or 100/1,000. This conclusion is supported by other research, which shows that respondents respond differently—are more emotionally impacted by—information presented in frequency (1/100) versus probability (10%) format, even though the two formats are mathematically equivalent (Kahneman, 2011).
These findings are notable because in using an MTurk sample, we have respondents who may be especially familiar with numeric scales and may be more educated or more numerate than the general population (Chandler et al., 2014; Levay et al., 2016; Weinberg et al., 2014). While the use of a convenience sample is a limitation of our study that future research should address, it also suggests that our results likely underestimate the proportion of individuals in the population who think about probability in verbal terms.
Our findings imply that numerical scales, whatever their advantages, are probably less natural for respondents than verbal scales and probably differ more from how they actually think about risk when making decisions. Still, verbal scales present their own difficulties. Because probability is a complex concept, individual respondents may interpret verbal response options in systematically different ways (Brady, 1985), also called “differential item functioning” (DIF; King et al., 2004, p. 191; see also Ward et al., 2017). DIF often occurs in questions on subjects that are significant, yet abstract, and which are commonly explained with reference to examples (e.g., political efficacy). A key issue then is how scholars may increase the comparability of verbal responses to subjective probability questions.
An increasingly prevalent method to account for DIF is the use of anchoring vignettes (Hopkins & King, 2010; King et al., 2004; King & Wand, 2007). Using the anchoring vignette method to measure perceived arrest risk would proceed in the following steps. First, before measuring perceived arrest risk, respondents would be presented with three anchoring vignettes about a crime that include numerical information about the objective risk of arrest (see Online Appendix B, Table B1). The respondents would map verbal assessments of probability onto the numeric information provided in each vignette. Then, respondents would be asked how likely or unlikely (1 = very unlikely; 5 = very likely) it is they would be caught if they committed different crimes (e.g., “stole something like a video game from a store”; “drove drunk”). Normally, the raw scores for these answers would be combined into a scale and used in any analysis (e.g., to predict offending). The problem is that this scale could be subject to DIF across respondents. Accordingly, for each respondent, a new “vignette-corrected” version of their perceived risk would be created by recoding the original response relative to the respondents’ answers to the three anchoring vignettes. Respondents are thus given a new 7-point score (see Online Appendix, Figure B1).4 The resulting variable is theoretically DIF-free (King et al., 2004). Future research should consider employing anchoring vignettes to account for potential DIF issues in verbal measures of subjective probabilities.
Limitations
Our experiments have several limitations. First, while our results provide evidence of a preference for thinking about apprehension risk verbally instead of numerically, we cannot claim that these preferences are indicative of stable, binary patterns or styles of thinking. Indeed, the results of Group 4 in Study 3, where we primed respondents to think in numeric terms about apprehension risk, suggest that respondents can be flexible in their use of either verbal or numeric thought processes. While literature on numeracy suggests that there may be somewhat stable differences in individuals’ capacity to engage with numeric information, this is likely a continuous attribute, not a binary one (Nelson et al., 2013; Peters et al., 2007), and pronouncing definitively on this subject is far beyond the current investigation.
Second, Study 3 relied on classifying some terms as distinctively numeric (e.g., “percent chance”) or verbal (e.g., “likelihood”). Such classifications may be overly determinative or inaccurate. We attempted to account for this by providing several varieties of prompts (see Groups 1–3), but nevertheless, all of our modes of elicitation could be similarly compromised. Thus, future research is needed that explores alternative methods for studying whether respondents think in verbal or numerical terms about risk. Third, and relatedly, Study 3’s modes of elicitation are novel. While this methodology has some parallels to prior literature (Fagerlin et al., 2007; Zikmund-Fisher et al., 2007), it is of our own devising. Future research should therefore assess the validity of the method.
Conclusions
Researchers should be attentive to the differences in the measurement of subjective probability, given that Studies 1 and 2 imply that neither verbal nor numeric answers are optimal for eliciting subjective assessments of apprehension risk. Providing stronger explanations for measurement choices should be a priority in future work (see Cullen et al., 2019). Considering the results of Study 3, it may be wise to use verbal measures in situations where numerical comparability (e.g., to objective probabilities derived from clearance rates) is not the goal (i.e., correlation studies) and to avoid calibration studies that attempt such a direct comparison until better elicitation methods are devised (e.g., Erickson & Gibbs, 1978; Kleck et al., 2005; Quillian & Pager, 2010). Our results suggest that subjective numeric responses cannot be taken at face value for a calibration study—that is, one cannot compare 50% reported subjectively by respondents to a 50% objective rate of apprehension—because these subjective numeric estimates are pseudo-mathematical, and depend on the denominator provided. If such calibration studies are to be attempted, a much more sophisticated underlying mathematical model should be employed in conjunction with it (e.g., Ferrell & McGoey, 1980).
Supplemental Material
Supplemental Material, sj-pdf-1-cjr-10.1177_0734016820978827 - On the Measurement of Subjective Apprehension Risk
Supplemental Material, sj-pdf-1-cjr-10.1177_0734016820978827 for On the Measurement of Subjective Apprehension Risk by Sean Patrick Roche, Justin T. Pickett, Jonathan Intravia and Andrew J. Thompson in Criminal Justice Review
Footnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by funding from the University at Albany Faculty Research Awards Program (FRAP)—Category B and from the Hindelang Criminal Justice Research Center at the University at Albany, State University of New York.
Supplemental Material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
