Abstract
Abstract
In this study, we examined the extent to which computerized linguistic analysis of natural language data from chat transcripts of Internet child sex stings predicted recidivism among 334 convicted offenders. Using the Linguistic Inquiry and Word Count (LIWC) program, we found that reoffenders (including simultaneous and previous offenders) differed significantly from nonreoffenders in measures of clout (a composite measure of social dominance) and percentage of words used in the following linguistic categories: cognitive processes, personal pronoun use, insight, time, and ingestion. In contrast, total word count and percentage of sexual words, two categories that might be assumed to be predictive of recidivism, were not significantly different between these two groups. These analyses help to develop a typology for an Internet sex reoffender as one who is dominant, nonequivocating, and likely to discuss meeting with their target and/or parents' schedules. Moreover, they highlight the importance of examining the functional aspects of language in forensic linguistic analysis, and exemplify the utility of computerized linguistic analyses in the courtroom.
Introduction
C
A prime example wherein content analyses may be of great benefit is in the realm of online child solicitation sex stings. Recently, two separate studies have shown that the psychological hallmarks of grooming strategies (i.e., the luring language used by sexual predators 5 ) are evident in chat transcripts from online child solicitation sting cases,6,7 and that some dimensions of offenders' language were associated with length of jail time assigned during sentencing. 7 In this study, we build upon the content analyses of previous work to examine whether offenders' language patterns from online chat transcripts are predictive of recidivism among convicted Internet sex sting offenders.
Online child sex sting offenders
As the Internet gained popularity in American households, the general public and law enforcement agencies (LEA) began to recognize potential dangers lurking for children online. Consequently, the Department of Justice, in conjunction with the National Center for Missing and Exploited Children (NCMEC), has been investigating online child victimization for almost two decades.8–10 Although the number of children reporting unwanted online sexual solicitation has dropped significantly from 2000 to 2010 (from 19 to 9 percent), concerns remain that Internet and computer-mediated communication technologies may still facilitate child victimization.9,11 In answer to such concerns, law enforcement agents across the United States regularly conduct online sex stings to apprehend potential child predators before they are able to commit a contact offense with an actual child.12–15
With regard to recidivism, numerous studies conducted in various contexts have found consistently that Internet sex offenders have a much lower rate of recidivism than contact offenders, 16 and Internet sex offenders without a history of previous contact offense(s) have a lower rate of recidivism than those who do have a history of such offenses. 17 For example, in a comprehensive meta-analysis of more than 2,500 internet sex offenders, only 4.6 percent were found to have had a sexual reoffense, and only 2 percent had a new contact offense. 16 Thus, research suggests that Internet-only sexual offenders may comprise a distinct category of offenders who are at a relatively low risk of recidivism relative to others.
Importantly, however, most of the offenders in previous studies had been convicted of child pornography possession or distribution; offenders from Internet sex stings comprise only a small portion of the sample. A primary distinction may be made in that child solicitation offenders could be distinct from other Internet-only offenders in that, although they have not committed a contact offense, they are contact-driven (i.e., they have, in most circumstances, made a substantial step toward committing a contact offense by traveling to meet a minor for sexual activity) rather than fantasy-driven. 18 In light of this difference, there is a strong need for more specific studies with Internet sex sting offenders to determine whether there are any markers that predict recidivism in this subsample of online offenders. In other words, disambiguating those factors indicative of a propensity to reoffend is crucial from both a psychological perspective and a legal perspective.
Language as a predictor of recidivism among internet sex offenders
For decades, researchers have been examining the relationships between natural language data and individual personality and dyadic communication characteristics.19–25 Computerized text analysis software, such as the Linguistic Inquiry and Word Count (LIWC) program,26–28 has revolutionized this research by allowing researchers to compare natural data sets across contexts using meaningful, validated linguistic categories. Importantly, work with software such as LIWC has found that the very foundations of an individual's psychology, such as emotions, motivations, and attentional processes, are embedded (and quantifiable) within the words people use to communicate. Simply put, much research has been conducted to establish and develop language-based psychological analyses into a reliable and valid mode of assessing psychological and social processes.
As digital evidence has become more popular in the courtroom, language analysis has gained attention from the legal community as a useful forensic tool.4,29 With regard to Internet sex stings specifically, these analyses may be particularly beneficial as the prosecution of these cases often includes chat transcripts containing extensive natural language data (i.e., interactions between offenders and undercover stings), with word counts numbering in the thousands to tens of thousands. Although it is common for these chat transcripts to be admitted as forensic evidence, in absence of testimony from a forensic linguist to contextualize the language used in the chats, the trier of fact must rely upon their own, likely naïve understanding of language and grooming strategies to make conviction and sentencing decisions.
Recent computerized linguistic analyses from these chat transcripts have revealed important trends that could prove helpful to the trier of fact.6,7 Black et al. 6 were the first to map these natural language features onto sexual predation indicators, analyzing grooming strategies within transcripts of 44 convicted online sex sting offenders. They found grooming strategies were prevalent throughout their transcripts; however, their frequency and order of appearance did not conform with O'Connell's 30 proposed stage model of online grooming. O'Connell 30 suggested that relationships would be established (first as friendships and then as more exclusive relationships) before offenders engaged in risk assessment and sexual talk (including plans to meet), whereas sex sting offenders engaged in these talks early and throughout the transcripts. 6
In a more recent study that used the same offender chat database, Drouin et al. 7 found that multiple psychological dimensions of offenders' language (i.e., total word count, sexual words used, and clout) were associated with length of jail time adjudged during sentencing (i.e., offenders with lower rates of word usage in these categories than their undercover agents received less jail time). This suggests that the trier of fact may already be using some linguistic features of these transcripts in their decision making. However, currently, there is no empirical evidence that these language markers are related to increased risk of past, concurrent, or subsequent offenses, which calls into question the validity of relying upon these cues in language for the purpose of forming and rendering sentencing decisions.
Current study
In this study, we sought to extend previous lines of inquiry to determine whether there were natural language patterns that predicted recidivism among online child sex offenders. In line with Drouin et al., 7 we predicted that those who reoffended would be more overtly predatory, using more overall words, more sexual words, and displaying higher clout than nonreoffenders. Additionally, we focused on more nuanced, functional aspects of language. (See Table 1 for the linguistic categories we included and representative text samples from the chat transcripts.) As shown in Table 1, cognitive processes and insight are categories that contain words like “think,” “know,” and “ought.” We predicted that reoffenders would use fewer words than nonreoffenders in these categories because they would be goal-oriented, less reflective, and nonequivocating, having already determined their desired course of action and expressing little doubt.31,32 Additionally, we predicted that nonreoffenders would use increased rates of first-person singular pronouns; previous work has repeatedly found that greater use of this class of pronouns is indicative of lower levels of social status and dominance, and decreased self-reflection.21,33–35 Finally, a manual inspection of more than 200 Perverted Justice Foundation Inc. (PJFI) transcripts showed that many offenders used time words to arrange meeting, verify parents' schedules (corresponding to both the sexual and risk assessment categories of online grooming6,30), and otherwise coordinate behaviors toward end goals of contact. Additionally, many undercover agents engaged in talk about food as an engagement tactic when offenders were not forthcoming or dominant with regard to conversational topic. Therefore, we predicted that reoffenders would use higher rates of words in the time category and lower rates of words in the ingestion category.
Note: “//” indicates a line break and/or conversational turn.
LIWC, Linguistic Inquiry and Word Count.
Methods
Data collection and preparation
Chat transcripts
PJFIa is a nonprofit organization focused on the apprehension of sexual predators using online sting operation tactics. The PFJI liaisons with local LEA in conducting operations that primarily take place in regional online chatrooms. The PJFI makes transcripts of all chat logs resulting in conviction publicly availableb—this archive served as the primary source of natural language data. As part of a larger study, 7 data were collected from the PFJI archives in January, 2016. Collected data included complete transcripts of all interactions between stings and offendersc and various metadata, such as geographical data and sting demographics (i.e., fictional sting gender, fictional sting age). A total of 590 full transcript collections and associated metadata were collected. For the current study, we included only those individuals who had chatted online with fictional female stings (n = 538)d and were registered as sex offenders in their respective state, resulting in a final N of 334 transcripts. All offenders in this final sample were male (M age = 33.36, SD = 10.14).
Recidivism data
Offenders were searched for by name via the U.S. Department of Justice National Sex Offender Public Registry online portal.e In cases where multiple individuals with the same name were found, the registry entry was matched with the conviction information, demographic characteristics (e.g., age and city of residence), and offender photo (if available) to ensure that the correct person was identified. After matching the initial sex sting conviction information to the conviction record, the records were inspected for any prior, concurrent, or subsequent offenses. All offenses were recorded; however, only offenses related to minors (e.g., child pornography, child solicitation, use of computer to harm a minor), sexual offenses (e.g., sexual assault of elderly or disabled person), or unspecified subsequent felonies were counted as recidivism. We included both contact and noncontact offenses because either would be a significant violation of probation and/or other restrictions placed upon sex offenders, and it also allowed us to cast the widest net possible for the identification of potential reoffenders. Additionally, concurrent or prior offenses reported by PJFI but not included in the sex registries were also recorded. Of the offenders located in the registry, 291 (87 percent) had no prior, concurrent, or subsequent offenses (nonreoffenders), and 43 (13 percent) were categorized as reoffenders. Among reoffenders, 12 (3.6 percent of the total sample) had reoffended after the sex sting conviction (usually child pornography or solicitation offenses; only 2 were contact offenses), 18 (5.4 percent of total sample) had prior offenses, and 9 (2.7 percent of total sample) had simultaneous offenses. Additionally, 4 (1.2 percent of total sample) had multiple offenses (e.g., prior and simultaneous).
Language preparation/analysis
Following data collection, all transcripts were preprocessed in multiple stages. Within each transcript, natural language data were separated by speaker, isolating the words written by offenders from the fictional sting characters. To ensure between-individual consistency, spelling standardization procedures were used to correct common misspellings (e.g., “teh” instead of “the”), netspeak (e.g., “ty” for “thank you”), and other common idiosyncrasies (e.g., elongation of “oh” to “ohhh”). Texts were manually inspected to ensure the general accuracy of the spelling standardization process.
Offender language samples were subsequently analyzed using the LIWC2015 software. 26
The LIWC2015 software quantifies natural language data into objective measures of psychological processes. Using LIWC2015, language is quantified using a dictionary-based word counting approach, wherein a body of text is quantified along ∼80 psychological dimensions as a function of word use. For example, if 1 out of every 10 words in a language sample belongs to the “positive affect” LIWC category, the text will receive a score of 10 percent for positive emotionality; this approach has been extensively validated across hundreds of studies (see Tausczik and Pennebaker 25 ). LIWC2015 also provides a handful of “summary” measures, such as the “clout” score, that represent population-normed psychological measures based on the word-counting method just described. These LIWC-based measures can be used in traditional statistical models in the same way as other types of quantified psychological measures—our statistical analyses are reported in the following section. For the current analyses, all texts were of adequate length for inclusion; descriptive statistics for the measures under consideration for the current study are presented in Table 2.
Note: With the exception of the “Word Count” and “Clout” categories, which are computed as a sum and internally normalized by the LIWC2015 software (Pennebaker et al. 26 ), respectively, all categories reflect the percent of words in a given text that are indicative of each psychological process.
Results
Results from all analytic procedures are presented in Table 3. For all language-based psychological measures included in the current study, we performed independent-samples t tests to investigate differences between reoffenders and nonreoffenders. Both groups met assumptions for homogeneity of variance for all models (Levene's test ps ≥ 0.15), with the exception of the ingestion category. As such, standard t tests were performed for all but this category, wherein we performed a t test accounting for the violated equality of variances assumption. In the current analysis, results for all measures showed statistically significant differences between reoffenders and nonreoffenders with the exception of the “sexual” and word count measures. Reoffenders were more likely to exhibit higher clout in their language in addition to a greater use of time words; nonreoffenders tended to use words from the cognitive processes, insight, ingestion, and first person singular pronoun categories at higher rates.
Notes: Results from independent-samples t tests and Δ recidivism probabilities from a binomial logistic regression model. For ease of interpretation, mean scores for each group are bolded to indicate significantly higher use of language from a given language-based measure of psychological processes.
For psychological measures that demonstrated significant between-group differences, we performed follow-up binomial logistic regressions, which were used to model the probability of reoffense as a function of psychological processes measured via language.f Log odds ratios were converted to probability scores for ease of interpretation. Estimates of changes in recidivism probabilities were calculated for an increase of 1 standard deviation (SD) for each measure. 36 For example, an increase of 1 SD in an offender's use of words from the “time” language category (i.e., an increase of 1.13; Table 2) corresponded to a 59.57 percent increase in their probability of reoffending.
Discussion
The Department of Justice has long suggested that content analyses of digital evidence may be valuable for many types of criminal cases where online communication takes place.2,4 At the same time, psychologists have been conducting computerized natural language analyses, showing that language patterns exhibit trait-like psychometric properties.24,37,38 However, although forensic linguistic researchers have used linguistic analyses in the courtroom for years39,40 only a handful of known studies have used computerized linguistic programs for these analyses.6,7,19,35 Furthermore, little work has been done with explicitly psychological language analyses, which can not only provide objective measures for use in statistical modeling, but interpretable metrics that facilitate valuable psychological insights. 41 Accordingly, forensic computerized linguistic analyses are just beginning to gain attention from the legal community as potential sources of evidentiary support.
Our study extended previous work in this area by examining whether patterns in natural language data are predictive of recidivism within a large sample of Internet sex sting offenders. Previous work has shown that the word count, sexual words, and clout are associated with jail time in sentencing decisions. 7 Of those measures, however, only the clout category was predictive of recidivism in our current analysis; offenders scoring 1 SD above the mean were 60 percent more likely to recidivate. This is a notable finding, as it shows that the most visible, intuitive language categories that one might assume to be linked to reoffending (i.e., word count and sexual words) are not predictive of recidivism. In contrast, clout, along with some of the other more functional language dimensions (e.g., use of personal pronouns, cognitive processes, and insight), were predictive of recidivism. This is also important to note because these latter language dimensions may not be immediately apparent to the human eye, and current stage models of online grooming 30 do not differentiate between typologies of online offenders in terms of least and most likely to reoffend. These findings, coupled with the higher rates of time category words and lower rates of ingestion related words among reoffenders (as compared to nonreoffenders), allow us to develop a relatively clear picture of Internet sex sting offenders who recidivate. Those Internet sex sting offenders most likely to reoffend are more predatory in their language; they dominate the chat conversations with their fictitious underage targets and show little equivocation (using less “I think I might” language, and more “We are going to” language). They are not often sidetracked by conversations of what they are eating for dinner but would rather engage in conversations about when they are going to meet up for sexual activity and when the target's parents will be away (i.e., sexual and risk assessment stages).
Limitations and conclusion
Our study does have limitations that need mention. First, all transcripts were drawn from the Perverted Justice online archive. Although the cases were prosecuted in different jurisdictions, it could be that sting protocols that employ different conversational tactics (e.g., avoid talk about food when there is a lull in conversation), may have different results with regard to the ingestion category, specifically. Additionally, we were able to locate and analyze transcripts for only 334/538 (62 percent) of offenders in the sex offender registries; thus, our reported rate of recidivism may be lower or higher than the actual rate as the missing reports may not be random. However, the recidivism rate reported here (3.2 percent) is relatively equal to that found in other studies17,18; therefore, we expect that our rates are generalizable.
Overall, our findings align somewhat with Black et al., 6 who found that sexual and risk assessment stages of online grooming occur early in chats, and with Drouin et al., 7 who suggested that clout is an important factor to consider in prosecution of these cases. However, they also extend these studies in an important way by showing how these linguistic categories could be used to estimate probability of recidivism. From a practical standpoint, these analyses could be used during sentencing to help direct the trier of fact away from linguistic distractors, like word count and sexual words, and toward the linguistic predictors of recidivism (i.e., use of “we,” often checking schedules in attempts to meet up). More importantly, from an empirical standpoint, this study provides a model for how computerized linguistic analysis can be used to estimate and understand recidivism in forensic settings.
Notes
a. Perverted Justice Foundation Incorporated.
b. The PJFI transcript archives (
c. According to the PFJI, the perpetrator initiated interactions in all cases (
d. Previous analyses (Drouin et al. 7 ) showed that the chat transcripts for male perpetrators with male stings were qualitatively and quantitatively different from those with male perpetrators and female stings.
e. The NSOPW links to individual jurisdictions; however, the Department of Justice “does not guarantee the accuracy, completeness, or timeliness of the information contained in Jurisdiction Websites.” Moderate rates (62 percent) of offender database matching may be attributable to incomplete Jurisdiction Websites combined with high numbers of incarcerated, deceased, or deported individuals.
f. Note that while we do not present significance tests from the binomial logistic regressions here, the results are parallel to those of the t tests.
Footnotes
Acknowledgments
This work was supported in part by grants by the National Science Foundation (IIS-1344257) and John Templeton Foundation (#48503). The views, opinions, and findings contained in this report are those of the author(s) and should not be construed as position, policy, or decision of the aforementioned agencies, unless so designated by other documents.
Author Disclosure Statement
No competing financial interests exist.
