Abstract
Successful social interactions are assumed to depend on theory of mind—the ability to represent others’ mental states—yet most studies of the relation between theory of mind and social-interactive success rely on non-interactive tasks that do not adequately capture the spontaneous engagement of theory of mind, a crucial component of everyday social interactions. We addressed this gap by establishing a novel observational rating scale to measure the spontaneous use of theory of mind (or lack thereof) within naturalistic conversations (conversational ToM; cToM). In 50 age- and gender-matched dyads of autistic and typically developing youth aged 8–16 years (three dyad types: autistic–typically developing, typically developing–typically developing, autistic–autistic), we assessed cToM during 5-min unstructured conversations. We found that ratings on the cToM Negative scale, reflecting theory-of-mind-related violations of neurotypical conversational norms, were negatively associated with two forms of non-interactive theory of mind: visual-affective and spontaneous. In contrast, the cToM Positive scale, reflecting explicit mental state language and perspective-taking, was not associated with non-interactive theory of mind. Furthermore, autistic youth were rated higher than typically developing youth on cToM Negative, but the groups were rated similarly on cToM Positive. Together, these findings provide insight into multiple aspects of theory of mind in conversation and reveal a nuanced picture of the relative strengths and difficulties among autistic youth.
Lay abstract
Conversation is a key part of everyday social interactions. Previous studies have suggested that conversational skills are related to theory of mind, the ability to think about other people’s mental states, such as beliefs, knowledge, and emotions. Both theory of mind and conversation are common areas of difficulty for autistic people, yet few studies have investigated how people, including autistic people, use theory of mind during conversation. We developed a new way of measuring cToM using two rating scales: cToM Positive captures behaviors that show consideration of a conversation partner’s mental states, such as referring to their thoughts or feelings, whereas cToM Negative captures behaviors that show a lack of theory of mind through violations of neurotypical conversational norms, such as providing too much, too little, or irrelevant information. We measured cToM in 50 pairs of autistic and typically developing children (ages 8–16 years) during 5-min “getting to know you” conversations. Compared to typically developing children, autistic children displayed more frequent cToM Negative behaviors but very similar rates of cToM Positive behaviors. Across both groups, cToM Negative (but not Positive) ratings were related to difficulties in recognizing emotions from facial expressions and a lower tendency to talk about others’ mental states spontaneously (i.e., without being instructed to do so), which suggests that both abilities are important for theory of mind in conversation. Altogether, this study highlights both strengths and difficulties among autistic individuals, and it suggests possible avenues for further research and for improving conversational skills.
Keywords
Introduction
Everyday life is saturated with social interactions, and leveraging these interactions to form high-quality relationships is critical for mental and physical well-being (Hedley et al., 2018; Holt-Lunstad et al., 2010). It is widely assumed that social-interactive success depends on theory of mind (ToM)—the ability to represent the mental states of others and oneself—an assumption that drives the influential “mindblindness” account of social impairment in autism (Baron-Cohen et al., 1985). While empirical findings indicate broad, distal relations between ToM and social outcomes (Bosacki & Astington, 1999; Caputi et al., 2017; Devine & Hughes, 2013; Peterson et al., 2016), the specific mechanisms driving these relations are unclear. That is, how does ToM ability or propensity impact real-world behaviors that in turn influence social-interactive success?
Answering this question with a useful degree of specificity is hampered by the discrepancy between standard ways of measuring ToM in the laboratory and the nature of everyday social interactions. Standard ToM tasks range widely in format, but they share a common feature: the subject plays the role of an observer rather than a participant in a social exchange (Begeer et al., 2010; Schilbach et al., 2013). As such, the behaviors directly elicited by observational tasks—labeling mental states, or verbally predicting or explaining a character’s actions, based on passive observation—do not easily map onto the behaviors that typically occur throughout everyday social interactions. Furthermore, most laboratory tasks include explicit prompts to make ToM inferences, for example, asking participants to reason about a fictional character’s motivations in a social vignette (e.g. Happé, 1994). Such prompts rarely occur in everyday interactions; thus, performance on these standard tasks may not correspond with the tendency to represent the mental states of one’s real-world interaction partners.
Reliance on explicit, non-interactive measures is especially problematic when it comes to characterizing ToM abilities in autism. Autistic (AUT) individuals who pass first-order false belief tasks nevertheless struggle with everyday mindreading (U. Frith, 1994; Peterson et al., 2009), a finding that has prompted the development of several more advanced ToM measures, such as the Strange Stories (Happé, 1994) and the Reading the Mind in the Eyes Test (Baron-Cohen et al., 2001). Yet, even performance on advanced ToM tasks is not consistently related to social impairment in autism (Alkire et al., 2021; Barendse et al., 2018; Usher et al., 2015), suggesting that AUT individuals may solve explicit ToM tasks through compensatory strategies (i.e. verbal or domain-general reasoning) that do not serve them in naturalistic, interactive contexts (Happé, 1995; Scheeren et al., 2013). Furthermore, the unstructured nature of everyday social situations (as opposed to highly structured laboratory tasks) appears to exacerbate ToM difficulties in autism (Ponnet et al., 2008; Roeyers et al., 2001). Thus, to more accurately characterize real-world ToM and its role in interactive success, particularly for AUT individuals, there is a need for more direct measures of the application of ToM within unstructured social interactions.
Naturalistic conversation is an ideal unstructured format in which to measure applied ToM. Conversation becomes an increasingly prominent feature of peer interactions during middle childhood to adolescence, an important period for social development in general (Raffaelli & Duckett, 1989; Rubin et al., 2007). Crucially, successfully navigating a conversation inherently requires ToM (Roth & Spekman, 1984; Sperber & Wilson, 1987). As such, the application of ToM within conversation, hereafter termed conversational ToM (cToM), represents a key aspect of interactive social cognition and, potentially, its impairment in autism.
cToM is apparent in a variety of communicative acts. While some of these acts are overt indications of ToM, such as explicitly referring to a conversation partner’s mental states, other speech acts imply that the speaker has or has not taken the listener’s perspective into account. Failing to engage in this perspective-taking can result in statements that are under- or over-informative, irrelevant, ambiguous, or disorganized (Grice, 1975), at least from a neurotypical perspective. Conversely, speakers engage in cToM when they clarify potential ambiguities, distinguish between common and privileged ground (i.e. information that is or is not mutually shared; H. H. Clark & Marshall, 1981; Raczaszek-Leonardi et al., 2014), and repair misunderstandings (Schegloff, 2004). These properties of everyday conversation offer a promising avenue for investigating how ToM supports social interaction. However, as we review below, there is a dearth of measures specifically designed to capture cToM.
Behaviors pertaining to cToM are primarily assessed within the discipline of pragmatic language, which concerns the use of language in social contexts and the ability to infer meanings that depart from the literal meaning of an utterance. Pragmatic impairment has been associated with poor social outcomes (Mok et al., 2014; Whitehouse et al., 2009) and is common in autism (Kalandadze et al., 2018; Ying Sng et al., 2018). However, two characteristics of the pragmatics literature limit its applicability to studying ToM in real-time social interactions. First, the broad category of pragmatics includes a range of heterogeneous skills that vary in the extent to which they depend on ToM. Several authors have proposed a theoretical division between aspects that require taking the perspective of one’s conversation partner and those that do not, instead depending solely on structural language (i.e. vocabulary and syntax) and contextual reasoning from the listener’s egocentric perspective (Andrés-Roqueta & Katsos, 2017; Kissine, 2012; O’Neill, 2012). However, many existing measures of pragmatic ability conflate these heterogeneous aspects (Matthews et al., 2018). A second limitation of previous pragmatics studies is that, much like ToM studies, they often employ non-interactive measures. Because these measures may simply reflect knowledge of social norms (e.g. Carrow-Woolfolk, 1999) and because pragmatic language use is highly context-dependent (Ying Sng et al., 2018), standard laboratory tasks may not adequately capture one’s true capacity to use ToM within everyday conversations.
Given the limited ecological validity of task-based measures of pragmatic competence, observational measures of natural language samples are better suited to identifying behaviors that reflect real-world ToM. For example, the Pragmatic Rating Scale (PRS; Landa et al., 1992) has been used to characterize pragmatic difficulties in autism that are observed during interactions with a clinician or experimenter (Greenslade et al., 2019; Lam & Yeung, 2012; Paul et al., 2009) and during free-play among preschoolers (Bauminger-Zviely et al., 2013). However, because the PRS and similar rating scales are designed to quantify only pragmatic deficits, they are ill-suited to measuring positive indicators of cToM, such as clarifying ambiguities or explicit perspective-taking. Other observational studies have quantified discrete ToM-relevant features of conversation, such as the number of relevant and appropriately informative statements (Capps et al., 1998; Nadig et al., 2010; Tager-Flusberg & Anderson, 1991). Though focused on limited aspects of discourse, these studies establish the feasibility of analyzing naturalistic conversation for behavioral indicators of cToM.
We argued above that non-interactive ToM tasks are limited in their capacity to measure individuals’ ability and propensity to apply ToM during real-world interactions. Nevertheless, understanding how performance on established tasks relates to cToM-related pragmatic abilities could shed light on the cognitive underpinnings of these abilities. ToM is multifaceted, and different non-interactive tasks likely tap into several distinct underlying abilities (Altschuler et al., 2018; Hayward & Homer, 2017; Schaafsma et al., 2015; Warnell & Redcay, 2019). As measuring every facet of ToM is beyond the scope of a single study, here we focus on two aspects that are likely instrumental to conversation. Face-to-face interactions, in particular, require sensitivity to multimodal cues from one’s interaction partner, including facial expressions conveying affective information relevant to the conversation (e.g. interest or confusion). Tasks measuring visual-affective ToM (vaToM), the ability to infer emotional states from facial expressions, are widespread in the general ToM literature and show positive correlations with social functioning in autism (Trevisan & Birmingham, 2016), yet their connection with conversational skills is underexplored. vaToM tasks provide participants with a set of emotion words from which to select the most appropriate label for each facial expression. However, as mentioned above, real-world social interaction lacks such explicit prompts to reason about mental states. In contrast, spontaneous ToM (sToM), the propensity to attribute mental states in the absence of explicit prompts, 1 is commonly measured using social animation tasks in which participants are asked to describe scenes depicting interactions between geometric shapes; responses are then coded for the degree of spontaneous mental state attribution (Abell et al., 2000; Castelli et al., 2002; Klin, 2000). sToM may be another key mechanism for cToM, allowing one to actively make inferences or predictions about a conversation partner’s mental states to guide and adjust one’s own communication.
This study aims to demonstrate the instrumental role of ToM in social interaction through observational coding of unstructured, face-to-face conversations among AUT and typically developing (TD) children and adolescents. Guided by theoretical arguments (e.g. the “double empathy problem”; Milton, 2012) and empirical work suggesting that dyadic interactions differ depending on the characteristics of each member, particularly autism status (Crompton et al., 2020; Morrison et al., 2020), our study includes dyads that were either matched (both TD or both AUT) or mismatched (one TD, one AUT).
Within this ecologically valid context, we established a novel coding framework for cToM that builds on prior work by focusing on ToM-relevant behaviors rather than general pragmatic competence and by considering both positive and negative indicators of cToM rather than a deficit-only approach (M. Clark & Adams, 2020; McCrimmon & Montgomery, 2014). After examining differences between AUT and TD participants on cToM, we examined cToM in relation to the two aforementioned facets of ToM (measured with non-interactive tasks) that are especially relevant to conversation: vaToM and sToM. We hypothesized that both vaToM and sToM would show positive associations with the tendency to apply ToM in conversation (cToM Positive) and negative associations with ToM-related violations of conversational norms (cToM Negative) (Hypothesis 1). Finally, in line with the widespread assumption that ToM plays an instrumental role in social interactions, we hypothesized that cToM Positive would positively predict interaction success, as rated by interaction partners, and cToM Negative would negatively predict success (Hypothesis 2).
Methods
Participants
This study was part of a larger ongoing study approved by the University of Maryland’s Institutional Review Board (Approval No. 733144). Participants were recruited largely based on participation in previous studies in our laboratory. Additional AUT participants were recruited through the Interactive Autism Network, Simons Foundation Powering Autism Research (SPARK), Facebook advertisements, and flyering at local events. We appreciate obtaining access to recruit participants through SPARK research match on SFARI Base. Additional TD participants were recruited through the University of Maryland’s Infant and Child Studies database. Autism diagnoses were confirmed using the Autism Diagnostic Observation Schedule, Second Edition (ADOS-2; Lord et al., 2012) administered by a research-reliable clinician. All participants were native English speakers with a full-scale IQ of at least 80 on the Kaufman Brief Intelligence Test, second edition (KBIT-2; Kaufman & Kaufman, 2004). TD participants had no diagnosis of any neurological or psychiatric disorders, or first-degree relatives with autism or schizophrenia. Additional diagnoses of the AUT participants are reported in the Supplementary Materials.
The sample size was based on a power analysis using the Actor–Partner Interdependence Model (APIM) Power Analysis Shiny Application (Ackerman et al., n.d.), which indicated that a sample of 50 dyads provides 80% power to detect effects with standardized beta weights of at least 0.3. See Supplementary Materials for additional details.
Dyad members were matched on gender and age (within 1 year or grade level) and were arranged into three dyad types: AUT–TD (one AUT, one TD participant), TD–TD, and AUT–AUT. The final sample of 50 dyads included 25 TD–TD (8 female dyads), 18 AUT–TD (4 female dyads), and 7 AUT–AUT (1 female dyad); 2 across dyad types, the sample comprises 68 (20 female) TD and 32 (6 female) AUT individuals, for a total of 100 individuals. Each participant was included in only one dyad. The average age across the whole sample was 13.34 years, with a range of 8.72–16.91 and a standard deviation of 1.85 years. Participants’ gender was assessed via parent report: parents were asked to select between the options of “male” or “female” to indicate their child’s gender. Race and ethnicity distributions for the whole sample are as follows: 2% Asian, 11% Black or African American, 65% White, 11% more than one race, 11% not reported; 9% Hispanic or Latino, 78% not Hispanic or Latino, 13% not reported. Additional participant demographics, including indicators of socioeconomic status, are reported in Supplementary Table 1.
Face-to-face peer interaction task
The unstructured conversations took place within a peer interaction paradigm developed by Usher et al. (2015, 2018). After separately providing informed written consent along with their caregivers, participants were seated across from each other in a behavioral testing room in which three video cameras recorded the interaction from multiple angles. The experimenter told participants they would engage in simple activities together for about 20 min, then said, “Before I explain your task, why don’t you get to know each other? I’ll be back in about 5 min.” This 5-min “Get to Know You” period was followed by two semi-structured activities. Given that this study focused on unstructured conversation, only the Get to Know You portion was coded for cToM. Following these activities, participants were placed in separate rooms, where they individually completed questionnaires and additional behavioral tasks.
cToM Rating Scale
Conversations were rated along two scales, Positive and Negative, based on whether behaviors reflected the presence (Positive) or absence (Negative) of ToM. Categories of behaviors considered for the cToM rating are described in Supplementary Table 3; where applicable, real examples from the observed sample are provided. Among other categories, the Positive scale included explicit references to the partner’s mental state, whereas the Negative scale included violations of neurotypical conversational norms (e.g. over- or under-informative statements). The Positive and Negative scales were 6 points each (0–5), with higher numbers representing higher frequencies of behaviors.
The Get to Know You task was coded for cToM by a team of three trained research assistants supervised by the first author. Coders were masked to participant diagnosis except for three participants who self-disclosed their autism diagnosis and one who self-disclosed his lack of autism diagnosis during the conversation task. Furthermore, due to concern that coders may be biased by participants’ physical appearance or mannerisms such that they could speculate on diagnosis or form a negative impression (Sasson et al., 2017), coding was based on extracted audio files rather than video. Prior to coding, conversations were transcribed from the videos using the Computerized Language Analysis (CLAN) program (MacWhinney, 2000). Transcripts assisted coding by indicating meaningful nonverbal gestures (e.g. nodding the head yes or no, shaking hands, etc.), which provided additional information by which raters could interpret the conversation, and by highlighting mental state words, which were taken into consideration for the explicit ToM category. All transcripts were checked for accuracy by a second research assistant or the first author.
Additional details on the coding system, training and coding procedures, and interrater reliability are reported in the Supplementary Materials. Reliability coefficients for the final dataset were as follows: Positive scale, Krippendorff’s alpha = 0.81; Negative scale, Finn’s r = 0.94. 3 Coefficients of 0.7 or above are considered acceptable for both Krippendorff’s alpha and Finn’s r (Hayes & Krippendorff, 2007; Heyman et al., 2014).
Control variables
For both hypotheses, analyses included the following variables (in italics) as covariates of no interest. Verbal IQ, consistently linked to performance on ToM tasks (Ronald et al., 2006; Scheeren et al., 2013), was measured by the KBIT-2 Verbal Intelligence score. Executive functioning (EF), also known to correlate with ToM (Jones et al., 2018; Lecce et al., 2017) and pragmatic abilities (Matthews et al., 2018), was measured by the Global Executive Composite on the Behavior Rating Inventory of Executive Function, second edition (Gioia et al., 2015). Higher scores on this measure correspond to greater impairment. Age, gender, and autism diagnosis are also associated with variability in ToM skills (Bal et al., 2013; Kirkland et al., 2013). Finally, given that cToM scores were derived from speech, we controlled for the amount of speech produced by each individual to ensure that any effects of interest were specific to cToM and not simply driven by how much an individual talked. Language productivity was operationalized as the total number of words per minute (Scott & Windsor, 2000) spoken by each child during the 5-min interaction, as calculated by CLAN. For Hypothesis 1 only, this total was averaged together with the average word count of responses to the sToM task (described below) to create a language productivity composite.
Hypothesis 1: non-interactive ToM predicts cToM
After the peer interaction, participants individually completed two non-interactive ToM tasks: vaToM and sToM.
vaToM was assessed using the Face task from the Cambridge Mindreading Face-Voice Battery for Children (Golan et al., 2015). Children viewed videos of actors expressing complex emotions (emotions involving cognitive states) and chose the appropriate label from among four choices. The task consisted of 27 items representing nine emotional concepts: unfriendly, disappointed, embarrassed, jealous, loving, nervous, bothered, amused, and undecided. The measure of interest was accuracy (% correct). See Supplementary Materials for details on stimulus presentation and task administration.
sToM was assessed using the Frith–Happé Triangles task (Abell et al., 2000). Children viewed four short animations in which two triangles move in a manner that suggests a ToM-related interaction. After viewing each video, children were asked to describe the cartoon while their verbal responses were audio-recorded. Responses were coded for the frequency of unique and appropriate internal state attributions (mental states, intentions, and emotions) following guidelines adapted from Rice and Redcay (2015); see Supplementary Materials. Responses were independently coded by two trained research assistants, with 24% of cases double-coded and discussed with the coding supervisor (a third research assistant) during weekly meetings until consensus was reached. Interrater reliability was excellent (Krippendorff’s alpha = 0.85). Scores for each of the three items (excluding a practice item) were averaged for an overall sToM score.
Analysis strategy
Due to the dyadic data structure and thus the possibility of non-independence among variables, APIMs (Kenny et al., 2006) were used to estimate the effects of interest (vaToM and sToM) on cToM. The APIM simultaneously estimates actor effects (e.g. the effect of Person 1’s vaToM on Person 1’s cToM) and partner effects (e.g. the effect of Person 1’s vaToM on Person 2’s cToM); see Figure 1. We ran separate models with each of the cToM scales as outcomes. Each model included actor and partner effects of the non-interactive ToM variables, diagnosis (AUT or TD), verbal IQ, EF, and the language productivity composite; and age (averaged between dyad members) and gender as dyad-level covariates. The hypothesized effects were the actor effects of each of the non-interactive ToM variables. In addition, we ran reduced versions of these models without partner effects of the individual-level predictors. Results and interpretations for the reduced models were similar to the full models and are reported in the Supplementary Materials, along with details on the modeling approach used for each cToM scale. Because the Negative scale had a highly skewed distribution, leading to violations of the assumptions of linear models, for this model, we instead applied generalized estimating equations, which make no distributional assumptions and are recommended for estimating APIMs with non-Gaussian outcomes (Loeys et al., 2014; Loeys & Molenberghs, 2013). All APIM analyses (both hypotheses) were conducted using SPSS Version 26; the data and syntax used for these analyses are available in the Open Science Framework (https://osf.io/b586n/?view_only=3b8331f29ed040e883e2c4e5312d88d6).

Schematic of the actor-partner interdependence model.
Hypothesis 2: cToM predicts interaction success
Following the peer interaction, participants individually completed six items from the Social Interaction Evaluation Measure (Berry et al., 1996) assessing interaction quality, that is, how much the participant enjoyed the interaction and would like to interact again (see Supplementary Materials for specific items). The summed total of responses to these items from each participant’s partner served as the operational definition of interaction success. We predicted that each cToM scale would have a significant effect on interaction success (positive effect for Positive, negative effect for Negative).
Analysis strategy
For each of the cToM scales, an APIM was estimated using linear mixed effects modeling in SPSS, with actor and partner cToM as the predictors and interaction success as the outcome. Covariates included actor and partner effects of verbal IQ, language productivity, and EF; and dyad-level age, gender, and dyad type (TD–TD, AUT–TD, and AUT–AUT). We included dyad type rather than individual diagnosis as a covariate based on dyadic research suggesting that perceptions may depend on the match or mismatch between partners’ diagnosis rather than an individual’s diagnosis alone (Morrison et al., 2020).
We also conducted a planned follow-up analysis examining the effect of vaToM and sToM on interaction success (see Supplementary Materials).
Community involvement statement
Community members were not involved in the design, implementation, or interpretation of the study.
Results
Preliminary analyses
Descriptive statistics, group comparisons, and zero-order correlations among all variables are shown in Supplementary Tables 4–7. Notably, Positive and Negative scales were uncorrelated (r = 0.00). Figure 2 depicts the distributions of and group differences in each cToM scale. The Negative scale showed a marked group difference in the expected direction based on previous literature, with AUT participants rated higher (meaning they displayed more frequent negative cToM behaviors; t(35.14) = 3.25, p = 0.003). In contrast, TD and AUT participants showed very similar distributions on the Positive scale (t(59.18) = 0.02, p = 0.98).

Distributions of the cToM scales among the full sample (leftmost graphs) and by diagnostic group.
We also examined the frequency of each category of behavior within the cToM coding system. Figure 3 depicts these within-category scores and their averages across the full sample and for each group separately. The most frequent category on the Positive scale was explicit ToM, followed by distinguishing common versus privileged ground. The other Positive categories were relatively infrequent, with only a handful of participants scoring 1 or above. Categories on the Negative scale were more comparable in their frequency, with most participants showing no or very few instances across categories (as reflected in the skewed distribution of the overall Negative score).

Category-level analysis of cToM Positive and Negative scales. Within each scale, sample-wide averages are displayed (with individual data points superimposed) above boxplots comparing the distributions within diagnostic group. On the Positive scale, “Following maxims” refers to consistently following the maxims of quantity, relevance, and manner throughout the conversation; each child was assigned a global score of either 1 (consistent throughout), 0.5 (followed maxims for about half the conversation), or 0 (rarely followed maxims).
AUT and TD participants showed comparable rates across all Positive categories. On the Negative scale, higher rates within each category were represented almost exclusively by a minority of AUT participants. On average, AUT participants were more likely than TD participants to violate maxims of quantity, relevance, and manner and to interrupt their partners in a manner suggesting they were either not attending to their partner’s mental state or were attending but struggled to integrate this information into their responses.
The distributions of the other key variables in Hypotheses 1 and 2 are plotted by diagnostic group in Figure 4. Groups did not significantly differ on sToM (t(45.83) = −0.22, p = 0.83), but accuracy on the vaToM task was significantly lower in the AUT group (t(42.21) = −3.72, p < 0.001). Groups did not significantly differ on interaction quality (participant’s own rating; t(50.23) = 0.63, p = 0.53) or interaction success (partner rating; t(55.51) = 0.56, p = 0.58). Interaction success did not significantly differ across dyad types (F(2, 97) = 1.07, p = 0.35; AUT–TD vs TD–TD: t(80.65) = −0.44, p = 0.66; AUT–TD vs AUT–AUT: t(17.19) = 1.17, p = 0.26; AUT–AUT vs TD–TD: t(16.94) = 0.95, p = 0.36). The distributions of all additional variables are plotted by group in Supplementary Figure 1.

Distributions plotted by diagnostic group for (a) non-interactive ToM and (b) interaction quality (participant’s own rating) and interaction success (partner rating). Interaction success is also plotted by dyad type.
Hypothesis 1 results
For cToM Negative, there was a significant negative actor effect 4 of vaToM (B = −0.24, SE = 0.09, p = 0.01) and a significant negative actor effect of sToM (B = −0.30, SE = 0.13, p = 0.02), supporting our hypothesis that non-interactive ToM predicts cToM (Figure 5). However, this hypothesis was not supported for cToM Positive, for which there were no significant actor effects of either vaToM or sToM. Unstandardized regression coefficients for all effects are reported in Table 1.

Hypothesis 1 results for cToM Negative.
Hypothesis 1: parameter estimates of non-interactive ToM predicting cToM.
B: unstandardized beta coefficient; SE: standard error; CI: confidence interval; LangProd: language productivity composite (see “Methods”); EF: executive functioning (higher scores correspond to greater impairment); Dx: diagnosis.
Hypothesized effects in bold.
p < 0.05,
Hypothesis 2 results
To test the hypothesis that cToM predicts interaction success, we estimated linear mixed effects models with each of the cToM scales as predictors. Contrary to our hypothesis, there were no significant effects of either cToM scale predicting interaction success (Supplementary Table 11).
Discussion
This study introduced the cToM coding system, a novel approach to measuring the use of ToM in social interaction that capitalizes on properties of naturalistic conversation that inherently involve ToM. We found partial support for our hypotheses in that cToM (Negative) was correlated with non-interactive ToM; however, cToM was not related to interactive success. We also demonstrated dissociations between cToM Positive and Negative scales, and this divergence is relevant to our understanding of cToM in autism.
Divergence between cToM Positive and Negative scales
The Positive and Negative cToM scales—reflecting the presence or absence, respectively, of ToM-related behaviors—were decidedly uncorrelated (r = 0.00). Given that they appear to capture orthogonal dimensions, it is not surprising that we found divergent results for the two scales in Hypothesis 1, and in the between-group comparison.
Comparing TD and AUT participants on cToM revealed a few noteworthy patterns, especially considering the prominent mindblindness theory of autism. As expected, at the group level, the AUT participants were rated significantly higher than TD participants on the Negative scale (reflecting more frequent negative cToM behaviors). Thus, the Negative scale seems to capture ToM-relevant pragmatic difficulties that have been previously documented in the autism literature, particularly violations of Gricean maxims (Landa, 2000; Paul et al., 2009; Ying Sng et al., 2018). At the same time, AUT and TD participants showed highly similar distributions on the Positive scale, and this was also true at the category level (Figure 3). Consistent with previous findings (reviewed in Bang et al., 2013), AUT participants used mental state language (“explicit ToM” in our categorization) to the same extent as TD participants. Furthermore, while explicit ToM was the most frequent category within the Positive scale, the AUT participants’ scores do not solely reflect superficial knowledge of mental state concepts; they were also comparable to TD participants on the second-most frequent category of distinguishing between common and privileged ground. Behaviors coded in this category include preemptively sharing information that would plausibly be unknown to one’s partner, or asking whether one’s partner shares certain knowledge, both of which reflect an active awareness that others can possess different knowledge states from oneself. Thus, our findings converge with other evidence that AUT individuals’ ToM abilities are often underestimated (Heasman & Gillespie, 2018), and that they are aware of common ground and modify their discourse in response (De Marchena & Eigsti, 2016). Together with the lack of correlation between the Positive and Negative scales, this pattern of between-group difference and similarity implies even those AUT individuals who violate certain neurotypical conversational norms may be comparable to their TD peers on other aspects of cToM. As such, the abilities captured by the cToM Positive scale could be recognized as an area of strength that could potentially be leveraged in social skills interventions (McCrimmon & Montgomery, 2014).
The discrepancy between the Positive and Negative scales also raises the question of how they may differ in their cognitive demands. Focusing on the dominant category of explicit ToM, simply referencing a partner’s mental state does not necessarily require switching between two perspectives (self vs other) beyond the basic recognition that others have their own mental states. Thus, it is possible that the behaviors captured by the Positive scale do not reflect the degree of perspective-taking necessary for scoring low on the Negative scale.
cToM Negative relates to non-interactive ToM
Non-interactive ToM—specifically, vaToM and sToM—was negatively associated with cToM Negative but not Positive. The association between vaToM and cToM Negative suggests that children who struggle to identify complex emotions based on facial expressions also tend to violate neurotypical conversational norms, such as providing too little, too much, or irrelevant information, or expressing themselves in a confusing manner. It is easy to imagine how this association might play out in a conversation. For example, a child who does not pick up on her partner’s confused expression would be unaware that her utterances come across as irrelevant or unclear (in the absence of verbal feedback to this effect), and thus would not adjust her behavior. In contrast, explicitly referencing a partner’s mental state could reflect a general (rather than partner-specific) understanding of diverse mental states, as argued above, and thus might not necessitate “checking in” with a partner to dynamically integrate feedback from their facial expressions—hence the lack of association between vaToM and cToM Positive.
The negative association between sToM and cToM Negative is consistent with the notion that successfully navigating a conversation, and social interaction more broadly, involves spontaneous mental state attribution. Specifically, participants who described social animations in more mentalistic terms were rated lower on cToM Negative, suggesting that these children spontaneously attributed mental states to their partners during the conversation, and that this allowed them to follow neurotypical conversational norms. The lack of association between sToM and cToM Positive is somewhat surprising given that our sToM measure was the frequency of mental state attributions within a verbal description, and cToM Positive scores were largely driven by explicit mental state language. However, it is important to consider that while the sToM task did not explicitly prompt children to attribute mental states, it did require them to give verbal descriptions of the scenes. In contrast, most mental state attributions made during a naturalistic conversation are almost certainly not vocalized. Whereas cToM Negative may be related to the general tendency to make mental state attributions (which may be unvocalized unless prompted), cToM Positive, specifically the explicit ToM category, indexes the subset of mental state attributions that happen to be vocalized during the conversation.
Altogether, these results add to the relatively sparse literature linking diverse aspects of ToM with conversational skills.
cToM does not predict partner-reported interaction success
Contrary to our hypothesis, neither cToM scale was a significant predictor of interaction success, as measured by partner ratings of interaction quality. Two aspects of our methodology must be considered when interpreting these null findings.
First, this study focused on the specific context of a single interaction with an unfamiliar peer. This setup was meant to represent the type of real-world scenario that, if successful, would lead to further interactions and eventually the formation of a long-term relationship. AUT individuals often find interacting with unfamiliar peers particularly challenging, as they struggle to apply their social knowledge (including skills learned through explicit training) to novel situations (Bauminger-Zviely et al., 2013; Usher et al., 2015). As such, we expected our paradigm to be more effective at eliciting cToM difficulties than would interacting with familiar individuals. However, it is possible that cToM plays a more important role in the perceived quality of interactions between familiar compared to unfamiliar partners. That is, as two people gain more experience with each other, one might expect the other to make more accurate mental state attributions about oneself, and thus failure to do so would be more striking. More generally, while this study suggests that cToM is not a significant factor in how individuals judge the quality of a single interaction, the full impact of ToM on relationship quality likely plays out over repeated interactions. Future studies could explore this possibility by following dyads longitudinally and tracking their interaction patterns, including cToM, and relationship outcomes over time.
Second, our choice of partner self-report as the measure of interaction success may have limited our ability to detect a true effect of cToM, as social desirability bias may have led some children to underreport any dissatisfaction with the interaction. We do not believe this was a major concern, as there is a decent spread of scores on this measure (Figure 4). Furthermore, in a post hoc exploration of a subset of participants’ future partner preferences (i.e. whether they would prefer the same or a different partner if they were to participate in another session; see Supplementary Materials), which they completed at the end of the session while the experimenter was out of the room, the extent to which participants preferred the same partner correlated positively with their earlier report of interaction quality (r = 0.42, p < 0.001). This suggests that our measure of interaction success is a reasonable proxy for the participants’ motivation to interact with their partners again, a prerequisite for long-term relationship formation. Nevertheless, future studies could incorporate alternative measures of peer acceptance, such as classroom sociometric ratings, that may be stronger indicators of real-world interaction success.
General limitations and future directions
A few additional caveats apply to our discussion of the overall study. First, while our sample of 50 dyads allowed sufficient power to test our hypotheses, a larger sample would likely produce more robust and reproducible results. Furthermore, the relatively low proportion of AUT participants (32% of the total sample) likely limited the variability we observed in the cToM Negative scale, on which only a handful of AUT participants received high ratings. This in turn may have limited our ability to detect the predicted effects. The generalizability of our findings to the wider AUT population is also limited, as we excluded individuals with IQs below 80 or who are minimally verbal, and our AUT sample was 84% White. It is also possible that our findings would not generalize to age groups outside of the middle childhood to adolescent range. In addition, while there is growing recognition that AUT females often differ from their male counterparts in their experience and behavioral presentation of autism, including social behaviors (Dean et al., 2017; Mandy & Lai, 2017; Wood-Downie et al., 2021), our sample was predominantly male. Further research with larger and more balanced samples should assess whether gender moderates the relations under study, and whether gender interacts with autism diagnosis to predict cToM or interaction success. We also recognize the limitation that we assessed gender via parent- and not self-report and provided only binary options. In line with growing recognition of a broader spectrum of gender identities, particularly among the AUT population (Corbett et al., 2022; Strang et al., 2020), future studies should include nonbinary and other options for gender and pose this question to the participants themselves.
Future studies should also further explore AUT–AUT interactions, which constituted only a small portion of our sample. Recent work has suggested that AUT–AUT interactions may qualitatively differ from TD–TD or TD–AUT interactions, including features relevant to cToM, such as a generous assumption of common ground (Heasman & Gillespie, 2019), and in ways that impact interaction success (Granieri et al., 2020; Morrison et al., 2020). Such findings support the idea of the double empathy problem (Milton, 2012), in which social interaction difficulties are attributed not solely to the AUT individual, but to a breakdown in communication resulting from the different experiences and expectations of AUT and neurotypical people (Crompton et al., 2021). Thus, rather than being a static trait of an individual, cToM may be intrinsically tied to the dyadic context created between an individual and a particular partner, and an important aspect of this context may be the match or mismatch in autism status. As such, future studies of cToM should be sufficiently powered to investigate the effect of dyad type (i.e. match or mismatch) on both cToM and interaction success.
Other limitations relate to the cToM measure itself. Because coding was based on the audio and transcripts of the conversations, we were unable to capture potentially relevant features, such as facial expressions and eye gaze; future iterations of the cToM system could be expanded to include the visual modality. Furthermore, any measure based on third-party ratings has the limitation of not directly accessing an individual’s subjective experience. We encourage future studies using the cToM coding system to collect self-report measures immediately following the conversation to provide complementary information about how people perceive their own use of ToM in conversation. Ratings about an individual’s cToM from their conversation partner would be similarly valuable and may be more likely to uncover associations between cToM and partner-rated interaction success.
Further research is also needed to validate and refine the cToM rating scales. Beyond the above-discussed divergence between the Positive and Negative scales, there may be heterogeneity within each scale in terms of the cognitive processes they reflect. Future research using a larger sample, and perhaps a slightly more structured interactive task that elicits higher rates of the categories that were infrequent in this study (e.g. misunderstandings, explicit perspective-taking, and irony comprehension), could employ factor analysis to characterize the pattern of correlations among the categories. Uncovering the latent dimensional structure of cToM could enable more precise assessments of its relations with non-interactive ToM, interaction success, and other constructs.
Finally, our cToM coding system was developed without input from AUT perspectives and thus is biased to reflect the norms of the dominant neurotypical culture; these norms may not be as valued by AUT individuals. Future studies could involve AUT individuals in adapting the cToM system to better capture how AUT individuals consider (or do not consider) each other’s perspectives while interacting, and how important this is to their perceptions of interaction quality.
Conclusion
We introduced the cToM coding system, addressing the need for interactive, ecologically valid measures of ToM. While we did not find evidence that this novel measure improves on standard ToM measures in predicting interaction success, the cToM is a useful framework for characterizing the various ways in which ToM-related difficulties show up in naturalistic conversations between TD and AUT individuals. Crucially, the divergence between the Positive and Negative scales reveals the multidimensionality of ToM in conversation. This finding is valuable in two respects. First, it adds to mounting evidence of divergence between distinct components of ToM in general (Schaafsma et al., 2015; Warnell & Redcay, 2019). Second, it refines our understanding of ToM and pragmatic difficulties among AUT individuals, as even those who struggle with ToM-related violations of neurotypical conversational norms can nevertheless display typical levels of other forms of mental state representation, such as those indexed by the cToM Positive scale. Furthermore, our finding that cToM Negative relates to two forms of non-interactive ToM can inform future studies of the more basic processes that support the application of ToM in social interactions. Altogether, this study provides a springboard for further investigation into the mechanisms and consequences of ToM-related behavior within naturalistic conversations.
Supplemental Material
sj-docx-1-aut-10.1177_13623613221103699 – Supplemental material for Theory of mind in naturalistic conversations between autistic and typically developing children and adolescents
Supplemental material, sj-docx-1-aut-10.1177_13623613221103699 for Theory of mind in naturalistic conversations between autistic and typically developing children and adolescents by Diana Alkire, Kathryn A McNaughton, Heather A Yarger, Deena Shariq and Elizabeth Redcay in Autism
Footnotes
Acknowledgements
The authors thank the members of the cToM coding team: Ryan Regars, Aliceann Trostle, Daniel Friedman-Brown, and Ming Yuan; Katie Beverley, Hema Clarence, Diana Grant, Sarah Gray, Alex Kalomiris, Dahye Kang, and Manasvinee Mayil Vahanan for their assistance with transcription and coding; Jacqueline Thomas, Alexandra Hickey, Tina Nguyen, Micah Plotkin, Kathryn Bouvier-Weinberg, Matthew Kiely, Ryan Stadler, Aranje Sripanjalingam, Dominic Smith-DiLeo, Avi Warshawsky, Maddie Reiter, Bess Bloomer, Miranda Sapoznik, Nicole Chapman, and Aiste Cechaviciute for their assistance with data collection; Dr Lauren Usher for conceiving of the Get to Know You task; and Drs Peter Carruthers, Jude Cassidy, and Edward Lemay for their advice on study design, methods, and data analysis. Finally, they thank the participants and their families for making this research possible.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Research reported in this publication was supported by the National Institute of Mental Health of the National Institutes of Health under Award No. R01MH107441. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Supplemental material
Supplemental material for this article is available online.
Notes
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
