Artificial Intelligence and the Illusion of Understanding: A Systematic Review of Theory of Mind and Large Language Models

Abstract

The development of Large Language Models (LLMs) has sparked significant debate regarding their capacity for Theory of Mind (ToM)—the ability to attribute mental states to oneself and others. This systematic review examines the extent to which LLMs exhibit Artificial ToM (AToM) by evaluating their performance on ToM tasks and comparing it with human responses. While LLMs, particularly GPT-4, perform well on first-order false belief tasks, they struggle with more complex reasoning, such as second-order beliefs and recursive inferences, where humans consistently outperform them. Moreover, the review underscores the variability in ToM assessments, as many studies adapt classical tasks for LLMs, raising concerns about comparability with human ToM. Most evaluations remain constrained to text-based tasks, overlooking embodied and multimodal dimensions crucial to human social cognition. This review discusses the “illusion of understanding” in LLMs for two primary reasons: First, their lack of the developmental and cognitive mechanisms necessary for genuine ToM, and second, methodological biases in test designs that favor LLMs’ strengths, limiting direct comparisons with human performance. The findings highlight the need for more ecologically valid assessments and interdisciplinary research to better delineate the limitations and potential of AToM. This set of issues is highly relevant to psychology, as language is generally considered just one component in the broader development of human ToM, a perspective that contrasts with the dominant approach in AToM studies. This discrepancy raises critical questions about the extent to which human ToM and AToM are comparable.

Introduction

Large Language Models (LLMs), particularly those based on transformer architectures such as GPT, have marked a paradigm shift in Artificial Intelligence (AI) technologies, profoundly transforming the field of natural language processing and impacting social and economic sectors.¹ The influence of these models has been compared to disruptive technological revolutions, such as the introduction of the printing press or the advent of the Internet, sparking interdisciplinary debates across philosophy, psychology, sociology, anthropology, and economics.² However, beyond the technological advancements, one of the most debated issues revolves around whether these models can replicate advanced social competencies, specifically Theory of Mind (ToM)—the ability to attribute mental states such as intentions, beliefs, and desires to oneself and others.^3–5

Recent studies suggest that LLMs can successfully complete tasks traditionally used to assess ToM in humans. Their ability to generate highly realistic human language raises fundamental questions: To what extent do these models understand humans’ intentions and beliefs through language? How comparable is their performance to that of humans? More critically, does their success reflect genuine social cognition, or is it an “illusion of understanding,” a case where LLMs mimic ToM reasoning without true comprehension of social dynamics?

These issues are particularly relevant to psychology, as language is widely recognized as just one element within the broader framework of human ToM development. This perspective stands in contrast to the dominant view in Artificial ToM (AToM) studies, raising crucial questions about the comparability (or lack thereof) between human ToM and AToM.

In this context, significant theoretical and practical implications arise. If LLMs could demonstrate ToM abilities, the applications in practical contexts would be massive. For instance, in settings involving mental health, education, or therapeutic interactions, an AI chatbot or assistant capable of understanding and responding to users’ mental states could become a powerful tool for support and care. This raises key ethical concerns, such as responsibility: Who is accountable for the responses generated by an LLM—its programmers, the model itself, or the users who interact with it through specific prompts? These are all issues that are the focus of a significant and growing body of research.

In this review, we will focus on the debate over the similarities and differences between human ToM and AToM. We will refer to AToM as the ability of LLMs to produce responses that align with human performance on ToM tasks established by psychological research.

ToM and LLMs

ToM is an essential socio-cognitive skill that allows individuals to attribute mental states—such as emotions, intentions, desires, and beliefs—to themselves and others, recognizing that these states can differ from objective reality.^4,6,7 This ability is central to social cognition, enabling individuals to interpret and predict others’ behavior based on their mental states, even if those mental states are different from one’s own and, in the case of beliefs, divergent from reality.⁸ Successfully passing false belief tests is often considered a key indicator of ToM acquisition,^9,10 with the development of ToM reaching increasingly complex levels of recursivity, such as the understanding of third-order false beliefs.¹¹

ToM is not exclusively a cognitive competence: There are affective and socio-relational capacities that are operationalized through specific tasks, such as Reading the Mind in the Eyes (RME),¹² Strange Stories,¹³ and the Faux Pas Test.¹⁴ These tests measure the ability to perceive and interpret not only cognitive mental states but also emotions and the social context that modulates interactions. The distinction between cognitive and affective ToM reflects the complexity of the human mind, capable of managing not only epistemic states but also the emotional component of social interactions.^15,16

The relationship between ToM and language is especially critical and has been extensively investigated.^17–19 Linguistic narratives not only make mental states explicit but also provide a context where these states can be discussed, examined, and understood more profoundly.^20,21 This relationship between ToM and language is particularly relevant to the study of LLMs, as these models operate primarily through language-based processes.

The implications are twofold: First, ToM learning in LLMs is constrained to linguistic interactions, whereas in humans, ToM development is deeply rooted in experiential and social learning. Second, the human capacity to handle more complex linguistic forms—such as conversational implicatures and complex communication—reflects the deeper cognitive and affective understanding required for ToM. Early studies^22,23 have shown that LLMs perform relatively well on simpler tasks, supporting the notion that these models may struggle with more socially nuanced and complex interactions.

This systematic review aims to examine studies that tested the ability of LLMs to simulate ToM, analyzing whether these simulations can be considered a form of AToM. The experiments conducted, methodologies, and results will be discussed, with the goal of clarifying one of the most fascinating and controversial topics in the field of AI.

Methods

Protocol

This systematic review was conducted in alignment with the guidelines outlined by the Preferred Reporting Items for Systematic Reviews (PRISMA), aiming to create a comprehensive overview of developments at the intersection of ToM and LLMs. Data extraction occurred on October 1, 2024, focusing on peer-reviewed articles from journals and conference proceedings written in English that (a) compared AToM performances in LLMs with the performances of human participants and (b) evaluated AToM capabilities in LLMs without direct human comparisons. Both experimental and quasi-experimental designs were considered, while qualitative papers without clear assessment criteria, theoretical and position, were excluded (see “Inclusion and Exclusion Criteria” section for details). Relevant articles were identified through searches across multiple bibliographic databases, including Scopus, Web of Science, ACM Digital Library, and IEEE Xplore. Initial search strategies were developed by F.M. and refined collaboratively through team discussions, with search results exported into Rayyan for systematic removal of duplicates by F.M. The search strategy involved the query: (“theory of mind” OR ToM) AND (“Large Language Models” OR LLMs OR ChatGPT), adapted as necessary for each database, specifically including “ChatGPT” to capture the most relevant studies. A two-phase screening process was used for references generated by the search. In the first phase, two authors independently reviewed titles and abstracts to exclude studies that did not meet the eligibility criteria, resolving any disagreements through discussion. In the second phase, the same authors independently reviewed the full texts of articles that passed the initial screening. After the reviews were completed, any conflicts were addressed through consensus. The initial search yielded 419 articles, of which 252 were removed as duplicates, and 137 were excluded during the first screening phase based on title and abstract review. Ultimately, 30 articles were retrieved in full text. Following a detailed assessment of the full texts, 20 articles met the inclusion criteria and were incorporated into the final review (see Table 1).

Table 1.

Overall Information of the Included Studies

Paper (first authors)	LLMs	LLMs vs. Human	Samples	ToM tasks	Modification level of ToM tasks^a	Ad hoc dataset
Brunet-Gouet et al.²²	ChatGPT-3.5, ChatGPT-4	No	N/A	Hinting Task, False Belief, Strange Stories	3	No
Marchetti et al.²³	ChatGPT-3.5	No	N/A	Sally-Anne, Ice-Cream-Van task, Third-order false belief	1	No
Ünlütabak et al.²⁴	GPT-3.5, GPT-4	Yes	N/A	Unexpected Contents, Unexpected Location Transfer, Second-Order ToM	3	No
Trott et al.²⁵	GPT-3	Yes	1,156 participants (age not specified)	False Belief	2	No
Jones et al.²⁶	GPT-3 (different versions)	Yes	Independent groups were used for each task (age not specified, except for three groups, which are reported to consist of undergraduate students)	False Belief Task, Recursive Mindreading, Short Story Task, Strange Stories, Indirect Request, Scalar Implicature	2	No
Strachan et al.²⁷	GPT-4, GPT-3.5, LLaMA2 (three versions)	Yes	1,907 participants (18–70 years)	False Belief, Faux Pas, Hinting Task, Strange Stories, Irony	2	No
Li et al.²⁸	ChatGPT-3.5, GPT-4	No	N/A	Sally-Anne task (ad hoc stories inspired by Sally-Anne for multiagent interactions)	4	No
van Duijn et al.²⁹	Falcon (7B), LLaMA (30B), GPT-davinci (175B), BLOOM (176B), GPT-3, GPT-3.5-Turbo, PaLM2 (175-340B), GPT-4	Yes	73 children (7–10 years)	Sally-Anne task (first and second order), Strange Stories, Imposing Memory Test	3	No
Jin et al.³⁰	GPT-3.5, GPT-V, GPT-4, LLaMA2 –	Yes	180 participants (age not specified)	MMToM-QA (ad hoc stories from videos)	4	Yes
Verma et al.³¹	GPT-4, GPT-3.5-Turbo	Yes	125 participants (18–60 years)	ToM-PROBE task (ad hoc social and nonsocial situations)	4	Yes
Gandhi et al.³²	GPT-4, GPT-3.5 (different versions), Claude (different versions), LLaMA	No	N/A	BigToM (ad hoc dataset of stories based on false belief stories logical structure)	4	Yes
He et al.³³	GPT-3.5, GPT-4, Claude, Guanaco	No	N/A	HI-TOM (ad hoc dataset of stories based on false belief stories logical structure)	4	Yes
Ma et al.³⁴	GPT-3.5 (different versions), GPT4	No	N/A	ToMChallenges (ad hoc dataset of stories based on Sally-Anne and Smarties tasks logical structure)	3	Yes
Sap et al.³⁵	GPT-3 (different versions)	No	N/A	SOCIALIQA, TOMI	3	Yes
Wilf et al.³⁶	LLaMA2 (two versions), GPT-3.5-Turbo, GPT-4	No	N/A	ToMI, BigTOM (from Gandhi et al.³²)	3	Yes
Zhu et al.³⁷	Mistral, DeepSeek	No	N/A	BigToM (from Gandhi et al.³²)	4	Yes
Shapira et al.³⁸	Flan-T5 (different versions), Flan-UL2, GPT-3 (different versions), GPT-3.5, GPT-4, Jurassic2 (different versions)	No	N/A	ToMi′ and ToM-k (ad hoc dataset of stories based on false belief stories logical structure)	3	Yes
Ni et al.³⁹	GPT-4, GPT-3.5, ChatGLM, ERNIE-Bot 4.0, SparkDesk	No	N/A	Ad hoc sets of games: Guess Number, Auction, Who Is Chameleon	4	Yes
Xu et al.⁴⁰	LLaMA2, Mixtral, GPT-3.5, GPT-4	No	N/A	OpenToM (stories generated by LLMs)	4	Yes
Kim et al.⁴¹	GPT-4, GPT-3.5-urbo, InstructGPT, Flan-T5 (XL, XXL), Flan-UL2, Falcon (7B, 40B), Mistral (7B), Zephyr (7B), LLaMA-2 Chat (70B)	Yes	Not specified	FANTOM (ad hoc dataset of social stories)	4	Yes

Modification level of ToM tasks: 1 = No modification: ToM tasks are identical to the classical versions from the literature; 2 = Minimal modification: ToM tasks have minor adjustments that do not significantly change their structure; 3 = Substantial modification: ToM tasks have major changes, altering their format or content considerably; 4 = Ad hoc tasks: ToM tasks were created specifically for the study, with no classical equivalent.

LLMs, Large Language Models; ToM, Theory of Mind.

Inclusion and exclusion criteria

The aim of this systematic review was to examine the AToM capabilities of LLMs by synthesizing studies that assessed their performance on ToM tasks. Selected studies had to (1) assess the performance of LLMs on ToM tasks and/or (2) compare the performance of LLMs with human participants. Studies were required to include experimental or quasi-experimental methodologies, allowing for assessment and comparison of results. Inclusion criteria were as follows:

AToM assessment in LLMs: Studies had to assess the performance of LLMs on ToM tasks, regardless of the LLMs tested.

Comparative studies between LLMs and human participants: Studies that directly compared human and LLM performance on ToM tasks.

Experimental or quasi-experimental design: Studies had to present results that allow for systematic analysis of the performance of LLMs on ToM tasks.

After the initial screening, articles were excluded based on the following criteria:

Theoretical and position articles: Studies that discussed ToM in LLMs from a conceptual perspective without providing empirical validation (e.g., Ma et al.⁴²).

Studies that referred to ToM without an assessment of the psychological construct: Articles that mentioned ToM as a theoretical framework but did not test it through ToM tasks (e.g., Yongsatianchot al.⁴³).

Qualitative studies and case studies without clear assessment criteria for LLM performance on ToM tasks: Studies that relied on case-based observations without systematic methods to assess LLM’s performance (e.g., Milička et al.⁴⁴ and Sileo and Lernould⁴⁵).

A detailed flowchart outlining the study selection process, in accordance with PRISMA guidelines, is shown in Figure 1.

FIG. 1.

PRISMA 2020 flow diagram. PRISMA, Preferred Reporting Items for Systematic Reviews.

Results

As this interdisciplinary research area is recent, the results are structured to provide both a descriptive and an analytical overview. The first part presents an outline of methodological aspects, highlighting significant trends that characterize the procedures adopted in this research area. The second part provides a summary of the main findings, focusing in particular on (1) the AToM performance of LLMs through ToM evaluation and (2) direct comparisons of performance between LLMs and human participants in ToM assessment.

Evaluating ToM in LLMs: Classical tests vs. Ad hoc datasets

To evaluate ToM capabilities in LLMs, researchers have adopted two main distinct methodological approaches: Eight studies used well-established tests from the ToM psychological research, and 12 studies developed new datasets inspired by the logical structure of classical tests but tailored to the specific characteristics of LLMs.

Testing through the classical ToM tests

All eight studies analyzed False Belief Tasks, with a predominant focus on first-order reasoning.^22–29 Among them, three studies also assessed second-order false belief,^23,24,29 and one study included third-order false belief tasks,²³ highlighting a specific interest in higher-order recursive thinking. Beyond false belief tasks, four studies incorporated advanced ToM assessments.^22,26,27,29 Specifically, two studies evaluated tasks requiring the understanding of intentions behind indirect comments (i.e., the Hinting Task).^22,27 Moreover, four studies incorporated assessments that evaluate sarcasm, irony, bluffing, and double bluffing, providing a more nuanced perspective on AToM. These include Strange Stories,^22,25–27 the Faux Pas Test,²⁷ and an irony recognition task,²⁷ all of which require processing advanced ToM components, where mental state inference relies on complex and context-dependent social logic. One study also used the Short Story, Recursive Mindreading, Indirect Requests, and Scalar Implicature,²⁶ which assess different inferential reasoning.

Ad hoc datasets for LLMs

Twelve studies focused on the development of ad hoc datasets to assess AToM in LLMs. Of these, only two studies compared them with humans.^30,31 Eight studies adopted false belief logic, adapting Sally-Anne or Smarties paradigms to higher-order reasoning based on unexpected transfer.^32–39 One study extended the dataset (i.e., HI-TOM) to the fourth-order recursive reasoning.³³ Generally, these datasets introduce narrative variations to avoid repetition and test robustness between different scenarios, such as the ToMChallenges.³⁴ Among them, BigToM,³² a large-scale benchmark for first- and second-order reasoning, was used in two other studies.^36,37 One study extended the assessment with longer narratives and various ToM reasoning tasks with the OpenToM.³⁹ One study used the TOMI dataset to refine how first- and second-order prompts were presented to LLMs, leading to the development of ToMI.³⁸

In addition to traditional false belief logic, four studies expanded AToM assessment into new domains. One study applied false belief reasoning to robot behavior in domains such as urban search and rescue.³¹ One study used a multimodal dataset, the MMToM-QA, with video and text to assess the attribution of mental states in domestic contexts.³⁰ One study designed a game-based dataset to test multi-agent social reasoning,⁴⁰ while one study developed FANTOM that focuses on the identification of “illusory ToM,” in which LLMs simulate comprehension without true inferential reasoning.⁴¹

Comparison of LLMs

Comparisons of different versions of GPT models

Sixteen studies showed that GPT-4 performed better than GPT-3, mainly in second-order belief tasks, pragmatic reasoning, and complex social inferences. In first-order false belief tasks, four studies showed similar performance between GPT-3 and GPT-4,^24,28,29,34 whereas three studies reported that GPT-4 outperformed GPT-3 in tasks requiring higher inferential reasoning.^22,27,38 Even in ad hoc datasets, four studies found GPT-4 performing better than GPT-3 in higher-order belief inferences^32,36 and in tracking mental states.^34,39 In implicit inference tasks, one study showed GPT-4 outperforming GPT-3 in scenarios based on social games (i.e., not directly using ToM tasks).⁴⁰ In pragmatic reasoning and indirect inference tasks, two studies found GPT-4 performing similarly to GPT-3, especially in tasks characterized by contextual ambiguity.^26,31 Although GPT-4 showed improvements in intentional explanations and belief tracking,³¹ two studies highlighted its limitations in tasks requiring long-term belief tracking.^30,41 This trend was particularly marked in one study testing deception and multi-agent interactions,³³ where the GPT-4 performance decreased beyond second-order beliefs.

Comparisons of different versions of no-GPT models

When comparing models beyond GPT, three studies showed that, although performance can vary across tasks for the LLaMA2 family, versions with more parameters generally proved more robust in most ToM tests.^27,36,39 A similar trend was observed in three studies comparing Flan and Jurassic,³⁸ as well as Falcon and Claude,^32,41 where larger parameter size significantly enhanced performance on ToM tasks. However, for Flan, there was a discrepancy between the findings of two studies,^38,41 possibly due to the use of the ad hoc datasets or variations in the Flan model versions tested.

Comparisons of different LLM families

In 10 studies, GPT-4 generally outperformed other LLMs, such as GPT-3, LLaMA2, Claude, Mistral, Falcon, Guanaco, Jurassic, ERNIE-Bot, SparkDesk, PaLM, and Flan, especially on tasks involving false belief.^{27,29,30,32,33,36,38–41} Two studies, however, showed that fine-tuning of the models can improve performance, as in the case of GPT-3³⁶ and Flan.⁴¹ While one study found that Flan performed similarly to GPT-4 on simpler ToM tasks,³⁸ Claude and Falcon performed worse than both GPT models.⁴¹ Some models, including Flan and Jurassic, showed moderate success but were unable to match the performance of GPT-4 in more complex reasoning tasks.³⁸ Studies testing Mistral and LLaMA showed that while Mistral handled first-order false beliefs better, both models struggled with second-order tasks.³⁹ Interestingly, GPT-4V, a multimodal model, approached the GPT-4 performance in some tasks but still showed limitations in tracking higher-order beliefs and goal inference.³⁰

Comparative performance: Humans vs. LLMs

Six studies directly compared the performance of LLMs with humans. A key distinction emerges between studies focusing on false belief tasks and those assessing pragmatic and advanced ToM components. Four studies showed that LLMs and humans showed similar performance on first-order false belief tasks, but LLMs struggled with second-order attributions.^29,30,41 One study showed that LLMs matched the performance of 7- to 10-year-olds in first-order tasks but were outperformed in second-order tasks and Strange Stories.²⁹ Beyond false belief tasks, two studies found that, compared to humans, LLMs performed poorly in attributing beliefs in tasks characterized by ambiguous and inconsistent reasoning.^26,31 Interestingly, one study found that LLaMA2 performed better than humans in the Faux Pas Test, probably due to a heuristic bias to detect social violations, whereas GPT-4 showed better performance in understanding irony and Strange Stories.²⁷

Discussion

This review underscores the significant gap between human ToM and the AToM simulated by LLMs. While LLMs, particularly GPT-4, demonstrate good performance in simpler tasks (e.g., first-order false belief), they encounter significant challenges in more complex social reasoning tasks (e.g., second-order false belief), where humans consistently outperform them. Furthermore, adaptations to classical ToM tests and the creation of ad hoc datasets tailored to the specific LLMs’ capabilities may neglect their inherent limitations in social understanding.

The crucial role of ToM tests and the risk of their adaptations

A key issue is the nature of the ToM tests used to evaluate LLMs. Many studies have adapted classical tests,^{22–24,26,27} such as the Sally-Anne Test or the Faux Pas Test, or developed new datasets specifically tailored to the capabilities of LLMs.^{30–33,38,39,41} While these adaptations are often necessary to make the tasks compatible with LLM frameworks, they risk oversimplifying the tasks by eliminating the inherent ambiguities present in human social interactions. Consequently, LLMs may seem to perform well on these tasks, but their success largely reflects an ability to recognize and replicate linguistic patterns rather than an understanding of mental states. For instance, in the study by van Duijn et al.,²⁹ GPT-4 outperformed children in the Imposing Memory Test, but this was a task originally designed for adolescents and adapted for children without proper validation. Such modifications can artificially inflate the performance of LLMs, creating an “illusion of understanding.” This epistemological concern is critical, as the apparent accuracy of LLMs in such tests may mislead researchers into believing these models possess human-like inferential capabilities when, in fact, they do not.

Evolution and limitations of LLMs: A comparison with human development

LLMs exhibit significant variability in their architecture, training data, and the refinements introduced across successive versions. Models like GPT-4, Claude, and LLaMA each rely on distinct training methods, leading to incremental improvements in their linguistic capabilities. This progression can be loosely compared to human ToM development, where experiential learning shapes mental abilities. While humans develop through rich, real-world interactions (both social and physical), LLMs are confined to text-based learning. Recent advancements in multimodal LLMs, such as GPT-4V, have integrated visual and spatial information, improving performance on tasks that require contextual reasoning.³⁰ This model improved performance in certain tasks, but the AToM is still underperforming human performance. In particular, despite processing richer inputs, GPT-4V does not outperform GPT-4 (i.e., text-based) in tasks requiring complex social inferences, suggesting that additional input modalities do not yet equate to a deeper understanding of mental states. Multimodal models are marking progress in expanding AToM evaluations beyond purely linguistic tasks. As research progresses, the illusory similarity between ToM and AToM will diminish, as advancements will enable the integration into a single LLM of multimodal sensory information, cognitive inferences, reasoning heuristics, and the ability to grasp implicit contextual information, key components of human ToM. However, it is important to emphasize that the current state of research in this field has not yet achieved this goal.

Mimicking vs. Understanding: The “illusion of comprehension” in LLMs

A central issue in the evaluation of AToM in LLMs lies in distinguishing between understanding of mental states and mere linguistic pattern recognition. LLMs like GPT-4 succeed mainly in linguistically structured tasks, such as false belief tests, but these do not capture the full complexity of human mental state reasoning. The strong performance of LLMs in such tasks can be attributed to their ability to recognize and replicate patterns from their training data, rather than any internal comprehension of the opacity of the mind, the possible differences between perspectives, and the subjectivity of mental states. This success in linguistic tasks can create the illusion that LLMs are capable of AToM, but their abilities are largely confined to tasks that are inherently easier for models trained on language data. Humans, by contrast, demonstrate a more robust and flexible understanding of mental states, particularly when they are depicted in socially complex and emotionally ambiguous scenarios. In these contexts, humans—both adults and children—outperform LLMs, particularly in tasks that involve recursive beliefs or the interpretation of social cues.^25–27,29 This performance gap highlights that, while LLMs perform well at basic, language-driven tasks, they struggle significantly with the real-world social interactions that require more than pattern recognition. Compounding this issue is the fact that many of the tests used to evaluate LLMs are often adapted or simplified to fit their capabilities. Tasks that are originally designed to assess human ToM are frequently modified to suit the limitations of LLMs, thereby reducing the complexity and ambiguity that are critical to social cognition. In some cases, entirely new datasets have been created specifically for LLM evaluation, which are based on the structure of false belief tests but tailored to the models’ strengths and constraints. While these ad hoc datasets allow for greater experimental flexibility, they can deviate from the original purpose of the tests and dilute the depth of social reasoning they are meant to assess. These adaptations compromise both the experimental and ecological validity of the tests, leading to results that suggest LLMs are more capable than they truly are. In essence, researchers may unintentionally contribute to the “illusion of understanding” by designing tasks that LLMs can solve, rather than maintaining the original complexity of human social interaction and finding the way to test LLMs on it. This phenomenon reflects a more fundamental aspect of the “illusion of comprehension”: Not only are LLMs mimicking mentalistic understanding without truly engaging in mental state reasoning, but the tests used to assess them are often tailored to align with their inherent strengths. Thus, their success on these tests does not reflect AToM abilities, but rather an aptitude for navigating simplified, linguistically structured tasks, reinforcing the illusion that they possess human-like mental state comprehension.

Another critical aspect in the comparison between ToM and AToM is that human ToM is a developmental skill, typically studied in children as it evolves over time. However, most AToM studies are conducted with adult participants,^26,27,30,31 overlooking the dynamic and developing nature of ToM. Moreover, many studies fail to report detailed demographic information, such as the age or individual characteristics of human participants, making comparisons with LLMs even more challenging. This issue is compounded by the fact that many datasets used for LLM evaluation are ad hoc and specifically designed for these models, which may not align well with human ToM development. Even more concerning is that, despite the creation of new datasets, only two studies have compared LLM performance with humans, which limits the validity of the tests.^30,31 Without these direct human comparisons, the assessments risk overestimating the AToM capabilities of LLMs and further neglecting the differences between artificial and human social reasoning.

It is not all bad: A cross-fertilization between domains

The study of ToM from a linguistic perspective raises epistemological questions about the nature of social cognition and its assessment. Although important connections between ToM development and language development have been highlighted (e.g., Siegal and Varley⁴⁶ and De Villiers and De Villiers⁴⁷), ToM is not reducible to mere linguistic processing.^10,48 Other cognitive dimensions (e.g., executive functions^48,49) are also involved in reasoning about mental states, and cold-ToM (i.e., recursive thinking) represents only a part of the story, since desires and emotions are elements of ToM as well.^10,49 The key issue is that LLMs, solving ToM tasks, do not engage in the social experience that underpins human ToM. As demonstrated extensively in research, ToM is deeply rooted in embodied interactions that allow humans to navigate the social world.⁵⁰ In this sense, false belief tests widely used in AToM should be conceived not as capturing the overall sense of mentalistic functioning but rather as representing a structured and synthetic way of analyzing the explicit inferential process. In fact, ToM research in psychology has long moved beyond false belief classical tasks, incorporating more naturalistic paradigms and jointly considering cognitive and affective features.¹⁰ It seems that the AToM studies are moving along a trajectory already covered by the human ToM studies. In fact, AToM research is beginning to use more naturalistic and dynamic tests. Despite their textual nature, LLMs can be tested in contexts that require multimodal integration, extended narratives, or interactive settings.^30,40 This shift highlights a shared need among psychology and AI research to refine ToM assessments in ways that differentiate authentic social reasoning from pattern-based responses. Moving forward, interdisciplinary efforts could improve both fields: AI research could benefit from more psychologically grounded benchmarks, while ToM research could expand its methodological approaches to better distinguish linguistic heuristics from the whole mental state inference process.

A final terminological note

A small concluding note, particularly relevant when discussing language: The calls for caution in comparing ToM and AToM, and the attention to terminological choices when making such comparisons, should extend beyond specialist circles to the broader media context. This is crucial if we want to prevent the “illusion of comprehension” from becoming “illusionism.” As Crawford⁵¹ exemplified through the story of the horse Hans, which seemed to know how to count but was merely responding to external cues, we must be vigilant in how we communicate the capabilities of LLMs to avoid misleading narratives.

Conclusion and Limitations

This review highlights that while LLMs generally perform well on first-order false belief tasks, they struggle with more complex reasoning, such as higher-order recursive inferences and advanced ToM assessments (e.g., irony comprehension). Furthermore, the variability in AToM assessments raises concerns about comparability with ToM. Most AToM assessments remain text-based, neglecting the embodied and multimodal aspects that are fundamental to ToM. In particular, across studies, human participants consistently outperform LLMs, reinforcing the gap between ToM and AToM. Recent advances in multimodal LLMs suggest a promising direction for AToM development. Future research should investigate whether multimodal integration improves AToM or simply sharpens response patterns without addressing the gap between ToM and AToM.

More broadly, we propose a cross-fertilization between AI and ToM research. ToM studies can offer greater methodological and interpretive rigor to computer science for the assessment of AToM in LLMs; the study of AToM prompts psychologists and cognitive scientists to update ToM studies by expanding knowledge beyond classical boundaries and provides the rare epistemological opportunity to simulate psychological functioning. This two-way exchange, if fostered, has the potential to redefine the way social cognition is studied in both humans and artificial systems, offering new perspectives on the nature and measurement of ToM.

A limitation of this review stems from the novelty of this research field, with the first empirical study published in 2023.²² Consequently, the number of analyzed studies remains few. Future research should incorporate more recent publications and track the ongoing development of this rapidly evolving area.

Footnotes

Authors’ Contributions

A.M., F.M., and D.M. conceived the review. F.M. extracted the data. F.M. and D.M. conducted the screening. A.M., F.M., G.R., A.G., and D.M. contributed to the writing of the article.

Author Disclosure Statement

The authors declare no competing interests.

Funding Information

This research and its publication are supported by the research line (funds for research and publication) of the Università Cattolica del Sacro Cuore of Milan and for F.M. by “PON REACT EU DM 1062/21 57-I-999-1: Artificial agents, humanoid robots and human-robot interactions” funding of the Università Cattolica del Sacro Cuore of Milan. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the article.

References

Filippucci

, Gal

, Jona-Lasinio

, et al. The Impact of Artificial Intelligence on Productivity, Distribution and Growth: Key Mechanisms, Initial Evidence and Policy Challenges. OECD Artificial Intelligence Papers. OECD Artificial Intelligence Papers; 2024; doi: 10.1787/8d900037-en

Floridi

. On the future of content in the age of artificial intelligence: Some implications and directions. Philos Technol, 2024; 37(3):112; doi: 10.1007/s13347-024-00806-z

Premack

, Woodruff

. Does the chimpanzee have a theory of mind? Behav Brain Sci, 1978; 1(4):515–526; doi: 10.1017/S0140525X00076512

Wimmer

, Perner

. Beliefs about beliefs: Representation and constraining function of wrong beliefs in young children’s understanding of deception. Cognition, 1983; 13(1):103–128; doi: 10.1016/0010-0277(83)90004-5

Frith

, Frith

. The neural basis of mentalizing. Neuron, 2006; 50(4):531–534; doi: 10.1016/j.neuron.2006.05.001

Perner

, Wimmer

. “John thinks that Mary thinks that…” attribution of second-order beliefs by 5- to 10-year-old children. Journal of Experimental Child Psychology, 1985; 39(3):437–471; doi: 10.1016/0022-0965(85)90051-7

Frith

, Frith

. Development and neurophysiology of mentalizing. Frith CD, Wolpert DM, eds. Phil Trans R Soc Lond B, 2003; 358(1431):459–473; doi: 10.1098/rstb.2002.1218

Hughes

, Leekam

. What are the links between theory of mind and social relations? review, reflections and new directions for studies of typical and atypical development. Social Development, 2004; 13(4):590–619; doi: 10.1111/j.1467-9507.2004.00285.x

Wellman

, Cross

, Watson

. Meta-analysis of theory-of-mind development: The truth about false belief. Child Dev, 2001; 72(3):655–684; doi: 10.1111/1467-8624.00304

10.

C-L

, Wellman

. A meta-analysis of sequences in theory-of-mind understandings: Theory of mind scale findings across different cultural contexts. Developmental Review, 2024; 74:101162; doi: 10.1016/j.dr.2024.101162

11.

Valle

, Massaro

, Castelli

, et al. Theory of mind development in adolescence and early adulthood: The growing complexity of recursive thinking ability. Eur J Psychol, 2015; 11(1):112–124; doi: 10.5964/ejop.v11i1.829

12.

Baron‐Cohen

, Wheelwright

, Hill

, et al. The “Reading the Mind in the Eyes” test revised version: A study with normal adults, and adults with asperger syndrome or high‐functioning autism. Child Psychology Psychiatry, 2001; 42(2):241–251; doi: 10.1111/1469-7610.00715

13.

Happé

FGE

. An advanced test of theory of mind: Understanding of story characters’ thoughts and feelings by able autistic, mentally handicapped, and normal children and adults. J Autism Dev Disord, 1994; 24(2):129–154; doi: 10.1007/BF02172093

14.

Baron-Cohen

, Jolliffe

, Mortimore

, et al. Another advanced test of theory of mind: Evidence from very high functioning adults with autism or asperger syndrome. J Child Psychol Psychiatry, 1997; 38(7):813–822; doi: 10.1111/j.1469-7610.1997.tb01599.x

15.

Brothers

, Ring

. A neuroethological framework for the representation of minds. J Cogn Neurosci, 1992; 4(2):107–118; doi: 10.1162/jocn.1992.4.2.107

16.

Paal

, Bereczkei

. Adult theory of mind, cooperation, Machiavellianism: The effect of mindreading on social relations. Personality and Individual Differences, 2007; 43(3):541–551; doi: 10.1016/j.paid.2006.12.021

17.

Giovanelli

, Di Dio

, Lombardi

, et al. Exploring the relation between maternal mind‐mindedness and children’s symbolic play: A longitudinal study from 6 to 18 months. Infancy, 2020; 25(1):67–83; doi: 10.1111/infa.12317

18.

Meins

, Fernyhough

, Wainwright

, et al. Maternal mind–mindedness and attachment security as predictors of theory of mind understanding. Child Dev, 2002; 73(6):1715–1726; doi: 10.1111/1467-8624.00501

19.

Milligan

, Astington

, Dack

. Language and theory of mind: Meta‐analysis of the relation between language ability and false‐belief understanding. Child Dev, 2007; 78(2):622–646; doi: 10.1111/j.1467-8624.2007.01018.x

20.

Astington

, Baird

., (eds). Why Language Matters for Theory of Mind. Oxford University Press; 2005; doi: 10.1093/acprof:oso/9780195159912.001.0001

21.

De Villiers

. The interface of language and theory of mind. Lingua, 2007; 117(11):1858–1878; doi: 10.1016/j.lingua.2006.11.006

22.

Brunet-Gouet

, Vidal

, Roux

. Can a Conversational Agent Pass Theory-of-Mind Tasks? A Case Study of ChatGPT with the Hinting, False Beliefs, and Strange Stories Paradigms. In: International Conference on Human and Artificial Rationalities. Springer Nature Switzerland: Cham; 2023; pp. 107–126.

23.

Marchetti

, Di Dio

, Cangelosi

, et al. Developing ChatGPT’s theory of mind. Front Robot AI, 2023; 10:1189525; doi: 10.3389/frobt.2023.1189525

24.

Ünlütabak

, Bal

. Theory of mind performance of large language models: A comparative analysis of Turkish and English. Computer Speech & Language, 2025; 89:101698; doi: 10.1016/j.csl.2024.101698

25.

Trott

, Jones

, Chang

, et al. Do large language models know what humans know? Cogn Sci, 2023; 47(7):e13309; doi: 10.1111/cogs.13309

26.

Jones

, Trott

, Bergen

. Comparing humans and large language models on an Experimental Protocol Inventory for Theory of Mind Evaluation (EPITOME). Transactions of the Association for Computational Linguistics, 2024; 12:803–819; doi: 10.1162/tacl_a_00674

27.

Strachan

JWA

, Albergo

, Borghini

, et al. Testing theory of mind in large language models and humans. Nat Hum Behav, 2024; 8(7):1285–1295; doi: 10.1038/s41562-024-01882-z

28.

, Chong

, Stepputtis

, et al. Theory of Mind for Multi-Agent Collaboration via Large Language Models. In: EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings; 2023; pp. 180–192.

29.

van Duijn

, van Dijk

, Kouwenhoven

, et al. Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art Models vs. Children Aged 7-10 on Advanced Tests. In: CoNLL 2023 - 27th Conference on Computational Natural Language Learning, Proceedings; 2023; pp. 389–402.

30.

Jin

, Wu

, Cao

, et al. MMToM-QA: Multimodal Theory of Mind Question Answering. In: Findings of the Association for Computational Linguistics ACL; n.d.; pp. 16077–16102.

31.

Verma

, Bhambri

, Kambhampati

. Theory of Mind Abilities of Large Language Models in Human-Robot Interaction: An Illusion? In: ACM/IEEE International Conference on Human-Robot Interaction; 2024; pp. 36–45; doi: 10.1145/3610978.3640767

32.

Gandhi

, Fränken

, Gerstenberg

, et al. Understanding social reasoning in language models with language models. Advances in Neural Information Processing Systems, 2023:36.

33.

, Wu

, Jia

, et al. HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models. In: Findings of the Association for Computational Linguistics: EMNLP 2023. 2023; pp. 10691–10706.

34.

, Gao

, Xu

. ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind. In; CoNLL 2023 - 27th Conference on Computational Natural Language Learning, Proceedings. 2023; pp. 15–26.

35.

Sap

, LeBras

, Fried

, et al. Neural Theory-of-Mind? On the Limits of Social Intelligence in Large Lms. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing ACL; 2022; pp. 3762–3780.

36.

Wilf

, Lee

, Liang

, et al. Think Twice: Perspective-Taking Improves Large Language Models’ Theory-of-Mind Capabilities. In: ACL; 2023; pp. 8292–8308.

37.

Zhu

, Zhang

, Wang

. Language Models Represent Beliefs of Self and Others. In: Proceedings of Machine Learning Research ML Research Pressfili; 2024; pp. 62638–62681.

38.

Shapira

, Levy

, Alavi

, et al. Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models. In: EACL 2024 - 18th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference. 2024; pp. 2257–2273.

39.

, Zhao

, Zhu

, et al. penToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models. In: Proceedings of the Annual Meeting of the Association for Computational Linguistics ACL; 2024; pp. 8593–8623.

40.

, Yu

, Ma

, et al. The social cognition ability evaluation of LLMs: A dynamic gamified assessment and hierarchical social learning measurement approach. ACM Trans Intell Syst Technol, 2024:3673238; doi: 10.1145/3673238

41.

Kim

, Sclar

, Zhou

, et al. FANTOM: A Benchmark for Stress-Testing Machine Theory of Mind in Interactions. EMNLP 2023–2023 Conference on Empirical Methods in Natural Language Processing, Proceedings. 2023; pp. 14397–14413.

42.

, Sansom

, Peng

, et al. Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models. In: Findings of the Association for Computational Linguistics: EMNLP 2023. 2023; pp. 1011–1031.

43.

Yongsatianchot

, Thejll-Madsen

, Marsella

. What’s Next in Affective Modeling? Large Language Models. In: 2023 11th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos. ACIIW; 2023; doi: 10.1109/ACIIW59127.2023.10388124

44.

Milička

, Marklová

, VanSlambrouck

, et al. Large language models are able to downplay their cognitive abilities to fit the persona they simulate. Hassani H. ed. PLoS One, 2024; 19(3):e0298522; doi: 10.1371/journal.pone.0298522

45.

Sileo

, Lernould

. MindGames: Targeting Theory of Mind in Large Language Models with Dynamic Epistemic Modal Logic. In: Findings of the Association for Computational Linguistics: EMNLP 2023. 2023; pp. 4570–4577.

46.

Siegal

, Varley

. Neural systems involved in “theory of mind. Nat Rev Neurosci, 2002; 3(6):463–471; doi: 10.1038/nrn844

47.

De Villiers

, De Villiers

. The role of language in theory of mind development. Topics in Language Disorders, 2014; 34(4):313–328; doi: 10.1097/TLD.0000000000000037

48.

Rakoczy

. Foundations of theory of mind and its development in early childhood. Nat Rev Psychol, 2022; 1(4):223–235; doi: 10.1038/s44159-022-00037-z

49.

Perner

, Lang

. Development of theory of mind and executive control. Trends Cogn Sci, 1999; 3(9):337–344; doi: 10.1016/S1364-6613(99)01362-5

50.

Gallese

. Before and below ‘theory of mind’: embodied simulation and the neural correlates of social cognition. Philos Trans R Soc Lond B Biol Sci, 2007; 362(1480):659–669; doi: 10.1098/rstb.2006.2002

51.

Crawford

. Atlas of AI: Power, Politics, and the Planetary Costs of Artificial Intelligence. Yale University Press: New Haven; 2021.