Abstract
The Reading the Mind in the Eyes Test (RMET) is the most popular adult measure of individual differences in theory of mind. We present a meta-analytic investigation of the test’s psychometric properties (k = 119 effect sizes, 61 studies, ntotal = 8,611 persons). Using random effects models, we found the internal consistency of the test was acceptable (α = .73). However, the RMET was more strongly related with emotion perception (r = .33, ρ = .48) relative to alternative theory of mind measures (r = .29, ρ = .39), and weakly to moderately related with vocabulary (r = .25, ρ = .32), cognitive empathy (r = .14, ρ = .20), and affective empathy (r = .13, ρ = .19). Overall, we conclude that the RMET operates rather as emotion perception measure than as theory of mind measure, challenging the interpretation of RMET results.
Keywords
“The eyes are the window to the soul.”
The Reading the Mind in the Eyes Test (RMET; Baron-Cohen et al., 2001) is a popular measure of individual differences in theory of mind (ToM; 7,110 citations according to Google Scholar as of August 2020). This test is frequently used in clinical contexts (e.g., 6,170 citations in the area of autism, 3,210 citations in regards to schizophrenia) and researchers have found the test discriminates between persons on the Autism spectrum (Baron-Cohen et al., 2001), schizophrenic patients (Bora et al., 2009), and alcoholics (Maurage et al., 2011), relative to controls.
However, despite its immense popularity, a large-scale evaluation of the test’s psychometric properties has not been shown. This is particularly important given recent concerns about the test’s internal consistency and factor structure (e.g., Olderbak et al., 2015), as well as its construct validity (e.g., Oakley et al., 2016). To this end, we present the first large-scale evaluation of the test’s internal consistency and its construct validity, comparing performance on the test with performance with other measures of ToM (to evaluate its convergent validity), as well measures of emotion perception ability, vocabulary, cognitive empathy, and affective empathy (to evaluate its discriminant validity).
Theory of Mind Measurement
To evaluate the design of the RMET, we will briefly review the definition of ToM and the general design of ToM measures.
ToM has been researched in a variety of populations, including chimpanzees (Premack & Woodruff, 1978), children (Wellman et al., 2001), and clinical populations (Baron-Cohen et al., 2001). Because ToM is researched across a variety of disciplines, there is no clear model of its structure (Quesque & Rossetti, 2020). And while there are numerous terms for ToM, and idiosyncratic definitions, researchers tend to generally agree on a single definition. ToM is the “ability to reason about mental states, such as beliefs, desires, and intentions, and to understand how mental states feature in everyday explanations and predictions of people’s behavior” (Apperly, 2012, p. 836). Thus, ToM is conceptually closely connected with cognitive empathy and often considered a similar or equivalent construct (Lawrence et al., 2004) since both constructs involve inferring the mental state of others.
Given the complex definition of ToM, and that it is a complex cognitive ability (Schaafsma et al., 2015), many tests have been designed to assess specific components of ToM. We have provided a brief description of some of the more popular measures in Table 1, but see Achim et al. (2013) or Quesque and Rossetti (2020) for a more comprehensive list. Each measure, including the RMET, is considered to be a maximal effort measure with participants asked to perform their best, and performance is compared with a veridical response standard (Cronbach, 1949).
Overview of Common Theory of Mind Tests (in Order of Publication Date).
Note. Additional theory of mind tests include (in order of publication date): False Belief Task (Frith & Corcoran, 1996), Comic strips (Character Intention Task; Sarfati et al., 1997), Theory of Mind Stories (Happé et al., 1999), Picture Sequencing Task (Langdon & Coltheart, 1999), TASIT (TASIT Part 2: The Awareness of Social Inference Test, TASIT Part 3: Social Inference-Enriched; S. McDonald et al., 2003), Cartoon Task (Brüne, 2003), Versailles-Situational Intention Reading (V-SIR; Bazin et al., 2009), social cognition tasks (mentalistic interpretation and social problem-solving; Channon & Crawford, 2010).
However, the tasks are very different in their design, instructions, choice of stimuli, and question format (Achim et al., 2013). Thus, it is probably not surprising that there is weak convergence among these measures, as was illustrated in a recent meta-analysis (Kirkland et al., 2012). For example, the RMET as well as the Movie for the Assessment of Social Cognition (MASC; Dziobek et al., 2006) differ conceptionally from other ToM tests. Both require viewing emotional faces, presupposing face cognition, and perceptual abilities that are based in fluid intelligence, whereas other measures typically consist of longer reading passages, requiring reading comprehension skills that are based in crystallized intelligence (Sternberg & Kaufman, 1998). Additionally, the tests represent different conceptualizations of ToM. For example, the Faux Pas test assesses awareness and understanding of Faux Pas while MASC assesses an understanding of social situations.
On a more critical note, a recent review argued that many ToM tests assess lower level abilities, that support ToM, but also support other abilities (e.g., emotion understanding). Thus, many tests do not require participants to recognize a mental state and do not necessitate individuals distinguishing their own mental state from that of the target (Quesque & Rossetti, 2020). Thus, there are general calls for developing stronger ToM tests (e.g., Olderbak & Wilhelm, 2020).
Reading the Mind in the Eyes Test
For the design of the RMET, Baron-Cohen et al. (2001) suggested two suggested stages of ToM. The first, mental state decoding, is the ability to detect mental states based on immediately observable information. The second stage, mental state reasoning, involves thinking or reasoning about mental states to explain or predict actions by the target. Information acquired in both stages are combined for an assessment of other’s mental states.
Mental state decoding (also called mental state identification) involves a spontaneous appraisal of one’s immediate environment, including the facial expression and gaze direction of other people, to perceive what another person is thinking or feeling (Harkness et al., 2005). For example, a smiling target with direct gaze in the presence of a birthday cake could be quickly appraised as happy. Mental state decoding is thus considered to be a rudimentary skill that does not require more complex reasoning skills (Harkness et al., 2005).
In contrast, mental state reasoning involves reasoning about why a target could feel the way one perceives them to feel. Reasoning about why someone may feel a certain way requires more information than is necessary for mental state decoding, such as additional knowledge about the person or the current context. For example, with the earlier example, it would be important to know whose birthday it is, and depending on the stage of the event, whether the target is perhaps happy about the cake, the presents, or the presence of friends or family. On the basis of these information, assumptions about the target’s mental state can be drawn. Actions can be explained or predicted, and false beliefs can be identified. Therefore, mental state reasoning is a rather complex skill, since a lot of information has to be sorted, evaluated, and combined (Harkness et al., 2005).
As a consequence, both stages not only differ in their assumed complexity but are different kind of skills.
Most ToM test supposedly measure the second stage, mental state reasoning, to assess whether one can accurately reason about the mental state of others. For example, the Faux Pas test attempts to measure awareness of and understanding of faux pas (see also the Hinting Task and Strange Stories, which are described in Table 1). In contrast, the RMET was designed to measure the first stage of ToM, mental state decoding. In particular, the test was developed to focus on the capacity to identify what a target is thinking or feeling, partly based on the target’s eye gaze (Baron-Cohen et al., 2001). For example, earlier studies have shown that eye gaze affects the accuracy with which certain emotions are perceived. Approach-oriented emotions, such as anger and joy, are facilitated by direct gaze, while avoidance-oriented, such as fear and sadness, are facilitated by averted gaze (R. B. Adams & Kleck, 2003; Sander et al., 2007). Thus, the RMET presumes that ToM rests partly on information inferred from the perception of another person’s eye gaze.
RMET Version 1
The first version of RMET was developed in 1997 and had 25 items with two response options each. The visual stimuli were black and white photographs of the eyes region of the face, and were pulled from magazines. Four judges, described as two men and two women, agreed in an open discussion on the mental state of the person in the photograph, in addition to a second mental state that would act as a foil, which served as the two response options for that picture. The response options were then presented to separate group of eight judges, four men and four men, who also agreed on the correct mental state for each photograph (Baron-Cohen et al., 1997; Baron-Cohen et al., 2010).
However, this test had many limitations including a 50% guessing probability for each item, foil items were just the semantic opposite of the solutions, and some items were too easy (e.g., questions concerning gaze direction, basic emotions were presented as response options instead of advanced mental states; Baron-Cohen et al., 2001). Because of these limitations, the authors revised the measure and created the RMET Version 2.
RMET Version 2
The second version of RMET, or RMET V2, addressed the limitations of Version 1. Specifically, Version 2 has four response options, lowering the guessing probability to 25%. The response options present complex mental states, which are also accompanied by a document defining each mental state. The test is also longer, with a practice trial, which provides the participant with feedback, followed by 36 trials. The trials again present pictures of a person’s eyes and participants are again asked to identify which mental state best describes what the target is thinking or feeling. The pictures set included the original 25 in addition to other photographs, which are roughly balanced between men and women. The response options were newly selected by the two authors and then given to a group of judges. Five of the eight judges had to agree which response option is correct and additionally, at least 50% of the tested sample of healthy adults and adult students should choose the target word and not more than 25% should choose a foil option (Baron-Cohen et al., 2001).
Evaluation of the RMET
Psychometric Evaluations of the RMET
Several have reported that RMET has poor internal consistency (e.g., Vellante et al., 2013; Voracek & Dressler, 2006; cf. Dehning et al., 2012; Girli, 2014). Some have reported that test data does not meet assumptions of normality (Söderstrand & Almkvist, 2012; Vellante et al., 2013), with several groups showing the test lacks a unidimensional structure. Instead, other factor structures are discussed (e.g., Harkness et al. 2005; Olderbak et al., 2015). Likewise, performance on the test differs depending on whether the targets can be categorized as positive or negative (Hudson et al., 2020).
However, the test typically has good test-retest reliability (e.g., Hallerbäck et al., 2009; Vellante et al., 2013) and scores on the test differentiate between healthy controls and individuals with autism or schizophrenia, with the latter groups performing expectedly worse (Baron-Cohen et al., 2015; Baron-Cohen et al., 2001; Bora et al., 2009). Additionally, women typically perform better than men (Kirkland et al., 2013).
RMET Is Maybe an Emotion Perception Test
While the RMET was designed to measure the first stage of ToM, the test design strongly resembles the one of an emotion perception test (see also Oakley et al., 2016). For example, the DANVA (Diagnostic Analysis on Nonverbal Accuracy; Nowicki & Duke, 1994) presents participants with a picture of someone who has an emotional expression on their face and asks them to select among four response options that best describe what the person in the picture is feeling. Thus, in the DANVA and RMET, participants are presented with an emotional face and asked to select among four response options that best describe what the person in the picture is feeling.
A notable difference between most emotion perception tasks, like the DANVA, and the RMET is that the RMET focusses only on the eyes. However, one could argue that limiting the portion of the face that is viewable to a method for reducing ceiling effects of emotion perception tasks. For example, the Identification of Emotion Expressions from Composite Faces task (Wilhelm et al., 2014) presents participants with a photograph where the top half and the bottom half of the face (divided at the nose) are from two different emotional expressions by the same person. Participants are then asked to identify which emotion is expressed in the top or bottom half of the face.
One could also argue that the response items of the RMET are more complex emotional states relative to the six basic emotions (anger, disgust, fear, happiness, sadness, surprise) used in most emotion perception tasks. However, some emotion perception tests (e.g., Geneva Emotion Recognition Test; Schlegel et al., 2014; Multimodal Emotion Recognition Test; Bänziger et al., 2009) include more complex emotions. Actually, many researchers have identified over 62.6% of the RMET terms as emotions in their research (see Table 2).
Use of RMET response options as emotion labels.
Note. “x” Indicates that the term is used by the respective researcher. “t” Indicates that the term is used in a slightly different way, for example, amusement instead amused. 33.7% of the terms of the RMET (Reading the Mind in the Eyes Test) are not used by these researchers.
Given the strong similarities between RMET and emotion perception tests, one could instead argue that emotion perception tests measure ToM. In fact, many consider emotion perception to be an indicator of ToM. However, this is assumption is limited to ToM research with humans, whereas with nonhuman animal, emotion perception abilities are considered to be distinct from ToM (Quesque & Rossetti, 2020). However, it is generally argues that emotion perception is a lower level cognitive process relative to ToM, involving basic visual discrimination, or basic learning. Thus, under the rule of Ocham’s Razor, we should attribute performance to emotion perception (Olderbak & Wilhelm, 2020; Quesque & Rossetti, 2020). Under the requirements for construct validity, it is not clear that ToM is the necessary attribute to causes variations in performance, and rather emotion perception is the more likely ability (Borsboom et al., 2004). Thus, we could conclude that the RMET essentially operates as an emotion perception test.
Current Study
Because of the many psychometric limitations presented earlier, we hypothesize that the RMET will have poor internal consistency, especially if unpublished estimates can be retrieved (Hypothesis 1).
The RMET design differs from the other ToM tests in several ways. First, it asks participants to visually perceive static and limited facial information, while others present facial expressions within a social context (e.g., MASC) or present no person stimuli and instead focus on reading comprehension (e.g., Faux Pas, Strange Stories). Thus, the RMET should be more strongly related with visual skills and face processing abilities, while others require working memory, emotion memory (e.g., MASC), and reading comprehension (e.g., Faux Pas, Strange Stories). Second, the test utilizes a multiple-choice format with single word response options, which differs from others with the longer response options (e.g., MASC) or the use of short answer items. The completion of multiple-choice items requires basic crystallized intelligence skills (Hartung et al., 2017), while the longer response options and short answer items are heavily reliant on verbal abilities. Thus, the RMET should require different cognitive processes, relative to the other ToM tests, and hence we expect only a moderate relation between the measures (Hypothesis 2; see Supplemental Figure S1 [available online] for an illustration of expectations).
Because the RMET test design is so similar to that of an emotion perception test with facial stimuli (e.g., Wilhelm et al., 2014), we expect a strong relation with measures of emotion perception (Hypothesis 3). For the purposes of this article, emotion perception is defined as the ability to accurately observe and decode facial expressions of emotion. Although emotions can be expressed through the voice or body movements, here only emotions expressed through the face are relevant since the RMET uses solely pictures of faces.
In addition, because the commonness of the response options correlates with the likelihood of that option being selected (Olderbak et al., 2015), we hypothesize that RMET will be moderately correlated with vocabulary abilities (Hypothesis 4). Verbal abilities are defined as the degree of acquisition and the availability of the relation system language. They are part of most intelligence models, for instance as one of three contents in the second hierarchical level (with a general intelligence factor on the first level) of the Berliner Intelligenz-Struktur Modell (Berlin Intelligence-Structure Model, BIS, Jaeger et al., 1997) besides figural and numeric.
Since RMET is designed as a maximal effort construct, we hypothesize that it will be weakly related with typical behavior measures (i.e., everyday behavior, personal tendencies; Cronbach, 1949), even though they are conceptually similar. This is because maximal effort measures are only weakly related with their typical behavior counterpart (e.g., Di Fabio & Saklofske, 2014). To this end, we examined relations with affective and cognitive empathy. Affective empathy is as “an emotional response to . . . emotional responses of others” (Lawrence et al., 2004, p. 911) and cognitive empathy is the extent to which individuals think they can deduce the inner mental state of another person (Blair, 2005) and is often is seen as an equivalent of ToM by some researchers (e.g., Batson, 2009; Blair, 2005; Lawrence et al., 2004). Specifically, we expect weak relations with cognitive and affective empathy (Hypotheses 5 and 6).
In the present study, we additionally examined whether the diagnosis of schizophrenia or autism moderated relations, since for both diagnoses, researchers assume deficits in ToM or socioemotional traits (Ahmed & Miller, 2011; Baron-Cohen et al., 2015; Bora et al., 2009; Grove et al., 2014). The presence of these diseases may impact the heterogeneity of the participants, thus impacting the variance in RMET and its relations with other constructs. This investigation, however, was exploratory. Autism spectrum disorder (henceforth autism) is a neurodevelopment disorder characterized by impairments in social development and communication, restrictive interests and repetitive behaviors which can be heterogeneous concerning the severity or type of symptoms (American Psychiatric Association, 2013). On the other hand, schizophrenia is a mental disorder characterized by symptoms like hallucinations, delusions or disorganized speaking (American Psychiatric Association, 2013).
Method
Literature Search
A keyword-based literature search was conducted in May 2017 via Ovid, an aggregator, which we used to search multiple databases simultaneously, specifically Embase, the Cochrane Libraries, PsycINFO, PsycARTICLES, and Medline (see Figure 1 for the flowchart). The keywords used were (“mind in the eye*” or “eye* test”) AND (“emotion percept*” or “emotion recog*” or “empathy” or “vocab*” or “verbal ability*” or “verbal intelligence” or “Faux Pas Test” or “Strange Stories” or “Hint* Task” or “Stories from Everyday Life” or “False Belief Stories” or “false belief task*” or “False Belief Picture Sequencing” or “Character Intention Inference” or “MASC” or “Movie for Assess* of Social Cogn*” or “attribution of intention*” or “Theory of mind stories”). We searched for these terms in the title, abstract, keywords, measures, and full text and we screened English- and German-language manuscripts (the final set of estimates came from English-language articles only). In addition, we searched the reference lists of three existing RMET meta-analyses (Baker et al., 2014; Bora et al., 2009; Kirkland et al., 2012) for additional studies meeting the inclusion criteria. Additionally, emails were sent to all authors who met the inclusion criteria but did not include relevant statistic information (i.e., correlations of the RMET) or authors of conference abstracts that mentioned the RMET.

Flowchart of the literature search progress.
Inclusion and Exclusion Criteria
Our inclusion criteria were as follows: (1) full text is available in English or German, (2) the exact sample size is reported, (3) the mean age of the participants is between 18 and 60 years, (4) effect sizes are based on the full RMET Version 2, and (5) participants are neurotypical or had Autism Spectrum Disorder, schizophrenia, or schizotypy. We excluded studies if participants had a condition other than those listed because we there were too few studies to allow us to significantly test the conditions as moderators. We additionally excluded studies if the RMET was administered following an intervention, since this was the most generalizable condition. Post interventions measure of the RMET could be biased by memory effects or the effects of the intervention.
Coding
We coded 27 variables in total (see Supplemental Material [available online] for the coding manual). We coded the article’s bibliographic information including the full reference, year of publication, authors, and the document type (e.g., peer-reviewed paper or dissertation). We coded characteristics of the sample including the sample size, the proportion of females, proportion of Caucasians, the mean age, whether the sample was only students, testing language, and the percentage of the sample that was schizophrenic, schizotypic, or on the Autism Spectrum. Finally, we coded the reliability of the RMET, mean RMET scores, the reliability of the covariates, the covariate test name, the relation between the RMET and the covariate, and the type of effect size (e.g., latent correlation). Some studies had multiple effect sizes, which were coded on separate rows (Cooper et al., 2009).
All studies were doubled coded; coders were three master students studying psychology. On average, the proportion agreement was .86, ranging from .71 to 1.00 (see Supplemental Material Table S1 [available online] for details). The lower values were for continuous variables and were most often due to rounding errors, for example, effect sizes with multiple decimal places. The differences were discussed between the coders and solved.
Data Analytic Strategy
We conducted analyses within R version 3.2.2 (R Development Core Team, 2016) with the package psych for basic descriptive statistics. We meta-analyzed raw correlation coefficients. We then corrected all correlation coefficients for attenuation to account for measurement error (Hunter & Schmidt, 2004), and pooled correlations are presented without and with this correction. When the sample-specific reliability was not reported, the median score from the present data was used, followed by published reliability estimates (Card, 2012; see Supplemental Material Table S2 [available online] for details). Correction is a common practice, although there are some concerns about the interpretation of these effect sizes (Schmidt & Hunter, 2015). One reason is that more imprecise measures with lower reliability yield bigger corrected values. Although the corrected values should be treated with caution, we decided to present both uncorrected as well as corrected values to illustrate the generalizability of the effects.
We then transformed correlations in Fisher’s z scale to stabilize the variances (Borenstein et al., 2011). Estimates of the RMET’s internal consistency were limited to Cronbach’s alpha. For these scores, we estimated their square root, which were then meta-analyzed (Thompson & Vacha-Haase, 2000). The meta-analytic results were then retransformed back to their original metric for easy interpretation.
When there were no dependencies in the effects (i.e., all effects came from independent samples), we used the package metafor to get a pooled estimate (Viechtbauer, 2016). Alternatively, when effects were dependent, the package robumeta was used, which applies robust variance estimation (RVE; Fisher & Tipton, 2014). RVE is a new approach for handling nonindependent effect sizes when the within-study correlation is unknown (Hedges et al., 2010). The within-study correlation value ρ was set to the default of 0.8, which we examined afterward via sensitivity analyses (Fisher & Tipton, 2014). We investigated the determination of the dispersion (τ2), the variance of the “true” effect sizes, and Higgin’s I2 to investigate the heterogeneity of the effects (Borenstein et al., 2011). Given the heterogeneity in our samples, which is also indicated in the heterogeneity statistics below, we estimated a random-effects model throughout. For all analyses, alpha was set at .05 and we required a minimum of 10 studies for any meta-analytic model (Borenstein et al., 2011). For visualization, forest plots are presented.
We identified outliers, which were values more than 3.5 standard deviations from the means, which was double checked through stem-and-leaf plots. Additionally, we examined the publication bias with Funnel Plots and, Egger’s Test (Sterne & Egger, 2005) and trim and fill analysis (Duval & Tweedie, 2000).
We investigated the incremental predictive validity of the RMET through a two-stage meta-analytic structural equation model using random effects (TSSEM; Cheung & Chan, 2005), incorporating new modifications to the tool that allow us to effectively account for missing correlation coefficients at the study level (Jak & Cheung, 2018). TSSEM is an inclusive tool with the pooled correlation matrix created in Stage 1, and a structural equation model estimated based on that matrix in Stage 2. These analyses were conducted using metaSEM and semPlot to visualize the results (Jak & Cheung, 2017). The TSSEM could only be estimated with attenuated coefficients.
Results
Descriptives
We coded 121 effect sizes representing correlations between the RMET and several covariates. Of these effect sizes, we identified one outlier that was more than three standard deviations from the mean for both transformed and not transformed values, and was hence excluded (see Thoma et al., 2013). An estimate from Balogh et al. (2014) was also excluded because the imputed reliabilities lead to a corrected correlation greater than 1.
The final sample included a total 61 studies (57 peer-reviewed, 4 dissertations) with 62 independent samples and k = 119 effect sizes based on a total sample size of n = 8,611 (see Table 3 and see Supplemental Material [available online] for a complete list of studies). Of these samples, eight included participants with schizophrenia or schizotypy (n = 744) while eight included participants with autism or Asperger’s syndrome (n = 519; all diagnoses were according to the Diagnostic and Statistical Manual of Mental Disorders–Fourth edition [DSM-IV], Fifth edition [DSM-5], or International Classification of Diseases–Tenth revision). The weighted mean age of the participants was 28.2 years (ranging between 18.4 and 56.3 years). The majority of participants were women (weighted mean 61%, ranging from 5% to 100%) and Caucasian (weighted mean 77%, ranging from 0% to 100%). Sixteen samples were based exclusively on students, 22 used nonstudent samples, and 24 were mixed. Also, 32 samples used English test materials, while 30 applied the tests in other languages.
Characteristics of Included Studies With RMET Cronbach’s α and Correlations of RMET With Measures of ToM, Emotion Perception (EP), Affective Empathy (AE), Cognitive Empathy (CE), and Vocabulary (Voc).
Note. Effect sizes r: AE = affective empathy; CE = cognitive empathy; ToM = theory of mind; EP = emotion perception or recognition; Voc = vocabulary; Autism = Autism spectrum condition, also Asperger; RMET = Reading the Mind in the Eyes Test. Tests (ordered by covariates): ToM: Faux = Faux Pas Test (Stone et al., 1998), Hinting Task (Corcoran et al., 1995), Strange Stories (Happé, 1994), False Belief Task (Frith & Corcoran, 1996); IMT = Imposing Memory Test (Kinderman et al., 1998); StoryC = Story Comprehension Test (Channon & Crawford, 2000); ToM Stories = Theory of Mind Stories (Happé et al., 1999); MASC = Movie for the Assessment of Social Cognition (Dziobek et al., 2006); KDEF = Karolinska Directed Emotional Faces Test (Lundqvist et al., 1998); TASIT2 = The Awareness of Social Inference Test (S. McDonald et al., 2003) Part 2: Social-Inference-Minimal; TASIT 3 = TASIT Part 3: Social Inference-Enriched (S. McDonald et al., 2003); EP: DANVA-2 = Diagnostic Analysis on Nonverbal Accuracy Version 2 (Nowicki & Duke, 1994); BLERT = Bell-Lysaker Emotional Recognition Task (Bell et al., 1997); CAM_visual = Visual Task of Cambridge Face-Voice Battery (Golan et al., 2006); EIS-F = Emotional Intelligence Scale-Faces (Matczak et al., 2005); FEEST = Facial Expressions of Emotions–Stimuli and Tests (Ekman & Friesen, 1976); Faces Test (Baron-Cohen et al., 1997); TASIT1 = The Awareness of Social Inference Test Part 1 emotion evaluation test (S. McDonald et al., 2003); ER-40 = Penn Emotion Recognition Test (Kohler et al., 2003). Voc: WASI = Wechsler Abbreviated Scale of Intelligence (Wechsler, 1999); NART = National Adult Reading Test (Nelson, 1982); WAIS-Vocab = Wechsler Adult Intelligence Scale Vocabulary (Wechsler, 1997); Shipley = Shipley Vocabulary Test (Shipley, 1946); WAIS-IV Similarities (subtest; Wechsler, 2008); Gf/Gc = Vocabulary Test (Gf/Gc Quickie Test Battery; Stankov, 1997); 4-Choice = 4-Choice synonym vocabulary test (Ekstrom et al., 1976); AAIQ = Army Alpha IQ Test (Yerkes, 1921); WAIS-R information subscale (Wechsler, 1981); WTAR = Wechsler Test of Adult Reading (Wechsler, 2001); CE: IRI = Interpersonal Reactivity Index–Perspective Taking (Davis, 1980); ESE = Emotion Specific Empathy Questionnaire–Cognitive Empathy (Olderbak et al., 2014). AE: IRI = Interpersonal Reactivity Index–Empathic Concern (Davis, 1980); QMEE = Mehrabian and Epstein Questionnaire Measure of Emotional Empathy (Mehrabian & Epstein, 1972); ESE = Emotion Specific Empathy Questionnaire–Affective Empathy (Olderbak et al., 2014).
Different sample. bSplit-half reliability. cInternal consistency ω. dDissertation/thesis. eEffect size sent by author. fAdapted/revised, gFalse belief. hSubscale. iGerman, 7 of 12 vignettes.
Because it was unclear whether it was appropriate to combine the clinical and nonclinical samples, analyses were initially based on the 46 healthy samples (k = 84). Clinical samples were only included when there was enough data to test the diagnosis as a moderator.
Internal Consistency
We initially coded 24 effect sizes representing the internal consistency of the RMET, with 6 effect sizes not published. We also excluded an estimate of test-retest reliability, two split-half reliability coefficients, and an internal consistency omega estimate, to decrease the heterogeneity in our pooled correlation. The final sample included 21 effect sizes based exclusively on Cronbach’s alphas with a total sample size of n = 4,305.
On average, internal consistency was acceptable (α = .73, 95% confidence interval [CI: .65, .79], p < .001; I2 = 94.90, τ2 = .10, k = 21, n = 4,305, participants with and without mental disorder). The internal consistency of RMET, as estimated by Cronbach’s α, ranged from .45 to .96, with 50% of the samples reporting estimates below .70.
Yet, because of the length of the test (36 items), it is likely that the internal consistency is rather an artifact of test length. This was also supported by applying the Spearman–Brown prophecy formula, which shows an average interitem correlation of .07 if the test length would be shortened to only one item. Published reliability estimates were on average higher (α = .71) than those not reported (i.e., 6 effect sizes retrieved through email: α = .67), however in contrast to our prediction, this difference was not significant (intercept α = .72, 95% CI [.58, .82], p < .001, reliability source α = 0, 95% CI [−.15, .23], p = .80; model I2 = 97.52, τ2 = .23, k = 21). In addition, age, sex, language, and clinical diagnosis did not significantly moderate the internal consistency. Fail-safe N with Orwin’s (1983) method showed that nine further studies would be necessary that the internal consistency would be not significantly different from .5.
Convergent and Discriminant Validity
Next, we estimated relations of RMET with several covariates using meta-regression, limiting the analyses to samples with healthy participants. The observable pattern partly supports our hypotheses (see Table 4).
Random Effects Model of the Convergent and Discriminant Validity of the RMET for Healthy Participants.
Note. The relations were estimated with RVE to account for multiple effects sizes from a single study. nhealthy total = 6,419. k = number of effect sizes; RMET = Reading the Mind in the Eyes Test; CI = confidence interval; RVE = robust variance estimation.
The strongest relation was with measures of emotion perception, both according to the pooled correlations (r = .33) and to the reliability-corrected pooled correlations (ρ = .48). The strongest raw observed correlations were between RMET with emotion perception tests (rs ranged from .12 to .72). However, the reliability-corrected pooled correlation was still only .48 in magnitude. Thus, there is a 23% shared variance between RMET and emotion perception, or 77% nonshared variance, indicating there is still a lot of distinction between the two.
The next strongest relation was with other measures of ToM, which was medium in magnitude, both according to the raw correlations (r = .29), and to the reliability-corrected pooled correlations (ρ = .39). Thus, there is up to 15% shared variance between RMET and other measures of ToM. The raw observed correlations were weaker than those observed with emotion perception, even negative in some cases, and ranged from −.19 to .56 across the samples.
RMET was moderately related with vocabulary, both according to the raw correlations (r = .25), and to the reliability-corrected pooled correlations (ρ = .32). Thus, on average, there is up to 10% shared variance with measures of vocabulary. The raw observed correlations ranged from negative (r = −.14) to moderately positive (r = .37) across the samples.
Finally, RMET was weakly related with cognitive (r = .12, ρ = .20) and affective empathy (r = .13, ρ = .19). Thus, RMET shared up to 4% variance with cognitive empathy and 4% variance with affective empathy. The raw correlations ranged from −.04 to .39 for affective empathy, and −.43 to .30 for cognitive empathy. Across all meta-regressions, there was no significant moderation by sex ratio, mean age, or whether or not these estimates were published.
Two-Stage Meta-Analytic SEM
Emotion perception and vocabulary were the strongest non-ToM predictors of performance on the RMET, however, both abilities are also moderately correlated with one another. Thus, we looked at their individual incremental predictive validity using two-stage random effects meta-analytic structural equations modeling based on the attenuated correlations. Correlations between emotion perception, specifically tasks with facial stimuli, and vocabulary were pulled from a recent meta-analysis and can be found in the supplemental material (k = 7, ntotal = 1,932; Olderbak et al., 2019; see Supplemental Table S3, available online). This model showed that emotion perception and vocabulary combined explained 9% of the variance in RMET (see Figure 2). Unfortunately, there was insufficient data to include theory of mind in the model.

Two-stage meta-analytic structural equations model with random effects modeling.
Moderators by Participant Characteristics
Next, we examined whether the proportion of participants with a psychiatric diagnosis moderated the strength of relations. This was done only in instances where there were at least three samples with individuals on the Autism Spectrum (Theory of Mind: k = 3; Vocabulary: k = 6) or with Schizophrenia (Emotion Perception: k = 5; Theory of Mind: k = 10). Given the recommendation that 10 studies are needed per group, these results should be interpreted with caution (Borenstein et al., 2011). Samples with individuals on the Autism Spectrum (M = 21.9, range: 17.4-25.4) or with Schizophrenia (M = 20.5, range: 17.9-24.3) performed slightly worse at the RMET that developmentally normal samples (M = 25.5; 15.9-28.6).
A meta-regression revealed for individuals with schizophrenia that the relation with emotion perception was significantly stronger (see Table 5 and Supplemental Figure S2, available online): if, for example, 10% of the sample would have the diagnosis schizophrenia, the relation of the RMET with emotion perception would increase from ρ = .41 to ρ = .47. This result was stable when controlling for multiple testings (still p <.001). Yet, we found no moderation for schizophrenia for measures of ToM. On the contrary, we found indications for a slight moderation of autism for relations with vocabulary, indicating that for individuals with autism, the relation of vocabulary with the RMET is lower (see Table 5 and Supplemental Figure S3, available online). Yet, it failed to reach significance. A similar pattern was observable for ToM, however, it should be noted that only 3 of the 26 samples had participants with autism.
Results of Moderator Analysis How the Percentage of a Specific Diagnosis Moderate the Relations With the RMET.
Note. The relations were estimated with RVE to account for multiple effects sizes from a single study. k = number of effect sizes. The effect sizes for the diagnoses autism or schizophrenia indicate the change of the relation of the covariate (e.g., vocabulary) with the RMET when 1% of the sample has the diagnosis. We modeled the health status “healthy” as the intercept. All significant coefficients remained significant (p <. 001) after controlling for false discovery rate. RMET = Reading the Mind in the Eyes Test; CI = confidence interval; RVE = robust variance estimation.
Moderation by Methodology
Next, we examined whether the RMET was more strongly related with particular measures of ToM (only in instances where at least three samples completed that particular measure). This was done separately for all samples and for those with just healthy participants (see Table 6 and Supplemental Material Figure S4, available online). Overall, we found that methodology significantly moderated relations in both models: the RMET was more strongly related with the MASC (Dziobek et al., 2006) in comparison to other ToM tasks like the Faux Pas test (Stone et al., 1998) or the Hinting Task (Corcoran et al., 1995) for healthy participants as well as participants with diagnosis.
Comparison of the Relation With the RMET for Varying Theory of Mind Tests.
Note. The relations were estimated with RVE to account for multiple effects sizes from a single study. RMET = Reading the Mind in the Eyes Test; CI = confidence interval; k = number of effect sizes; Faux = Faux Pas Test (Stone et al., 1998); Hinting = Hinting Task (Corcoran et al., 1995); MASC = Movie for Assessment of Social Cognition (Dziobek et al., 2006).
Publication Bias
From the 121 effect sizes in total, 51 were retrieved by authors. Publication bias was analyzed using Funnel Plots, Egger’s Test (Sterne & Egger, 2005) which tests for funnel plot asymmetry, and trim and fill analyses, which also analyze funnel plots’ symmetry (Duval & Tweedie, 2000). There was no apparent publication bias according to the funnel plots (see Supplemental Figure S5, available online). This was also supported by the nonsignificant findings for Egger’s test (ToM: t = −1.95, p = .09; emotion perception: t = 2.60, p = .06; vocabulary: t = 0.09, p = .93; cognitive empathy: t = −0.21, p = .84; affective empathy: t = −0.16, p = .88). Likewise, trim and fill analysis revealed no indications for asymmetry for emotion perception, that is, zero estimated missing studies on the left. However, trim and fill analysis revealed indications of asymmetry for vocabulary, where the analysis suggested five studies missing on the right (robserved = .25; radjusted = .29, 95% CI [.16, .41]; ρobserved = .32; ρadjusted = .37, 95% CI [.24, .49]). This indicates that the relation between RMET with vocabulary might be slightly underestimated. However, the adjusted estimates should be interpreted with caution because they ignore among-study heterogeneity without publication bias (e.g., due to observed moderations by diagnosis for both emotion perception and vocabulary) and their provided corrections are imprecise (Cooper et al., 2009).
We could not estimate trim and fill analyses for theory of mind, cognitive empathy, or affective empathy because the nonindependent in the effect sizes is ignored by trim and fill analyses and results could lead to Type I errors (Rodgers & Pustejovsky, 2020).
For studies with nonindependent effect sizes, we used RVE analysis (Fisher & Tipton, 2014). However, with RVE, we made an assumption about within-study covariances, and modeled the within-study correlation ρ at .8. We used sensitivity analyses to test whether the effect sizes, and standard errors in our models would change with a different within-study correlation assumption, specifically 0, .2, .4, .6 or 1. Results revealed the results were robust (see supplemental R markdown, available online).
Discussion
We conducted the first large-scale evaluation of the RMET, examining its internal consistency and construct validity. Overall, the meta-analysis revealed that the internal consistency of the RMET, as calculated through Cronbach’s alpha, was acceptable and performance on the test was strongly related with emotion perception, moderately related with other ToM tests and vocabulary, and weakly related with cognitive and affective empathy.
Internal Consistency of RMET
We found that unpublished internal consistency estimates were comparable to those that were published. Internal consistency was also not moderated by the mean age or gender distribution of the sample, the testing language, or clinical diagnosis. Yet, since the test is rather long with 36 items, and other studies finding weak interitem correlations, or even negative correlations (see Olderbak et al., 2015), the acceptable internal consistency seems rather to be an artifact of test length. This is also supported when applying the Spearman–Brown prophecy formula, which shows only an average interitem correlation of .07. Likewise, we should note that despite its common use, Cronbach’s alpha is not an indicator of test unidimensionality (Schmidt-Atzert & Amelang, 2012). Detailed analyses of the test’s factor structure have shown that several factors can be identified (e.g., Olderbak et al., 2015).
RMET Design Limitations
The RMET V2, despite being an improvement over Version 1, has still many issues that limit its precision and validity. First, the picture stimuli are unstandardized with regards to the use of shadows and head angles, which convey additional mood information outside of the eye gaze, which the test is designed to focus on. Researchers found that changing the brightness of a picture changes response patterns, with a brightened photo associated with participants more likely choosing the correct response option (Hallerbäck et al., 2009). Likewise, there may be limited cross-cultural generalizability of the test, with Caucasians having an intercultural advantage because only Caucasian pictures were used as stimuli (J. Adams et al., 2010; Elfenbein & Ambady, 2002).
Second, the response options differ in how common the terms are in the English language. The commonness of the response options, as based on the Corpus of Contemporary American English (Davies, 2015), is weakly, but positively correlated with the frequency by which that option was selected in other samples (Olderbak et al., 2015). Overall, this causes the test to have a somewhat high reliance on vocabulary abilities, which could reduce the extent to which it measures ToM (see also Burnel et al., 2017).
Third, and most critical, trials do not have true veridical responses, as can be identified in traditional intelligence tests (e.g., tests of mathematical ability). The persons in the magazine articles were not contacted as to what they were thinking or feeling. The test is scored based on the consensus of five out of eight judges, but there is no information regarding their qualifications. It should be noted that it is not certain that a consensus agreement actually and always indicates the correct response (e.g., 41% of Americans think dinosaurs and humans coexisted; Moore, 2015). Consensus scoring limits the extent to which difficult items can be developed and it rewards individuals for identifying the most common response, and not perhaps the correct response (although see Barchard et al., 2013 regarding two-stage proportion consensus scoring). This is a general critique of ToM tests, but can also be extended to most measures of social and emotional abilities (Olderbak & Wilhelm, 2020).
Finally, consensus scoring is biased by the abilities and individual biases of the persons on which the consensus score is based, a bias which is then transferred to the participant’s score (Barchard & Russell, 2006). There are some tools for adjusting the consensus score to account for these biases (e.g., cultural consensus theory; Anders et al., 2018); however, those tools can only be applied if the original ratings from the judges are known.
RMET Illustrates a Jangle Fallacy
We found that the RMET had the highest meta-analytic relation with emotion perception, with this relation higher in schizophrenic samples, whereas we found only a medium relation with ToM. These results are in line with and extend a previous meta-analysis (Kirkland et al., 2012). Additionally, we found the MASC-test, which uses a video of people having emotional interactions, had stronger relations with RMET, relative to the other ToM test. Others have suggested that the MASC resembles an emotion perception test, and others have found the measure is strongly related with emotion perception tests (e.g., Dziobek et al., 2006 found a correlation of r = .715 with a basic emotion recognition test). Therefore, these results support the perspective that the RMET is rather measuring EP instead of ToM. This represents a jangle fallacy, when two tests are claimed to measure different constructs but in practice actually measure the same construct (Kelley, 1927).
A review of the literature suggests the RMET has a similar nomological network as emotion perception tests. RMET and emotion perception tests correlate comparably with vocabulary. A recent meta-analysis found emotion perception tests with exclusively person stimuli are moderately related with crystallized intelligence (ρ = .27). We found a comparable relation between RMET and vocabulary (ρ = .32), which is a strong indicator of crystallized intelligence (Wilhelm & Schroeders, 2019).
Likewise, RMET and emotion perception tests relate comparably with empathy. We found only weak relations between the RMET with cognitive (ρ = .20) and affective empathy (ρ = .19). A mini-meta-analysis also found weak relations between emotion perception with cognitive (ρ = .13) and affective empathy (ρ = .13; Olderbak & Wilhelm, 2017). Some researchers assume that ToM and cognitive empathy are closely connected or equivalents (e.g., Batson, 2009; Blair, 2005; Lawrence et al., 2004). However, because of the difference in measurement approach, namely that the RMET is a maximal effort measure and affective and cognitive empathy are measured as typical behavior, they are weakly related.
Additionally, RMET has similar predictive validity as emotion perception tests. Performance on the RMET and emotion perception tests improves following the intranasal administration of oxytocin (Domes et al., 2007; Shahrestani et al., 2013). Additionally, individuals on the autism spectrum perform worse on the RMET and emotion perception tests (Baron-Cohen et al, 2001; Hudepohl et al. 2015).
Overall, these findings support our evaluation that the psychometric design of the RMET is closer to that of an emotion perception test, rather than a measure of ToM.
In a similar vein, the results of this meta-analysis may help improve our understanding of the close conceptual connection between mental state decoding and emotion perception. Mental state decoding tasks necessitate basic discrimination processes of spontaneous information, like facial expressions or eye gaze (Harkness et al., 2005). Consequently, one could suggest that emotion perception is an aspect of mental state decoding. Furthermore, just like mental state decoding, emotion perception is based on multimodal cues (Connolly et al., 2020; Harkness et al., 2005; Young & Bruce, 2011). However, typical measures of mental state decoding, such as the RMET, have severe methodological flaws some of which were highlighted in this meta-analysis. Therefore, it might be useful to take mental state decoding as an aspect of emotion perception (see also Oakley et al., 2016). Emotion perception as well as more general person perception has solid theoretical foundations in comparison to mental state decoding. There is a vast amount of data in support of popular models of person perception including the underlying neural framework (see Duchaine & Yovel, 2015). The perception of unfamiliar faces (neutral or emotional) can be seen as part of an emerging ability we might label emotional intelligence (Hildebrandt et al., 2015; Mayer et al., 2016). Additionally, perception of neutral or emotional faces incorporates processing eye-gaze and therefore key aspects of what the RMET is supposed to measure (Young & Bruce, 2011). Clearly, by virtue of surface features of RMET and popular emotion perception measures (Hildebrandt et al., 2015) the RMET by and large qualifies as an indicator of emotion perception. The RMET does so with two flaws. First, responses to items can hardly be evaluated based on a veridical response standard. Second, restriction to displays of the eye-region might be counterproductive for yielding a comprehensive ability estimate.
Instead of risking a jingle-jangle fallacy by contrasting two abilities—one being an essential facet of ToM, the other being key for emotional intelligence, we suggest that it is one and the same ability that happens to be studied in insufficiently connected fields. For the time being, we conclude that the more parsimonious explanation is sufficient and we suggest to further consider the RMET in within Young’s modified model of emotion perception (Young & Bruce, 2011).
However, it is clear that RMET is also not purely a measure of emotion perception. First, the RMET focusses only on the eyes and all prototypical expressions are expressed across the full face (Ekman & Friesen, 1975). Second, the pictures are unstandardized in regard to the shadows and head angles, which influences responses (Hallerbäck et al., 2009). Third, the response options vary in regard to their commonness (Olderbak et al., 2015), artificially increasing the reliance of performance on vocabulary abilities. Finally and most critical, the trials do not have veridical answers. Thus, we would argue that while RMET is designed similar to a measure of emotion perception, limitations in its design prevent us from recommending it be used as a measure of emotion perception.
RMET Illustrates a Jingle Fallacy
The results of our meta-analysis also point to a jingle fallacy in the study of ToM. A jingle fallacy is the inaccurate assumption that measurement instruments that are purported to measure the same construct actually measure the same construct (Thorndike, 1904; see also Olderbak & Wilhelm, 2020). A reading of the ToM literature suggests a lack of consensus about how ToM is defined and how frequently used instruments accomplish these definitions (Quesque & Rossetti, 2020; Schaafsma et al., 2015). Some claim typical ToM instruments measure different underlying processes (e.g., see Quesque & Rossetti, 2020).
Emotion perception is a lower level ability and it does not necessitate mentalizing. Facial expressions of emotion are not direct indicators of how one feels (Fernandez-Dols & Crivelli, 2013; Reisenzein et al., 2013), and instead may be socially shaped and important for social communication (Barrett et al., 2011). Performance on facial emotion expression tasks may be due to knowledge about culturally acceptable ways to express certain emotions. Thus, the ability to complete emotion perception tests may not require any ability to infer the thoughts and feelings of another person, but rather the ability to label stereotypical facial expressions of emotion.
In a similar vein, meta-analysis showed a weak agreement between the RMET and with the Faux Pas Test. This illustrates how jingle fallacies can occur when tests differ in their design. For example, ToM test differ in the extent to which they rely on and reading comprehension skills (i.e., Strange Stories Test, Faux Pas Test rely strongly on them; Table 1). Thus, not surprisingly, there is little convergence between these measures and measures that rely on perceptual abilities like the RMET. Moreover, measures relying on reading comprehension skills are also based on a different definition of ToM: some aim to measure the understanding of faux pas, whereas others measure understanding of social situations. Therefore, there is a need for more research about the nomological network of ToM. A revision of the ToM is claimed: ToM could be rather seen as conceptual framework consisting of different psychological processes, which can be distinguished by neuroimaging (Schaafsma et al., 2015). Several distinctions hierarchically below the construct ToM are discussed: some examples are implicit versus evplicit, cognitive versus affective, comparing one’s own mental state with those of other people, perceptual discrimination of socially relevant stimuli, understanding of causality (see Schaafsma et al., 2015, for a more comprehensive discussion). Further studies could investigate the convergent or discriminant validity of various ToM tests and how they relate to these basic processed of ToM.
Measuring Individual Differences in ToM
Given the jingle fallacies in ToM, and the jangle fallacies for the RMET, it is reasonable to discuss what is the best way to measure individual differences in ToM as an ability. We would argue that this may be an impossible task.
A fundamental design requirement of a cognitive test is that the test’s trials have a veridical response. This means that for any ToM test, where participants are asked to infer the thoughts or emotions of the target, that the researchers know the thoughts and emotions of the target. However, at present, it is impossible to truly know the thoughts and emotions of another person. At best, we can know the target’s self-report of what they are thinking and feeling. However, that self-report is several stages removed from the veridical truth. Targets have individual differences in their awareness (Lane et al., 1990) and understanding of what they feel (Mayer et al., 2016). Targets have personal motivations for hiding thoughts or feelings they feel may be inappropriate (Snyder, 1974). There are also cultural aspects influencing how thoughts and feelings are conveyed (Gendron et al., 2018). Thus, we argue that ToM can only be measured as a typical behavior construct, and not as a maximal effort ability construct.
We bring this concern to other socioemotional ability constructs that also rely on the self-report of a target for the veridical response. The Empathic Accuracy Paradigm (Ickes, 2001) asks two persons to sit together and discuss a topic, while being videotaped. Following the conversation, both persons watch the video and periodically record what they were thinking and feeling. They then rewatch the video and report on what their partner was thinking and feeling. Agreement between the two self-reports is taken as an indicator of empathic accuracy, with more agreement indicating better empathic accuracy ability. Deeper psychometric analyses, however, have revealed that performance in this paradigm is heavily moderated by whether the participants were verbally saying how they were truly thinking and feeling (Lieber & Hodges, 2019). Thus, it is not clear than an additional mindreading capacity is needed.
In naturalistic paradigms, a meta-analysis showed that there is only a weak correlation between how individuals report feeling and the emotion that they express (Fernandez-Dols & Crivelli, 2013). In experimental paradigms, there is strong coherence between self-reported amusement and smiling. Otherwise, there is weak coherence between expected expressions and self-report happiness, disgust, fear, anger, and surprise (Reisenzein et al., 2013). This suggests that is it largely wise to not rely on the facial expressions of another person, to infer how they are thinking and feeling. As many have shown, context is very important in shaping emotion perception, and can even over power the perception of high intensity prototypical expressions (Barrett et al., 2011). Thus, the idea of presenting an emotional face, or in the case of RMET, a pair of eyes, as sufficient information to allow someone to accurately infer the thoughts and feelings of the target is fundamentally flawed.
Given that the RMET or similar ToM tests (e.g., Strange Stories) are scored through consensus scoring, or through knowing what instructions were given to actors, we should instead infer that the expressions and corresponding “veridical” response are correct from the perspective of the targets culture. Strong performance on RMET does not indicate that one can correctly infer the thoughts or emotions of another person. Instead, it indicates that one can accurately determine what expressions other people in one’s culture believe indicate a certain thought or feeling. Thus, the ability measured is closer to that of a culturally knowledge, rather than a ToM ability.
With that being said, there is one noteworthy example of a perspective taking test with a veridical response that is not influenced by culture and does not rely on consensus scoring. Samson et al. (2010) designed a perspective taking task where participants view an avatar in a room clearly facing a specific direction, and participants are asked to judge what the avatar can see. Given the task design, this test has a clear veridical answer, and there is evidence it operates as an individual difference measure (e.g., Drayton et al., 2018).
Conclusion
The results of this meta-analysis indicate that the construct validity of the RMET is questionable since it shares only 15% of the variance with other ToM measures (the variance shared is a bit stronger with the MASC test and weaker for the Faux Pas and the Hinting test). In contrast, RMET is more strongly related with emotion perception measures, especially for schizophrenic samples, which becomes even more apparent when correcting for unreliability. We provide several reasons for why we interpret the strong similarity between RMET and emotion perception as an indicator that RMET operates more as a measure of emotion perception rather than ToM. Primarily because one only needs the more cognitively basic emotion perception abilities to complete the test, rather than more advanced ToM abilities (Quesque & Rossetti, 2020). Nevertheless, we recommend against the use of RMET as an emotion perception test, due to the design limitations mentioned above. Instead, we would recommend emotion perception tests with better psychometric properties like the Geneva Emotion Recognition Test (Schlegel et al., 2014) or the Berlin Emotion Perception and Recognition Task Battery (Wilhelm et al., 2014).
For measuring individual differences in ToM, more work is needed. Given the challenges to identify a veridical response, typical behavior measures of ToM may be more appropriate.
Overall, the lack of construct validity of the RMET is troubling and disconcerting given that these tests are frequently used in clinical settings where poor performance is an indicator of certain disorders such as Autism Spectrum Disorder or Schizophrenia. Thus, we recommend researchers exercise caution when administering RMET and interpreting data based on the test.
Supplemental Material
sj-pdf-1-asm-10.1177_1073191121996469 – Supplemental material for Sty in the Mind’s Eye: A Meta-Analytic Investigation of the Nomological Network and Internal Consistency of the “Reading the Mind in the Eyes” Test
Supplemental material, sj-pdf-1-asm-10.1177_1073191121996469 for Sty in the Mind’s Eye: A Meta-Analytic Investigation of the Nomological Network and Internal Consistency of the “Reading the Mind in the Eyes” Test by Anne Frieda Doris Kittel, Sally Olderbak and Oliver Wilhelm in Assessment
Footnotes
Acknowledgements
We would like to thank Julia Meixner for her help with data collection and all researchers who provided additional nonpublished effect sizes.
Authors’ Note
The coding manual, interrater agreement estimates per coded variable, imputed reliability estimates, a complete lists of studies in the meta-analysis, correlations used for two-stage meta-analytic structural equation model, the hypothesized relations, three forest plots and funnel plots to illustrate the results more in depth are available in the online supplemental material. Additionally, we have uploaded our analysis with results as html R Markdown file for transparency.
Public Significance Statement
This meta-analysis provides an evaluation of the internal consistency and convergent and discriminant validity for the frequently used Reading the Mind in the Eyes Test (RMET; Baron-Cohen et al., 2001). It shows that the RMET, designed to measure theory of mind, measures rather emotion perception than theory of mind, but the test also misses current standards for an emotion perception test.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received the following financial support for the research, authorship, and/or publication of this article: The analysis and publication of these results are supported by a grant from the Margarete von Wrangell-Habilitationsprogramm für Frauen awarded to Sally Olderbak.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
