Abstract
We studied the fundamental issue of whether children evaluate the reliability of their language interpretation, that is, their confidence in understanding words. In two experiments, 2-year-olds (Experiment 1: N = 50; Experiment 2: N = 60) saw two objects and heard one of them being named; both objects were then hidden behind screens and children were asked to look toward the named object, which was eventually revealed. When children knew the label used, they showed increased postdecision persistence after a correct compared with an incorrect anticipatory look, a marker of confidence in word comprehension (Experiment 1). When interacting with an unreliable speaker, children showed accurate word comprehension but reduced confidence in the accuracy of their own choice, indicating that children’s confidence estimates are influenced by social information (Experiment 2). Thus, by the age of 2 years, children can estimate their confidence during language comprehension, long before they can talk about their linguistic skills.
Keywords
When we use language to communicate, we are doing more than processing the words we hear; we are trying to infer the speaker’s intended meaning given the context we are in (e.g., Clark, 1996; Clayards et al., 2008; Frank & Goodman, 2012; Gibson et al., 2013; Grice, 1975; Levy, 2008; Sperber & Wilson, 1986). Under this account, the ability to estimate confidence—that is, the likelihood that an interpretation is correct—is thus a central component of language comprehension. A rich body of research now attests that infants and toddlers can recognize words quickly and accurately, much as adults can (e.g., Fernald et al., 2006). But it is far less clear how children develop the capacity to evaluate confidence in these interpretations. Here, we provide novel evidence that toddlers display behavioral markers of confidence in whether they have accurately understood a word and that their confidence is context sensitive.
The idea that toddlers may be able to estimate their linguistic confidence contrasts with a broader consensus that metalinguistic skills do not develop until quite late (Gleitman & Gleitman, 1979; Hakes, 2012; Levelt et al., 1978). For example, although children know the meanings of many words before their first birthday (Bergelson & Swingley, 2012; Tincoff & Jusczyk, 1999), typically they cannot provide reliable verbal reports on whether a word is familiar or even whether they know an object’s name until they are about 4 years old (Marazita & Merriman, 2004).
Although traditional accounts have long assumed that metacognition (i.e., the ability to evaluate one’s own cognitive representation) is limited in children younger than 4 years (e.g., Flavell, 1999), recent research outside the language domain has suggested that certain core aspects of metacognition develop much earlier, long before children can talk about their own cognition (Balcomb & Gerken, 2008; Geurten & Bastin, 2019; Ghetti et al., 2013; Goupil & Kouider, 2019). For example, there is evidence that infants are able to estimate confidence (i.e., the likelihood that a decision is correct; Kepecs et al., 2008; Pouget et al., 2016) years before they can provide metacognitive verbal reports (Balcomb & Gerken, 2008; Geurten & Bastin, 2019; Goupil & Kouider, 2016; Hembacher & Ghetti, 2014; Kuzyk et al., 2019; Vo et al., 2014). In one recent study, Goupil and Kouider (2016) adapted a nonverbal equivalent of the postdecision-wagering paradigm (Kepecs et al., 2008) to show that 12-month-old infants can monitor the accuracy of their perceptual decisions. Infants were presented with masked faces that appeared for brief durations on the left or right side of a screen and then reappeared a few seconds later as a fully visible reward. Having performed their initial choice (looking either right or left following the prime), infants maintained their gaze longer (i.e., waited longer for the rewarding face) when their initial choice was correct compared with when it was incorrect. Thus, infants’ postdecision persistence primarily varied with the accuracy of their decision, in the absence of any external feedback indexing their performance. This specific pattern of postdecision persistence has been argued to reflect confidence monitoring (i.e., the ability to internally monitor the reliability of one’s own decisions), with lower persistence times suggestive of lower confidence in a decision and higher persistence times reflecting greater confidence, a capacity that can also be found in nonhuman animals (Hampton, 2009; Kepecs et al., 2008; Miyamoto et al., 2017). Postdecision persistence has also been shown to correlate with confidence reports in adults (Kepecs & Mainen, 2012).
These considerations thus raise the possibility that similar behavioral tasks that do not require a verbal report might reveal that young children can evaluate their confidence in their understanding of language—for instance, in whether they have correctly identified what referent or meaning a speaker intends for a word.
Statement of Relevance
Humans have the ability to internally monitor the reliability of their decisions, beliefs, and memories. This ability, termed metacognition, is crucial for learning in adults and school-age children, as it allows learners to seek information when they realize that they have incomplete knowledge. In two studies, we explored the emergence of this metacognitive ability in the language domain. By tracking eye movements in a language-comprehension task, we found that, by 2 years of age, children can evaluate their confidence in understanding words (i.e., whether they know or do not know that they understood a word correctly). These findings provide novel evidence that the capacity to internally monitor one’s own decisions, beliefs, and memories can be measured behaviorally and exists before children are old enough to talk about what they know.
We developed a novel paradigm to assess whether young children’s understanding and recognition of words incorporate evaluations of confidence. Our procedure is composed of three distinct phases: (a) looking while listening, (b) anticipation, and (c) reward. The looking-while-listening phase is based on a well-validated eye-tracking paradigm of the same name that has frequently been used to assess children’s understanding of word meanings (e.g., Bergelson & Swingley, 2012; Fernald et al., 2008; Golinkoff et al., 1987). Participants saw two pictures on a screen and heard the label of one picture (e.g., “Where is the dog?”). To measure children’s word-recognition accuracy, we recorded their fixations to the named picture over time. The looking-while-listening phase was directly followed by the anticipation phase, a modified anticipatory looking paradigm: The pictures were occluded, and participants were asked again to look at the target picture (e.g., “Where was the dog?”) before it reappeared a few seconds later (the final-reward phase). The anticipation task provided a second discrete measure of how the word was understood while the objects were occluded (their first-look decision to look toward the left or right side of the screen in anticipation of the reappearance of the target object), alongside their confidence in that understanding (indexed by postdecision persistence: how long they persisted in gazing toward the hidden object after their first look, in the absence of any further information that could influence their decision). If children can internally evaluate their accuracy in recognizing the target word, then they should show longer persistence times after a correct first look compared with an incorrect first look, but only when they actually know the meaning of the word. Critically, the postdecision-persistence measure is taken in the absence of any information on the screen about object locations, which ensures that persistence is driven by children’s internal evaluation of their first-look accuracy (i.e., their confidence).
Experiment 1
Experiment 1 tested whether children’s objective word knowledge modulated their confidence in understanding those words. This was a preregistered replication of a pilot experiment reported in the Supplemental Material available online.
Method
The preregistration, material, data, and analysis script are available at https://osf.io/9fapj/.
Participants
Fifty English-speaking children were included in the final analysis (age: M = 23 months 8 days, SD = 122 days, minimum = 18 months 5 days, maximum = 29 months 19 days; 25 girls). Our sample size was based on Goupil and Kouider’s (2016) Experiment 3. They tested 50 12-month-olds in a postdecision-persistence wagering paradigm; a power analysis based on this effect suggested that we should test 70 children to have a power of 80% at the .05 α level. However, because our participants were older, we limited the number of participants to 50 (preregistered). An additional seven children were tested but excluded from the analysis because they did not provide sufficient trials (n = 4; see exclusion criteria below), because their caregiver interfered (n = 1), or because they were born at less than 37 weeks gestational age (n = 1). Participants were recruited in the Edinburgh area via social media and the participants’ database of the Wee Science Lab.
Procedure and experimental design
The procedure and protocol were approved by the ethics committee of The University of Edinburgh prior to study participation. Before coming to the lab, parents completed a child vocabulary questionnaire to ensure that they knew the familiar words used in the experiment. During the experiment, children sat on their caregiver’s lap in front of a monitor. Caregivers wore opaque glasses and were asked to not interact with the child during the procedure.
We adapted a version of the postdecision-persistence wagering paradigm (for rats, see Kepecs et al., 2008; for infants, see Goupil & Kouider, 2016) with an anticipation eye-movement paradigm using an eye tracker. The experiment consisted of a series of test trials whose time course is depicted in Figure 1. The trial started with a looking-while-listening phase: Children saw two pictures on the screen depicting either two known objects (known-word trials; e.g., a dog and a banana) or two unknown objects (unknown-word trials; e.g., a DNA double helix and a 3D virus shape) and were prompted to look at one of the objects (the target) using its label (e.g., “Where is the dog?” for known words or “Where is the blicket?” for unknown words). The objects were then covered by animated curtains (ending the looking-while-listening phase; 5 s including 1 s of curtain motion). A fixation point (a green circle changing size) then appeared at the center of the screen between the two curtains and flickered as long as children did not look at it. After children fixated the fixation point for at least 100 ms, the fixation point stopped flickering and the audio started prompting children to find the object labeled during the looking-while-listening phase (e.g., “Did you see the dog?”). The anticipation phase started as soon as children initiated a look toward one of the sides (target curtain, distractor curtain) and lasted for 2.5 s of silence with no visual change. The target object then reappeared at the same location as the looking-while-listening phase, along with a rewarding animation and a cheering sound (the reward phase; 2.5 s). The reward phase also occurred if the child did not initiate a look in the 4 s following the target word offset.

Design of Experiment 1. Children’s gaze position on a screen was recorded as they completed up to 40 test trials. The figure shows an example of the time course of a test trial in which children were tested on the known word “dog.”
Trials were separated by a 1-s pause. No immediately consecutive trials presented the same pictures or words. Target and distractor pictures appeared the same number of times on the right and the left sides of the screen. The target side did not repeat more than two times on consecutive trials.
Test trials were presented in blocks of 10: five known-word trials and five unknown-word trials. Blocks of trials were repeated as long as children did not show any sign of boredom, to a maximum of four repetitions (40 trials). Children received on average 13.16 trials (minimum = 4, maximum = 32) after applying the criteria for trial rejection.
The test trials were preceded by two practice trials, designed to familiarize children with the procedure. The first trial consisted of the looking-while-listening phase followed directly by the reward phase, with no anticipation phase. The second trial included a short anticipation phase of only 500 ms.
Materials
Picture stimuli were drawings or photographs of objects on a light gray background. Pictures were always yoked in pairs: five pairs for known words (banana/dog, cat/boat, car/bird, shoe/book, hat/ball) and five pairs of objects that did not have obvious names in English for unknown words. The familiarization trials used the pairs star/tree and duck/apple. Parental reports indicated that seven children did not know one to three of the target known words. Removing these items from the analysis does not change the pattern of results.
For the unknown-word trials, five novel labels were created: “nurmy,” “toma,” “blicket,” “meb,” and “dax.” Each novel label was presented with the same pair of unknown objects across participants. Half of the participants saw the novel label associated with the first object of the pair, and the other half with the second object.
The audio stimuli consisted of one sentence played during the looking-while-listening phase (“Where is the [target]?”) and one sentence played just before the anticipation phase (“Where was the [target]?”). All sentences were recorded by a native speaker of English in a child-friendly way.
Criteria for trial and participant exclusion
Trials were rejected if they met any of the following preregistered rejection criteria: (a) Children did not look at either image (target or distractor) for at least half of the looking-while-listening-phase time window (n = 199), (b) the time between the display phase and the anticipation phase was more than 3 s (to ensure that the memory of the objects and their location is comparable across trials within and across children; n = 53), (c) participants did not initiate a look to one of the regions of interest (target or distractor) during the anticipation phase (n = 16), (d) this initial look lasted less than 100 ms (to avoid implausibly brief responses; n = 47), and (e) children did not look at either image (target or distractor) for at least half of the anticipation-phase time window (n = 81). These criteria resulted in the removal of 37% of the total number of trials collected.
Participants were excluded if they had fewer than two trials per word type (known, unknown) after we applied the above criteria (a–e), they were premature (born before Week 37 of gestation), or they were exposed to less than 50% English input on a weekly basis per parental estimate.
Measurement and analysis
Gaze position on each trial was recorded via an EyeLink 1000 eye tracker (SR Research, Kanata, ON, Canada) with a 2-ms sampling rate. All mixed model analyses were conducted using the lme4 package in R (Bates et al., 2015). For mixed models, we used a maximal random-effects structure as supported by the data. We based p values for main fixed effects on likelihood-ratio tests; simple effects are reported from the summary table of the model. To check whether the age of children modulated the effects of interest, we added age (in months) as a predictor in all of our analyses. Including age as a predictor was suggested during the review process and was therefore not preregistered; omitting age as a predictor does not change other results.
Analysis 1: word-recognition performance
Recognition during the looking-while-listening phase (preregistered)
We inspected the time course of eye movements during the looking-while-listening phase (5 s). We used the proportion of fixations toward the target image as a dependent variable. We conducted three cluster-based permutation analyses (Maris & Oostenveld, 2007) as used previously in eye-tracking studies (e.g., Dautriche et al., 2015) on binned data (bins of 50 ms excluding away looks) using a custom Python script. Word knowledge (known, unknown) was compared with chance by comparing the average proportion of looks toward the target picture with 50% (the chance level), and one analysis compared the looking proportions between word types.
First-look responses during the anticipation phase (preregistered)
By looking toward one of the sides, toddlers commit to one alternative, which we conceptualize here as a decision, in line with evidence accumulation models of decision making (e.g., Kiani & Shadlen, 2009; Pleskac & Busemeyer, 2010) and many previous studies in children (e.g., Goupil & Kouider, 2016), adults (e.g., Nieuwenhuis et al., 2001), and animals (e.g., Kiani & Shadlen, 2009) and more generally in anticipation designs (e.g., Kovacs & Mehler, 2009).
We modeled participant’s first look to one of the hidden objects during the anticipation phase (coded as 1 when the participant initiated a look toward the target and 0 toward the distractor) using a mixed logit model specified as Target First Look ~ Trial Type × Age + (1 | Participant); age was coded in months and scaled to avoid convergence issues.
Note that first-look responses were, on average, initiated 213 ms after target word onset. This is shorter than the minimum latency expected during looking-while-listening tasks (367 ms; Swingley & Aslin, 2000; Swingley et al., 1999), but this is expected given the anticipatory nature of the paradigm: When hitting the anticipation phase, children already know the location of the target word, even before hearing the target word label, because they heard it during the immediately preceding looking-while-listening phase (Fig. 1).
Analysis 2: word-recognition confidence
Persistence times (preregistered)
To index word-recognition confidence, we measured persistence times—that is, how long participants fixated toward the direction of their first look—during the anticipation phase window. Note that we did not have any a priori hypothesis about how persistence might vary overall between the different stimuli (known objects or words vs. unknown objects or words); rather, we focused on whether persistence times, within each stimulus type, depend on first-look accuracy. We used the following mixed model: Persistence ~ Trial Type × First Look × Age + (1 | Participant). We had to simplify the preregistered analysis, which included a random slope for Trial Type × First Look, because the model turned out to be singular. Persistence times were log transformed to respect the assumption of normality (nontransformed data are used for display purposes in the figures). Age was coded in months and scaled to avoid convergence issues. We did not include random effects for items because our number of items was low, but additional analyses did not reveal important item-specific variation. In particular, there were no significant differences in persistence times among test trials that feature pairs of inanimate objects (e.g., shoe/book) versus pairs that include one animate object and one inanimate object (e.g., banana/dog).
Gaze-shift frequency (post hoc)
It was suggested during the review process that we also conduct a gaze-shift frequency (GSF) analysis. Several studies have reported that GSF, how frequently participants saccade between options presented on the screen, reflects explicitly reported confidence (Folke et al., 2016; Sepulveda et al., 2020): Adult participants who shift their gaze more often between visually presented options are more likely to report lower confidence in their choice than participants who shift their gaze less (for children, see also Leckey et al., 2020). We analyzed GSF during the anticipation phase. We expected children to switch gaze between options more frequently when they were less confident in their response (i.e., in the unknown-word trials) compared with when they were confident in their response (i.e., in the known-word trials). GSF was calculated as the number of times participants shifted their gaze from one area of interest (target position, distractor position) to the other during the anticipation phase. We used the following mixed model: GSF ~ Trial Type × Age + (1 | Participant).
Results
Analysis 1: word-recognition performance
Recognition during the looking-while-listening phase
During the looking-while-listening phase, children hearing known words looked toward the target significantly above chance (from 1,400 ms to the end of the trial; p < .001) and, as expected, did not show any preference for the target object when hearing unknown words (p > .3; see Fig. 2a). There was a significant difference between known and unknown words (from 2,100 ms to 3,350 ms; p = .006). Further analyses revealed that this effect was more robust in the older participants than in the younger ones (see details in the Supplemental Material).

Word-recognition performance (Experiment 1). (a) Mean proportion of target looks during the looking-while-listening phase for known words and unknown words. The purple shaded region surrounding the asterisk represents the time range where the proportion of target looks for known words was significantly above chance level (.5). The ribbon surrounding each curve represents the standard error of the mean obtained at each time bin for each condition. (b) Mean proportion of target first looks during the anticipation phase depending on word knowledge (known, unknown). Large dots represent group means, and small dots represent individual means; error bars represent standard errors. Asterisk indicates where the proportion of target first looks was statistically above chance (.5).
First-look responses during anticipation phase
There was a main effect of trial type (known words vs. unknown words), χ2(1) = 7.49, p = .006 (see Fig. 2b). For known words, children were significantly more likely than chance to initiate a first look toward the hidden target (M = 0.58, SE = 0.03, β = 0.11, z = 2.45, p = .01) but not for unknown words (M = 0.45, SE = 0.03, β = −0.15, z = −1.22, p = .22). There was also a main effect of age, χ2(1) = 6.58, p = .01; older children were more likely to initiate a first look toward the hidden target than younger children (see figure in the Supplemental Material). No other effects or interactions were significant.
Additional analyses reported in the Supplemental Material show that first-look responses were more robust when children took more time to initiate their first-look responses for known words, whereas response latencies did not affect accuracy for unknown words (see details in the Supplemental Material). This suggests that first-look decisions are a mixture of informed responses, in which children fully process the target word and retrieve the most probable location of the target referent (longer latencies), and early responses initiated before children are able to fully process the target word—that is, potential mistakes reflecting a variety of additional factors (stimulus preference or stimulus complexity, side biases, etc.).
Analysis 2: word-recognition confidence
Persistence times
Persistence times during the anticipation phase were affected by first-look accuracy, χ2(1) = 4.30, p = .04; participants looked longer after a correct compared with an incorrect first look (see Fig. 3a). Yet this pattern depended on whether the words were known, as would be expected if persistence indexes the confidence associated with children’s decisions about what each word meant. When tested on known words, participants showed longer persistence times after making a correct first look compared with an incorrect first look (correct: M = 1,228 ms, SE = 78 ms; incorrect: M = 823 ms, SE = 62 ms; β = 0.33, t = 3.92, p < .001), but accuracy did not affect persistence when children were tested on unknown words (correct: M = 1,061 ms, SE = 98 ms; incorrect: M = 1,145 ms, SE = 70 ms; β = −0.07, t = −0.85, p = .39), and the interaction between word knowledge and first-look accuracy was significant, χ2(1) = 11.207, p < .001.

Word-recognition confidence (Experiment 1). (a) Relationship between persistence times and first-look accuracy depending on word knowledge (known, unknown). Persistence times were averaged separately for correct and incorrect first looks for each level of word knowledge. (b) Mean number of gaze switches between areas of interest (target, distractor) during the anticipation phase. In both panels, large dots represent group means, and small dots represent individual means; error bars represent standard errors.
There was also a significant interaction between age and first look, χ2(1) = 8.14, p = .004, indicating that older children were more likely than younger children to display higher persistence after a correct first look than after an incorrect first look. No other main effects or interactions were significant.
Four further analyses provided evidence consistent with this key finding reflecting confidence and inconsistent with low-level counterexplanations. First, if this pattern of persistence reflects confidence, then we may expect the difference between correct and incorrect looks to be larger on trials in which participants trade off speed for accuracy, echoing the rich literature showing a correlation between performance and confidence in human adults (for a review, see Fleming & Lau, 2014). And consistent with this, postdecision persistence differed between correct and incorrect responses for both slow and fast responses (see related analysis of accuracy above), and this difference was stronger for slow responses (see details in the Supplemental Material).
Second, although the effect of accuracy was larger for slow response times (see the Supplemental Material), we did not find any evidence of a simple correlation between latency to first-look and persistence times (p = .8), which rules out the possibility that children’s persistence times can be explained by a low-level association between persistence times and response times (see details in the Supplemental Material).
Third, we also did not find any evidence that first-look responses (correct vs. incorrect) reflected different degrees of word-referent knowledge activation. For instance, it could be that children who initiated an incorrect first look did so because they knew the tested word less well than children who initiated a correct first look (despite parental reports being similar) or were just less motivated by the task and thus did not reactivate their word-referent knowledge as well as motivated children. Yet children’s target-looking behavior during the looking-while-listening phase (which reflects their word-referent knowledge as well as their motivation for the task) was not different between those trials that lead to a correct versus an incorrect first look (no cluster found; see details in the Supplemental Material).
Finally, we did not find evidence that children were more likely to persist for longer toward the object they had favored during the looking-while-listening phase (see details in the Supplemental Material). This suggests that persistence times do not simply reflect the aftereffects of low-level attentional processes operating during the looking-while-listening phase (i.e., side or object).
In sum, because children did not receive external feedback indexing their performance during the anticipation phase, the difference in persistence times suggests that they were using internal evidence to evaluate whether they had made the correct decision—that is, monitoring the confidence associated with their understanding of the words. This effect was most visible by 2 years of age, where participants showed reliable look-to-target performance for known words during the looking-while-listening task and on first looks during the anticipation phase. This could be the result of weaker language processing skills in younger children, weaker confidence monitoring, or both.
GSF (post hoc)
The number of gaze shifts during the anticipation phase was modulated by word knowledge (Fig. 3b). Children shifted their gaze more often when tested on unknown words (M = 2.45, SE = 0.07) than when tested on known words (M = 2.18, SE = 0.08), χ2(1) = 4.08, p = .04. No other effect or interaction was significant. This is suggestive of low confidence; children actively shifted their gaze between the two possible options when they knew that they did not know (Folke et al., 2016; Leckey et al., 2020).
Our results thus show that 2-year-olds can monitor their word-recognition performance in a word-recognition task.
Experiment 2
Whereas Experiment 1 showed that persistence times index children’s confidence about what a word is likely to refer to, Experiment 2 aimed to establish that these confidence estimates reflect a child’s confidence that they understand what a word is intended to mean.
Our method draws on evidence that, by age 2 years, children can account for speakers who use words idiosyncratically, such as labeling a ball as “dog.” If an unreliable, idiosyncratic speaker teaches a 2-year-old a new word, for example, that a novel object is called a “wug,” then that child will restrict the domain of that word to that specific individual and will not generalize its use with other individuals (Dautriche et al., 2021; Koenig & Woodward, 2010). This suggests that the reliability of a speaker may impact children’s confidence in how words are used even when children show similar accuracy levels. To wit, if an unreliable speaker tells the child to “look at the cat” on a trial in which both a cat and a boat are hidden, then the child may infer that “cat” probably refers to the cat, as a best guess. But they may not be confident in that response because the speaker has been unreliable in the past and would thus show a reduced difference between postdecision persistence times following correct versus incorrect responses.
In Experiment 2, 2-year-olds first watched a video in which a confederate demonstrated themselves to be either a reliable or unreliable speaker and then taught the child two new words. Then, participants completed a word-recognition task as in Experiment 1, in which the same speaker used a combination of familiar words and the newly taught novel words. For both novel and known words, we predicted that children would show accurate recognition but with low confidence when the speaker is unreliable (Fig. 4).

Design of Experiment 2. The experiment consisted of three phases. The first was the speaker-exposure phase, in which a speaker labeled familiar objects using either correct labels (e.g., calling a ball a “ball”; the reliable condition) or incorrect labels (e.g., calling a ball a “dog”; the unreliable condition). The second phase was the teaching phase, in which the speaker taught two novel words (“danu” and “modi”) for two novel objects. The third phase was the testing phase, similar to Experiment 1, which tested recognition and confidence in both known words (as pictured; different from the labels used during the exposure phase) and novel words (with the two novel objects displayed on the screen). The test trials used the same speaker as in the exposure phase.
Method
Participants
Sixty English-speaking children were included in the final analysis: 30 in the reliable condition (age: M = 30 months 19 days, SD = 53 days, minimum = 27 months 26 days, maximum = 34 months 14 days; 12 boys) and 30 in the unreliable condition (age: M = 29 months 28 days, SD = 80 days, minimum = 24 months 14 days, maximum = 35 months 29 days; 14 boys). We tested children who were older, on average, than those in Experiment 1 because past literature using a similar design mostly focused on older children (a single study tested children younger than 2 years; Luchkina et al., 2018). The number of participants was estimated using Experiment 1’s data on the results of the first 16 trials and by considering the experiment as a between-subjects design. A power analysis based on this effect suggested that we should test at least 40 children per condition to have a power of 80% at the .05 α level. Because we tested children who were, on average, older than those in Experiment 1, we decided to limit the number of participants to 30 per condition (preregistered). An additional 20 children were tested but excluded from the analysis because they did not provide sufficient trials (n = 11; see exclusion criteria below), because they did not want to participate in the experiment (n = 5), because of sibling or caregiver interference (n = 3), or because of technical issues (n = 1). Participants were recruited in the Edinburgh area.
Procedure, experimental design, and material
The experiment was composed of three phases as described in Figure 4.
Speaker-exposure phase
Participants saw a video of a native English female speaker playing with five objects and labeling them. Each object was taken out of a box individually, labeled three times, and put back into the box. The same five objects were used across the two conditions: a tiger puppet, a banana, a ball, a shoe, and glasses. In the reliable condition, the speaker used the correct label to refer to the objects. In the unreliable condition, the speaker used incorrect labels that did not refer to any other objects seen in the video (flower, car, dog, book, star).
Teaching phase
Participants saw two 30-s videos, each teaching them one novel word. In each video, the speaker (of Phase 1) showed a novel object and labeled it five times using one of two novel words (“danu” or “modi”). The novel objects were two unfamiliar animals (see pictures in the Supplemental Material).
Testing phase
Test trials matched the procedure of Experiment 1 (Fig. 1) but used new audio stimuli recorded by the reliable/unreliable speaker. We implemented two changes to the trial time course. First, the looking-while-listening phase started with the simultaneous presentation of the two pictures in silence (2 s) to increase children’s performance during the looking-while-listening phase by giving them sufficient time to explore the picture before hearing the target word. Second, both pictures reappeared on the screen during the reward phase. This was done to maintain the unreliability of the speaker for children in the unreliable condition but was implemented in both conditions. Importantly, this did not impact children’s motivation to look at the target object during the reward phase (see details in the Supplemental Material).
The testing phase was composed of 16 test trials: eight known-word trials and eight novel-word trials. The known-word trials used eight objects that did not appear during the speaker exposure phase (orange/butterfly, spoon/duck, cat/boat, hat/fish). According to parental reports, five children did not know one to three of these known words. Removing these items from the analysis does not change the pattern of results. Each pair was shown twice, and each referent named once. The novel trials showed the two newly learned objects, with each being named four times. The smaller number of trials in this study matched the average number of trials completed in Experiment 1.
Criteria for trial and participant exclusion
The criteria were the same as in Experiment 1. Following these criteria, we removed 43% of the total number of trials collected.
Analyses
We conducted the same analyses as in Experiment 1. Because neither age nor any of its interactions with other predictors were significant, we removed it from the model. Note that in our preregistration, we discarded the first-look analysis as a measure of word knowledge because it seemed to be too noisy following Experiment 1, and previous research showed that first looks are a more variable index of word understanding than fixation proportions in the looking-while-listening paradigm (L. G. Naigles & Gelman, 1995; L. R. Naigles, 1996). However, for the sake of completeness, we report these results here.
Preliminary results
Because there was no learning difference between the specific novel word being tested (“danu” vs. “modi”; pmin = .20 using a cluster-based permutation analysis on the proportion of target looks during the looking-while-listening phase), we compared participants’ behavior across conditions (reliable vs. unreliable), collapsing looking behavior for all trials testing novel words. As in Experiment 1, the animacy of the target did not significantly affect participants’ persistence times.
Results
Analysis 1: word-recognition performance
Recognition during the looking-while-listening phase
As in Experiment 1, the looking-while-listening phase of the test trials showed that children readily recognized known words (see Figs. 5a and 5c). They looked toward the target significantly above chance in the reliable condition (from 600 ms to 4,700 ms; p < .001) and in the unreliable condition (from 550 ms to 4,350 ms; p < .001), with no difference between these conditions. For the newly taught novel words, we observed a similar pattern: Children looked toward the target significantly above chance in both conditions (reliable: from 850 ms to 2,050 ms, p = .007, and from 2,450 ms to 3,600 ms, p = .001; unreliable: from 2,900 ms to 3,550 ms, p = .036), again with no difference between conditions.

Word-recognition performance (Experiment 2). For known (a) and novel (c) words: proportion of looks toward the target picture, time-locked to the beginning of the target word for the reliable condition and for the unreliable condition. The shaded regions surrounding the asterisks represent the time range where the proportion of target looks was significantly above chance level (.5). The ribbon surrounding each curve represents the standard error of the mean obtained at each time bin for each condition. Children looked to the target significantly above chance (.5) in the reliable condition and in the unreliable condition. For known (b) and novel (d) words: mean first-look accuracy during the anticipation phase in the reliable condition and in the unreliable condition. In both (b) and (d), large dots represent group means, and small dots represent individual means; error bars represent standard errors.
First-look responses during the anticipation phase (post hoc)
Overall, participants looked toward the target above chance when tested on known words (β = 0.30, z = 2.58, p = .01; see Figs. 5b and 5d). They were significantly more likely than chance to initiate a first look toward the target in the unreliable condition (M = 0.60, SE = 0.03, β = 0.39, z = 2.34, p = .02). Performance was not significantly above chance in the reliable condition (M = 0.55, SE = 0.03, β = 0.21, z = 1.29, p = .20), possibly because the average latency of first looks in this condition was 174 ms, lower than in the unreliable condition (200 ms) or than the latency in the known-word condition of Experiment 1 (213 ms), and thus may include more early, preemptive looks than the other condition (see details in the Supplemental Material). However, there was no difference between conditions, χ2(1) = 0.09, p = .76. For novel words, participants were not more likely than chance to look at the target in either the reliable condition (M = 0.52, SE = 0.05, β = 0.03, z = 0.17, p = .86) or the unreliable condition (M = 0.55, SE = 0.04, β = 0.14, z = 0.75, p = .45).
As a whole, our results show that children recognize familiar words when tested by both a reliable or an unreliable speaker. Their display-phase responses also show that children learned the novel words in both conditions, replicating previous studies (Koenig & Woodward, 2010). Following Experiment 1, the first-look accuracy was not high (because it may not always correspond to a response selection; see the Supplemental Material) but, critically, was comparable across conditions for both word types, allowing us to analyze how word-recognition confidence may vary across conditions while controlling for accuracy.
Analysis 2: word-recognition confidence
Persistence times
Following our preregistered plan, we analyzed persistence times separately for known and novel words (a combined analysis can be found in the Supplemental Material). For known words, children’s persistence was influenced not only by their first-look accuracy but also by the reliability of the speaker, leading to a significant interaction between these factors, χ2(1) = 4.24, p = .04 (Figs. 6a and 6c). Overall, children persisted longer after a correct first look than an incorrect first look (main effect of accuracy), χ2(1) = 20.35, p < .001, but they did so more when the speaker was reliable rather than unreliable. For the reliable speaker, persistence times were significantly longer after correct rather than incorrect first looks (correct: M = 1,526 ms, SE = 101 ms; incorrect: M = 878 ms, SE = 92 ms; β = 0.542, t = 4.70, p < .001), whereas for the unreliable speaker, this difference was marginally significant (correct: M = 985 ms, SE = 85 ms; incorrect: M = 813 ms, SE = 90 ms; β = 0.205, t = 1.76, p = .08). The main effect of speaker reliability on persistence times was marginal, χ2(1) = 3.56, p = .06.

Word-recognition confidence (Experiment 2). For known (a) and novel (c) words: relationship between persistence times and first-look accuracy depending on condition (reliable, unreliable). Persistence times were averaged separately for correct and incorrect first looks for each condition. For known (b) and novel (d) words: mean number of gaze switches between areas of interest (target, distractor) during the anticipation phase. In all panels, large dots represent group means, and small dots represent individual means; error bars represent standard errors.
For the novel words, however, persistence times were not modulated by either first-look accuracy or condition (all ps > .11), despite children having shown that they could recognize these novel words during the looking-while-listening phase. This suggests that children were able to recognize the referents of the novel words but that they were not yet confident in their lexical decisions (at least as indexed by persistence times) or that they could not yet evaluate their confidence, presumably because the words were newly learned.
Further analyses again ruled out low-level counterexplanations. First, as in Experiment 1, there was no evidence that children’s persistence times can be explained solely by low-level associations between persistence times and response times, or between persistence times and object preference during the looking-while-listening phase (see the Supplemental Material).
Moreover, analysis of behavior during the reward phase indicated that participants’ memory for the target words and target word locations was matched across the reliable and unreliable speaker conditions. Specifically, children in both conditions looked at the target above chance, with no difference between conditions, even though linguistic stimuli were absent. Thus, the idiosyncrasy of the speaker did not affect memory reinstatement processes. This also rules out the possibility that persistence times may index children’s confidence in remembering the location of the object rather than their linguistic confidence because memory is unaffected by speaker reliability (see more details in the Supplemental Material).
GSF (post hoc)
We tested whether the number of gaze shifts during the anticipation phase was influenced by speaker reliability, word type, and the interaction between these two factors. Children shifted their gaze more when the speaker was unreliable compared with when she was reliable, χ2(1) = 3.99, p = .05 (Figs. 6b and 6d). This was significant for known words (reliable: M = 2.10, SE = 0.09; unreliable: M = 2.69, SE = 0.13; β = 0.54, t = 3.19, p = .002) but not for unknown words (reliable: M = 2.39, SE = 0.10; unreliable: M = 2.49, SE = 0.14; β = 0.29, t = 0.23, p = .79). The interaction between word type and speaker condition was significant, χ2(1) = 8.13, p = .004. This result is congruent with the postdecision-persistence analysis and confirms that children in the unreliable condition were less confident than children in the reliable condition, in particular when tested on known words. There was no main effect of word type, χ2(1) = 0.10, p = .75, but in the reliable condition, children generated more gaze shifts when tested on novel words than when tested on known words (β = 0.27, t = 2.25, p = .02), suggestive of greater uncertainty in novel-word trials in the reliable condition.
Our results thus show that children’s confidence estimates are influenced by social information.
General Discussion
These two experiments show that, by 24 months, children’s looking behavior reveals their confidence in understanding a word: They persist more in recognition decisions when they have reasons to be sure about a word’s meaning.
Critically, because children’s confidence appeared to be affected by the reliability of the speaker, this suggests that children were evaluating not only the meaning of the words they heard, but what they thought the speaker intended the words to mean. This is important because it is consistent with pragmatic accounts of language comprehension (Clark, 1996; Grice, 1975; Sperber & Wilson, 1986) as well as with modern noisy-channel models of adult language processing (e.g., Clayards et al., 2008; Levy, 2008), which highlight that sentence comprehension involves both decoding the current signal and integrating that signal with prior knowledge about what meanings a speaker is likely to express, in order to derive the most probable interpretation. Children’s context-relative confidence estimates suggest that they can already integrate their processing of a signal with their prior knowledge of a speaker (e.g., the speaker’s reliability) and thus imply that, by age 2 years, they are already able to process words and sentences using an active, noisy-channel strategy.
The ability to estimate confidence during language comprehension could play an important role throughout language development. For instance, confidence estimates could be used by children to optimize how they allocate attention during learning (e.g., attending to situations in which they have low confidence in their interpretation of words; see Zettersten & Saffran, 2019). Moreover, confidence estimates could also guide children’s interrogative behaviors: Low confidence in having understood a word would be a signal for children to request clarification from their caregivers, either behaviorally or verbally (Bazhydai et al., 2020; Butler et al., 2020; Hembacher et al., 2020; Jimenez et al., 2018). Our findings also suggest that confidence estimates are responsive to the social context. Specifically, children exposed to an idiosyncratic speaker who used labels unconventionally show reduced confidence in their interpretations. Such recalibration of confidence may have important implications for learning: Underconfidence may lead children to be overly receptive to any additional information they may receive about a word meaning; overconfidence, on the other hand, could make them indifferent to it (Rollwage & Fleming, 2021; Rollwage et al., 2020).
Throughout this discussion, we have interpreted our participants’ eye-movement behavior in terms of decision accuracy (for the first location of eye movements) and confidence (for the persistence following initial choices). But might these data also be accounted for by simpler mechanisms? For instance, it has been previously suggested that simpler interpretations in terms of first-order processes such as attention or memory could explain both decision accuracy and postdecision persistence in similar paradigms (Carruthers, 2020; Gliga & Southgate, 2016). These alternative interpretations, however, find little support in our results. First, and most importantly, our results show for the first time that young children’s postdecision persistence can be dissociated from their ability to perform a task and varies depending on the social context, as is the case in adults (Jacquot et al., 2015). Such a dissociation would not be expected if accuracy and persistence were driven by a single mechanism. Second, neither participants’ memory for the target words or target word locations nor attention during the display phase predicted persistence patterns. This, taken as a whole, represents the strongest evidence to date that young children’s postdecision persistence truly reflects confidence, rather than performance, attention, or memory. It may be that confidence directly reflects properties of the decision-making process (e.g., the distance between accumulated evidence and a decision bound, or an evaluation of decision time; Kiani & Shadlen, 2009; Pereira et al., 2021). Alternatively, it could be that confidence reflects core metacognitive monitoring even in young children (Goupil & Kouider, 2019). The finding that speaker idiosyncrasy can differentially impact first-look accuracy and postdecision persistence favors this latter interpretation.
Finally, our results show that one of the most widely used methods in infant language research, the looking-while-listening paradigm, can elide very different states of label-referent understanding. For instance, Experiment 2 found highly similar looking-while-listening performance for recognizing known words uttered by reliable versus unreliable speakers, but our persistence measure revealed differences in confidence levels. We suggest that our paradigm could be an important new tool for more precisely evaluating the interpretations that infants give to words and sentences, especially to assess children’s emerging pragmatic skills.
In sum, our work converges with a growing body of evidence suggesting that monitoring confidence is a fundamental ability that enables humans to actively and adaptively respond to their environment from a very young age (Ghetti et al., 2013; Goupil & Kouider, 2019). It extends previous results by showing that toddlers’ capacity for confidence monitoring is not restricted to the evaluation of simple perceptual decisions but extends to socially informed conventional knowledge. Certainly, while our work extends our knowledge of the development of core metacognition in Western, educated, industrialized, rich, and democratic (WEIRD) populations, it is open to question whether such a development is impacted by cultural and socioeconomic factors. The influence that monitoring confidence has on early lexical development is currently unknown, but we hope that these results will stimulate interest in characterizing the role that confidence monitoring plays in supporting active and adaptive language learning.
Supplemental Material
sj-pdf-1-pss-10.1177_09567976221105208 – Supplemental material for Two-Year-Olds’ Eye Movements Reflect Confidence in Their Understanding of Words
Supplemental material, sj-pdf-1-pss-10.1177_09567976221105208 for Two-Year-Olds’ Eye Movements Reflect Confidence in Their Understanding of Words by Isabelle Dautriche, Louise Goupil, Kenny Smith and Hugh Rabagliati in Psychological Science
Footnotes
Acknowledgements
We thank Jessica Brough, Jenny Chim, Anna Hall, Rachel Kindellan, and Rebekah Oakley for data collection and thank the actor and voice of our stimuli, Emma Healey.
Transparency
Action Editor: Vladimir Sloutsky
Editor: Patricia J. Bauer
Author Contributions
I. Dautriche developed the study concept. All the authors contributed to the study design. I. Dautriche conducted testing, data collection, and analysis. All the authors contributed to the writing of the manuscript and approved the final manuscript for submission.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
