Abstract
Children successfully learn words through overhearing others engaged in verbal interactions. The current studies investigated the degree to which four-year-old overhearers are influenced by the response behaviors of addressees and by the interactional pattern of the speakers and addressees. It was found that while addressee responses on their own did not influence vicarious word learning (Experiment 1), they did influence vicarious word learning when they were part of the shared process of dialogue construction (Experiment 2). When speaker responses reinforced the label after disagreeing addressee behaviors, vicarious word learning was boosted. This suggests that, by the age of four, vicarious word learning involves dialogue comprehension and a pragmatic understanding of the joint construction of talk by conversational partners.
Children live in a complex social world that provides a variety of informational sources for language acquisition. Much of the research on early word learning has focused on the role of child-directed speech (Lieven & Stoll, 2010), and the positive relationship between child-directed speech and children’s vocabulary size (Hoff, 2003; Rowe, 2008). Direct interactions with children are not, however, the only contexts in which word learning takes place. In some communities children are expected to learn through observation, and children readily acquire language despite minimal child-directed speech (Crago, 1992; Ochs & Schieffelin, 1984; Rogoff, Mistry, Göncü, & Mosier, 1993). Thus, observing others’ interactions is also fertile ground for child word learning (Akhtar, 2005; Akhtar, Jipson, & Callanan, 2001; O’Doherty et al., 2011).
How is it that young children are able to learn words through observing others engaged in conversation? One explanation is that they simply associate any label-like word they hear spoken, to anyone, to the object of the speaker’s attention. In this sense they might be imagining themselves in the place of the addressee, acting as second listener and thereby receiving information from the speaker’s labeling (Herold & Akhtar, 2008). However, comprehending third-party dialogue is not the same as either comprehending monologue or directly participating in dialogue (Fox Tree, 1999; Schober & Clark, 1989). Observing dialogue involves coordinating information from both the speaker and the addressee, the individual actively engaged in the conversation but not currently speaking. The present study explores children’s sensitivity to addressee behaviors within the context of vicarious word learning.
Vicarious word learning
Observing others engaged in interaction is a common context in which learning both words and actions occurs (Akhtar, 2005; Rogoff, Paradise, Mejia-Arauz, Correa-Chavez, & Angelillo, 2003; Shimpi, Akhtar, & Moore, 2013). In such a context, the child is present within the perceptual and physical sphere of a given interaction, but does not actively participate. This type of third-party listening is generally referred to as overhearing (Goffman, 1981).
In one of the earliest studies on vicarious word learning in children, two-year-olds’ ability to learn novel words when taking part in an interaction and when overhearing as a third-party listener were compared (Akhtar et al., 2001). In each condition, the same object-labeling interaction took place. The speaker introduced a label for a novel object, handing it to the addressee after producing the label. In the direct addressee condition, the child was the clear recipient of the speaker’s talk. In the overhearer condition, the speaker followed the same script but interacted with a confederate addressee, with both ignoring the co-present child. Comprehension tests demonstrated that children learned labels for the target objects equally well in the two conditions (Akhtar et al., 2001), illustrating the efficacy of overhearing in language acquisition.
This type of vicarious learning through overhearing appears to be a robust phenomenon. Overhearing leads to learning in contexts in which children observe interactions involving both familiar and unfamiliar addressees, as well as in contexts involving both child and adult addressees (Shimpi et al., 2013). Children learn novel object labels when observing them presented in direct labeling, but learn equally well when the labels are embedded in more complex, indirect contexts (Gampe, Liebal, & Tomasello, 2012), and when embedded in directive requests (Akhtar, 2005).
Important to this ability to learn through observation of others is the child’s sensitivity to socio-pragmatic information. Children rely on socio-pragmatic information, and especially joint attention, for appropriate mapping of words to referents, whether directly engaged in or simply observing object-oriented activity (Baldwin et al., 1996). Indeed, children as young as 16 months are sensitive to social and attentional cues to speaker reliability (Koenig & Echols, 2003). When resolving referential ambiguities present in the interactions of others, children take into account pragmatic information on what could or could not be considered a part of joint attention of the observed interactants (Martin & Vouloumanos, 2013). This ability to take into account cues from multiple speakers during a labeling event develops with experience. Children with more experience with multi-party interactions (in which they are more likely to take on the role of overhearer) pay more attention to third-party interactions (Shneidman, Buresh, Shimpi, Knight-Schwarz, & Woodward, 2009).
Grounding conversations
Conversational partners negotiate mutually shared knowledge, or common ground, which forms the basis for their coordination within some joint activity (Brennan, Galati, & Kuhlen, 2010). In dialogue, common ground is created and updated through a process called grounding (Clark & Brennan, 1991; Clark & Wilkes-Gibbs, 1986). Grounding is a collaborative moment-to-moment process that occurs in three steps (Bavelas, De Jong, Korman, & Smock Jordan, 2012, p. 1; italics in original):
The speaker presents information.
The addressee displays that she has understood the information (or has not understood or is not certain).
The speaker acknowledges that the addressee has understood (or not).
One way Step 2 can be achieved is through addressees’ use of backchannels. Backchannels are responses that are used to display continued attention, understanding, or agreement (Goodwin, 1986; Yngve, 1970), and can take the form of nonverbal expressions (eyebrow raises, smiles, nods), verbal expressions (mm-hmm, yeah, oh really), or more complex behaviors such as turn completions.
In collaborative language theory, all contributions to a conversation must be grounded (Brennan et al., 2010). The experimental paradigm used to explore these processes involves people describing novel objects, which requires the construction of novel labels and agreement on which labels to rely on (Brennan & Clark, 1996; Clark & Wilkes-Gibbs, 1986; Fay, Garrod, & Roberts, 2008). Novel labels are not always successfully grounded at first. But over repeated turns people do develop reliable referring expressions. The development of these referring expressions is a collaborative process shaped by both speaker attempts and listener responses, including backchannels (Tolins, Zeamer, & Fox Tree, 2017).
Listening in on conversations does not afford overhearers the same advantages as direct addressees. Unlike direct addressees, overhearers cannot guide a conversation to meet their needs (Schober & Clark, 1989). However, listening to a collaborative dialogue is more informative for overhearers than listening to a single speaker (Fox Tree, 1999). Further, the more contributions made by an addressee or multiple addressees to a conversation, the more all addressees understood (Fox Tree & Clark, 2013). This benefit is likely due to the role that addressees play in shaping the talk produced (Fox Tree & Mayer, 2008; Tolins et al., 2017). Within the vicarious word-learning paradigm, children witness a successful grounding within an interaction, yet the contributions to this process made by the addressee have thus far been unexamined.
Addressee backchannels in dialogue comprehension
In adults, dialogue comprehension involves understanding interactions holistically – not as two individuals’ interleaved contributions, but as turns that build up conversational participants’ common ground. This is true even for backchannels. The way a conversation proceeds varies depending on the types of backchannels addressees provide (Tolins & Fox Tree, 2014). Because of this, adult overhearers use these different backchannel responses from addressees to predict what types of turns speakers are likely to take next (Tolins & Fox Tree, 2016).
Differences in comprehension were also found when overhearers listen to an addressee responding positively or negatively to a speaker’s talk. In Tolins and Fox Tree (2015), overhearers listening to an addressee responding positively or negatively to the same speaker talk remembered what was said differently. When the listener responses were included in the back-and-forth of the conversation, overhearer recall and judgment were found to be biased by the way the listener responded, such that they had more negative opinions and were more likely to remember negative aspects of the original description after watching the interaction containing negative addressee responses (Tolins & Fox Tree, 2015).
Together, these studies suggest that for adult overhearers, backchannel responses from addressees are informative as part of the collaborative process of understanding others talking together. Conversational meaning is collaboratively constructed (Clark, 1996; Linell, 2008), with active participation from multiple interlocutors. As such, the meaning of addressee backchannels for overhearers relates not to information conveyed by the addressee, but rather to how a given backchannel shapes the unfolding dialogue.
Current studies
Although backchannels are critical to the success and continuation of coordinated verbal activity, not much is known about how addressees’ responses affect vicarious learning in children. Across studies testing vicarious learning in children, addressee behavior has been similar, and children’s attention to the addressee has not been analyzed. The confederate addressee maintains attention on the speaker, and smiles and nods when presented with the label and receiving the target objects. In one study, the researcher measured children’s attention to the conversational participants, but did not separate looks to the addressee from looks to the speaker (Akhtar, 2005). In another study, the researchers coded visual attention to speaker and addressee separately, but then collapsed these measures in the analyses (Shneidman et al., 2009). Researchers have also examined effects of the identity of the addressee or the relationship between the addressee and the child overhearer (Shimpi et al., 2013), but not the relationship between the speaker and the addressee. To date, no research has examined whether children engaged in the context of overhearing treat addressee behaviors as informative, or to what extent variation in the type of addressee response may influence vicarious word learning.
In one exception, observing a reciprocal social interaction was demonstrated to be critical to vicarious word learning in both in-person contexts and in learning from video recorded interactions (O’Doherty et al., 2011). Regardless of whether the observed interaction involved previously recorded or live actors, when the object being labeled was not handed over from the speaker to the addressee in the observed interaction, learning was reduced. As reciprocity is a feature of conversation that cannot be reduced to the actions of any one individual, dialogue comprehension in children, as with adults, requires an understanding of joint interactivity.
Given that addressees play a critical role in the co-construction of talk, it is important to understand the role of addressees in child word learning through overhearing. That children successfully deploy such backchannels when addressees themselves (Dittman, 1972; Hess & Johnston, 1988) suggests they may have some grasp of their function within dialogue – but this does not preclude the possibility that understanding the responses of others in overheard conversations may develop later. We consider three possibilities: the speaker only hypothesis, the distinct sources hypothesis, and the collaborative grounding hypothesis.
The speaker only hypothesis is that children engaged in overhearing project themselves into the conversation, understanding the speakers’ talk as though they were the direct addressees and therefore ignoring any addressee backchannels. This hypothesis suggests that engaging in direct listening and overhearing are essentially equivalent.
The distinct sources hypothesis is that child overhearers pay attention to addressee responses as well as speaker talk, but take backchannels at face value as an informational resource distinct from the speaker. We consider two contrasting versions of this hypothesis, the reliability cue hypothesis and the surprise hypothesis. In the reliability cue hypothesis, positive addressee backchannels in response to speaker labeling may support word learning by acting as independent cues to speaker reliability. Backchannels are one of a number of cues that children use within selective social learning (Sobel & Kushnir, 2013), which also include bystander attention (Chudek, Heller, Birch, & Henrich, 2012) and consensus across speakers (Bernard, Proust, & Clément, 2015). In the surprise hypothesis, negative addressee responses, which are less common and therefore more surprising (Atkinson & Heritage, 1984), might draw child overhearers’ attention. Negative responses might boost learning above the unmarked interaction involving positive addressee responses. Both of these hypotheses are versions of the distinct sources hypothesis because in both it is addressee backchannels on their own that affect learning.
The collaborative grounding hypothesis is that backchannels are understood as part of the collaborative process of dialogue, and thus are informative not on their own but only in relation to speaker response and uptake. This hypothesis draws on findings in the adult literature suggesting that adult overhearers take the interactional role of backchannels into account when engaged in third-party listening (Tolins & Fox Tree, 2015, 2016).
The present study adjudicates among these hypotheses by testing the extent to which addressee backchannels affect children’s vicarious word learning. The extension of the vicarious learning paradigm to include recorded interactions improves experimental control. The exact same interaction was observed by multiple children without requiring live actors to faithfully reenact a script multiple times (O’Doherty et al., 2011). Our vicarious learners observed recorded reciprocal interactions. In the video recordings for each experiment, two actors, a speaker and an addressee, engaged in a labeling task involving the introduction of novel labels for four novel objects. Addressees in the observed interactions displayed either agreeing or disagreeing backchannels in the form of nonverbal facial expressions and head movements.
The three hypotheses make distinct predictions regarding vicarious word learning. The speaker only hypothesis predicts that children will learn words equally well regardless of what type of backchannels are present in the observed interaction. The distinct sources hypothesis predicts that word learning should be influenced whenever an addressee produces a backchannel, with the reliability cue version predicting increased learning following agreeing backchannels and the surprise version predicting increased learning following disagreeing backchannels. The collaborative grounding hypothesis also predicts that addressee backchannels will influence word learning, but only in contexts where it is possible for the backchannel to be informative in relation to a speaker’s continued actions. Therefore, this hypothesis predicts no effect when backchannels are presented as the final contribution in an observed dialogue. Rather, it predicts that the speaker talk following a backchannel (Step 3 in Bavelas et al.’s [2012] process) will be interpreted in relation to addressee feedback, with pragmatic understanding of the interaction as a whole influencing vicarious word learning.
In Experiment 1 we used the standard interactional script of the vicarious word-learning paradigm and included a condition with disagreeing addressee responses. By contrasting the two types of addressee responses following label presentation, we tested both versions of the distinct sources hypothesis. In Experiment 2 we presented the addressee backchannels as interwoven within continued speaker labeling, and thus tested the collaborative grounding hypothesis. In Experiment 2 we also included a rough measure of gaze direction. Less gaze towards an observed addressee could be taken as support for the speaker only hypothesis. Similarly, distinct patterns of gaze allocation across interactions involving agreeing and disagreeing addressee responses could be informative regarding children’s sensitivity to listener behaviors. In both experiments, the speaker only hypothesis predicts no differences in learning across the agreeing and disagreeing conditions.
Experiment 1
Children observed object-labeling interactions in which the addressee provided either agreeing or disagreeing responses. While prior studies have shown pragmatic sensitivities at an earlier age (O’Doherty et al., 2011), understanding the role of addressee backchannels may require more fully developed social processing skills. We therefore tested four-year-olds, an age at which theory of mind is more fully developed (Flavell, 2004). We made use of the standard procedure within the vicarious word-learning paradigm, in which the speaker introduces the label before presenting the object and handing it to the addressee (Akhtar, 2005; Akhtar et al., 2001; Floor & Akhtar, 2006). We replicated this paradigm and extended it to compare agreeing and disagreeing addressee response behavior.
Method
Participants
Twenty-six four-year-olds were recruited from a database of families who had indicated interest in participating in studies of child development. Twenty-one were included in the analyses (M = 53 months, SD = 2.98 months, range: 48–59 months, 10 male), two were outside the age range of the study, and three had missing data (e.g., they did not respond to all test questions). Seventeen participants were monolingual English speakers; in addition to English, four children were exposed to the following languages: Spanish (3, one of whom had some knowledge of American Sign Language) and Mandarin (1). The children were from Caucasian (15), mixed ethnicity (4), Asian (1), and Native Hawaiian/Pacific Islander (1) families. Their parents were primarily college educated or higher; two reported that their highest education level was high school.
Design
There were two within-subjects conditions: (1) agreeing backchannels and (2) disagreeing backchannels. The agreeing backchannel condition was a replication of the overhearing condition from Akhtar et al. (2001). In the disagreeing condition addressee responses consisted of negatively valenced nonverbal backchannels (i.e., shaking head and frowning). Children observed two trials in each condition. The four trials, two with agreeing and two with disagreeing addressees, were presented in two counterbalanced orders. In one order, the first experimental trial was in the agreeing condition and in the other, the first experimental trial was in the disagreeing condition.
Materials
Four familiar items were used for the warm-up trials (see Appendix A). For the experimental trials, there were four sets of four novel objects, two animate and two inanimate (see Appendix B). In each set, one toy served as the target and the remaining three were distractors. The target object was randomly selected from the four sets. The same four targets (and four sets) were used in both videos. Across the two videos, each target was presented in both conditions. Toys were placed in opaque containers such that they were invisible to the participant until the speaker presented them.
Procedure
Children’s parents completed the informed consent form and demographic survey. Children were seated individually at a table in front of a laptop on which the stimuli were presented. The experimenter, who sat adjacent to the child, let the child know that they would be watching videos of people playing a game. After completing the experimental trials, children were thanked and given a small gift. The entire session took about 10 minutes.
Each trial consisted of the speaker removing the objects one at a time from the containers and presenting them to the addressee. For the target (labeled) object the speaker first placed her hand on the lid of the container and said, ‘I’m going to show you the [label 1 ]. Do you want to see the [label]? I’m going to show you the [label].’ The target object was always the first in the array, maximizing the interval between presentation and test. See Figure 1(a) for a schematic depiction of the labeling interaction. For the subsequent distractor objects, the speaker followed the same practice but said, ‘I’m going to show you this one. Do you want to see this one? I’m going to show you this one.’ Following these utterances, the speaker then removed the object, held it in front of her, and passed it to the addressee. Addressee responses were provided at three points: when the speaker first pulled the object out of the container, when the addressee was holding the object, and when the speaker took the object back from the addressee before returning it to its container. In the two agreeing test trials, the addressee displayed agreement with the speaker’s label by smiling and nodding. In the two disagreeing test trials, the addressee displayed disagreement by frowning and shaking her head (see Figure 2). In all other respects, the procedure was identical across the two conditions. For the distractor object presentation, addressee responses were the same as the agreeing backchannels trials.

Labeling interaction script for (a) Experiment 1 and (b) Experiment 2. S represents the speaker and A represents the addressee. Contributions by speaker and addressee are indicated by the aligned text bubbles, with the timing of the reciprocal physical actions surrounding the exchange of the target object in the middle columns.

Video stills demonstrating (a) agreeing addressee backchannel response and (b) disagreeing addressee backchannel response after speaker presentation of label and object.
After each trial, children were shown a picture containing the four objects presented in the trial and were asked a comprehension question (‘Which one is the [label]?’) and a preference question (‘Which one is your favorite one?’) with the experimenter distracting the child for a time between each question. The order of comprehension and preference questions was counterbalanced within each video order and across participants. An on-line observer noted which object the child selected from behind a one-way mirror, scoring each comprehension and preference question 1 if the child selected the target object and 0 if the child selected a distractor object. Thus, a child’s score could range from 0 to 2 in the agreeing and disagreeing addressee response conditions for both the comprehension and preference questions.
Results
Above-chance performance required a minimum of 14 out of 21 children to select the target object on at least one of the comprehension trials within a condition. Chance probability for a binomial p < .05 that the target object would be selected at least once within a condition = .4375, as the chance of picking one of the three non-target objects in either of the two agreeing or disagreeing addressee trials is .75 × .75 = .5625, making above chance performance 1 – .5625 = .4375 (see Floor & Akhtar, 2006, for a similar analysis). The number of children who selected the target object on at least one comprehension trial met or exceeded the number expected by chance in both the agreeing (n = 19; 90%) and disagreeing (n = 17; 81%) conditions. In both conditions, the number of children who selected the target object on at least one preference trial did not exceed chance performance (see Table 1).
Number of children (out of 21) who selected the target object in comprehension and preference trials never (0), once (1), or twice (2) in each condition of Experiment 1.
Preliminary analyses on the comprehension trial data revealed no effects or interactions of trial order, question order, gender, age, or language exposure (monolinguals as compared to bilinguals); these factors were not analyzed further (all ps > .05). A 2 × 2 repeated measures analysis of variance examining the effects of question type (comprehension, preference) and backchannel type (agreeing, disagreeing) on children’s word learning revealed an effect of question type, F(1, 20) = 27.92, p < .001, ηp2 = .58, but no effect of backchannel type, F(1, 20) = 0.05, p = .83, ηp2 = .002, and no interaction, F(1, 20) = 0.26, p = .61, ηp2 = .013. Children were more likely to select the target object when asked the comprehension question (M = 1.14, SD = 0.45) than the preference question (M = 0.36, SD = 0.42), MDiff = 0.79, paired samples t-test t(20) = 5.28, p < .001, d = 1.45.
In order to rule out children’s selecting the target object during the comprehension question simply because they liked it best, we also conducted a more conservative analysis in which all trials on which a participant selected the target object for both the comprehension and preference questions were dropped. A paired-samples t-test revealed no significant difference in comprehension across the agreeing backchannel (M = 0.95, SD = 0.59) and disagreeing backchannel conditions (M = 0.95, SD = 0.74), MDiff = 0.0, t(20) = 0.0, p = 1.0.
Discussion
Children learned the label for the target object through watching the videotaped interaction, replicating previous studies. The valence of addressee backchannels in response to the presentation of a labeled object, either agreeing or disagreeing, did not influence the degree to which children learned the words. These findings are the first to demonstrate successful vicarious word learning in a condition involving disagreeing responses from the addressee.
The findings in Experiment 1 cast doubt on both versions of the distinct sources hypothesis, which predicted that addressee backchannels would be treated as a cue for social learning, independent of the actions produced by the speaker. According to the reliability cue version, children should have learned better in the agreeing addressee backchannel condition, but they did not. According to the surprise version, children should have learned better in the disagreeing addressee backchannel condition, but they did not.
The current findings cannot adjudicate between the two remaining hypotheses: speaker only and collaborative grounding. The speaker only hypothesis predicts that addressee responses will be ignored by children engaged in dialogue comprehension, and as such their valence, whether agreeing or disagreeing with the label, will have no effect on vicarious word learning. The collaborative grounding hypothesis predicts that addressee responses alone are uninformative without the presence of subsequent speaker uptake, which completes the grounding of a contribution within the context of collaborative dialogue, as in the three-part structure of successful grounding outlined above (Bavelas et al., 2012; Clark & Schaefer, 1989). In Experiment 1, following standard procedure for the paradigm, the speaker provided the label three times before the addressee responded. After the addressee responses, the object was simply put away and the speaker moved on without any verbal acknowledgment of the addressee’s behavior. It thus remains possible that addressee backchannels do play a role in vicarious word learning, but only when speaker and addressee contributions are seen as mutually co-constructive within a collaborative interaction.
In Experiment 2 we teased apart the speaker only hypothesis and the collaborative grounding hypothesis. By interweaving the contributions of the two conversational partners, we provided a context in which the backchannels could be used to guide interpretation of the speaker’s continued presentation of the label.
Experiment 2
In order to better exemplify the typical grounding sequence and tease apart the two remaining hypotheses, we restructured the typical vicarious word-learning script such that speaker and addressee contributions were interwoven. As in Experiment 1, the speaker introduced the object label three times. However, in the video stimuli for Experiment 2, the introduction of the label co-occurred with the presentation of the target object and was followed directly by addressee response. Similarly, because of the interwoven nature of the speaker and addressee contributions, in this new dialogue the second and third labeling by the speaker followed addressee backchannel responses, and could therefore be interpreted as speaker uptake (see Figure 1(b)).
As in Experiment 1, the speaker only hypothesis predicts that addressee backchannels will not influence vicarious word learning. The collaborative grounding hypothesis predicts that addressee backchannels will influence learning, as the interaction makes it possible for the backchannels to be informative in relation to a speaker’s continued actions. The critical difference across experiments is not the backchannels themselves, but how the backchannels change the interpretation of subsequent speaker talk (Tolins & Fox Tree, 2014, 2015, 2016).
The collaborative grounding hypothesis reframes the two conditions such that the agreeing condition is reinterpreted as the continuing speaker condition: Addressee agreement with the label and the speaker’s subsequent repetition would be seen as a successful continuation of the grounding of the speaker’s label (Bavelas et al., 2012). In contrast, the disagreeing condition is reinterpreted as the reinforcing speaker condition: Addressee disagreement with the label and the speaker’s subsequent repetition would be seen as the speaker’s rejection of the addressee’s disagreement and a continued effort to insist upon the introduced label.
By proposing that both speaker and addressee actions are interpreted within the context of the dynamics of the conversation, the collaborative grounding hypothesis makes a prediction that may at first seem counterintuitive – increased learning when addressee disagreement is followed by speaker repetition of the label. The interactional pattern of a speaker repeating a label following disagreeing addressee responses should be taken as the speaker’s rejection of the addressee’s negative feedback, and a continued effort on the part of the speaker to ground (and insist upon) the introduced label in the face of negative uptake. If four-year-olds are sensitive to these interactive grounding processes, they should show improved learning in the disagreeing (speaker reinforcing) condition. This hypothesized improvement would not stem from the backchannels themselves, but from an understanding of the interaction as a whole.
Finally, in Experiment 2 we added a measure of gaze direction to investigate children’s attention in the two conditions. This was done to document whether child overhearers focus on addressees during the course of an interaction involving backchannel responses, as a lack of gaze directed towards the addressee would support the speaker only hypothesis. Further, this measure could indicate whether agreeing versus disagreeing interactions draw distinct patterns of attention.
Method
Participants
Twenty-four four-year-olds were recruited from the same database as in Experiment 1. Twenty participants were included in the analyses (M = 52.5 months, SD = 3.5 months, range: 47–59 months, 10 male), four had missing data (e.g., they did not respond to test questions). Thirteen participants were monolingual English speakers, four were bilingual in English and Spanish, and three were regularly exposed to one other language (Turkish, Hebrew, Japanese). The children were from Caucasian (15), mixed ethnicity (4), and Asian American (1) families. Their parents were primarily college educated or higher; six reported their highest education level was high school and two did not report their education level.
Design, materials, and procedure
The same materials, design, and procedure were used as in Experiment 1, with two modifications to the materials: (1) two new models were recorded interacting with the objects and presenting the labels and responses, and (2) label presentations and backchannels were interwoven as opposed to the speaker presenting the label three times prior to presenting each target object (see Figure 1(b)). In these videos, the speaker began each trial by saying, ‘Let’s see what’s in here,’ before removing the object. Upon retrieving the object, the speaker said, ‘Oh look, we found the [label],’ passed the object to the addressee and stated, ‘Look at the [label],’ and then took the object back from the addressee and stated, ‘We found the [label].’ The addressee provided either agreeing or disagreeing backchannels, either smiling and nodding or frowning and shaking her head, after each label presentation. By interweaving the label and response, speaker uptake of the addressee’s response was made available.
Gaze coding
Participants were recorded with the built-in camera on the computer on which the stimuli were presented. Gaze was coded for the labeling interaction, starting at the beginning of Let’s see what’s in here and ending just before the start of the next object introduction (the following Let’s see what’s in here). Trained research assistants coded gaze direction and length for four categories: participant looks to the right half of the screen were coded as looks to the addressee, looks to the left half of the screen were coded as looks to the speaker, other looks were coded as looks off, and segments in which the child’s eyes were closed or in which they had dipped their head below the camera were categorized as uncodeable. Given this rough-scale granularity, it was not possible to measure gaze direction towards the object as distinct from looking towards the individual holding the object. Two independent coders blind to experimental conditions and hypotheses coded a subset of the data (20 trials). Inter-rater agreement for these trials was 90.25%, and so the rest of the data were coded individually.
Results
Above-chance performance (binomial p < .05, with chance = .44) required a minimum of 13 out of 20 children to select the target object on at least one of the comprehension trials. The number of children who selected the target object on at least one comprehension trial met or exceeded the number expected by chance in both the agreeing (n = 13; 65%) and disagreeing (n = 18; 90%) conditions. In both conditions, the number of children who selected the target object on at least one preference trial did not exceed chance performance (see Table 2).
Number of children (out of 20) who selected the target object in comprehension and preference trials never (0), once (1), or twice (2) in each condition of Experiment 2.
Preliminary analyses on the comprehension trial data revealed no effects or interactions of trial order, question order, gender, age in months, or language exposure (monolingual or bilingual); these factors were not analyzed further (all ps > .05). A 2 × 2 repeated measures analysis of variance examining the effects of question type (comprehension, preference) and backchannel type (interwoven agreeing, interwoven disagreeing) on children’s word learning revealed an effect of question type, F(1, 19) = 9.91, p < .01, ηp2 = .34, no effect of backchannel type, F(1, 19) = 0.43, p = .34, ηp2 = .02, and an interaction between backchannel type and question type, F(1, 19) = 7.95, p < .05, ηp2 = .29. Given the interaction, we tested for simple main effects of question type. In the interwoven agreeing backchannels condition there was no effect of question type, such that children selected the target object in response to comprehension questions, M = 0.9, SD = 0.79, and preference questions, M = 0.7, SD = 0.73, to a similar extent, MDiff = 0.20, t(19) = 0.74, p = .46 d = 0.26. However, in the interwoven disagreeing backchannels condition, there was an effect of question type, MDiff = 1.0, t(19) = 4.87, p < .001, d = 1.56, with children selecting the labeled object more often when asked the comprehension question (M =1.4, SD = 0.68) than the preference question (M = 0.4, SD = 0.60).
As in Experiment 1, we also conducted a more conservative analysis. After dropping all trials on which a participant selected the target object for both the comprehension and preference questions, we found that children learned words better in the interwoven disagreeing backchannels condition (M = 1.1, SD = 0.71) than in the interwoven agreeing backchannels condition (M = 0.70, SD = 0.65), MDiff = 0.4, paired t(19) = −2.37, p < .05, d = 0.58.
Gaze direction
Given variations in the length of the object labeling interactions (M = 16.50 seconds, SD = 0.95 s), introduced by small timing differences in speed of talk and exchange of the target object, we analyzed the proportion of total time in which gaze was directed towards the half of the screen with the addressee or the half of the screen with the speaker. These proportions were entered into a 2 × 2 repeated measures ANOVA with gaze recipient (speaker or addressee) and backchannel type (agreeing or disagreeing backchannels) as factors. We found an effect of gaze recipient, F(1, 19) = 35.48, p < .001, ηp2 = .65, but no effect of backchannel type (p = .70) and no interaction (p = .95). Participants spent more time directing gaze at the addressee (M = 40% of total interaction time, SD = 16%) than the speaker (M = 12%, SD = 7%), MDiff = 27%, paired-sample t-test t(19) = 5.96, p < .001.
Discussion
Although comprehension tests revealed that children learned the target label in both conditions, the significant interaction between backchannel type and question type suggests children learned better in the disagreeing addressee (reinforcing speaker) condition than the agreeing addressee (continuing speaker) condition. Thus we reject the speaker only hypothesis that children would not be influenced by observed addressee backchannels when engaged in vicarious word learning.
Instead, the results support the collaborative grounding hypothesis. Speaker repetition of the label following disagreeing backchannels by the addressee solidified an object’s label in comparison to speaker repetition of the label following agreeing backchannels. This suggests that the role of addressee backchannels in vicarious word learning is not due to any inherent meaning of the cues, as with the previously rejected distinct sources hypothesis, but instead they are understood to play a role in shaping the back-and-forth collaborative dialogue. It is further unlikely that the surprise hypothesis could be revived in order to explain these results, as any attempt to do so would require an explanation of why disagreement is surprising when it is interwoven within the turn taking but not when it is presented following the labeling as in Experiment 1.
What is informative for the child overhearers was not the addressee response, but the addressee response in connection with the subsequent speaker talk. By repeating the label after witnessing the addressee frown and shake her head, the speaker is understood as attempting to override the addressee’s disagreement, emphasizing or reinforcing her confidence that the label was indeed correct. Given that the speaker is making the same contributions across the two different addressee response conditions – repetition of the label using the same script – we take these results as a demonstration of how addressee backchannels can change the way in which a speaker’s contributions are interpreted pragmatically by overhearers. It is the sequential interactional pattern, taken as a whole, which influences vicarious word learning. That is, we are not arguing that perceiving disagreement across two conversational partners boosts comprehension. Rather, the particular interactive patterns across our two conditions changes the pragmatic interpretation of the continued use of the label. If the speaker had responded to the disagreement in some other way, without insisting upon a particular label, we believe that vicarious word learning would not have been enhanced in the disagreeing condition.
The gaze data provide further information about what children are doing when overhearing conversations. While children pay attention to the interactions of others even when engaged with a distractor toy (Akhtar, 2005), how that attention is allocated across speakers was unknown. Previous researchers measured the distribution of visual attention across observed interactions (Shneidman et al., 2009), but no previous researchers analyzed attention paid to speakers as compared to addressees, nor the possibility of variation in attention across different interactional patterns.
Participants spent a larger proportion of the labeling interactions with gaze directed towards the addressee, regardless of whether the addressee was providing agreeing or disagreeing responses. Increased gaze towards the addressee may have resulted from the addressee behaviors, which were nonverbal and therefore required visual attention. This finding emphasizes the importance of the addressee in vicarious word learning, suggesting, against the speaker only hypothesis, that addressee contributions play a role in vicarious word learning.
The gaze direction and duration data indicate that the increased learning in the reinforcing speaker condition was not driven by increased attention to either the speaker or the addressee, again emphasizing that it is the nature of the interaction as a whole that is driving the effect. The lack of gaze differences across conditions also suggests that the disagreeing backchannels were not inherently more interesting or surprising than the agreeing backchannels.
General discussion
Four-year-old children are sensitive to addressee backchannels when engaged in vicarious word learning. While backchannels did not influence word learning when they were perceived without subsequent response from the speaker (Experiment 1), they did influence learning when they were followed by speaker uptake (Experiment 2). This influence was not, however, based on the particular valence of the backchannels themselves, but on how these backchannels contextualized the subsequent contributions of the speaker. That is, children were sensitive to the collaborative processes found within dialogue, which require both an addressee response and speaker uptake of this response.
The collaborative grounding hypothesis within vicarious word learning mirrors results observed with adults engaged in overhearing. Across experiments that varied addressee backchannels to the same speaker talk, overhearers understood the speakers’ talk surrounding different backchannels differently (Tolins & Fox Tree, 2014, 2015, 2016). Along with the current findings, these effects make a case against any interpretation of dialogue comprehension as involving the comprehension of two distinct sources of information produced by the individual conversational partners. Instead, overhearers as young as four years of age understand talk as an interactional process.
The current experiments represent the first attempt to modify the typical script of the vicarious word-learning paradigm, both in terms of the organization of the interaction and the responses produced by the observed addressee. Across prior experiments (Akhtar, 2005; Akhtar et al., 2001; Floor & Akhtar, 2006), and in Experiment 1 of the current studies, addressee responses always followed multiple labelings of the target object. Similarly, in all prior studies, addressee responses were always positive, with the addressee smiling and nodding, appearing to agree with the speaker. Experiment 2 represents a departure from prior methods by highlighting the give-and-take of a conversation and the accretion of common ground. In varying the script of the observed interaction, the current studies expand our understanding of what exactly children are doing when they are engaged in vicarious learning.
The current work also contrasts with prior studies of children’s sensitivities to consensus when engaged in social learning. Children appear to use a number of different cues when learning from others, including reliability (Scofield & Behrend, 2008), familiarity (Corriveau & Harris, 2009), and accent (Corriveau, Kinzler, & Harris, 2013). When observing the actions of multiple individuals, children more readily learned words that appear to be supported by a consensus (Bernard et al., 2015). They were also more likely to learn words from individuals who were part of a consensus previously, compared to speakers who were the odd ones out (Corriveau, Fusaro, & Harris, 2009). In contrast with this previous work, we demonstrate that agreement across two interactants, as a type of consensus, is not a requirement for learning through observation. Instead, when the particular sequential organization of the conversation supports a particular pragmatic interpretation – in the current case, the reinforcement of a label as an expression of speaker confidence – disagreement can support rather than hinder word learning. This suggests that sensitivity to the pragmatic and interactional patterns within conversation plays a role in observational learning.
The children in our studies did not simply place themselves in the role of the addressee, as though the speaker was producing the label for them (Herold & Akhtar, 2008). If this were the case, addressee responses would not have influenced word learning. Children also did not treat the speaker and addressee as two distinct sources of information. If this were the case, results would have been similar across experiments, with the particular valence of the backchannels displayed influencing comprehension. Instead, children attended to the interactive processes between the speaker and the addressee, changing how they learned words as a consequence. Such sensitivities are likely built on children’s prior experiences engaging in direct social interactions at an earlier age (Reddy, 2008), and are also likely influenced by the opportunities they have to engage in observational learning (Shneidman et al., 2009).
Direct comparison across Experiments 1 and 2 is problematic given that they involved different stimuli and participants. Inspection of the comprehension means, however, suggests that there may have been reduced learning in Experiment 2, as the children in the agreeing (continuing speaker) condition of Experiment 2 selected the correct target only 65% of the time, but children in the agreeing condition of Experiment 1 selected the correct target 90% of the time. While the target objects were the same across experiments, a number of differences could have led to a general reduction in comprehension. The interwoven nature of the script, with the addressee responses following each labeling, likely required a higher ability to switch attention across the observed interactants – an increase in cognitive load that may not have served comprehension. In more mundane differences, the actors in the videos were different, as the two undergraduate researchers who modeled the first experiment had graduated before the creation of the second. It is possible that these differences worked to reduce learning generally, although this does not negate the fact that even in this context the speaker-reinforcing condition boosted learning above the speaker-continuing condition.
Future studies can build on the current findings in numerous ways. Researchers could combine the two experiments into a single repeated-measures design that includes trials in which the addressee responses are interwoven and produced only after the labeling, which would both remove differences in effects of the models used and provide a stronger test of the direction of the effect. Researchers could also measure gaze direction more precisely, to compare looking time and gaze patterns across the conversation. And researchers could investigate other types of addressee contributions, such as verbal backchannels.
More broadly, researchers could test how socio-cultural factors affect learning from overhearing collaborative dialogue, such as cultural background (in a number of communities, children do not participate in many direct teaching interactions, but they closely observe and learn from third-party interactions; de Haan, 1999; Gaskins, 1999; Rogoff et al., 1993), family size (children with older siblings more easily learn personal pronouns, likely due to the increased frequency of these words in triadic contexts involving speech not directed to them; Oshima-Takane, Goodz, & Derevensky, 1996), and the degree to which a child engages in multi-party social interactions on a daily basis (which correlates with their ability to successfully distribute attentional resources that support observational learning; Shneidman et al., 2009).
Conclusion
Given the ubiquity of overhearing as a context of learning across development, understanding the processes involved in dialogue comprehension remains a critical endeavor. Following previous research that demonstrated the importance of observed interactivity in children’s vicarious word learning (O’Doherty et al., 2011), the current study extends our understanding of vicarious word learning as a form of dialogue comprehension. Rather than associating any presented label with a co-present object, children learn labels based on the pragmatics of the observed interaction. Further studies are needed to fully illuminate the developmental processes by which these sensitivities emerge.
Footnotes
Appendix A
Appendix B
Acknowledgements
Parts of this work were presented at the Ninth Biennial Meeting of the Cognitive Development Society; we thank the attendees for their informative feedback. We also thank the many research assistants who worked as models and helped in data collection, including Alina Crom, Kelly Ann Kelso, Naba Khan, and Margaux Schindler.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by faculty research funds granted by the University of California, Santa Cruz.
