Abstract
This article reviews research findings involving visual input in speech processing in the form of facial cues and co-speech gestures for second-language (L2) learners, and provides pedagogical implications for the teaching of listening and speaking. It traces the foundations of auditory–visual speech research and explores the role of a speaker’s facial cues in L2 perception training and of gestural cues in listening comprehension. There is a strong role for pedagogy to maximize the salience of multimodal cues for L2 learners. Visible articulatory gestures that precede the acoustic signal and the preparation phase of a hand gesture that precedes the acoustic onset of a word provide a priming effect on perceivers’ attention to signal upcoming information and facilitate processing, and visible gestures that co-occur with speech aid ongoing processing and comprehension. L2 learners benefit from an awareness of these visual cues and exposure to input.
Keywords
Introduction
Speech is typically a multimodal phenomenon. Speakers employ facial and bodily gestures to complement and strengthen the meaning and force of their words and prosody (Bolinger, 1986; McNeill, 1992), and visible articulatory gestures such as lip, tongue, and jaw movements provide valuable information for identifying the consonants (Miller and Nicely, 1955) and vowels (Hagiwara, 1995) that a person is articulating and so are essential to lip-reading. Visual information from the head, face, hands, and arms can also aid comprehension when a listener is not able to fully interpret information from the audio channel, as may be the case in second-language (L2) comprehension.
The complex of facial expression, visible articulatory gestures, and bodily gestures operate in tandem with lexical and grammatical information and with both segmental and suprasegmental or prosodic aspects of pronunciation to produce a mutually reinforcing complex of cues to meaning. In addition, polyrhythmic sequences of vocal stress patterns, head movements, and co-occurring gestures serve as informational highlighters. The available articulatory and gestural cues in the visual channel moreover help to prepare listeners for speech by focusing their attention on the speaker just prior to receiving the acoustic signal. L2 learners can benefit from an awareness of these visual cues and exposure to input.
The value of seeing a speaker’s facial cues has become far more apparent today when face-to-face interaction has been replaced by mask-to-mask communication in many settings as one means of trying to reduce the spread of Covid-19. Face masks may muffle speech while also blocking sources of information that can be gleaned from watching a speaker’s mouth and surrounding area of the face to interpret a speaker’s words and affect. In addition, the tight connection and focusing effect of physical expression and gesturing to vocal information during speech have become more obvious in the current context of online communication, as facial expression often lags slightly behind the audio signal in video calls, and as hand and arm movements are restricted – because the tight focus of some small-screen video interactions both inhibits and obscures them in whole or in part.
As Shams and Seitz (2008) noted, ‘It is likely that the human brain has evolved to develop, learn and operate optimally in multisensory environments’ (p.411). Despite this assertion, research on the contribution of facial and gestural cues to speech processing for L2 learners has lagged behind research involving these cues in many areas, including: infant speech development (e.g. Dodd, 1977; Volterra and Erting, 2002); speech comprehension by hearing-impaired adults (e.g. Goldin-Meadow and Alibali, 2013; Walden et al., 1977) and non-hearing-impaired listeners (Clark and Paivio, 1991); and comprehension of conceptually difficult messages or accented speech (Reisberg et al., 1987) and speech in noise-added conditions (e.g. for French, see Benoît et al., 1994; for English, see Summerfield, 1979). This article reviews major research findings for L2 learners involving visual input in the form of facial cues and bodily gestures that co-occur with speech, with implications for the teaching of pronunciation. Although hardly addressed in L2 pedagogy, research suggests that facial cues and co-speech gestures are an important aspect of communication and so deserve attention in language teaching.
Foundations of L2 Auditory–Visual Speech Research
The focus of research on speech perception for many years was the auditory channel of input, and an awareness of the contribution of visual cues to speech intelligibility developed slowly. Sumby and Pollack (1954) discovered that speech intelligibility was enhanced by observation of the speaker, especially in noisy conditions. Their study, unlike most research conducted since that time, involved a live (vs recorded) speaker, who was positioned about five feet from the participants and produced isolated words. Similar conditions with live interlocutors and ambient noise often exist in daily face-to-face communication, where visual cues from a speaker’s face (e.g. lip gestures, brow raise) and co-speech gestures (especially the shaping and movement of the arms and hands) contribute to intelligibility.
An intriguing finding by Harry McGurk helped to spur research on the impact of the visual channel in speech perception. While reviewing recordings that a technician had prepared for a study on whether infants could differentiate the auditory and visual components of speech, a speaker’s auditory /bɑ/ was dubbed onto a video recording showing her mouth and articulatory gesture (i.e. showing the shape and movement of lips, tongue, and jaw) for /ɡɑ/ (McGurk, 1988/1998). McGurk was surprised to experience a unique perceptual outcome from the auditory /bɑ/ presented concurrently with the visual information for articulation of /ɡɑ/: /dɑ/. In other words, he perceived a consonant that had the same manner of articulation and voicing as /b/ and /ɡ/ (voiced stop) but was physically midway between the position of /b/ (bilabial) and /ɡ/ (velar) – thus, alveolar. Then, in a study involving first-language (L1) speakers of British English, McGurk and MacDonald (1976) used similar stimuli with incongruent audio and visual (AV) cues, consisting of the English bilabial and velar voiced and voiceless stop consonants followed by a vowel (i.e. /bɑ, ɡɑ/ and /pɑ, kɑ/). For combinations of the visual velar /kɑ/ and auditory bilabial /pɑ/, 81 percent of the participants reported hearing /tɑ/ – the voiceless stop consonant that is essentially midway between the voiceless velar and bilabial stops. In contrast, the visual bilabial /pɑ/ paired with the auditory velar /kɑ/ resulted in 44 percent combination responses of /pɑkɑ/ and 37 percent /pɑ/. Since then, numerous studies have explored what came to be known as the ‘McGurk effect’ – a perceptual illusion in which incongruent AV cues result in a perception that does not match either cue but is rather a compromise between them or a combination of both. Experiments using incongruent AV information have served as a means of studying the relative contributions of the visual and auditory channels to speech perception (Welch and Warren, 1980).
Subsequent studies pointed to variability in the magnitude of the McGurk effect. Different perceptual responses were reported by MacDonald and McGurk (1978) than in their earlier study (McGurk and MacDonald, 1976). For example, in the later study, the combination of visual /kɑ/ and auditory /pɑ/ produced primarily correct /pɑ/ responses (70 percent) versus so-called fused /tɑ/ responses. In other research, visual influence on perception varied according to the salience of the articulatory gesture for a specific consonant in the context of a given vowel (Green et al., 1988). It was found that the lip gesture for a rounded vowel obscured the cues that distinguish a preceding consonant. Studies such as these determined that the contribution of information from a given modality, visual or auditory, to the perceptual outcome is subject to factors such as the adjacent vowel, the presence of noise, and the degree of discordance or conflict between the auditory and visual cues. The perception of discordance, in turn, is influenced by the perceivers’ ability to identify cues accurately, which is a challenge for some L2 learners.
To explore the McGurk effect in another language, Japanese listeners were presented with stimuli consisting of Japanese consonant–vowel (CV) syllables in noise-free and noise-added conditions (Sekiyama and Tohkura, 1991). Visual effects occurred for consonants which had shown less than 100 percent auditory-only intelligibility, and presentation in noise enhanced those effects. In a follow-up study, Japanese speakers listened to English syllables, and English speakers listened to Japanese syllables (Sekiyama and Tohkura, 1993). For the Japanese listeners who had studied English in the past, visual influence was inconsistent. For example, visual /ɡ/ had significant effects on perception when combined with an auditory bilabial consonant such as /p/, but identification accuracy of /r/ with visual input did not improve at all over the auditory-only level of 67 percent. The authors argued that perceptual processing by Japanese listeners was ‘vision-independent’ (Sekiyama and Tohkura, 1993: 441) compared to that of English listeners. In contrast, the McGurk effect did occur for speakers of Japanese, Spanish, and English with the use of synthesized AV stimuli using a /bɑ/–/dɑ/ continuum and an animated face (Massaro et al., 1993), indicating that specific experimental conditions affected the perceptual results.
The McGurk effect was also found for Japanese, Korean, Spanish, and Malay speakers learning L2 English in the US (Hardison, 1996). In the first experiment, when stimuli included American English (AE) CV syllables with a variety of consonants /p, f, w, r, t, k/ combined with /ɑ/ in both congruent and incongruent AV conditions, the Japanese and Korean learners’ identification accuracy of /f/ and /r/ improved compared to their audio-only scores. Visual nonlabials /t, k/, which were not problematic for these learners, contributed more to perception when combined with the more confusable auditory labial sounds such as /p, f/. For L1 speakers of English, the McGurk effect occurred only with stimuli in noisy conditions. However, in the second experiment, when stimuli involved only the stops /p, t, k/, responses from both the L1 and L2 English speakers showed significant visual effects.
Further evidence of L1 influence emerged in a study by Wang et al. (2009) using audio-only, video-only, and both congruent and incongruent AV presentation of English CV syllables to L1 Korean, Mandarin, and English speakers. Stimuli contained fricatives at different points of articulation: labiodental (/f, v/: nonexistent in Korean), interdental (/θ, ð/: nonexistent in Korean and Mandarin), and alveolar (/s, z/: present in all three L1s) combined with the vowels /i, ɑ, u/. Results indicated lower video-only identification accuracy of labiodentals for Korean speakers, and lower audio-only identification accuracy of interdentals for Korean and Mandarin speakers than for their performance in combined AV conditions. The results of these studies underscore the role of L1 influence and L2 exposure on the information value of the cues and on their integration in perception and production.
The relative weight given to auditory and visual cues in perception is also influenced by individual speaker characteristics and varies across perceivers (Hazan et al., 2010). CV syllables /ba/, /da/, and /ɡa/ as produced by Australian English and Mandarin speakers were presented to Australian English, British English, and Mandarin listeners in several stimulus conditions: audio-only, video-only, congruent and incongruent AV, and presented either in the clear, in noise, with visual blurring, or with combined AV degradations. Findings indicated that speech produced by an L2 speaker and the perceiver’s L1 background influenced the weighting of auditory and visual cues in perception.
Visual Cues in Perception Training
The McGurk effect findings for the Japanese and Korean learners of L2 English suggested a need to explore how visual cues could be enhanced through focused perception training to improve learners’ identification accuracy of AE /r/ and /l/ – two sounds which had dominated the L2 speech literature because of their acoustic variability and perceptual challenge for those learners based on their L1 phonology (Hardison, 2003).
Earlier comments by Japanese researchers had suggested the potential benefit of visual cues for L2 learners. With a particular focus on Japanese speakers’ challenges in identifying AE /r/ and /l/, Goto (1971) had related his own problems with these sounds, which he attributed to ‘the disadvantage of not being able to read the lips of the speaker’ (p.321). In addition, Hattori (1987) reported survey results from Japanese students who indicated they used facial cues more as a result of having lived in the US for at least two years due to both American custom and the need to ‘catch as much information as possible from their interlocutor’ (p.115). Building on the success of high-variability perception training (HVPT), which ‘emphasizes the use of multiple speakers and diverse phonetic contents to increase learners’ awareness and tolerance of variation’ (Pennington and Rogerson-Revell, 2019: 200), and which had improved the auditory perception of AE /r/ and /l/ for Japanese speakers (e.g. Lively et al., 1993), a series of AV training studies was conducted with Japanese and Korean intermediate-level L2 English learners in the US (Hardison, 2003).
The sources of variability found by Lively et al. (1993) to impact the auditory perception of /r/ and /l/ included the target sound’s position in a word and the use of multiple voices (vs just one voice) producing natural speech (vs synthesized speech). To these sources of variability, Hardison (2003) added two others: visual input that also captured the facial cues of multiple speakers and the vocalic context of /r/ and /l/, in recognition of the fact that, for example, the high vowels /i/ and /u/ generally constitute more difficult contexts for perceptual accuracy compared to lower unrounded vowels (Hagiwara, 1995). Following three weeks of HVPT, results for the Japanese speakers revealed significant effects of training type (greater improvement with AV vs auditory-only input), speaker’s face and voice, word position, and adjacent vowel. With AV input, visual cues contributed strongly to perception in the most challenging phonetic environment for the Japanese speakers, word-initial clusters, and they became more informative overall following training. Stimuli with a relatively open vowel (/ɑ/ or /aɪ/) improved earlier in the training process than those with a rounded vowel (/u/ or /o/). In contrast to the pattern of results for the Japanese, identification accuracy scores for the Korean speakers were higher for /r/ and /l/ in word-final position, especially following /i/. Perceptual accuracy for /r/ and /l/ in intervocalic position improved for both L1 groups. Similar to the results of Lively et al. (1993), in the Hardison (2003) study, improved identification accuracy generalized to novel stimuli and those produced by a new speaker. These skills also transferred to significant production improvement even in the absence of explicit production training, suggesting a link between perception and production. The contexts in which production improved the most were similar to those for perception.
This research has demonstrated that seeing lip gestures provides a significant advantage for L2 learners’ perceptual identification and pronunciation accuracy. This area of research also demonstrated the context- and talker-dependent nature of speech processing, supporting the view that the details and sources of variability in articulation and the context of speech are not lost or ignored in processing but are encoded in long-term memory traces, consistent with episodic views of learning as involving ‘large clusters of remembered episodes of individual experiences’ (Välimaa-Blum, 2009: 2). The elements that comprise these perceptual representations depend on the attention given to the auditory and visual attributes of the stimulus that are relevant to the task. In training studies, the use of multiple exemplars with immediate feedback enhances the learning process by promoting awareness of within-category similarities and between-category distinctions across contexts and speakers (see Hardison, 2012, for review).
Priming Effects of Visual Cues in Word Identification
Further research investigated whether the benefits of perceptual training extended to the accurate identification of words in connected speech (Hardison, 2018a). In this research, successively increasing amounts of a speech stimulus were presented until it was correctly identified. Results indicated that seeing a speaker’s face facilitated word identification by both L1 English speakers and L2 learners of English (L1 Japanese and Korean) compared to the audio-only presentation of words in isolation and in sentence contexts. The AV benefit in speech processing may stem, in part, from the natural temporal precedence of visual speech cues over the associated acoustic cue (Munhall and Tohkura, 1998). In the word identification process, the temporal precedence of a speaker’s articulatory gesture over the associated acoustic cue means that the visual cue is dominant in the early stage of determining what word is being spoken (Skipper et al., 2007).
Co-Speech Gesture
Expanding on studies focused on facial cues in spoken language processing, Sueyoshi and Hardison (2005) compared the contributions of a speaker’s facial cues (e.g. lip movements) and co-speech hand gestures (e.g. the hands forming a cup or sphere as if holding or containing something) to L2 English listening comprehension when participants were presented with a lecture on an unfamiliar topic. For lower-proficiency learners, seeing the lecturer’s face and hand gestures while listening to the lecture produced the highest comprehension scores. At a more advanced proficiency level, seeing only the face, which provided pronunciation cues, while listening produced the highest scores. The lowest scores for both proficiency levels were found in the audio-only condition. Questionnaire responses revealed very positive attitudes on the part of lower- and higher-proficiency learners toward both sources of visual cues.
Two training studies led by Hirata (Hirata and Kelly, 2010; Hirata et al., 2014) tested the contribution of gestures constructed to model vowel length and rhythm. The first of these studies (Hirata and Kelly, 2010) involved L1 speakers of English who had no prior exposure to Japanese and who were trained in one of four conditions: audio, audio + mouth, audio + hand (the talker’s face was obscured but the hand was visible), and audio + mouth + hand (all movements were visible). Participants in the third and fourth groups were told that short vertical and long horizontal hand movements corresponded to short and long Japanese vowels, respectively. Although all training groups improved, only the audio + mouth condition was significantly better than the audio-only, in contrast to the findings of Sueyoshi and Hardison (2005). In the Hirata et al. (2014) study, L1 English speakers with no knowledge of Japanese were again trained on Japanese vowel duration contrasts, with a focus on the potential benefit to perceptual accuracy of seeing speakers’ hand movements accompanying syllable or mora 1 rhythm. Training involved auditory input with gestures designed to represent either syllable or mora rhythm, which participants would either observe or produce together with the instructor. Results indicated improvement in auditory perception of vowel length for all training types; however, observing the gesture that was associated with syllable rhythm (closer to participants’ L1) produced the most improvement in their ability to determine vowel length.
The relationship between Japanese vowel duration and co-speech gesture was the focus of a recent two-part study (Hardison, 2019). The first experiment, conducted in Japan, investigated the temporal coordination of naturally occurring gestures (e.g. head nods) by three L1 Japanese-speaking classroom teachers of beginning-level Japanese, in conjunction with two speech components: pitch movement and vowel duration. A significantly greater number of head nods occurred with a long (vs short) vowel for all teachers. The apex (most extended position) of the head movement coincided with the peak of the pitch and amplitude of the syllable containing the long vowel. The second experiment, conducted in the US, explored the influence of these gestures on the perceptual accuracy of vowel duration by second-year L2 learners of Japanese (L1 English). Accuracy was greatest when both head movement and facial cues were present.
In addition to research on segmental length contrasts, Morett and Chung (2015) investigated whether hand gestures could facilitate English speakers’ ability to discriminate between Mandarin words differing only in lexical tone. Participants with no prior knowledge of Mandarin were assigned to one of three learning conditions involving the video of a Mandarin speaker whose gesture use varied as follows: pitch gesture (motion conveyed pitch contour), semantic gesture (motion conveyed word meaning), and no gesture. Participants viewed a series of video clips, and following each, they repeated the word the speaker said and its English translation while re-enacting any gesture they saw. Tone identification accuracy increased significantly in both the pitch gesture and no gesture conditions, but not in the semantic gesture condition. It is notable that whereas a pitch gesture has a relatively direct, iconic interpretation in terms of pitch height or contour shape, a semantic gesture indicating word meaning may be a more metaphorical or abstract, and often also a more culture-specific, way of representing meaning. This group of studies with both constructed and authentic co-speech gestures reinforces the fact that gestures linked to prosodic features such as duration and pitch contours can provide valuable clues to these features of pronunciation for L2 speakers. Speakers manage the multiple channels of communication to be mutually reinforcing and temporally coordinated, as their physically prominent gestures co-occur with prosodically prominent features of stress and pitch, to achieve a high degree of self-synchrony in speech (Condon, 1982).
Consideration of visual cues in speech communication has further explored the relationship between hand gestures, head movement, brow raise, and prosodically prominent features (e.g. Bolinger, 1986; Kendon, 1972; McNeill, 1992; Tuite, 1993). In a recent study, annotations from Praat (Boersma and Weenink, 2014), a phonetic analysis tool, and Anvil (Kipp, 2001), a video annotation tool, were combined to provide a time-aligned display of visual (gestural) and acoustic beats in the natural speech of university instructors (both L1 and L2 speakers of English) (Hardison, 2018b). A frame-by-frame analysis revealed several points of temporal convergence with pitch-accented vowels, including maximum brow raise and upright head position. Analysis also revealed that the temporal intervals between gestural beats were longer when they occurred with the pitch-accented vowels produced in key information than in the rest of the discourse.
The Differential Value of Visual Cues
Although research has pointed to the benefits of visual cues from a speaker’s face, it has also raised the question of whether cultural differences or bias might influence speech processing. One study investigated the effect of seeing a speaker’s face on L1 English listeners’ ability to identify keywords in English sentences produced by two L1 English and two L2 English (L1 Korean) speakers in noisy conditions (Yi et al., 2013). Keywords were more correctly identified in the AV versus audio-only condition, consistent with other research; yet visual cues enhanced the intelligibility of the speech produced by the L1 English speakers more than the speech produced by the Korean speakers. The variability of articulatory gestures produced by L1 and L2 English speakers might account for this finding; however, in a separate accentedness rating task, the Korean speakers were rated as more strongly accented in the AV versus audio-only condition. Yi et al. (2013) suggested that the findings represented a possible bias, which could have implications for the efficacy of the integration of AV cues in speech processing in some contexts.
Cultural differences may also influence the focus of attention and information processing in face-to-face communication. Taking a neurocognitive perspective, researchers presented English and Japanese speakers with the stimuli /bɑ/ and /ɡɑ/ recorded in their respective L1s for a syllable-identification task in the following conditions: AV and audio-only to collect event-related brain potentials (ERP) data, and AV for eye-tracking data (Hisanaga et al., 2016). The ERP data revealed that English speakers processed AV speech more efficiently than audio-only speech, whereas Japanese speakers showed the opposite pattern. Eye-tracking data revealed a gaze bias to the mouth for English speakers, especially before the acoustic onset, but not for the Japanese. Hisanaga et al. argued that the findings were consistent with the influence of linguistic and cultural background on eye-gaze behavior and speech processing, and compatible with the earlier vision-independent hypothesis proposed for Japanese perceivers (Sekiyama and Tohkura, 1993). However, the stimuli in the Hisanaga et al. (2016) study were presented in the perceivers’ L1, and the findings are in contrast to those of studies showing that Japanese learners of L2 English attended to the speaker’s mouth when experiencing the McGurk effect (Hardison, 1996) and during perceptual accuracy training and word identification (e.g. Hardison, 2003, 2018a).
These studies might suggest that perceivers strategically shift attention and eye gaze to the perceptual details that are relevant for a given task, as Hattori (1987) had suggested. Such a pattern was found for L1 English speakers’ eye-tracking data while looking at speakers’ faces producing English and L2 French words in an identification task under three conditions: AV, AV with noise added, and video-only (Hardison and Inceoglu, 2019). For both languages, fixation durations increased to the mouth (i.e. the information source) and decreased to the eyes with noise or no audio. Fixations were often made to the nose (a central position on the face) with shifts to other areas, especially the mouth, when there was the slightest visible articulation-related movement. Such attentional shifts in processing are beneficial to perceivers but might require language learners to set aside the more comfortable practices of their L1 culture in order to maximize information in face-to-face communication with L2 speakers.
Some gestures are relatively universal, though, as noted by Pennington (1989: 31), ‘[t]here are significant cultural differences in gestural complexes and their application during interaction’. Consequently, even though there will be some positive transfer from the first language and culture, the specific components and multimodal complexes of prosody, articulation, facial expression, bodily gestures, and lexicogrammatical structures, and how these are deployed in specific communicational situations, must be learned when acquiring L2 speech (Pennington, forthcoming).
Pedagogical Recommendations
As the foregoing review makes clear, the visual channel and its many physical embodiments contribute significantly to meaning and communicative impact, suggesting that there is a strong role for pedagogy to maximize the salience of multimodal cues for L2 learners, in both perception and production. The knowledge base of language teachers should, therefore, include an understanding of how speech sounds are produced and the visible features that might accompany segmental and prosodic aspects of speech, so they can provide information to learners on the facial cues and co-speech gestures that learners will encounter in the context of communication (Hardison, 2014). In addition, as the authors explored in video recordings of lectures by university professors who had won teaching excellence awards, good teachers make use of the multiple resources and channels available to a speaker for creating meaning and impact, and coordinate these in presenting information to students. 2 It can, therefore, be of value for language teachers to raise their awareness of the visual channel and its many physical embodiments, what these contribute to meaning and communicative impact, and how they can be managed to maximize their own and their students’ communicative effectiveness.
From the perspective of segmental-level training, increasing learner proficiency in identifying L2 contrasts in one modality, visual or auditory, is linked to an increased proficiency in the other (Hazan et al., 2006). Work on visible articulatory gestures can, therefore, be recommended as a way to improve learners’ auditory perception of L2 sound contrasts. A sequence of activities for getting students to focus on lip shapes is provided in Pennington (1996: 124–127). It is suggested that the teacher begin by modeling the lip shapes of vowels using nonsense syllables and interjections:
open and neutral or loose, uh;
wide open, ah;
rounded and protruded, oh;
strongly rounded and protruded, ooh;
spread, eek.
The teacher next contrastively models one-syllable words that offer examples of spread and rounded lip shapes, for both vowels and consonants (e.g. Al vs all; tea vs too, vine vs wine, led vs read). The teacher then dictates a series of one-syllable words, and the students are to place them in the correct column according to whether an underlined sound in each word has spread, rounded, or neutral lips. The final activity in the series is that of ‘silent dictation’, in which the teacher and later individual students working with a partner mouth words silently for dictation (see Pennington, 1996: 124–125, for the word lists), as a way ‘to focus attention on the visible settings and movements which produce the sounds of a language’ (Pennington, 1989: 30). Using this series of techniques as a model, similar activities can be devised for helping students focus on the visual dimension of phoneme contrasts such as for stop consonants in initial, medial, and final positions; /r/ and /l/ in all positions; and the tense and lax vowel pairs /i, ɪ/, /eɪ, ε/, /æ, a/, and /u, ʊ/. It can also be of value to focus on the visible articulatory differences between the L2 and the learners’ L1.
The technique of ‘mouthing’ silently can be further expanded to apply to longer stretches of speech, such as a dialogue that students memorize and then perform silently in front of the class (Davis and Rinvolucri, 1990, cited in Celce-Murcia et al., 2010: 342). Students can be put in pairs facing each other, one doing silent dictation or mouthing of longer material, and the other imitating the partner’s mouth gestures in real time, as it were ‘shadowing’ them. This activity can be expanded into a game in which the ‘shadowing’ student tries to guess the words that the partner is mouthing or what a longer stretch of speech is about. The same pair technique might be used for one student to imitate the gestures of the other, or the combined silent mouthing and acting out of a dialogue through gestures and movement (for other imitative games focused on the gestures that accompany speech, see Pennycook, 1985: 275).
The close linkage and cue value of gestures to prosodic features such as vowel length, stress, and pitch contours, which signal important distinctions in word meaning, information structure, and utterance pragmatics, can usefully be included in instruction that is focused on pronunciation and overall communicative performance. Acton (1984) proposed some time ago that gestures and body movements should be taught together with prosodic features since they are all coordinated in communication, a point echoed in our own work (e.g. Hardison, 2018b; Pennington, 1989, 2019, forthcoming; Sueyoshi and Hardison, 2005). Two approaches to pronunciation that incorporate co-speech gestures are the ‘Dramatic Imitative Approach Using Video Clips’, which Celce-Murcia et al. (2010: 489–490) describe in their Appendix 21, and the ‘Mirroring Project’ technique described by Tarone and Meyers (2018) and LaScotte et al. (2021). In the former, students view an authentic video clip first without sound and second with sound to observe, analyze, and notate the prosody and the facial expressions and gestures on a transcript, and to consider how these affect meaning. Students then practice and later perform the original discourse and also a similar role play in pairs, videotaping their performance in both for post-task self-evaluation and teacher evaluation of their pronunciation. In the Mirroring Project, whose rationale and steps are described in this issue (LaScotte et al., 2021), the L2 learner selects an L1 or L2 speaker as model and then analyzes a video segment of the person’s speech, focusing on prosodic features and the accompanying gestures in detail, as a way to learn to imitate them as closely as possible. In the best case, the learner is able to ‘channel’ the model speaker’s voice and gestures very closely, with carry-over to free speech.
Much as the auditory speech signal exhibits considerable variability across speakers, speech rates, and styles, so too does the visual modality, as some speakers have highly animated facial expressions, gestures, or both, while others are less physically expressive or demonstrative. Moreover, since speakers from different cultures and L1 backgrounds may differ substantially in both the quality and the quantity of their facial expressiveness and gesturing during speech, L2 speakers can benefit from instruction on the facial and gestural behaviors that occur in combination with speech in the L2 (Pennington, 1989: 31). It is therefore advisable for teachers to seek out a range of video materials which show social interactions between members of the L2 culture and different cultures and which illustrate the bodily gestures and facial cues that accompany the speech of a variety of speakers, including those from the same L1 background(s) as the students. These video recordings can be used to analyze, compare, and contrast the articulatory and other gestures that accompany speech and as a basis for instruction to help learners acquire multimodal communicative competence for spoken language, such as the imitative and mirroring activities described above.
Pennington (forthcoming) offers some suggestions for ‘top-down’ approaches which start with the sort of general iconic and universal meanings that prosody in combination with facial cues and gestures might convey, such as rhythm, emphasis, or positive and negative emotion, and then progress to those meanings that are specific and contrasting in the communicational modalities of different languages and cultures. She also suggests teaching students to take a ‘strategic’ approach to difficulties in communication by use of pronunciation and gestural features that convey empathy – such as varied pitch, smiling, and eye contact – and help to engender feelings of solidarity and a sense of connection in the audience and so to establish a positive basis for communication. In addition, Pennington (forthcoming) recommends incorporating the whole gestural complex in pronunciation instruction at all levels of proficiency as well as additional specific instruction on the facial cues to pronunciation for advanced learners, who, according to the research of Sueyoshi and Hardison (2005), seem to benefit especially from this narrowly focused connection of auditory and visual perception.
Conclusion
As the world struggles to contain the spread of Covid-19 through the use of mitigation measures such as the wearing of face masks and the use of video for contact at a distance, our awareness of the linguistic and cultural value of facial and gestural information has been heightened. While video conferencing tools eliminate the need for a mask, AV synchrony is often a challenge that can interfere with the processing of speech, and video communication often reduces the amount of information conveyed in gestures. Based on the preponderance of evidence that visual cues facilitate oral communication for L2 learners, it is easy to add them to the list of populations who are disadvantaged when those cues are not available.
Although it is not possible for language teachers to control the loss of visual information when face masks or video communication are used, nor, in general, when co-present communication is restricted, they can, nonetheless, help prepare themselves and their students for comprehending and producing speech as a multimodal communicational event. In so doing, they will be treating language with a recognition of speaking as a ‘full-context, full-body experience’ (Pennington, forthcoming). For pronunciation teachers in particular, connecting articulation and prosody to kinesic and visual sources of input in the mouth, the head and face, the arms and hands, and the body more generally can make learning more interesting, as it also makes it more communicatively real. Tying pronunciation to this larger communicative context can thus give a new rationale and direction for the teaching of pronunciation, and breathe new life into an area of the language curriculum that has so often been sidelined as boring or unimportant.
