Abstract
Computer-assisted pronunciation teaching (CAPT) for second-language learners has made great strides in recent years. While once small in number and confined to desktop computers, a great variety of CAPT resources are available online today. These resources provide instructional activities of various types, utilize different sensory modalities, and give different types of feedback. After discussing the many, long-promised, potential benefits of CAPT, this article outlines the basic principles of seven categories of CAPT resources and provides examples of each category.
Keywords
Introduction
A decade ago, Derwing spoke of ‘Utopian Goals’ for the field of second-language (L2) pronunciation teaching. Among those goals was ‘more development of easy-to-use and useful software’ (Derwing, 2010: 30). Since then, huge strides have been made in the development of computer-assisted pronunciation teaching (CAPT). Early forms of CAPT operated on integrated hardware-software workstations and desktop computers. Today, however, ever-expanding numbers of teachers and learners depend on websites or mobile apps to assist them in improving their L2 skills—including pronunciation. Accordingly, the focus of this article is restricted to online pronunciation teaching/learning resources that are accessible by means of personal computers, tablets, or smartphones. These online resources vary considerably in content, instructional approach, and quality. Some come much closer to technological and pedagogical ideals than others.
CAPT Advantages and Ideals
Early promoters of CAPT predicted that it would be beneficial to L2 learners for various reasons (Levis, 2007): the provision of individualized instruction in an environment that is private and stress-free, the availability of virtually unlimited L2 input, learning activities that involved both aural discrimination and focused repetition, learners’ ability to practise at their own rate, visual support, and instantaneous and individualized feedback based on automatic speech recognition (ASR). Additionally, current aspirations for ideal CAPT software include the following:
attention not only on segmentals (i.e. consonant and vowel sounds) but also on suprasegmentals (i.e. stress, intonation, and rhythm), particularly those features that most affect intelligibility (i.e. that carry the greatest functional load);
multiple-speaker exemplars (female, male, old, young, various regional and social dialects);
content and practice materials that involve realistic language presented in full sentences (or longer stretches of discourse);
adaptability and flexibility that allows learners to follow a variety of learning paths based on their individual pronunciation challenges and learning styles;
activities that build intrinsic motivation because they are fun, interesting, encouraging, or game-like.
Unfortunately, critics have noted that most CAPT programs show minimal understanding of phonology, acceptable variation in pronunciation, or the proper application of this knowledge in instructional contexts (Pennington, 1999: 434). For instance, a critical factor to consider in CAPT software is the sort of feedback it provides—targeted, or just binary. Binary feedback just tells learners whether their pronunciation is correct or incorrect. If it is wrong, the software merely tells them to try again. On the other hand, targeted feedback supplies information about learners’ pronunciation that is both specific and actionable. Based on ideas from the focus on form school of L2 acquisition, the noticing hypothesis, and explicit learning, targeted feedback tells learners precisely where their pronunciation went wrong and what they can to do to improve it. Targeted feedback has been shown to lead to noteworthy L2-pronunciation improvements in a comparatively brief period. Nevertheless, targeted feedback is rather uncommon in current CAPT software. For instance, Bajorek (2017) evaluated four popular online language software programs and found that three of them provided only binary feedback, and the fourth provided no feedback at all.
CAPT Resource Types and Realities
The description of online CAPT resources that follows will be structured based on the sensory modality each resource uses for explaining and modelling pronunciation. Another evaluative factor is the quality (binary or targeted) and quantity of feedback provided by the instructional program. In connection with these characteristics, the teaching and learning activities that each online pronunciation-improvement tool employs will also be considered.
Text and Audio Only
A fundamental tenet of Flege’s Speech Learning Model is ‘that many L2 production errors have a perceptual basis’ (1995: 238). In other words, learners’ skill in correctly producing L2 sounds is connected to their competence in correctly perceiving those same sounds. Based on this concept, the simplest type of CAPT software provides extensive audio input and listening discrimination exercises, in the expectation that these perception exercises will lead to more accurate output.
Thus, with some online CAPT products, learners simply listen to authentic, natural speech models. These speech samples may be downloaded from the Internet (e.g. TED talks) or captured by the teacher using digital recording programs built into most smartphones, tablets, and laptops. If instructors desire to modify the speech that L2 learners listen to, they may use audio-editing programs that not only record and play back but also allow modifications of various sorts (e.g. variation in playback speed or adjustments in pitch) that make target-language comprehension either easier or more challenging.
Some websites and mobile apps created to help learners improve their L2 pronunciation (and related language skills) are built on the premise that accurate perception precedes correct production—for example, Mango Languages (http://mangolanguages.com). The website for English Accent Coach (https://www.englishaccentcoach.com/index.aspx) proclaims that it ‘works because it trains the brain to recognize new sounds—an essential foundation for improved pronunciation’.
A few CAPT programs that rely on a listening-only approach offer an advantageous feature called High Variability Pronunciation Training (HVPT). To illustrate, English Accent Coach plays slightly different audio versions voiced by various people whose pronunciations vary in natural ways (e.g. some nasalize their vowels or use an onglide). Research (Wang and Munro, 2004) has shown that as advanced-level L2 learners listen to words spoken naturally by multiple speakers with the crucial sounds in varying word positions, the learners’ abilities to differentiate between and comprehend (and eventually pronounce) L2 vowels improve.
Listen and Repeat
A listen-and-repeat instructional model underlies many websites and mobile apps intended to improve learners’ L2 pronunciation. The software plays a recorded word, phrase, or sentence. Then the learner tries to repeat it accurately. These model phrases or sentences are sometimes ‘loaded’ with the target sound (e.g. ‘Why was Woodrow Wilson thought of as a wise and wonderful world leader?’ or ‘That lather is more soothing than this lather’). Regrettably, many such applications do not provide feedback on how accurate the learners’ repetition is, leaving them to wonder whether and in what areas they need to improve. Kaiser (2017) examined 30 iPhone apps marketed for L2 pronunciation improvement, and discovered that 73.3% (22) relied heavily on this listen-and-repeat instructional model but did not provide accuracy feedback to learners.
The Listen and Repeat Machine portion of the English Pronunciation Practice section of ManyThings.org (http://www.manythings.org/e/pronunciation.html) provides an example of this instructional approach. So does the Pronunciator website and mobile app at http://www.pronunciator.com.
Using a slightly different approach, learners listening to an audio model may also practise using tracking or shadowing (repeating along with or immediately after the speaker).
Recording the learner’s voice repeating the model phrase and then playing it back for comparison with the model is a feature offered by some CAPT programs. For instance, at http://www.shiporsheep.com, students may record themselves and compare their pronunciation with the original recordings. Mango Languages (http://mangolanguages.com) also offers such a voice comparison option. Unfortunately, L2 learners who cannot accurately perceive the target language sounds may not benefit from such comparisons because they are unable to tell how closely their pronunciation approaches the model.
Listening Discrimination (Minimal Pairs)
L2 acquisition researchers and L2 pronunciation experts generally agree that an elemental step in mastering the different sound contrasts in a foreign language is the development in learners’ minds of new phonetic categories. In general, learners need to be able to perceive the difference between sounds in these new categories before they can produce them correctly. For example, before they can produce the medial vowels in ship and sheep correctly, English language learners must learn to distinguish [
Listening discrimination exercises are designed to help L2 learners develop the ability to hear the difference between sounds that are phonetically similar but phonemically distinctive in the new language. The definitive listening-discrimination exercise employs minimal pairs (two words that are exactly the same phonetically, with the exception of one phoneme, e.g. rock/lock) in different sorts of contexts (single-word or full-sentence). Sadly, minimal-pair sentences are often contrived and artificial (e.g. He ate it with veal/zeal, or It was a live/lithe tree). Also, in minimal-pair activities, student responses (i.e. rejoinders) are often merely mechanical in nature (‘Number one’ or ‘Number two’) rather than being meaningful.
Websites that use minimal pairs heavily can be found at http://www.shiporsheep.com, as well as in the Minimal Pair Practice and Quizzes section at http://www.manythings.org/e/pronunciation.html. Some programs take a slightly different approach and use sentences that both contain and contrast the members of a minimal pair (e.g. He said thanks for the tanks). In addition, some minimal-pair-based CAPT programs (e.g. www.pronunciationmatters.com) use story contexts that are longer than a single sentence and employ meaningful rejoinders. Sadly, however large the linguistic context may be, and regardless of the context size and type of rejoinder used, minimal-pair practice activities typically provide only binary feedback. The only information that learners receive after attempting to distinguish between the two members of a pair is ‘Correct’ or ‘Incorrect’. After L2 learners choose the wrong rejoinder, they often do not know what they did incorrectly or how to improve their listening discrimination skills.
Visual: Articulatory Displays
For many decades, sagittal section diagrams of the human vocal apparatus and still or video pictures of a speaker’s mouth and lip movements have been utilized by professors of articulatory phonetics and L2 pronunciation teachers. Similarly, computer-based animations and videos in CAPT can display the positions of vocal articulators and their movements in the process of producing speech sounds.
One website and mobile app that does this very well is Sounds of Speech (http://soundsofspeech.uiowa.edu/index.html), which shows, models, and explains the articulations of vowels and consonants in English, Spanish, and German. The IPA Phonetics app for iPhone and iPad (https://www.uvic.ca/humanities/linguistics/resources/software/ipaphonetics/index.php) goes beyond simple line drawings and uses laryngeal ultrasound videos (in three different orientations) to show how various sound segments are articulated. The SPAN (Speech Production and Articulation Knowledge) Group website (https://sail.usc.edu/span/rtmri_ipa/index.html) goes even further and presents real-time magnetic resonance imaging videos of five different individuals’ vocal tracts while they speak the sounds corresponding to all the vowel and consonant symbols in the International Phonetic Alphabet, as well as some single words, sentences, and longer passages.
Of course, simply seeing the movements of the human articulatory apparatus is not, by itself, sufficient. Viewing articulatory illustrations is only the beginning of the pronunciation-learning process. To produce the target sounds accurately and naturally, learners need to practise the corresponding movements until they become automatic. Nevertheless, these visuals may be helpful for raising linguistically-oriented learners’ awareness of what they need to do to pronounce certain speech segments correctly. Still, critics have expressed doubt that ‘ordinary language learners’ benefit from such graphic depictions ‘without proper linguistic or phonetic training’ (Bajorek, 2017: 33).
American Accent on the Go (http://AmericanAccentOnTheGo.com) is an entertaining, pedagogically-oriented iOS app that uses visual articulatory displays to improve ESL learners’ pronunciation. It features a friendly, cartoon-like character named Mimo, whose body parts consist of only eyes, teeth, lips, a tongue, and the upper throat. Mimo depicts what happens inside the mouth when producing American English vowels and consonants. Other user-friendly programs that provide visual articulatory displays can be found at http://www.englishcentral.com and http://funeasyenglish.com/new-american-english-pronunciation-introduction.htm.
Visual: Acoustic Displays
Various L2 pronunciation websites and mobile apps offer visual feedback based on the acoustics of utterances. Such feedback takes various forms, as described in the following two subsections.
Waveforms, Spectrograms, and Formant Data
The electronic technology associated with acoustic phonetics can be used for L2 pronunciation instruction. Programs like Praat® (http://www.fon.hum.uva.nl/praat/), which was designed for research purposes, can also be used for instruction in the hands of a competent instructor. L2 learners’ speech can be recorded and then analysed (and compared with a model) using spectrographic and other images that include pitch (in Hz) and formant data. Lamentably, a Praat® mobile app does not exist, nor can Praat® be operated online. It must be downloaded to and operated on a desktop or laptop computer. Further, operating and understanding Praat® requires a considerable degree of linguistic and technical sophistication and training. Mobile apps and online programs like Mango (http://www.mangolanguages.com/) and Rosetta Stone (www.rosettastone.com) offer a more user-friendly presentation of waveforms and allow learners to compare their own waveforms with those of native speakers. Nevertheless, the only feedback these programs provide is the actual visual spectrographic images. Learners and teachers must learn to understand and analyse those images. As with many minimal-pair programs (see the preceding section), the usefulness of the tool is only as good as the users’ own analytical phonetic abilities, which, in many cases, are quite low.
Pitch Contours
The CAPT products described thus far have focused primarily on vowels and consonants. Much recent research, however, has shown that suprasegmentals like pitch and stress are crucial to intelligibility and equally, if not more important than individual sound segments.
Visi-pitch®, an early CAPT and speech therapy program, dazzled users in the 1980s with its ability to provide them with real-time, visual feedback on their intonation in the form of a display that showed a contour based on the rising and falling pitch in their spoken utterances.
Some online CAPT resources today do similar things. For example, Better Accent Tutor (http://www.betteraccent.com/index.html) offers ‘instant audio-visual feedback’ that ‘analyses and visualizes intonation, intensity and rhythm patterns of recorded utterances’ and ‘allows users to visually compare the user’s and native speaker’s intonation, intensity, and rhythm patterns’. Praat® (http://www.fon.hum.uva.nl/praat/) (described above) also produces rising and falling visual images (as well as actual pitch frequencies in Hz) of a speaker’s intonation.
Despite the technological impressiveness of all these acoustic tools, however, their pedagogical value is debatable. Some researchers argue that unless the visualizations are accompanied by instructional explanations and feedback, they do not lead to improved pronunciation.
ASR
For decades, the developers of CAPT software have anticipated the benefits of ASR. Ideally, ASR would ‘recognize everything the user says, point out those areas that are most problematic (depending on the user’s priorities, be it intelligibility, comprehensibility or accuracy), and then offer explicit feedback indicating how to improve’ (Fouz-González, 2015: 324). Unfortunately, although ASR is moving toward that ideal, CAPT apps are still far from reaching it. A decade ago, the reliability of ASR software for teaching English pronunciation was only ‘mediocre’ (r = 0.56) (Kim, 2006: 327). Despite these limitations, CAPT software, as well as speech-to-text dictation tools, have made use of ASR. Each of these will be discussed in its own section below.
CAPT Software
For years, experts in L2 pronunciation-teaching have argued for increased learner responsibility, autonomy, and self-monitoring. Online CAPT software with ASR promises to encourage such autonomy by permitting learners to practise outside of class without a teacher and still receive immediate feedback on their intelligibility. Sadly, few online CAPT programs have been able to provide these ASR-based capabilities validly and reliably. After reviewing a large number of L2 pronunciation teaching/learning apps, Kaiser (2017) reported that the few that used ASR provided only simplistic binary feedback that was often not accurate.
Most current ASR-CAPT software still cannot accurately and reliably recognize and process natural, spontaneous speech from different speakers—especially if those speakers are non-natives. Foreign accents typically produce frequent false alarms and have low rates of correct detection. In addition, ASR ratings of L2 learners’ utterances often differ from those made by human raters. An acceptable level of reliability can be achieved only when utterances are simple and restricted. The experience of using an app with ASR can be frustrating for L2 learners if their mistakes are either not detected or are detected incorrectly. L2 learners’ frustration may also grow as they repeatedly try to meet the software’s standards when their pronunciation is already acceptable. The CAPT device is supposed to be an ‘expert’ on which users can rely, but once they suspect it is not reliable, they lose confidence in it. In this way, inadequate CAPT-ASR software, rather than being helpful, can be frustrating and counter-productive.
Fortunately, there is hope on the horizon. The speech-to-text application programming interfaces (APIs)—for example, IBM Watson, Google Voice, and CMU Sphinx—underlying commercial voice-recognition software used for dictation programs, as well as artificial intelligence interfaces such as Siri, Alexa, Google Home, and Amazon Echo, are receiving heavy attention by developers and strong support by industry. Consequently, these APIs are becoming more and more powerful and accurate. Using robust neural-network models, these tools can be used to build mobile apps, which have the ability to convert speech into text more accurately than ever before, and they can do it in over 100 languages. As CAPT app developers implement these newer speech-recognition APIs, the quality of ASR in CAPT apps will increase accordingly. In fact, programs like Babbel (https://www.babbel.com) now boast, ‘Interactive dialogues will give you the confidence to speak, and our speech recognition technology will help you get it right’. This is not an idle boast. Babbel’s speech recognition engine uses a fine-toothed rating scale from 0 to 100. This more exacting evaluation allows the program to give feedback that is generous enough so that students don’t get discouraged, but stringent enough to make sure their speech is intelligible.
Speech-to-Text Dictation Software
Widely used speech dictation programs like Dragon Naturally Speaking (https://www.nuance.com/dragon.html), Macintosh Dictation (https://support.apple.com/en-us/HT202584), Google Voice Typing (GVT; https://support.google.com/docs/answer/4492226?hl=en), and Windows Speech Recognition (https://support.microsoft.com/en-us/help/14213/windows-how-to-use-speech-recognition) were developed for native-speakers and are, therefore—strictly speaking—not dedicated CAPT products. Nevertheless, they can be useful for language learners. When they speak into a computer or smartphone, the software recognizes the oral input and converts it to text. The potential uses and advantages of ASR dictation software for L2 learners are many. As the software converts L2 learners’ speech into text, the output clearly displays to L2 learners what the computer understood and what it did not. In this way, it provides learners with valuable, targeted feedback regarding their pronunciation.
Further, as explained above, the speech-recognition APIs used in these modern speech-to-text dictation products have been improving in reliability and accuracy. McCrocklin et al. (2019) measured the transcription accuracy of several commercial dictation products. They then compared their accuracy readings with those of earlier researchers. Derwing et al. (2000) found only 72% accuracy for non-native speech, but the McCrocklin et al. (2019: 196) study found that the accuracy rates for GVT were now much higher for non-native speakers (88.6% for controlled reading and 93.5% for free speech). With native speakers, GVT was even more accurate (92% with controlled readings and 98% with free speech). Other studies of L2 learners’ use of ASR dictation on their smartphones have shown several other advantages beyond improved accuracy, including visual representations of their oral output that made them more aware of their pronunciation, improved self-confidence, lower anxiety, and increased willingness to talk in the L2. In sum, despite the rather cloudy past of ASR in CAPT, it still ‘holds great promise’ (Levis and Suvorov, 2020: 154) and the future looks bright.
Corpora (for Research and Teaching)
Online corpora created for linguistic research can also be used for L2 pronunciation teaching. Some corpora include downloadable sound files, which can be used by L2 learners for speech perception and production practice. Since these sound files come from a wide variety of speakers, registers, and regional accents, they offer the advantage of HVPT (discussed above). Some of the more noteworthy corpora of this type are the following:
Speech Accent Archive (http://accent.gmu.edu). This corpus includes 2778 samples of native and non-native speakers of English from various accent areas reading the same elicitation paragraph. The archive contains demographic information on each speaker. In many cases, phonetic transcriptions of each downloadable sound file are available.
LANGSNAP (Languages of Social Networks Abroad Project; http://langsnap.soton.ac.uk) contains two corpora. One is in French (http://www.flloc.soton.ac.uk/list.html) and the other in Spanish (http://www.splloc.soton.ac.uk). Each corpus consists of audio recordings and transcriptions of 25 UK university students spending their third year abroad in Spain, Mexico, or France (plus 10 native speakers) per language.
Berlin Map Task Corpus (https://www.linguistik.hu-berlin.de/en/institut-en/professuren-en/korpuslinguistik/research/bematac). This corpus of spoken German comprises recordings of 12 native speakers and five advanced speakers of German as a foreign language made after they were instructed to give spoken directions about a route on a map.
You[En]Glish (Youglish.com) is a web-based pronunciation resource designed for English language learners. (A French-language version also exists.) The YouGlish website can be used to locate and listen to authentic pronunciations of English words in the larger context of their surrounding discourse. To use YouGlish, users merely type in a word or phrase whose pronunciation they are unsure of. The site then takes the users to segments in YouTube® videos. Each video comes from an original source (such as a TED Talk or television interview) that uses the word in context, as spoken naturally by real people (in US, UK, or Australian English). The interface allows users to pause the video, jump back five seconds, or return to the start. Typically, YouGlish offers multiple video examples (sometimes hundreds). In addition to the online video models, YouGlish provides a transcript of the speech, a phonetic transcription of the target word, a list of ‘nearby words’ (that have similar pronunciations), and a list of relevant ‘tips to improve your English pronunciation’. In addition, users can slow down the playback speed. Because different speakers appear in the videos, they also offer the advantage of HVPT.
Summary and Conclusion
As this overview has illustrated, a substantial variety of CAPT options currently exist—in terms of their instructional approach and/or their technological basis. They employ a number of different sensory modalities and give feedback of different types. For these reasons, their instructional value also varies. The taxonomy presented in this article is intended to help L2 pronunciation teachers and learners understand the breadth and variety that exist in CAPT software and leave them better prepared to choose and recommend online CAPT learning tools. The principles and categories underlying this taxonomy may also help sharpen our thinking and communication in discussions of CAPT.
To sum up, over the last decade, impressive progress has been made in CAPT. Nevertheless, much remains to be done. Future progress in CAPT will depend on (a) ongoing advances in artificial intelligence, ASR, and other computer technology; and (b) the development and application of instructional approaches based on research in L2 pronunciation teaching and learning.
