Abstract
Over the past several years, the field of bilingual speech perception has seen a substantial increase in both the number of publications and in the amount of interest directed at its findings. Consequently, the time is ripe to assess the state of the field, what we have accomplished and where we have yet to go. Although we cannot capture the full state of the field in the space of this paper, we hope to summarize the major trends that have led to the current state and take stock of its future directions. To that end, we focus our review on the relative merits of single phonemes versus whole words and phrases when investigating bilingual speech, the efficacy of the different training paradigms that have been attempted and we focus, in particular, on the role of individual differences in predicting learning outcomes. We conclude our review by highlighting recent developments demonstrating that identifying individual differences in ability pre-training can result in more efficacious training paradigms. Goals for future research are also discussed.
The past several years have seen an increased interest in second-language learning and bilingual language processing. This includes second-language and bilingual speech perception. We present here a review of some of the recent developments in non-native speech perception, focusing on isolated contrasts, training efforts and individual differences between learners. We note that throughout the literature, there is extensive evidence of individual variability, seen both in the ability to differentiate difficult non-native contrasts and in the ability to improve with training. Although the traditional approach has been to look at bilingual performance in the aggregate (e.g., Skehan, 1991), the existence of such extensive individual variability suggests that future efforts should be directed at understanding the loci of individual differences and how to capitalize on them to develop the most efficacious training paradigms.
The perception of isolated contrasts and whole-word perception
Much of the work in second-language speech perception has focused on non-native listeners’ ability to identify and discriminate isolated contrasts that are not found in the native inventory. Indeed, entire reviews are available for single contrasts (e.g., Yamada, 1995). We present here a review of two contrasts: perception of the English /r–l/ by native Japanese (NJ) speakers and perception of the Catalan /ε–e/ by native Spanish speakers in order to demonstrate some of the challenges faced by early and late bilinguals. We focus on these two contrasts, because as they happen to be two of the most written-about contrasts, they allow for the comparison of consonant and vowel perception, and they are instances of two second-language sounds heard as a single first-language sound by both late (Japanese) and early (Spanish) bilinguals.
In general, when learning to discriminate new sounds, language learners must learn the acoustic features that distinguish them from other sounds, either in their L1 or L2 (see Ettlinger & Johnson, 2009; Johnson, 1997; Pierrehumbert 2001 for an alternate view). The perception of more challenging sounds often requires the perception of acoustic features or cues that the learners do not use in their L1.
Perception of the English /r–l/ by native Japanese speakers
No review of second-language speech perception would be complete without a discussion of NJ listeners’ perception of the English /r–l/. As first documented by Goto (1971), NJ listeners struggle to differentiate /r/ from /l/, perceiving both to be instances of a single Japanese category /ɾ/, often described as a tap or a flap. In the time since, a tremendous effort has been put into understanding both the locus of the difficulty and the extent to which it can be overcome via training. We focus on the former here.
As noted by O’Connor, Gertsman, Liberman, Delattre, and Cooper (1957), the onset frequency of the third formant, F3, is sufficient to differentiate instances of /r/ and /l/ for native English (henceforth NE) listeners. Miyawaki et al. (1975) noted that Japanese productions of /ɾ/ vary unsystematically in F3 onset frequency, suggesting that a lack of sensitivity to F3 onset frequency in the /r–l/context may be the source of their perceptual difficulty. Indeed, they found that NJ listeners performed at chance level on a /r–l/ discrimination task where stimuli differed only on the basis of F3 onset frequency. Yamada and Tohkura (1990) and Iverson et al. (2003) have since demonstrated that not only are NJ listeners relatively insensitive to F3 onset frequency in the /r–l/ context, but also they appear to make great use of the second formant, F2, when making /r–l/ category judgments.
It was initially thought that both /r/ and /l/ were assimilated to a single Japanese category, /ɾ/, but there is evidence to indicate that there are asymmetries in the assimilation and confusion patterns indicating that /l/ is assimilated better (Aoyama, Flege, Guion, Akahane-Yamada, & Yamada, 2004; Guion, Flege, Akahane-Yamada, & Pruitt, 2000; Iverson et al., 2003; Takagi, 1993). However, this unequal assimilation does not appear to impact identification or discrimination performance. Instead, identification performance is best predicted by listeners’ reliance on the F3 onset cue, indicating that it is not perceived distance from the native category but listeners’ sensitivity to the native cue that best predicts performance (Hattori & Iverson, 2009). Ingvalson, McClelland, & Holt (2011) also found that reliance on the F3 onset cue was the best predictor of natural speech /r–l/ identification to the exclusion of traditional predictors of L2 fluency, such as length of residency, years of English-based education and proportion of language use that was Japanese. Clearly, then, some individuals are able to make use of the F3 cue in the /r–l/ context and these individuals show the most NE-like identification performance. Understanding what distinguishes these individuals from those who are unable to utilize F3 would give us the opportunity to predict learning outcomes and, more importantly, to develop training paradigms that can show a greater benefit to listeners than previous attempts.
Perception of the Catalan /ε–e/ by native Spanish speakers
While the case of NJ listeners perceiving the English /r–l/ is a classic example of late-learned speech perception, much of the world learns two or more languages from an early age (e.g., Petitto et al., 2001). The Catalonia region of Spain houses such a population. Spanish and Catalan are both official languages of the region and Catalan is predominant in early kindergarten, while both languages are taught in later grades, resulting in many highly proficient early bilinguals (Pallier, Bosch, & Sebastián-Gallés, 1997). Catalan contains several contrasts that Spanish does not, the most studied being /ε–e/, both of which are heard as instances of the single Spanish sound /e/.
One complaint that has been leveled against research on bilingual and non-native speech perception is that explicit identification and discrimination of phonemes—such as by asking participants to judge whether stimuli are instances of a particular category or if they sound different from a particular category—require a certain amount of metalinguistic knowledge (Cutler, Weber, & Otake, 2006; Sebastián-Gallés, 2005). These researchers have noted that the non-native listeners, who are aware they have difficulty with the tested contrast, may adopt decision criteria different from those of the native listeners. The non-native listeners’ decision criteria may in turn mask sensitivity to the phonetic distinction, making it appear that sensitivity is lacking when in fact it exists. Therefore, the aim of the experiments investigating /ε–e/ perception has been to develop sufficiently sensitive measures to determine whether sensitivity exists but has gone unnoticed by traditional measures (e.g., Pallier et al., 1997). As shall be seen, the results from these more implicit measures are largely in line with those seen using explicit measures.
These experiments utilize two types of bilinguals: adults whose first language was Catalan with an early exposure to Spanish (called Catalan dominant, or CD) and adults whose first language was Spanish with an early exposure to Catalan (called Spanish dominant, or SD); exposure to the L2 was typically before age 3. Although the contrast of interest is always /ε–e/, which is phonemic in Catalan but not in Spanish, listeners are generally never explicitly asked to identify or discriminate the phones or minimal pairs containing them. Instead, listeners are generally asked to identify the first syllable of a disyllabic non-word (Navarra, Sebastián-Gallés, & Soto-Faraco, 2005), make word–non-word judgments (Pallier, Colomé, & Sebastián-Gallés, 2001; Sebastián-Gallés, Echeverría, & Bosch, 2005) and identify words in a gating task (Sebastián-Gallés & Soto-Faraco, 1999). In all cases SD bilinguals showed a lack of sensitivity to the /ε–e/ contrast that was not seen in the CD bilinguals. Detailing these findings further, Navarra et al. (2005) found that CD bilinguals were slower to identify the initial syllable in lists where the second syllable changed between /ke/ and /kε/ relative to when the second syllable was constant, whereas no such slowing was found for SD bilinguals. SD bilinguals incorrectly classified /ε–e/ non-words as words at a much higher rate than CD bilinguals (Sebastián-Gallés et al., 2005) and showed a repetition-priming benefit from /ε–e/ minimal pairs that CD bilinguals did not (Pallier, Colomé, & Sebastián-Gallés, 2001). In the gating task, SD bilinguals require significantly more gates to correctly identify the word than do CD bilinguals (Sebastián-Gallés & Soto-Faraco, 1999).
These results have been taken as evidence of the fundamental importance of early language exposure in the development of speech sound categories (Werker & Tees, 1984). However, it is worth noting that although the SD bilinguals as a group express less sensitivity to the /ε–e/ contrast than do CD bilinguals, not all SD bilinguals are equally impaired. Noted by Sebastián-Gallés et al. (2005), there is much greater individual variability amongst the SD bilinguals than amongst the CD bilinguals, who are performing the tasks in the first language. Again, as was the case with the NJ listeners, there is a need to understand what distinguishes those SD bilinguals who are able to perceive the contrast from those who cannot. This is actually the case throughout much of the non-native speech perception literature, where a population of listeners struggles to differentiate a particular contrast when viewed in the aggregate, but where there is actually considerable variability when perception is examined at the level of the individual (e.g., Flege, MacKay, & Meador, 1999). Without an understanding of what makes a given listener good at perceiving non-native contrasts relative to other learners, it is impossible to fully characterize bilingual speech sound learning or to develop optimal training paradigms. As we will see below, although training has been shown to result in improved perception over a population of listeners, there is still extensive variability post-training, indicating that training is not optimal for all listeners. Optimal training might be developed by gaining a better sense of what characterizes a good learner.
Whole-word perception
The studies above, as well as the many related studies we did not discuss, have provided valuable insights into non-native speech perception. However, phonemes rarely occur in isolation outside of a laboratory, occurring instead in the context of whole words, leading to the question of how difficulties in phoneme perception might influence lexical activation. Some insight into these effects can be seen in the above studies using lexical tasks in lieu of traditional identification and discrimination tasks. Pallier et al. (2001) found a repetition-priming effect for /ε–e/ minimal pairs in SD, but not CD, bilinguals, indicating that instances of both sounds activate the same lexical entry (see also Cutler & Otake, 2004). Cutler et al. (2006) used an eyetracking paradigm to determine the extent to which difficulty differentiating phonemes impacts lexical access. NJ listeners were instructed to look at the target that on some trials started with /r–l/. In these cases there would also be a distracter present with an initial syllable that was a minimal pair of the target; for example, if the target were “rocket” the distracter would be “locker” because “rock–lock” are a minimal pair. NJ listeners looked initially at both the target and distracter when the target started with /r/, but not when the target started with /l/, in which case they only looked at the target. A similar asymmetry was found for native Dutch speakers listening to English vowels, where an English-only phone was sufficient to invoke looks to both target and distracter, but a shared Dutch-English phone was not (Weber & Cutler, 2004). Moving to a lexical decision task, Sebastián-Gallés et al. (2005) found that CD bilinguals—who showed better accuracy overall than their SD counterparts—also showed an asymmetry; CD bilinguals were more accurate when the word contained /e/ than when it contained /ε/.
Looking explicitly at lexical abilities in non-native listeners, Bradlow and Pisoni (1999) created lists of easy and hard words, where easy words were distinguished by being higher in frequency, had fewer neighbors, and the neighbors were relatively low in frequency. While both native and non-native listeners found words on the easy list to be more intelligible than words on the hard list, this benefit was greater for non-native listeners. This suggests that non-native listeners do struggle with whole-word perception, quite possibly due to difficulty distinguishing the phonemes of a word, as would be necessary when a word has many neighbors (Newman, Sawush, & Luce, 1997).
Clearly, difficulties with phoneme differentiation impact lexical access in second-language speech perception. Also interesting from these data are the asymmetries seen in lexical performance. In the above cases, the sound that is common between the bilinguals’ two languages (/e/ for Catalan) or that is best assimilated to the native category (/l/ for Japanese) shows the greatest lexical accuracy, whether in fewest looks to distracters or in lowest lexical decision errors. Although these data are too few to state definitively that words containing the phoneme common to the bilinguals’ two languages receive a benefit in word recognition, current models of bilingual word recognition hypothesize simultaneous activation of both lexicons (e.g., Kroll & Stewart, 1994), suggesting that if a bilingual’s speech sound categories are also shared (e.g., Flege, 2003; alternatively, Best & Tyler, 2007) there may be a benefit to words containing shared phonemes. Further research is warranted to determine exactly how asymmetric speech sound assimilation might influence lexical access. There is also a need for further research to ascertain how individual differences in second-language speech perception result in individual differences in lexical representations and/or lexical recognition.
Training non-native speech perception
Characterizing the difficulties non-native listeners have with second-language speech perception is only one goal of research. A second, arguably more common, goal is to test hypotheses of language learning via efforts to train listeners to better perceive non-native speech. The initial efforts in this domain utilized primarily synthetic speech, and so we will begin our review by discussing training with these types of stimuli. We will then discuss those studies that trained on unaltered natural speech tokens.
Training using synthetic speech
One of the classic studies in the speech-perception training literature was performed by Jamieson and Morosan (1986). Gains following training were small but significant: listeners identified trained stimuli on average 11% better than before training. This result is typical of the studies that used highly unnatural speech stimuli; the improvements were generally significant, but very small, and ultimate performance was not equivalent to that of native listeners. Reliable, but generally unimpressive, improvement was seen in the /r–l/ context (McCandliss, Fiez, Protopapas, Conway, & McClelland, 2002; Strange & Dittman, 1984), as well as in the perception of final stops (Flege, 1989). None of these studies resulted in large gains or improved performance on untrained sounds. Given the unimpressive results using synthetic speech as training stimuli, it is unsurprising the vast majority of the training literature has focused on natural speech training stimuli, to which we now turn.
Training using natural speech
If training on natural speech tokens were to be exemplified by any one particular paradigm, it would be the High Variability Phonetic Training (HVPT) paradigm (Lively, Logan, & Pisoni, 1993; Lively, Pisoni, Yamada, Tohkura, & Yamada, 1994; Logan, Lively, & Pisoni, 1991). In this paradigm, listeners hear multiple exemplars of the training phonemes produced by more than one native talker and in more than one phonetic context. Listeners identify each word from a set of response options and are given feedback on each trial. The duration of training and the particular details of each study may vary, but the general framework is very consistent.
The initial use of this paradigm was to train NJ listeners to perceptually differentiate /r–l/. Similar to the training studies using synthetic speech, listeners showed a significant improvement on the trained talkers and trained contexts. However, in contrast to the studies using synthetic speech, listeners showed a significant improvement on trained talkers producing untrained contexts, untrained talkers producing trained contexts and untrained talkers producing untrained contexts, evidence of a significant generalization effect (Lively et al., 1993, 1994; Logan et al., 1991). Perceptual training also improved NJ trainees’ /r–l/ productions (Bradlow, Pisoni, Akahane-Yamada, & Tohkura, 1997) and the effects lasted at least three months post-training (Bradlow, Akahane-Yamada, Pisoni, & Tohkura, 1999).
In the time since, the paradigm has been applied to teaching Mandarin tones to NE listeners (Wang, Spence, Jongman, & Sereno, 1999), English vowels to native Spanish and German listeners (Iverson & Evans, 2007; Kingston, 2003), English fricatives to native Danish listeners (Trapp & Bohn, 2000) and English vowels to NJ and Korean listeners (Nishi & Kewley-Port, 2007, 2008). It should be noted that although all training studies have found improvement, none have found mean performance that is equivalent to native listeners. Expanding the paradigm beyond /r–l/ has demonstrated that although training generalizes to novel contexts and talkers, it does not appear to generalize to novel contrasts with similar features (Trapp & Bohn, 2000). The studies training English vowels have found better training outcomes when using a full set of English monophthongs relative to focusing on those vowels known to be difficult for non-native listeners (Nishi & Kewley-Port, 2007, 2008) and have discovered that listeners with larger vowel inventories show more improvement over training than those with smaller vowel inventories (Iverson & Evans, 2007).
This last point is particularly interesting because it demonstrates again that not all non-native listeners are created equal, and differences among learners can result not only from differences in individual abilities, but also from differences in the language backgrounds of the listeners. As we saw in our above discussion on the perception of non-native contrasts, there is extensive variability amongst listeners, where some non-native listeners are able to perform the task quite well and are able to use the same cues as native listeners, whereas others struggle to differentiate non-native sounds, even after years of experience with the language. The same is true for the training studies. Although improvement is seen at the aggregate level, when performance is examined at the level of the individual, extensive variation is seen both when using synthetic speech (McCandliss et al., 2002; Strange & Dittman, 1984) and when employing the HVPT paradigm (Bradlow et al., 1997, 1999; Nishi & Kewley-Port, 2007, 2008; Trapp & Bohn, 2000). Thus, while training with natural speech is preferable to training with synthetic speech, and more phonetic variability is preferable to less, it appears that even the most efficacious training paradigm to date is limited by the individual abilities and language backgrounds of the learners. Again, we suggest that a better grasp on what characterizes a good non-native speech perceiver relative to a poor one will allow for the development of better training paradigms and perhaps eliminate the variability seen in all training studies to date.
Training changes in cue weighting
In recent years there has been in an interest in determining the extent to which training can alter the acoustic cues listeners use when making category judgments, with the hope that cue weightings can become more native-like. Francis, Nusbaum, & Baldwin (2000; Francis, Kaganovich, & Driscoll-Huber, 2008) first examined this issue in native listeners who rely on multiple co-varying acoustic cues when perceiving speech. Stimuli were altered to make one cue a more reliable indicator of category membership; listeners were trained to identify these altered stimuli. Following training, listeners placed greater weight on the trained cue than the untrained cue when identifying speech sounds, demonstrating a significant shift in cue weightings.
Iverson, Hazan, and Bannister (2005) attempted to train NJ listeners to rely more heavily on the F3 cue in the /r–l/ context using a variation on the HVPT paradigm. In addition to the multiple tokens typical of the paradigm, stimuli were altered to either emphasize the salience of the F3 cue or to de-emphasize the salience of competing acoustic cues. While they saw overall improvement, they saw no changes in cue weighting following training. Iverson and Evans (2009) used the HVPT paradigm to train native Spanish and native German listeners’ perception of English vowels. They too saw significant improvement following training, but no change in cue weightings. However, they did find that those listeners who showed more English-like acoustic weightings were more accurate at natural speech vowel identification. Ingvalson, Holt, & McClelland (2012) trained NJ listeners on stimuli that differed only in F3 onset frequency and found no effect of training at the group level, although a few listeners did show improved identification following training. Identifying predictors of sensitivity to non-native acoustic cues might make it possible to develop training that would result in more native-like cue weightings for all learners. Alternatively, it may be the case that only those learners who already show an aptitude for non-native cue weight learning will benefit from cue weight training; however, a way to identify these individuals pre-training remains necessary.
Individual differences in outcome performance
There is always extensive individual variation in second-language speech perception and speech learning. We have seen some evidence of this variation in our discussion of perception of non-native contrasts, noting that some individuals perform quite well on the contrast, while others struggle to make the distinction. We have seen additional evidence in our discussion of training paradigms, noting that even in the most efficacious training paradigms there are some individuals who benefit from training more than others. Recent efforts have looked to determine if it is possible to predict which individuals will receive the greatest benefit from training. These efforts have developed into customized training paradigms that take individual variability into account.
Predicting individual learning outcomes
An obvious place to look for potential predictors of speech sound learning outcomes is in the pre-training data. Wong and Perrachione (2007) trained listeners to perceive Mandarin lexical tones; prior to training, listeners identified non-lexical pitch patterns. Those listeners who were best able to identify non-lexical tones showed the greatest success in lexical tone learning. Similarly, Chandrasekaran, Sampath, and Wong (2010) found that listeners who best identified non-speech pitch movements pre-training benefited the most from lexical tone training. Golestani and Zatorre (2009) asked listeners to identify synthetic instances of the Hindi dental-retroflex contrast before and after training; training was a perceptual fading paradigm and listeners trained to criterion (defined as reaching criterion performance on the most difficult pair of stimuli or a total of 200 training trials, whichever was reached first). Pre-training identification performance correlated both with the ultimate attainment level and with the number of total training blocks, indicating that listeners with good pre-training identification learned faster and were more likely to show optimal performance at the end of training. Similarly, Lengeris and Hazan (2010) found that pre-training synthetic speech discrimination, non-speech frequency discrimination and natural speech identification all predicted post-training natural speech identification scores.
While these efforts do allow us to predict who will perform better following training, the essential message is that those who are better before training will receive the most benefit from training and will be better after training. What is lacking from this dataset is any indication of what makes a good learner good. Recent neuroimaging work has provided this insight.
Wong, Perrachione, and Parrish (2007) functionally neuroimaged the participants from Wong and Perrachione (2007). Relative to less successful learners, successful learners showed increased activation in the left posterior superior temporal gyrus (pSTG) and in the left transverse temporal gyrus (TTG) after training. Before training, relative to less successful learners, successful learners showed increased activation in bilateral superior and middle temporal regions and in the right inferior temporal gyrus (ITG). In tasks highly similar to those of Golestani and Zatorre (2009), Golestani, Paus, and Zatorre (2002; Golestani, Molko, Sehaene, LeBihan, & Pallier, 2007) have found that learning rate correlates with the white matter density bilaterally in parietooccipital sulcus such that greater density corresponds to faster learning; the sulcus appeared more posterior in fast relative to slow learners. Fast learners were also found to have greater white matter density in the left hemisphere relative to the right. Finally, fast learners had larger volumes of the left Heschel’s gyrus and more white matter density in the left Heschel’s gyrus than slow learners. A larger volume for the left Heschel’s gyrus was also implicated in more successful tone learning (Wong et al., 2008). Together, these data indicate that the differences between good and poor learners may arise from neurophysiological and neuroanatomical variations.
Developing training paradigms based on individual differences
One problem with most training methods is that training occurs for a prescribed duration, which may not be sufficient for learners to truly master the trained contrast. Some listeners may require more training than that provided in the limited time span, giving the appearance that learning was incomplete when in reality optimal learning could have been achieved if sufficient training time had been allotted. Thus, what appears to be variability in training outcomes could actually be an artifact of highly constrained training methods. Training listeners to criterion, however, ensures that learners have made the maximum amount of progress possible within the training set. Knowing that listeners have made the maximum amount of possible progress ensures that whatever variation might exist in the data is due to variation amongst the listeners and not due to some listeners having received insufficient training relative to other listeners.
In addition, since learners’ training outcomes appear to hinge on their pre-training abilities and pre-existing neurophysiological and neuroanatomical differences, it is logical to believe that learners will show the greatest benefit from training if training is designed to take advantage of their individual abilities. If it is the case that differential learning outcomes are, in part, the result of pre-existing neurophysiological and neuroanatomical differences that correspond to pre-training differences in sensitivity to relevant category dimensions, we should see that training paradigms that take into account listeners’ unique abilities result in efficacious training. Perrachione, Lee, Ha, and Wong (2011; see also Lee, Perrachione, Dees, & Wong, 2007) have some evidence to this effect. Prior to training, listeners’ identified non-lexical tones and this information was used to divide listeners into those anticipated to show high training performance and those anticipated to show less native-like performance following training. Listeners from both groups were equally assigned to HVPT or to low-variability (i.e., single-talker or blocked-talkers) training paradigms. Listeners trained to asymptotic performance. Those listeners who performed well on the pre-training identification task showed a training and generalization benefit on the basis of HVPT, consistent with earlier work in this paradigm. Conversely, those listeners who performed poorly on the pre-training task showed the greatest benefit when variability was limited, in either a single-talker or blocked-talker paradigm. These data demonstrate (1) the importance of assigning listeners to optimum training paradigms in order to see maximally efficacious training, and (2) that the HVPT paradigm may not be the optimal training paradigm for all listeners.
Conclusions and future directions
Throughout this review we have repeatedly returned to the individual variability present in the perception of non-native speech. Individual variability is present in the perception of individual speech contrasts, the perception of whole words containing difficult contrasts and the extent to which listeners benefit from training. These differences appear to stem from pre-existing aptitudes to utilize relevant aspects of the speech signal, in part, as a result of observed differences in neuroanatomy and neurophysiology. Taking these individual differences into account when constructing training paradigms may result in more efficacious training for all learners.
We suggest that future investigations into non-native speech perception be guided by these recent forays into individual differences. We note several areas for future research. Firstly, as noted above, in real-world interactions, listeners almost never hear phonemes presented in isolation. We therefore advocate the use of methods that investigate non-native phoneme perception in the context of whole-word perception, such as those described by Cutler et al. (2006; Cutler & Otake, 2004) and Pallier et al. (2001). The use of whole-word paradigms eliminates the requirement that listeners must rely on metalinguistic knowledge through comparisons of stimuli to stored phonological categories to perform the task (Sebastián-Gallés, 2005) and avoids the possibility that listeners are able to acoustically differentiate non-native speech sounds but are unable to attach linguistic significance to the differentiated contrast (Werker, Cohen, Lloyd, Casasola, & Stager, 1998). Secondly, in the face of such consistent variability amongst non-native listeners, we believe that further investigation into the loci of individual differences is paramount to the development of the field. Here we have presented data suggesting that differences in training outcomes can be linked to differences in neuroanatomy and neurophysiology. It is also likely that differences can be accounted for by differences in cognitive ability (MacDonald, 2008; Majerus, Poncelet, Van der Linden, & Weekes, 2008) or variations in language background (Iverson & Evans, 2009). Whatever the loci for the differences may be, it is important to understand what distinguishes an individual with an aptitude for speech sound learning from one without such an aptitude if we are to effectively describe the course of speech sound learning. Finally, we suggest two changes to traditional training approaches: (1) the discontinued use of time-limited training and its replacement with training to asymptotic performance and (2) the assignment of listeners to training paradigms on the basis of their pre-training performance. The use of asymptotic training will ensure that all learners have received sufficient information to benefit from training as much as possible. Combining training to asymptote with adaptive training (e.g., Iverson & Evans, 2009) would result in training paradigms that are highly customized to the learner’s needs. Having demonstrated that the HVPT paradigm is not equally efficacious for all learners (Perrachione et al., 2011), it seems clear that if we are to give all learners the most efficacious training possible, recognition and optimization of their individual differences is essential (e.g., Iverson & Evans, 2009). This ties in with our second recommendation, understanding individual differences, and we believe that a better understanding of the source of individual variation will allow for more reliable predictions of outcome performance, which can then in turn be used to customize training to give the learner the greatest benefit.
