Abstract
Musical experience has been demonstrated to play a significant role in the perception of non-native speech contrasts. The present study examined whether or not musical experience facilitated the normalization of speaking rate in the perception of non-native phonemic vowel length contrasts. Native English musicians and non-musicians (as well as native Thai control listeners) completed identification and AX (same–different) discrimination tasks with Thai vowels contrasting in phonemic length at three speaking rates. Results revealed facilitative effects of musical experience in the perception of Thai vowel length categories. Specifically, the English musicians patterned similarly to the native Thai listeners, demonstrating higher accuracy at identifying and discriminating between-category vowel length distinctions than at discriminating within-category durational differences due to speaking rate variations. The English musicians also outperformed non-musicians at between-category vowel length discriminations across speaking rates, indicating musicians’ superiority in perceiving categorical phonemic length differences. These results suggest that musicians’ attunement to rhythmic and temporal information in music transferred to facilitating their ability to normalize contextual quantitative variations (due to speaking rate) and perceive non-native temporal phonemic contrasts.
1 Introduction
A variety of experiential factors can affect adult second language (L2) speech learning. L2 speech learning theories and findings point to the determining effect of linguistic experience, in that L2 phonetic category formation can potentially be influenced by existing native (L1) phonetic categories (e.g., Best & Tyler, 2007; Flege, 1995; Werker & Tees, 2002). In addition to L1 background, extralinguistic factors such as musical experience have been found to have a facilitative effect on how listeners perceive non-native contrasts. Enhancements to speech processing as a result of musical training are posited to result from a multitude of factors (Patel, 2014), including musicians’ enhanced auditory acuity for acoustic features shared by speech and music (e.g., pitch, rhythm, etc.). Particularly, musicians have demonstrated superior performance relative to non-musicians in perceiving pitch-related suprasegmental speech contrasts, such as lexical tone (e.g., Cooper & Wang, 2012; Delogu, Lampis, & Belardinelli, 2010; Lee & Hung, 2008). However, relatively fewer studies have examined the effects of musical experience on suprasegmental perception in the temporal domain (e.g., Chobert, François, Velay, & Besson, 2012; Sadakata & Sekiyama, 2011). Moreover, investigations into the role of musical experience on non-native tone and segmental perception have focused predominantly on the perception of isolated words (e.g., Cooper & Wang, 2012; Lee & Hung, 2008; Sadakata & Sekiyama, 2011). However, compared to contour tones in a tone language whose identification relies more on syllable-intrinsic information, temporal speech information such as vowel length distinctions is more susceptible to contextual effects such as speaking rate (e.g., Abramson, 2001; Hirata, 2004a). Similarly, music varies in the tempo in which it is produced, such that relative note values (e.g., quarter notes, half notes, etc.) can change in their absolute durations depending on the speed at which the music is produced. It is thus conceivable that musical experience may affect the perception of temporal speech contrasts, particularly in longer, sentential-level speech contexts. To investigate the influence of musical experience on the perception of temporally-cued contrasts, the present work examines the perception of rate-varying Thai phonemic vowel length contrasts in sentence context by English musicians and non-musicians as well as native Thai listeners.
1.1 Vowel duration and speaking rate
Temporal speech information is frequently used in language and can serve a variety of functions, in that durational changes can, for example, serve to signal prominence, contrast word meaning, or enhance intelligibility. Quantity languages (e.g., Estonian, Finnish, Arabic, Japanese, and Thai) are known for their prevalent use of vowel duration differences (short versus long) to cue phonological distinctions (e.g., Abramson & Ren, 1990; Hirata, 2004a; Lehiste, 1970; Ylinen, Shestakova, Huotilainen, Alku, & Näätänen, 2006). For example, Thai utilizes nine short vowels with nine long counterparts to phonemically contrast different word meanings (Abramson, 2001). Long vowels in Thai have similar acoustic properties to their short counterparts but are sustained for a longer period of time (Svastikula, 1986). This temporal acoustic difference is considered to be the primary perceptual cue for native listeners (Abramson, 1962; Abramson & Ren, 1990).
Moreover, natural speech contains a high degree of variability with respect to rate of speech, which can have a significant effect on temporal contrasts such as vowel duration. Speaking rate variation may lengthen or shorten absolute vowel durations (e.g., Abramson, 2001; Hirata, 2004a; Pind, 1996), and such changes in vowel duration as a function of speaking rate have been found to not be proportional across rates. For example, Svastikula (1986) reported that an increase in speaking rate reduced the absolute duration of Thai long vowels more than short vowels, whereas a slower speaking rate lengthened short vowels more than long vowels. Thus, the absolute duration of phonemically long and short vowels across speaking rates may overlap considerably; for example, the duration of a fast rate-long vowel may be shorter than that of a slow rate-short vowel (Abramson, 2001; Hirata, 2004a; Tajima, Kato, Rothwell, Akahane-Yamada, & Munhall, 2008).
Native listeners of quantity languages have demonstrated an ability to efficiently shift their vowel length category boundary as a function of speaking rate (e.g., Tajima et al., 2008). The fact that native listeners are capable of normalizing rate-induced durational differences indicates that they possess internal phonemic length categories. Given the large variations in absolute vowel duration across speaking rates, native listeners must utilize a measure other than absolute duration to distinguish members of different length categories spoken at different rates of speech. Previous research has pointed to the existence of an invariant, relational acoustic cue for phonemic vowel length that facilitates perception, in that a long-to-short vowel ratio or a vowel-to-rhyme or -word ratio appears to remain largely constant across rates (Hirata, 2004a; Magen & Blumstein, 1993; Pind, 1996; Smiljanic & Bradlow, 2008). Studies have found that native listeners were able to normalize for rate by using relative duration (comparing against overall sentence rate or the adjacent segments) as a major perceptual cue to differentiate vowel lengths across speaking rates. This relational stability of vowel length and its surrounding context has been attested in Thai (Abramson, 1962), Japanese (Hirata, 2004a), Korean (Magen & Blumstein, 1993) and Icelandic (Pind, 1996). Other models of speech perception, such as exemplar-based or episodic models (e.g., Goldinger, 1996) have posited that native listeners accommodate speaking rate variation as a result of retaining minute phonetic details of their speech experiences. These details are posited to be encoded in long-term representations and later drawn upon to aid in normalizing across speaking rates (Nagao & de Jong, 2007).
1.2 Vowel duration and linguistic experience
While native listeners have demonstrated their ability to perceive length contrasts in a variety of speaking conditions, non-native listeners with L1s that do not contain phonemic length contrasts have been found to have difficulty with length distinctions, even at normal rates of speech (McAllister, Flege & Piske, 2002; Tsukada, 2011). Further research reveals that even non-native listeners with L1 experience with phonemic length (e.g., Arabic, and Japanese) but with no L2 experience may not always have an advantage over listeners (e.g., English) naïve to length distinctions in both native and target languages (Tsukada, 2011). Studies have also shown that listeners tend to perceive non-native quantity distinctions more continuously, rather than categorically like native listeners (Hayes-Harb, 2005; Ylinen, Shestakova, Alku & Huotilainen, 2005). Neurophysiological data have provided consistent evidence for the existence of language-specific durational categories, whereby brain responses (as indexed by mismatch negativity) were found to be more robust for native listeners as compared to non-native listeners when perceiving stimuli differing in vowel length (Japanese: Hisagi, Shafer, Strange, & Sussman, 2010; Finnish: Nenonen, Shestakova, Huotilainen, & Näätänen, 2003). Similarly, functional magnetic resonance imaging work comparing Thai and non-native listeners (Chinese and English) in their perception of Thai vowel length and homologous non-speech duration stimuli revealed that, while all groups demonstrated similar activation patterns when listening to non-speech, specific language-related brain regions were activated only for the Thai group listening to native Thai speech contrasts (Gandour et al., 2002a; Gandour et al., 2002b). Together, these findings indicate language-specific perception and processing of phonemic length categories.
However, the difficulty in perceiving non-native length contrasts has been found to be ameliorated with experience and training in the non-native language (Hirata, Whitehurst, & Cullings, 2007; Hirata, 2004b; McAllister et al., 2002). McAllister et al. (2002) reported that for L2 learners of Swedish, those who made use of phonemic vowel durational contrasts in their L1 (e.g., Estonian) were significantly better at distinguishing Swedish vowel length distinctions as compared to the L2 learners whose L1s did not contain vowel quantity distinctions (e.g., English and Spanish). These findings indicate that although L1 phonemic length experience may not immediately facilitate perception of new non-native vowel length contrasts to naive listeners (as shown in Tsukada, 2011), such experience with categorical length distinctions may become advantageous as the listeners learn to form non-native phonemic length categories. Further research shows that non-native listeners not only benefit from L1 experience with length contrasts but also that intensive exposure to length contrasts in the target language enhances their learning of such distinctions. Hirata (2004b) trained native English listeners to perceive Japanese geminate and vowel length distinctions with either isolated words or in sentence contexts and tested them on both word and sentence stimuli. While pre-test scores indicated their ability to identify Japanese length distinctions was low, they did demonstrate significant improvements in their accuracy after training. Furthermore, while both types of training yielded improvements, training and testing with sentence contexts was significantly more difficult for listeners than with isolated words. These results indicate that while listeners’ perception indeed improved after training, they still had difficulty with processing and encoding contextual temporal information when perceiving and learning non-native phonemic length contrasts.
Indeed, given that non-native listeners perceiving non-native quantity distinctions have more difficulty than native listeners, additional contextual temporal variations such as speaking rate differences can compound the challenge of perceiving the target quantity contrast (Hirata, 2004b; Tajima et al., 2008). Tajima et al. (2008) tested English and Japanese listeners’ identification of Japanese vowel length distinctions produced at three speaking rates with isolated words as well as in sentence contexts. Both Japanese and English listeners were significantly affected by rate changes, in that identification accuracy was higher for slow and normal rates and lowest for the fast rate. However, English listeners had significantly lower accuracy at all speaking rates relative to native Japanese listeners. Additionally, for English listeners, accuracy rates were higher at the fast rate in the sentence context relative to the isolated word context. However, at the normal rate, accuracy rates were higher in the word context relative to the sentence context. Performance did not differ between context types at the slow rate. The authors speculated that sentence contexts might have provided additional cues to target length (e.g., the overall tempo) in the case where cues within the target word itself were relatively weak (fast speech rate). When there were sufficiently salient cues (normal speech rate), English listeners may not have relied on contextual information from the carrier sentence, and the additional demands associated with processing sentential information may have actually hindered performance.
Taken together, previous findings suggest that the lack of L1 experience with phonemic durational contrasts results in non-native listeners’ less accurate identification of language-specific durational categories as well as less automatic processing of such categories (i.e., requiring more cognitive resources and attentional focus), as compared to native listeners. Furthermore, non-native listeners are found to be less capable than native listeners at accommodating speech rate variation when perceiving non-native temporal contrasts. It has further been found that the difficulty in perceiving non-native length contrasts can be ameliorated with experience and training in the non-native language (Hirata, 2004b; Hirata et al., 2007; McAllister et al., 2002).
1.3 Musical experience
In addition to linguistic background, musical experience has been found to have a significant facilitative effect on speech perception and learning (e.g., Besson, Schön, Moreno, Santos, & Magne, 2007; Milovanov, Pietilä, Tervaniemi, & Esquef, 2010; Slevc & Miyake, 2006). This enhancement in speech processing is posited to arise from behavioral and neurophysiological consequences of musical experience. Musical experience is uniquely relevant to the perception of prosodic information (such as phonemic tonal and length contrasts), as prosody and music share the primary acoustic correlates of fundamental frequency and duration (e.g., Besson & Schön, 2001). Prolonged musical training has been found to enhance listeners’ sensitivity to these shared acoustic features as well as strengthen auditory attention and working memory (Besson, Chobert, & Marie, 2011; Strait & Kraus, 2011). For example, Brancucci, Anselmo, Martello and Tommasi (2008) reported shared left hemisphere dominance for duration perception of speech and musical stimuli. Musicians also reportedly perceive musical rhythmic patterns categorically (Desain & Honing, 2003), in much the same way that native listeners of quantity languages form categories based on temporal distinctions in speech. Additionally, music and speech perception are posited to share similar learning mechanisms, as listeners need to form meaningful, discrete categories from an auditory stream in both music and speech contexts (Milovanov, Huotilainen, Esquef, P., Alku, P., Välimäki, & Tervaniemi, 2009). To account for the positive transfer from music to speech, Patel (2014) posited the expanded OPERA hypothesis, whereby enhancements in speech processing as a function of musical experience are attributed to the higher demands placed by musical training on the sensory and cognitive processing mechanisms used for processing features of speech and music, in conjunction with the emotionally-rewarding, highly-repetitive and attention-demanding nature of musical training.
The facilitative effect of musical training on speech perception has been demonstrated predominantly with prosodic pitch perception, whereby musically-trained listeners have consistently been found to be more accurate than non-musicians at perceiving non-native lexical tone contrasts (e.g., Alexander, Wong, & Bradlow, 2005; Cooper & Wang, 2012; Delogu et al., 2010; Delogu, Lampis, & Olivetti Belardinelli, 2006; Lee & Hung, 2008; Wong & Perrachione, 2007). For instance, Lee and Hung (2008) reported that native English musicians were significantly faster and more accurate at identifying Mandarin lexical tone contrasts than English non-musicians, suggesting that auditory acuity developed from extensive musical training can facilitate the perception of linguistic pitch stimuli. Similarly, English musicians demonstrated an enhanced ability to perceive non-native Cantonese lexical tones categorically, as they showed higher proficiency relative to non-musicians at acquiring lexical items contrasted by these tones (Cooper & Wang, 2012).
Research has also found evidence of musicians demonstrating superior performance in the perception of temporal speech information, including vowel duration (Milovanov et al., 2009), stop consonants (Chobert et al., 2012; Sadakata & Sekiyama, 2011; Zuk et al., 2013), and metrical structure (Marie, Magne, & Besson, 2011). Milovanov et al. (2009) examined the preattentive processing of speech and music duration in Finnish school-aged children by presenting them with a Finnish vowel (either 250 ms or 150 ms) or violin tones of the same durations. Children with higher musical aptitude had enhanced mismatched negativity (indicating greater sensitivity) to duration changes in the speech stimuli. Similarly, Sadakata and Sekiyama (2011) reported beneficial effects of musicianship in the perception of non-native geminate stop consonants, with native Dutch musicians significantly outperforming Dutch non-musicians at identifying and discriminating this Japanese stop contrast. In addition to findings based on isolated words and segments, musical experience has also been shown to affect the perception of temporal properties in sentential contexts. One study examining the role of musical experience in the perception of metrical structure presented target items in sentence-final position and found that musicians were superior to non-musicians at detecting metrical congruity and incongruity in these sentential contexts (Marie et al., 2011). Together, these findings point to domain-general auditory processing mechanisms that facilitate the transfer of information from music to language to aid speech perception.
1.4 The current study
The research reviewed above has suggested that L1 listeners perceive native temporal quantity distinctions categorically by mapping the speech signal to native language length categories. Non-native listeners, on the other hand, have greater difficulty with non-native length distinctions, perhaps in part due to the lack of stable phonemic quantity representations (e.g., McAllister et al., 2002; Tajima et al., 2008). Furthermore, native listeners have been found to be able to normalize across acoustic variation produced by speaking rate changes and attend to the appropriate acoustic information signalling their native durational categories (e.g., Nagao & de Jong, 2007). In contrast, non-native listeners have been observed to perceive non-native quantity distinctions more continuously rather than categorically; thus, they are more attentive to the irrelevant acoustic variations of within-category differences and speaking rate changes (Hayes-Harb, 2005; Ylinen et al., 2005). However, L2 learning and training experience has been shown to facilitate the perception of non-native length contrasts (Hirata, 2004b; Hirata et al., 2007; Hirata, 2004b; McAllister et al., 2002). In addition to the effects of linguistic experience, musical experience has also been demonstrated to play a significant facilitative role in the perception of non-native speech length contrasts (e.g., Sadakata & Sekiyama, 2011).
However, despite the abundance of research on music and linguistic processing, relatively little prior work has dealt with speech materials spanning temporal domains larger than single syllables or words (e.g., Marie et al., 2011). Particularly, research has not investigated the impact of musical experience on the ability to normalize speaking rate in the perception of non-native temporal contrasts that are affected by rate variations. As discussed above, normalizing for rate requires listeners to track the “tempo” of the preceding speech context in order to make judgments about the duration identity (e.g., long vs. short) of the target segment. Since musical training also involves tracking contextual temporal information, it is conceivable that such experience may be particularly relevant in perceiving speech length contrasts in different speaking rate contexts.
Such research examining how musical experience affects the ability to incorporate contextual speech information is also significant theoretically for unravelling the nature of cross-domain transfer from music to language, in terms of whether such transfer results in enhanced sensitivity to fine-grained acoustic speech information or results in a generalized ability to detect categorical speech contrasts. Musically-trained listeners’ enhanced auditory acuity could be facilitative or inhibitory when confronted with rate variability. Although their ability to detect fine-grained acoustic distinctions has been found to be beneficial in discriminating non-native durational speech contrasts with relatively minimal variation (e.g., Sadakata & Sekiyama, 2011), it is conceivable that this ability to perceive minute distinctions could make it more difficult for them to ignore and abstract over variation in perceiving categorical distinctions. However, musicians have extensive experience with normalizing for rate in musical phrases, such that they are able to recognize, for instance, quarter or half notes (proportional musical units identified by relative rather than absolute duration) in musical pieces played at different tempos. This experience might transfer to the linguistic domain and facilitate their ability to extract, for instance, vowel-to-word duration ratios (or vowel-to-larger speech context ratios), as a cue to vowel length, that remain relatively stable across speaking rates.
The present work investigated this issue by comparing how musicians and non-musicians perceive rate-varied vowel length in a non-native language to determine the extent to which musical experience with rate normalization in music may transfer to the linguistic domain to facilitate the perception of non-native vowel length categories. Specifically, the current study compared the performance of musically-trained and untrained native English listeners, along with a native Thai control group, on their perception of rate-varied phonemic vowel length distinctions in Thai. Participants completed a vowel length identification task with three different speaking rate conditions, as well as a between-category and within-category AX (same–different) discrimination task involving either vowel length or speaking rate differences, or both. The target items were embedded in a carrier sentence to provide listeners with contextual cues to speaking rate. We predict that, if experience with rate normalization in music positively transfers to the linguistic domain, English musicians would outperform non-musicians. This would predict higher length identification accuracy as well as higher between-category discrimination accuracy for musicians relative to non-musicians at all speaking rates. Alternatively, if musicians’ enhanced auditory acuity, allowing them to more readily detect fine-grained acoustic distinctions, results in greater difficulty in abstracting over acoustic variability, we would predict higher discrimination accuracy for within-category differences due to speaking rate variation for musicians than for non-musicians, and that musicians would show no significant difference in between- and within-category discrimination abilities.
2 Methods
2.1 Participants
Twenty-six American English listeners, with no prior knowledge of Thai or any other language with phonemic length distinctions, were included in this study. They were divided into two groups of listeners: non-musicians and musicians (n = 13 in each), recruited based on previously established criteria (e.g., Cooper & Wang, 2012; Wayland, Herrera, & Kaan, 2010; Wong & Perrachione, 2007). Specifically, non-musicians (ENM) had less than three years of musical training and no experience within the last five years (8 females; meanage = 21 years; meanmusical experience = 1 year). Musicians (EM) were defined as having at least seven years of continuous musical training, which ranged from seven to 20 years of experience (9 females; meanage = 22 years; meanmusical experience = 13 years), and the current ability to play an instrument. The average age of musical training onset was nine years (range = 5–11 years) and types of dominant training included trombone (n = 2), clarinet (n = 1), flute (n = 1), saxophone (n = 1), trumpet (n = 1), percussion (n = 1), euphonium (n = 1), horn (n = 1), violin (n = 1) and vocal (n = 3), with the majority of musicians receiving training on multiple instruments. A control group of native Thai speakers was also recruited (n = 9). Thai listeners (5 females; meanage = 26 years) had completed at least primary (and for most, secondary) school education in Thai. The average time they had spent in the United States was three years (range = 0.6–10 years). None of the participants reported any hearing impairments. They were paid for their participation in this study.
2.2 Stimuli
Stimulus materials included six minimal pairs of monosyllabic Thai pseudowords with consonant–vowel–consonant structure, contrasting in vowel length (e.g., /fop/ vs. /fo:p/). Words in each pair were matched for lexical tone and contained common Thai/English phonemes to establish common segmental familiarity for both groups. Each item was produced in the carrier sentence /put wa ____ ik ti/ (“I say ____ again”) at slow, normal and fast rates of speech. Rate instructions were taken from Hirata (2004a). The “fast” speech rate was described as the fastest rate possible without making any speech errors. The “normal” rate of speech was defined as a comfortable speaking speed, and the “slow rate” was described as the slowest rate possible without any obvious pauses or breaks in the sentence. The stimuli were recorded by a female native speaker of Standard Thai at a 44.1 kHz sampling rate in a sound-attenuated booth. Table 1 provides mean vowel durations across target items for long and short vowels for each speech rate. Consistent with prior studies (e.g., Abramson & Ren, 1990; Hirata, 2004a; Magen & Blumstein, 1993; Svastikula, 1986), the vowel-to-word ratio (slow: 0.35, normal: 0.35, fast: 0.34) as well as the long-to-short vowel ratio (slow: 2.1, normal: 1.8, fast: 1.6) remained relatively constant across the three rates of speech.
Mean vowel durations for target stimuli for long and short vowels at each speaking rate.
From the six minimal pairs described above, three were used for the identification task (/thik/, /sok/, /fut/) and three different ones were used for the discrimination task (/thok/, /sik/, /fop/). “Same” and “different” AX discrimination pairs were constructed in Praat (Boersma & Weenink, 2015), such that two sentences containing target items were placed in succession, separated by 500 milliseconds. For the “same” pairs (same category and same rate), two repetitions of the same sentence containing target items of the same phonemic length category and speaking rate were used. The “different” pairs were constructed in three conditions (see Table 2 for a comprehensive list with example pairs): (1) Between Category–Same Rate; (2) Between Category–Different Rate; and (3) Within Category–Different Rate. For Condition 1 (Between Category–Same Rate), each trial contained a pair at the same speaking rate, differing only in the vowel length of the target item (e.g., slow rate /fop/ and slow rate /fo:p/). Condition 2 (Between Category–Different Rate) contained trials where target items differed in vowel length but also in speaking rate. Three length-rate patterns were created: (1) Long–fast + Short–slow; (2) Long–fast + Short–normal; and (3) Long–normal + Short–slow. This condition provided a greater challenge for listeners, as the speaking rate and vowel length pairings result in smaller absolute differences between long and short vowels than in the “Between Category–Same Rate” condition. Particularly, for the Long–fast + Short–slow pair, the absolute duration for a long vowel at a fast rate could be shorter than that for a short vowel at a slow rate (cf. Table 1). This condition enables us to determine whether listeners are cuing into vowel length category differences or differences in rate. Finally, Condition 3 (Within Category–Different Rate) contained target item pairs of the same vowel length but at different speaking rates. The same three rate patterns (fast–slow, fast–normal, and normal–slow) were included for both lengths.
List of discrimination conditions by phonemic length category and speaking rate status, involving one “same” pair condition (within category, same rate), and three “different” pair conditions (within category, different rate; between category, same rate; and between category, different rate) along with a list of pairs presented in each condition.
All stimuli were normalized for root mean square amplitude (68 dB). All of the discrimination pairs were counterbalanced for order of presentation, such that each item in a pair would occur both in the first position and second position of the pair (e.g., Long–fast + Short–slow; Short–slow + Long–fast).
2.3 Procedure
The participants completed both identification and discrimination tasks, with stimuli played free-field over Alesis Point 7 speakers at a comfortable listening volume in a sound-attenuated booth. The experiment was programmed using E-Prime 2.0 software. The order of the identification and discrimination tasks was counterbalanced across participants. Both tasks were preceded by a brief familiarization session to acquaint participants with task procedures, including task instructions and practice trials. They also served to familiarize listeners with the carrier sentence and how to identify the embedded target item. Participants were explicitly informed that vowel length was the relevant distinction in these tasks. The familiarization trials were identical in format to the main task trials (with items not used in testing), except that they provided feedback on the accuracy of their responses as well as the correct answer after each trial. Each familiarization phase consisted of four trials each, with all items produced at a normal rate of speech so as to limit the amount of practice the English listeners had with rate-varying vowel length distinctions.
The vowel length identification task was a two-alternative forced-choice task, where participants listened to target items presented in the carrier sentence and indicated whether they heard a long-vowel word or a short-vowel word. The screen displayed the carrier sentence in both Thai and English, along with its phonetic transcription, and a choice of two numbered pseudoword response options, displayed both in Thai and the spelling of their pronunciation with English orthography. Participants had two seconds to respond by pressing a number on a computer keyboard corresponding to the numbered pseudoword response options on the screen. Stimuli from all three speaking rates were randomly presented. The task consisted of 72 randomized trials (3 syllables x 3 rates x 2 lengths x 4 repetitions), which were divided into two blocks of 36 trials each.
For the AX discrimination task, participants heard pairs of sentences containing target words and were asked to indicate whether the target word in the second sentence was the same or different from the first target word. They had two seconds to make a response. On the screen, the carrier sentence was displayed in Thai and English, along with its phonetic transcription, as well as a choice of “Same word” or “Different word”. Participants completed 36 trials in the Between Category–Same Rate condition (3 syllables x 3 rates x 2 order counterbalancing x 2 repetitions), 36 trials in the Between Category–Different Rate condition (3 syllables x 3 length-rate patterns x 2 order counterbalancing x 2 repetitions), 72 trials in the Within Category–Different Rate condition (3 syllables x 3 rate patterns x 2 lengths x 2 order counterbalancing x 2 repetitions) and 36 “Same” trials (3 syllables x 3 rates x 2 lengths x 2 repetitions) for a total of 180 trials. The task consisted of six blocks of 30 trials each. Stimuli in each block were randomized across conditions, such that listeners heard trials from all four conditions in each block.
3 Results
3.1 Identification task
The proportion of correct responses was tabulated for each length and speaking rate for each group (Table 3). A closer inspection of the responses showed a bias for English listeners to respond “short”, with overall lower accuracy rates for long vowels (61%) relative to short vowels (80%). To account for this potential response bias, the proportion of hit rates (defined as the proportion of short vowels to which participants correctly responded “short”) and of false alarms (defined as the proportion of long vowels to which participants incorrectly responded “short”) were used to compute d-prime values. The results are displayed in Figure 1.
Mean percent correct vowel length identification (standard error in parentheses) for each group by length and rate.

Mean d’ scores (+/− 1 standard error) for each speaking rate by group for the identification task.
These data were submitted to a 2-way mixed design analysis of variance (ANOVA), 1 with Rate (fast, normal, or slow) as a repeated measure and Group (Thai, EM, or ENM) as a between-subjects factor. Significant main effects of Rate, F(2, 32) = 28.966, p < 0.0001, and Group, F(1, 32) = 52.81, p < 0.0001, were obtained, along with a significant Rate x Group interaction, F(4, 32) = 3.595, p = 0.01.
Bonferroni-adjusted pairwise comparisons indicated that EM demonstrated significantly higher d’ scores than ENM at each rate of speech (p < 0.0001); Thai listeners similarly outperformed ENM at each speaking rate (p < 0.0001). They were also found to be significantly better than EM at fast and normal rates (p < 0.0001); however, the difference between Thai and EM groups did not reach significance at the slow rate (p < 0.135).
All groups demonstrated sensitivity to rate variation, with performance being significantly better at the slow speaking rate relative to fast and normal rates for both EM (p < 0.0001) and ENM (p < 0.011). No significant difference between fast and normal rates was found for either English group (p > 0.253). Thai listeners’ identification accuracy rates were higher for slow and normal speaking rates as compared to the fast rate (p < 0.006) but remained consistent across slow and normal speaking rates (p > 0.05).
3.2 Discrimination task
The proportions of hit rates (proportion of “different” trials that participants correctly indicated were “different”) and of false alarms (proportion of “same” trials to which participants incorrectly indicated were “different”) were used to calculate d-prime scores for each rate pattern of each condition: (1) Between Category–Same Rate (e.g., long–slow vs. short–slow); (2) Between Category–Different Rate (e.g., long–fast vs. short–slow); and (3) Within Category–Different Rate (e.g., long–slow vs. long–normal).
For the Between Category–Same Rate condition (Figure 2), a 2-way mixed-design ANOVA was constructed with d’ scores as the dependent variable, Rate (fast, normal, or slow) as a repeated measure and Group (Thai, EM, or ENM) as a between-subjects factor. Significant main effects of Rate, F(2, 32 = 80.76, p < 0.0001, and Group, F(1, 32) = 40.772, p < 0.0001, were obtained, along with a significant Rate x Group interaction, F(4, 32) = 6.995, p < 0.0001. Bonferroni-adjusted pairwise comparisons revealed that EM were significantly better at discriminating short and long vowels relative to ENM at every speaking rate (p < 0.032). Thai controls also demonstrated significantly higher between-category discrimination accuracy than ENM at all rates (p < 0.0001). Significantly higher d’ scores were found for Thai listeners relative to EM at normal and slow speaking rates (p < 0.019), though no difference was found at the fast speaking rate (p = 0.119). Bonferroni-adjusted comparisons of Rate fixing each level of Group indicated that for all groups, between-category discrimination accuracy was significantly higher at normal and slow speaking rates as compared to the fast rate (p < 0.008) but did not differ significantly between normal and slow rates (p > 0.05).

Mean d’ scores (+/− 1 standard error) in discriminating long and short vowels for Between Category–Same Rate condition by rate and group.
For the Between Category–Different Rate condition (Figure 3), d’ scores were submitted to a 2-way mixed ANOVA with Length–Rate pattern (Long–fast + Short–normal, Long–fast + Short–slow, or Long–normal + Short–slow) as a repeated measure and Group (Thai, EM, or ENM) as a between-subjects factor. Significant effects of Length–Rate pattern, F(2, 32) = 8.464, p = 0.001, and Group, F(2, 32) = 7.073, p = 0.003, were obtained as well as a significant Length–Rate pattern x Group interaction, F(4, 32) = 5.321, p = 0.001. Post hoc (Bonferroni) analyses found that across Length–Rate patterns, the difference in discrimination accuracy between EM and ENM was marginally significant (p = 0.07). Thai controls significantly outperformed ENM (p = 0.003) but did not differ significantly from EM (p = 0.394).

Mean d’ scores (+/− 1 standard error) for Between Category–Different Rate condition of the discrimination task by length-rate pattern and group.
To examine the Length–Rate pattern x Group interaction, Bonferroni-adjusted pairwise comparisons of Group fixing each Length–Rate pattern revealed that EM had significantly higher between-category discrimination accuracy than ENM for Long–normal + Short–slow pairs (p = 0.015) but did not differ significantly from them on the other Length–Rate patterns (p > 0.241). Thai listeners were significantly more accurate than ENM at discriminating Long–normal + Short–slow (p < 0.001) and Long–fast + Short–normal pairs (p = 0.026); however, they did not differ significantly from EM (p > 0.118). It should be noted that neither EM nor Thai listeners showed significantly better discrimination than the ENM listeners in the Long–fast + Short–slow condition where the mean duration for the long vowels (115 ms) was shorter than that for the short vowels (130 ms), indicating that the EM and Thai listeners’ perception was based on category information rather than absolute durational differences. Together, the findings from the Between Category–Different Rate condition indicate that EM and Thai listeners, compared to ENM, were more accurate at discriminating short and long vowels despite the challenge of having to adjust for speaking rate within a single discrimination pair, demonstrating EM as well as Thai listeners’ sensitivity to between-category differences.
Listeners’ within-category discrimination abilities were examined in the Within Category–Different Rate condition, using a 3-way mixed ANOVA with Rate pattern (Fast–normal, Fast–slow, or Normal–slow) and Length (long, short) as repeated measures and Group (Thai, EM, or ENM) as a between-subjects factor. Significant main effects were obtained for Rate pattern, F(2, 32) = 7.194, p = 0.002, Length, F(1, 32) = 32.027, p < 0.001, and Group, F(2, 32) = 5.407, p = 0.009.
The ANOVA also revealed a significant Length x Group interaction, F(2, 32) = 5.199, p = 0.011. Bonferroni-adjusted pairwise comparisons showed that, across rate patterns, Thai listeners were significantly worse than both English groups at within-category discrimination of short vowels (p < 0.004; Figure 4), while all groups performed similarly for the discrimination of long vowels (p > 0.372). A significant Rate x Length interaction was also found, F(2, 32) = 6.907, p = 0.002, with pairwise (Bonferroni) comparisons indicating that across groups, for long vowel pairs, discrimination scores were higher for the fast–slow rate pattern relative to the other rate patterns (p < 0.002), whereas performance was comparable for short vowel pairs at different rate patterns (p > 0.05). The remaining Rate pattern x Group and Rate pattern x Length x Group interactions did not reach significance (p > 0.08).

Mean d’ scores (+/− 1 standard error) for the Within Category–Different Rate condition of the discrimination of short vowels by rate pattern and group.
Finally, to compare overall discrimination performance by condition and group (Figure 5), a 2-way mixed ANOVA was constructed containing Category–Rate condition (Between Category–Same Rate, Between Category–Different Rate, Within Category–Different Rate) as a repeated measure and Group (Thai, EM, and ENM) as a between-subjects factor. Significant main effects of Category–Rate condition, F(2, 32) = 45.799, p < 0.001, and Group, F(2, 32) = 9.059, p = 0.001, were found, along with a significant Category–Rate condition x Group interaction, F(4, 32) = 26.950, p < 0.001. Pairwise (Bonferroni) comparisons of Category–Rate condition fixing each Group were performed. EM demonstrated significantly higher d’ scores in both Between Category conditions relative to the Within Category condition (p < 0.005), with performance in Between Category–Same Rate and Between Category–Different Rate not differing significantly from each other (p = 0.822). Thai controls displayed a similar pattern of discrimination performance, with higher discrimination accuracy in the Between Category conditions than the Within Category condition (p < 0.001) and no difference between the two Between Category conditions (p = 0.748). ENM, on the other hand, showed significantly lower scores in the Between Category–Same Rate condition relative to the other conditions (p < 0.013) and no significant difference between the Between Category–Different Rate and the Within Category–Different Rate conditions (p > 0.05).

Mean d’ scores (+/− 1 standard error) for each Category–Rate condition and group.
Pairwise (Bonferroni) comparisons of Group by Category–Rate condition found that in the Between Category–Same Rate condition, Thai controls had higher overall discrimination accuracy performance than both English groups (p < 0.004), and the EM group in turn had higher discrimination accuracy than ENM (p < 0.001). For the Between Category–Different Rate condition, the Thai group performed significantly better than ENM (p = 0.003), and EM performed marginally better than ENM (p = 0.074), while the Thai and EM listeners performed similarly (p = 0.394). Finally, for the Within Category–Different Rate condition, Thai listeners demonstrated significantly worse within-category discrimination than both English groups (p < 0.042), who did not differ significantly from each other (p > 0.05).
Taken together, these discrimination results indicate that the native Thai listeners were more sensitive to between-category vowel length distinctions than English listeners, evidenced by their superior identification and between-category (Same and Different Rate conditions) discrimination performance. In contrast, English non-musicians were more sensitive to within-category differences due to speaking rate variations, with significantly higher d’ scores than the Thai listeners in the Within Category–Different Rate discrimination condition. The English musicians’ performance was intermediate to the Thai and English non-musicians, showing enhanced sensitivity to categorical distinctions, as illustrated by their significantly higher identification and between-category (Same and Different Rate conditions) discrimination performance relative to the English non-musicians, while still maintaining within-category sensitivity, outperforming Thai listeners in the Within Category–Different Rate condition.
4 Discussion and conclusions
The present study examined whether musical experience would facilitate native English listeners’ ability to normalize for speaking rate variation in the perception of Thai vowel length distinctions. The results of the identification task revealed that all groups, including the native Thai controls, were affected by speaking rate, such that they displayed poorer identification accuracy at fast relative to slower speaking rates. This is in line with prior research on the influence of speaking rate on phonemic length perception, whereby a fast speaking rate was found to yield significantly lower identification accuracy scores than slower rates, even for native speakers of the quantity language being tested (Tajima et al., 2008). A similar pattern of rate sensitivity was found in the discrimination task across groups, with poorer between-category discrimination accuracy at fast as compared to slower speaking rates.
As predicted, the English non-musicians were significantly less accurate overall in the identification task and at between-category discriminations than native Thai listeners, which is in line with previous studies demonstrating that non-native listeners whose L1s do not have phonemic temporal distinctions have difficulty perceiving phonemic vowel length in an L2 (Hisagi et al., 2010; McAllister et al., 2002; Tajima et al., 2008). Moreover, while the English non-musicians showed poor between-category discrimination when the pairs of vowels were at the same rate of speech (Between Category–Same Rate), they showed improved discrimination when the vowel pairs were at different rates of speech (Between Category–Different Rate), with performance in the latter condition being on par with the Within Category–Different Rate condition. The fact that the English non-musicians’ discrimination performance was similar in the two conditions involving pairs of vowels at different speaking rates suggests that they were responding to rate differences rather than to phonemic length differences. It appears that non-musicians were latching onto the non-categorical cue, focusing on speaking rate differences bearing within-category temporal acoustic variations rather than categorical vowel length differences. These results are in line with the findings in non-native tone perception where, compared to native listeners, non-natives are more sensitive to within-category F0 differences, but less accurate in discriminating Chinese tones or classifying tonal exemplars into categories (Hallé, Chang, & Best, 2004; Peng, Zheng, Gong, Yang, & Kong, 2010; Xu, Gandour, & Francis, 2006). Indeed, Thai listeners were significantly worse than English listeners at discriminating within-category differences, though only for the short vowel pairs. Within-category differences as a function of speaking rate tend to be smaller for short relative to long vowels (Hirata, 2004a), and, in the present work, the average within-category difference for short vowels across rate patterns was 39 ms versus 100 ms for long vowels. Greater durational variation for the long vowels may have resulted in pairs containing items with sufficiently large enough differences that Thai listeners would be more likely to false alarm and perceive them as members of distinct length categories. Overall, these findings demonstrate that non-native listeners relative to native listeners are less capable of classifying exemplars bearing quantitative acoustic differences into categories.
Our prediction of the facilitative effects of musical experience in vowel perception was borne out because the English musicians demonstrated superior performance relative to the non-musicians at identifying non-native long and short vowels at all speech rates. Similarly, musicians were significantly more accurate than non-musicians at between-category vowel length discriminations, both at same and different speaking rates. Moreover, in contrast to the non-musicians, English musicians patterned similarly to the native Thai listeners, showing greater accuracy at distinguishing Thai vowel length categories relative to within-category differences due to speaking rate variation. The present results are consistent with previous research demonstrating that musical experience can play a significant facilitative role in the perception of non-native speech categories (e.g., Lee & Hung, 2008, Milovanov et al., 2009; Sadakata & Sekiyama, 2011). While the pattern of results was similar between English musicians and Thai listeners, the Thai group’s native language experience with these contrasts did result in enhanced accuracy rates in most conditions; however, group differences were neutralized in certain conditions. For example, no significant differences were found at the slow rate in the identification task between Thai and English musician groups, which may have resulted from the condition being sufficiently easy enough for these listeners, with the slow rate providing robust contextual and word-intrinsic cues for both musician and Thai listeners to have reached a performance ceiling. Moreover, the discrimination task saw a neutralization of Thai and English musician group differences at between-category discrimination at the fast speaking rate. This may have stemmed from Thai listeners doing equally poorly as English musicians as a result of relatively small acoustic differences between short and long vowels in this condition (M = 43 ms, a smaller difference even than certain within-category conditions such as short vowels at fast and slow rates, M = 58 ms). This may reflect the existence of absolute duration ranges in the native Thai listeners’ internal system (that is, permissible minimum and maximum durations for short and long vowels). Indeed, Abramson and Ren (1990) noted that sufficiently compressing a long vowel would at some point result in listeners hearing its short counterpart, perhaps as a product of the long vowel being outside of the permissible long vowel category range. Long vowels at a fast speaking rate in the present work may have been compressed to such a degree that it became difficult for native listeners to discriminate them from short vowels.
It is also important to note that, unlike the Thai group, English musicians did maintain a degree of within-category sensitivity, as they outperformed Thai listeners at discriminating within-category differences as a result of speaking rate variation, indicating that their perception of this non-native temporal contrast was not yet completely native-like. As the stimuli pairs were from the same phonemic length category, native listeners normalized for rate differences (i.e., demonstrated less sensitivity to within-category differences). In contrast, English musicians showed enhanced perception of between-category differences (relative to non-musicians) and within-category differences (relative to Thai listeners), reflecting influences of both musical and linguistic experience.
The present findings indicate that the musician group demonstrated a formation of non-native length categories that were relatively robust and could withstand speaking rate variability. They were capable of tracking the speech rate of the carrier sentence and accounting for that rate, while abstracting over the acoustic variation, when considering the vowel length of the target item. Previous research has established that native listeners possess internal phonemic categories that are normalized for extrinsic acoustic variations, such as speaking rate or style, suggesting relational invariance for duration to maintain phonological length contrasts (Boucher, 2002; Hirata, 2004a; Smiljanic & Bradlow, 2008). That the musicians showed near native-like patterns of speaking rate normalization points to the enhancement of domain-general auditory abilities as a result of musical experience. Specifically, musicianship involves tracking the temporal context and normalizing for changes in musical tempo. Such experience with extracting sound units and tracking regularities within a complex auditory environment appeared to enhance their ability to acquire regularities in a speech environment in terms of perceiving phoneme-intrinsic cues to speech sound categories while normalizing for phoneme-extrinsic acoustic variations. Indeed, previous research has shown that musical experience may facilitate the perception of more linguistically-relevant acoustic dimensions of speech sounds. For example, musicians relative to non-musicians were found to be able to better track the F0 contour information that distinguishes phonemic tonal categories (Lee, Lekich, & Zhang, 2014). The current results are in line with these previous findings, revealing that general auditory processing mechanisms such as experience with spectral and temporal categories from musical training may positively transfer to aid perception of speech categories in a non-native language.
The present work provides insight into the nature of music-to-speech transfer in the temporal domain, in that musical experience influences non-native phonemic length perception at the linguistic categorical level rather than the physical quantitative level. While the musicians did outperform non-musicians in detecting phonemic vowel length contrasts, they did not show any advantage over non-musicians in discriminating within-category temporal differences due to speaking rate variations. Indeed, musicians have extensive experience normalizing for changes in musical tempo by extracting the length identities of a sequence of notes in such a way that these length identities remain constant across different tempos. This requires musicians to overlook non-categorical, quantitative variations despite the fact that their musical training also enhances their sensitivity to subtle acoustic differences. These mechanisms involved in musical perception appear to mediate musicians’ perception of length contrasts in a non-native language. As they establish phonemic length categories they become less sensitive to subtle within-category acoustic durational differences. These patterns implicate common perceptual manifestations for music and linguistic categorical functions, suggesting that long-term exposure to contextual temporal variations may shape fundamental sensory circuitry in a domain-general manner.
Footnotes
Acknowledgements
We would like to thank Akkaporn Cooper, Janpanit Surasin, Ann Bradlow, and members of the NU Speech Communication Research Group, the SFU Language and Brain Laboratory for their assistance, support and feedback on this project. Portions of this research were presented at the 18th International Congress of Phonetic Sciences in Glasgow, 14 August 2015, and at the Society for Music Perception and Cognition Conference in Nashville, 1–5 August 2015.
Funding
We thank the Bienen School of Music, Program in Music Theory/Cognition for both financial and intellectual support.
