Abstract
This article reports a study that aimed to find out whether F0 patterns of L2 English produced by Vietnamese speakers are different to those of native English speakers, whether the non-native F0 patterns are transferred from Vietnamese, and to what extent English and Vietnamese F0 profiles differ. Ten native/L1 Australian English speakers, 20 Vietnamese speakers of English (10 beginners and 10 advanced speakers) and a control group of four native/L1 Vietnamese speakers were included. The F0 profiles (F0 maximum, F0 minimum, F0 range, F0 mean and F0 standard deviation at three levels: utterance, syllable and phoneme) were obtained from a set of 10 English sentences and 20 Vietnamese utterances. The results showed that F0 patterns of beginning-level L2 English are systematically different from those of native English speakers, which can be transferred from their native tone language. Nevertheless, the advanced speakers’ ability to produce native-like F0 patterns indicates the effect of language learning experience on prosodic acquisition. The data and results of this study contribute to the understanding of the process and nature of second language acquisition.
I Introduction
Several studies over the past decades have begun to establish that speakers of different languages may use characteristically different ranges and typical values of fundamental frequency (Dolson, 1994; Eady, 1982). Particularly, it has been found that speakers of tone languages, such as Chinese, display systematically different fundamental frequency (F0) patterns from speakers of non-tone languages such as English (Eady, 1982). In addition, the learned patterns of articulation of prosody have been found to be transferred from the native language (L1) to the second or foreign language (L2) (Flege and Davidian, 1985; Mennen and de Leeuw, 2014; Rosenberg and Hirschberg, 2010; Visceglia et al., 2010). This study aims to find out whether (1) F0 patterns of L2 English produced by Vietnamese speakers are different from those of English, (2) whether the non-native prosodic patterns are transferred from Vietnamese, and (3) to what extent English and Vietnamese F0 profiles differ.
1 Previous studies on differences in F0 patterns between languages
Not much research has been devoted to the comparison of pitch patterns between languages, particularly tone languages and non-tone languages because of the complexities of pitch calculation. One of the earliest experimental investigations of language differences in F0 patterns was two studies in the 1960s by Hanley et al. (1966) and Hanley and Snidecor (1967). These two studies compared the medians and standard deviations of the F0 (in semitones) in readings of ‘The Rainbow Passage’ (from Fairbanks 1960) by male native speakers of English, Spanish or Japanese and female native speakers of English, Spanish, Japanese or Tagalog. Results were mixed, with the only clear result showing that the English males had the lowest median F0. However, these English values were unusually low in comparison to other studies of English, as summarized in Baken and Orlikoff (2000). Other studies have compared F0 of Japanese versus English (Loveday, 1981; Ohara, 1992; Todaka, 1993; Yamazawa and Hollien, 1992), British English versus German (Mennen et al., 2012), Polish versus English (Majewski et al., 1972) and Mandarin versus Min (Taiwanese) (Chen, 2005). Chen (1972) compared the mean, standard deviation, and range of F0 of four English and four Mandarin speakers’ (two males and two females each, i.e. a very small sample) reading words and sentences. The Mandarin speakers, especially the women, had wider F0 ranges and larger standard deviations; the Mandarin women’s means were lower, while the men’s were the same as the English. In contrast, Eady (1982) compared several measures of F0 from passages read by male Taiwan Mandarin and English speakers. The mean F0 and measures of F0 fluctuation (dynamic movement) were all greater in Mandarin, but the standard deviation (taken as the measure of F0 range) was the same in the two languages. Nevertheless, the comparison was based on F0 values calculated in Hz rather than a normalized unit (such as semitone). Xue et al. (2002) compared the F0 mean, standard deviation, minimum, maximum and range of younger and older bilingual speakers (both males and females, but analyzed together), and found that while the older bilinguals had no differences between their Mandarin and English, the younger bilinguals had lower minimum F0 and a larger F0 range in their Mandarin (with no differences in maximum F0, mean F0, or standard deviation). Mang (2001) compared the longitudinal means for speaking and singing for eight pre-school girls who were either monolingual English or bilingual (English–Mandarin or English–Cantonese). The F0 values decreased over time for both language groups, but were lower in Chinese than in English from ages two to five. However, when the girls were five to six years old, the English-speaking F0 dropped below the Chinese. In a recent study, Keating and Kuo (2012) compared the speaking fundamental frequency in English and Mandarin, but no comparison between native English and L2 English was made. They found that the two languages’ F0 profiles sometimes differed, but these differences depended on the particular speech samples being compared. Most notably, the physiological F0 ranges of the speakers, determined from tone sweeps, hardly differed between the two languages, indicating that the English and Mandarin speakers’ voices are comparable. Their use of F0 in single-word utterances was, however, quite different, with the Mandarin speakers having higher maximums and means, and larger ranges, even when only the Mandarin high-falling tone was compared with English. In contrast, for a prose passage, the two languages were more similar, differing only in the mean F0 with Mandarin again being higher. Hirst and Ding (2015) employed 18 metrics for the comparison and found that Chinese English was intermediate between English and Mandarin Chinese. But these metrics were only extracted from the acoustic signal, without reference to any phonetic information. In a recent study by Ding et al. (2016), a comparison was made among the fundamental frequency (F0) patterns of continuous speech in English, Mandarin Chinese and L2 English produced by Chinese speakers. Ten adult native Chinese speakers were asked to read narrative text written in both English and Chinese. The comparative analysis was performed in the following aspects: F0 mean, pitch range, pitch change rate and pitch change amount. It was found that in terms of both pitch range at the phoneme level and pitch change amount at the utterance level, L2 English speech by Chinese subjects displayed a significantly larger value than the English speech by native speakers. Moreover, the same Chinese subjects demonstrated a larger value in these two pitch-related variables in their Chinese speech. The authors concluded that the dynamic characteristic of L2 English can be attributed to the negative transfer of L1 Chinese.
In summary, there is some evidence that F0 range is greater in Mandarin than in English, but results about mean F0 are mixed. In addition, the hypothesis that tone languages as such have an overall larger F0 range is not supported by Eady (1982) who found no difference between English and Mandarin F0 standard deviations (the measure of range in that study). The hypothesis that tone languages have an overall higher average F0 likewise receives only limited support from the studies, as reviewed above. Eady (1982) and (for one age group) Mang (2001) found that the mean F0 was higher in Mandarin than in English, but other studies comparing Mandarin and English have indicated the opposite.
2 F0 patterns of Vietnamese and English
Vietnamese is a tone language. In Northern Vietnamese, the tone system consists of six tones: level, falling, curve, broken, rising and dropping (Brunelle, 2009a, 2009b; Đoàn, 1999; Vũ, 1981). In Southern Vietnamese, however, there is a merging of the curve and broken tones, which are pronounced similarly. Therefore, in the Southern Vietnamese tone system there are just five tones: level, falling, broken-curve, rising and dropping (Brunelle, 2009a, 2009b). According to Vũ (1981) and Phạm (2003), the acoustic and perceptual correlates of Vietnamese tones include the direction of pitch (F0) movement, pitch height and voice quality, which play a more important role than other tonal dimensions, such as duration and intensity, in the identification of Vietnamese tones. According to Brunelle (2009a, 2009b, 2017), Kirby (2010) and Vũ (1981), the Southern Vietnamese language is purely pitch-based while the Northern Vietnamese is cued by a combination of pitch and voice quality. Southern Vietnamese speakers may mimic Northern Vietnamese voice quality distinctions in certain situations, but colloquial Southern Vietnamese speech does not make contrastive use of distinctive voice quality.
Vietnamese and English are two cases of tone vs. non-tone languages: Vietnamese is a typical tone language, whereas English is a typical non-tone language. Vietnamese is a language in which each syllable has its own underlying lexical tone. It has no tone reduction rules, which means that even on the surface, a Vietnamese sentence is heavily specified for tones. That is, the tones occur on all syllables in a sentence. The pitch contour is an interaction of syllable tones and the sentence intonation, while the F0 pattern of English speech is determined by the placement of primary stress on a few of the syllables in a sentence. Furthermore, English has a system of culminative word stress, but Vietnamese, a tone language, has no system of word stress; rather, it has a system of lexically distinctive tones (Nguyễn and Ingram, 2007a, 2007b). Brunelle (2017) found that there is little evidence for word stress in Southern Vietnamese. Stress is different from tone in several ways. First, stress is culminative (head-marking); that is, in stress languages, with few exceptions, every (content) word has at least one stressed syllable. Second, because a prominence hierarchy may occur among multiple stresses (e.g. primary vs. secondary stresses in English), stress is hierarchical. Third, stress can mark edges or boundaries in some systems, for example: some languages prefer iambic feet (stress on the final syllable), but others prefer trochaic ones (stress on the initial syllable). Fourth, stress is rhythmic in systems where stressed and unstressed syllables alternate and where clashes (adjacent stresses) are avoided. Fifth, stress contrasts tend to be enhanced segmentally: Stressed syllables may be lengthened by vowel lengthening or by gemination, and unstressed syllables may be weakened by vowel reduction (Kager, 1996).
In brief, Vietnamese, as a tone language, has no system of culminative word stress but a system of six lexical tones in which pitch is used to contrast individual lexical items or words. Even though Vietnamese also has vowel/syllable duration asymmetry (Brunelle et al., 2015; Nguyễn, 2014, 2015; Nguyễn and Ingram, 2007a, 2007b; Phạm, 2008), the extent of vowel and consonantal duration contrast is smaller in Vietnamese than in English. As a result, English and Vietnamese differ in terms of how they manipulate the acoustic correlates (e.g. F0) at word-level and utterance-level prosody.
As shown in Figure 1, with the help of ProZed, which is a tool designed by Hirst (2012), the difference of F0 patterns between English, L2 English and Vietnamese can be visualized. In Figure 1 the pitch contour is demonstrated in a continuous dotted line, with each circle corresponding to one syllable. The vertical level of the circle represents the pitch, and the diameter represents the duration of the syllable. The unit of pitch has already been normalized to the logarithmic scale log2(Hz/median). The scale is OMe (Octave Median), which was proposed by De Looze and Hirst (2014). As shown in Figure 1, while more fluctuations in the pitch contour with a few stressed syllables (circles with larger diameters) can be observed in the speech of a native English speaker and an advanced speaker of English, the speech of a beginning speaker of English shows a declarative intonation contour, with comparable diameters of the circles as that of the control native Vietnamese speakers. How to describe the differences of pitch change pattern between languages and the transfer of the F0 pattern of L1 to that of L2 speech has been a fascinating task for many prosody experts (Ding et al., 2016). The pitch changes based on syllables may provide more information for rhythmic differences in perception. It is suggested that the pitch patterns of an adult’s L2 can be characterized by the acquired pattern of F0 in L1 (Gut et al., 2007). Therefore, this study aims to compare the F0 patterns based on phonetic information at three levels (utterances, syllables, and phonemes) among English, Vietnamese and L2 English and to determine whether F0 patterns in the interlanguage (L2 English) are indeed systematically different from those of the target language (English), and whether they resemble the characteristics of the source language (Vietnamese).

Prosody display of a native Australian English (1st figure), an advanced Vietnamese speaker of English (2nd figure), a beginning speaker of English (3rd figure) and a native speaker of Vietnamese (4th figure).
3 Vietnamese acquisition of English prosody
There has not been much research that has focused on Vietnamese acquisition of English prosody. Nguyễn and Ingram (2005) examined the transfer of tonal acoustic correlates in Vietnamese learners’ production of English word stress. More specifically, the study examined acoustic features that native and non-native speakers (Vietnamese learners of English) use to differentiate stressed and unstressed syllables in noun–verb pairs (e.g. as in the words record vs. record). The results indicate that Vietnamese learners of English (both beginner and advanced proficiencies) utilized F0 and intensity correlates similarly to native speakers. A major difference was the lack of vowel and syllable duration cues in the beginning learners’ production. In another study on prosodic transfer effects in the production and perception of three English stress patterns (broad-focus noun phrase, narrow-focus noun phrase and compound) at the level of word and phrase prosody by Vietnamese learners of English, Nguyễn et al. (2008) found that Vietnamese speakers had no problem in manipulating contrastive levels of F0 and intensity on accent-bearing syllables. However, they failed to realize the timing contrast between compound words and phrases and the syntagmatic contrast of accent in larger units such as polysyllabic words or phrases, as evidenced by their failure to de-accent the second element of the compound and narrow-focus patterns. Nevertheless, the advanced speakers’ ability to compress the constituents of the compounds and to de-accent the final nouns shows the effect of language learning experience on prosodic acquisition. At the connected speech level, Nguyễn and Ingram (2004) found that the transfer of many segmental, prosodic, timing and syllable structures from the Vietnamese phonological system, such as checked stop, implosive stop, vowel quality, suppression of vowel reduction and checked tones, was also evidenced in advanced Vietnamese speakers of English. Particularly, the suppression of vowel reduction in unstressed syllables and the lengthening of many unstressed vowels/function words were projected under sustained high tones in unstressed syllables, in spite of an advanced level of English proficiency. In a recent study, the Nguyễn (2018) investigated the development of speech rhythm in L2 English by Vietnamese learners using the same 10 native/L1 Australian English speakers, 20 Vietnamese speakers of English (10 beginners and 10 advanced speakers) and a control group of four native/L1 Vietnamese speakers as this study. The rhythm metrics were obtained from the same set of 10 English sentences and 20 Vietnamese utterances as this study. The statistical results showed that the variation in duration of vowel and consonant intervals in the L2 is higher in advanced proficiency levels and lower in beginners’ speech. This, on the one hand, indicates that beginners may transfer their L1 rhythmic pattern into English; on the other hand, it suggests that rhythm in L2 English develops from more syllable-timed towards more stressed-timed patterns as acquisition progresses, which is consistent with Ordin and Polyanskaya (2015). It is noted that languages were classified into stress-timed (e.g., German, English, Russian, in which intervals between stressed syllables were thought to be of equal durations) and syllable-timed (e.g., Romance languages, in which syllables were thought to be of equal durations). Nevertheless, these terms are controversial due to unsuccessful attempts to find isochrony in any of the timing dimensions of speech rhythm as a linguistic structure, or to support the claim that languages are divided into rhythmic classes based on periodicity (Dauer, 1983; Roach, 1982; Whites and Mattys, 2007a, 2007b).
II Methods
1 Participants
In order to investigate the development of F0 patterns in L2 learners with a native language that prosodically contrasts to the target language, 10 native Australian English speakers (five females and five males) as a control group, 20 Southern Vietnamese speakers of English (10 beginners and 10 advanced speakers) and a control group of four native Southern Vietnamese speakers were recorded.
The 10 native Australian English speakers were first year linguistics students at the University of Queensland and in the age range of 24–48.
The 10 advanced Vietnamese speakers of English were full-time, overseas students at the University of Queensland. Vietnamese was their native language. They were in the age group of 23–41 and ranged in residency in Australia from a period of 0.5 to 6 years (mean: 1.4). All had attained written and spoken English proficiency scores (IELTS) of 6.5 to gain admission to the University of Queensland. They all had been EFL teachers or lecturers in teacher training programs at universities in Vietnam and were doing an MA in TESOL studies. Their English proficiency could be said to be advanced or of a high level.
The 10 beginners were overseas high school students who were in the age group of 15–18 and ranged in residency in Australia from a period of 0.5 to 1 year. The beginners had all started learning English at the age of twelve (in secondary school) with the grammar translation method, which focuses on vocabulary and grammar learning. However, they were exposed to communicative English learning for some time at foreign language centers in Vietnam, and English classes in Australia before entering high school. Their English proficiency can be said to be at a low level.
The control group of native Southern Vietnamese speakers (two males and two females) speak the Sài Gòn dialect and come from Hồ Chí Minh City. They were either visitors or newly arrived immigrants to Australia and had been in Australia for two weeks to three years. Their age ranged from 38 to 50 years. They only spoke Vietnamese at home with their family members and did not speak English as their primary language. They had a limited ability to read, speak, write or understand English, and thus can be said to have limited English proficiency. The two of the Southern Vietnamese speakers (one male and one female) who had been in Australia for three years were recruited because they satisfied the above criteria: only spoke Vietnamese at home with their family members, had a limited English proficiency and did not speak English as their primary language.
2 Linguistic materials
Ten English test sentences were selected from a set of 23 sentences that incorporated vocabulary items from a picture-naming pronunciation test that was originally designed to elicit segmental transfer errors of pronunciation by Vietnamese speakers of English. This was used in Nguyễn and Ingram (2004, 2016). The set of sentences were elicited via a grammatical paraphrase task. The grammatical paraphrase task required subjects to transform a sentence presented in spoken and written form (over headphones and a computer screen) into a meaning-equivalent form. The materials were presented via a spoken Language Assessment Program (http://www.languagemap.com). Subjects typed in the paraphrase in response to an initial prompt word and when satisfied with their construction, read out the sentence that they had formed. The linguistic aspects of the task were sufficiently complex to engage the subjects and to deflect their attention from the pronunciation aspects of the task. This yielded quite natural-sounding, careful but unguarded speech. After speaking their paraphrase response into a headset microphone, subjects pressed a button for access to the next item in the set, randomly selected without replacement until all items had been presented. Thanks to the same given prompt words, all the speakers, including all L2 learners, generally gave the same answers in the grammatical paraphrase task. The instruction of the paraphrase task, which showed an example of what the speakers had to do, is presented in Appendix 3.
The 20 Vietnamese sentences used for the four control Vietnamese speakers were taken from the excerpts of a short story (namely Tôi đi học by Thanh Tịnh) which was used in Nguyễn (2014). The 10 English test sentences and 20 Vietnamese utterances are presented in Appendix 1 and 2 respectively.
3 Segmentation and analysis
In order to measure F0 patterns of the speech corpus, the boundaries of phoneme (vowels and consonants) and syllable intervals are marked in Praat text grids (Boersma and Weenink, 2009): one tier for each level (utterance, syllable and phoneme) and one text-grid per utterance, using a waveform and wide-band spectrogram with a five ms window and a frequency resolution of 86 Hz, and using interactive playback. The boundaries of vowels were marked using vowel onset and offset criteria from Peterson and Lehiste (1960), supplementing their guidelines when placing difficult boundaries for /l/, /r/, /w/, /j/, and /h/ by using perceptual cues in combination with rapid changes in formants or energy visible on the spectrogram. In addition, syllabic consonants were treated as vowels, as in Ferragne and Pellegrino (2004). Then, Praat scripts were used to extract F0 maximum (F0 max), F0 minimum (F0 min), F0 range (F0 max-F0 min), F0 mean, F0 standard deviation (F0 std) and duration for each level (utterance, syllable and phoneme). Speech rate (syllable/second) per utterance and F0 change rate (semitone/millisecond) per 10-msec interval were also calculated.
Measurements of F0 values in Hertz were then converted to semitones to normalize across English and Vietnamese, male and female speakers, according to the equation with a reference of 100 Hz. The following conversion equation proposed by Fant et al. (2002) and used in Ding et al. (2016) was used in this study.
With this normalization in frequency level, it is possible to display a qualitative derivation of the essentials of an intonation contour and facilitate inter-speaker comparisons and specifications of group average data. Except for speech rate (syllables/second) and F0 change rate (semitone/millisecond), other F0 variables discussed in this study are presented in semitone (st). All the measures used in this study are summarized in Table 1.
F0 measures calculated.
Notes. A token here can be a phoneme, a syllable or an utterance.
4 Statistical analysis
In order to account for the effect of speakers’ differences and the intrinsic segmental and tonal effects, a restricted maximum likelihood (REML) applied to the mixed-effect model methodology was performed on the F0 values of the speech corpuses. The dependent variable was an F0 value (F0 max, F0 min, F0 range, F0 mean, F0 std, speech rate or F0 change rate). The fixed effects included languages/speaker groups (native English speakers, beginners and advanced Vietnamese speakers of English and native Vietnamese speakers) and gender (males versus females). The random effects were speakers (34 speakers) and utterances (380) or syllables (5156) or phonemes (11863). A Tukey post-hoc test was then conducted to determine the significant differences among the levels of the main effects. The use of REML overcomes the potentially serious deficiency of the ANOVA-based methods, which assume that data are sampled from a random population and are normally distributed. REML also avoids bias arising from maximum likelihood estimators in which all fixed effects are known without errors, and consequently tend to downwardly bias estimates of variance components. Moreover, REML can handle unbalanced data (Corbeil and Searle, 1976). The data analysis was carried out using the SPSS program. The two-way mixed-effect model results are presented in Table 2.
Summary of mixed- effect model results.
Notes. EN = native English speakers, AD = advanced speakers of English, BE = beginning speakers of English, Viet = control native speaker of Vietnamese.
III Results
1 F0 patterns at utterance level
As shown in Table 2, at the utterance level there were significant effects for the main factor languages (or speaker groups) (p < 0.001) and the interaction effect of languages
First, as shown in Table 2 and Figure 2, post-hoc pairwise comparison showed that the utterance F0 maximum value of the native English and advance speaker groups was significantly greater than that of the beginner group and the control Vietnamese speaker group across both female and male speakers (p < 0.001). There was no significant effect between the native English speakers and the advance group (p = 0.795). Neither was there a significant difference between the beginner group and the control Vietnamese group (p = 0.97). Second, similar significant patterns among the speaker groups were found for F0 range, while there was no significant effect for F0 minimum. Third, in terms of F0 mean value, as shown in Figure 2, only the male native English speakers had a significantly higher F0 mean than the other three groups. Finally, the utterance speech rate of the native English group was shown to be significantly higher than that of the other three groups (p < 0.001), while no significant difference was found among the three Vietnamese speaker groups.

Mean F0 values (st) and speech rate (syllable/sec) for utterances across four speaker groups: native English speakers, advanced speakers of English, beginning speakers of English and a control group of native speakers of Vietnamese.
2 F0 patterns at syllable level
As shown in Table 2, at the syllable level there were significant effects for the main factor languages (or speaker groups) (p < 0.001), gender (p < 0.001) and the interaction effect of languages
First, post-hoc pairwise comparison (Table 2 and Figure 3) shows that syllable F0 maximum value of the native English was significantly greater than that of the three Vietnamese speaker groups (beginner group, advance group and the control Vietnamese speaker group) across both female and male speakers (p < 0.001). There was no significant difference between the beginner group and the control Vietnamese group (p = 0.376). Second, the F0 minimum value of the native English group, particularly the male speakers, was significantly higher than that of the other three Vietnamese groups, while there were no significant effects among the three Vietnamese groups. Third, syllable F0 range value of the native English and advance speaker groups was significantly greater than that of the beginner group and the control Vietnamese speaker group across both female and male speakers (p < 0.001). There was no significant effect between the native English speakers and the advance group (p = 0.554). Neither was there a significant difference between the beginner group and the control Vietnamese group (p = 0.039). Finally, the F0 mean value of the native English group, particularly the male speakers, was significantly higher than that of the other three Vietnamese groups, while there were no significant effects among the three Vietnamese groups.

Mean F0 values (st) of syllables across four speaker groups: native English speakers, advanced speakers of English, beginning speakers of English, and a control group of native speakers of Vietnamese.
3 F0 patterns at phoneme level
As shown in Table 2, at the phoneme level there were significant effects for the main factor languages (or speaker groups) (p < 0.001), gender (p < 0.001) and the interaction effect of languages
First, post-hoc pairwise comparison (Table 2 and Figure 4) shows that phoneme F0 maximum value of the native English speakers, particularly the males, was significantly greater than that of the three Vietnamese speaker groups (beginner group, advanced group and the control Vietnamese speaker group) (p < 0.001), while there was no significant difference among the three Vietnamese speaker groups. Second, for F0 minimum, only the male speakers of the native English group had a significantly higher value than the other three groups. Third, phoneme F0 range value of the native English and advance speaker groups was significantly greater than that of the beginner group and the control Vietnamese speaker groups, across both female and male speakers (p < 0.001). There was no significant effect between the native English speakers and the advance group (p = 0.11). Neither was there a significant difference between the beginner group and the control Vietnamese group (p = 0.068). Finally, the F0 mean value of the native English group, particularly the male speakers, was significantly higher than that of the other three Vietnamese groups, while there were no significant effects among the three Vietnamese groups.

Mean F0 values (st) of phonemes across four speaker groups: native English speakers, advanced speakers of English, beginning speakers of English, and a control group of native speakers of Vietnamese.
4 F0 change rates
As shown in Table 2 and Figure 5, for F0 change rate there was a significant effect for the main factor languages (or speaker groups) (p < 0.001), while gender (p = 0.043) and the interaction effect of languages

Mean values of F0 change rate (st/ms).
5 F0 standard deviation
As shown in Table 2, for F0 standard deviation there was a significant effect for the main factor languages (or speaker groups) (p < 0.001), while gender and the interaction effect of languages

Mean F0 std across three levels: utterance, syllable and phoneme.
IV Discussion and conclusions
This section summarizes and discusses the results reported above, in the context of the study’s research questions:
First, to what extent do English and Vietnamese F0 profiles differ?
Second, whether F0 patterns of L2 English produced by Vietnamese speakers are different from those of English and whether the non-native prosodic patterns are transferred from Vietnamese.
For the first question, the results show that the F0 patterns (F0 max, F0 range, F0 mean, and F0 std) discriminate well between the two languages: English, a non-tone, stressed-timed language, has a greater F0 variation than Vietnamese, a tone language. This could be attributed to the fact that English has more stress-induced vowel reduction than Vietnamese, which has less vowel reduction and lower variation of vocalic and consonantal durations. Even though Vietnamese also has vowel/syllable duration asymmetry (Brunelle et al., 2015; Nguyễn and Ingram, 2007a, 2007b; Phạm, 2008), the extent of vowel and consonantal duration contrast is shown to be smaller in Vietnamese than in English. As a result of contrastive and fluctuation distribution of stressed and unstressed syllables, together with vowel and syllable reduction, it may be that English has greater F0 variation in an utterance than in Vietnamese, in which tones occur evenly on all syllables in a sentence. This result is further confirmed in Figure 1 which shows that the control native Vietnamese group’s utterance had a flatter, declarative intonation contour with comparable diameters of the circles, which suggests a syllable-timed rhythm, consistent with Nguyễn (2018), Sawanakunanon (2013) and Romano et al. (2011), who found Vietnamese to cluster with supposedly syllable-timed languages. This is in contrast to the English speaker’s utterance, which has more fluctuations in the pitch contour, with a few stressed syllables (circles with larger diameters) and a greater variation in duration (several unstressed syllables with much smaller circles).
In addition, the fact that Vietnamese has narrower F0 range and F0 variation than English is in line with Eady (1982), who found no difference between English and Mandarin F0 standard deviations (the measure of range in that study) and Keating and Kuo (2012) who found that for a prose passage, Mandarin and English were more similar in F0 range. As a result, the hypothesis that tone languages have an overall larger F0 range is not supported. Nevertheless, this result contradicts a few previous studies (Chen, 1972; Ding et al., 2016; Xue et al., 2002). This is probably due to the different speech samples used that showed different patterns of results. That is, whether the two languages will appear to have similar or different F0 profiles very much depends on the speech corpus or task, as well as on the acoustic measures. The inconsistency of results from previous studies may be partly due to such methodological differences. In this study, the utterances were elicited via a grammatical paraphrase task in which subjects typed in the paraphrase in response to an initial prompt word and when satisfied with their construction, read out the sentence that they had formed. Thus, the linguistic aspects of the task were sufficiently complex to engage the subjects and to deflect their attention from the pronunciation aspects of the task. This yielded quite natural-sounding, careful but unguarded speech.
Furthermore, the results on F0 mean consistently show that only the male native English speakers had significantly higher F0 mean than the other three groups across three levels (utterance, syllable and phoneme). This supports the mixed results on F0 mean in literature (Chen, 1972; Keating and Kuo, 2012). In Chen’s (1972) study, the Mandarin women’s means F0 were lower, while the men’s were the same as the English. In Keating and Kuo’s (2012) study, for a prose passage, English and Mandarin differ only in the mean F0, with Mandarin being higher. In other words, the findings on F0 mean are inconclusive and inconsistent across studies. Passoni et al.’s (2018) recent results suggest gender specificity within languages: the Japanese–English bilingual female displayed more pitch variation in the different formality settings than did the Japanese–English bilingual male.
For the second question, the results show that across three levels (utterance, syllable and phoneme), on the one hand, the F0 maximum, F0 range and F0 standard deviation of the advanced speaker group was similar to that of the native English speaker group; on the other hand, it was significantly higher than that of the beginner group and the control Vietnamese group, which had comparable F0 maximum, F0 range and F0 std values. In other words, the variation in F0 in the L2 is higher in advanced proficiency levels and lower in beginner speech. This, on the one hand, indicates that beginners may transfer their L1 F0 patterns into English; on the other hand, it suggests the effect of language learning experience on F0 pattern acquisition, lending further support for the L2 prosody acquisition in previous studies (Nguyễn and Ingram, 2005; Nguyễn et al., 2008; Trofimovich and Baker, 2007). In addition, the results on F0 max, F0 range and F0 std show that Vietnamese English was intermediate between English and Vietnamese, which is consistent with Hirst and Ding (2015).
Third, the speech rates of both English male and female speakers displayed a higher articulation rate on the syllable level than the three Vietnamese male and female groups respectively. This mirrors Ding et al.’s (2016) results on Mandarin Chinese, lending further support for the fact, which has been proved repeatedly, that L2 speakers speak more slowly than native speakers (Baese-Berk and Morrill, 2015; Ding et al., 2012; Guion et al., 2000).
Fourth, the beginner group had significantly lower F0 change rates than the other three groups (native English, advanced and control native Vietnamese). In other words, the F0 change rate is higher in native English speakers and advanced proficiency levels and lower in a beginner’s speech. The reason is that the high articulation rate of the English speakers in this study also increased the pitch change rate. Furthermore, there is also evidence showing that pitch change by speakers of a lexical tone language, like Chinese, is not significantly faster than that produced by speakers of languages with no lexical tone, Xu and Sun (2002).
Fifth, Vietnamese syllable structure is different from that of English. In English syllables, consonant clusters occur in both word-initial and word-final position: Clusters of up to three consonants are allowed in the initial position, and clusters of up to four consonants are allowed in the final position. Thus, it is theoretically possible for as many as seven consonants to occur in a sequence at word boundaries (Bowen, 1975). By contrast, it is generally agreed that the Vietnamese language does not have consonant clusters or blends (Seitz, 1986; Thompson, 1987). Therefore, there may be possible roles of syllable structure which differ between English and Vietnamese, and their potential impact on F0 measurements at different levels. Nevertheless, in this study the same linguistic material (10 English sentences) was used to compare the three speaker groups (control L1 English speakers, beginners and advanced L2 learners of English). The only difference is the L1 control Vietnamese data which is in narrative form. In addition, as shown in Appendix 1, most of the words in the English corpus have simple syllable structures (such as CV: they, to; CVC: box, hung; CCVC: strong, green; CVCC: bells, clouds). Studies on non-controlled speech (or conversational, spontaneous or narrative speech) are obviously constrained by the real distribution of words in the speech corpus which also applied to the data of this study.
Sixth, the results showed a significantly different effect on F0 values between the native English male speakers and the male speakers of the three Vietnamese groups. Specifically, first, at the utterance level, only the male native English speakers had a significantly higher F0 mean than the other three Vietnamese male groups. Second, at the syllable level, the F0 minimum and F0 mean values of the native English male speakers were significantly higher than that of the other three Vietnamese male groups, while there was no significant effect among the three Vietnamese groups. Third, at the phoneme level, F0 maximum, F0 minimum and F0 mean values of the native English male speakers was significantly greater than that of the three Vietnamese male speaker groups. In fact, as shown in Figures 2, 3 and 4, the native English male speakers’ F0 values were more or less comparable to the native English female speakers. What differs is the result between the native English male speakers and the other three Vietnamese male groups. This result suggests that the Vietnamese male speakers have a lower voice pitch (lower F0 values) than the native English male speakers, which may be due to physiological differences between speakers of different languages. In addition, gender difference in F0 has also been found in previous studies. In Hanley et al. (1966) and Hanley and Snidecor (1967), English males had the lowest median F0. In Chen (1972), the Mandarin women’s means were lower, while the men’s were the same as the English. Passoni et al.’s (2018) results on Japanese–English bilinguals also suggest gender specificity within languages.
In conclusion, this study conducted a systematic investigation on F0 patterns of English, Vietnamese and L2 English by Vietnamese speakers with the comparison of several F0-related variables. The results show that F0 patterns of beginning-level L2 English produced by Vietnamese speakers are systematically different from those of native English speakers, which can be transferred from their native tone language. Nevertheless, the advanced speakers’ ability to produce native-like F0 patterns indicates the effect of language learning experience on prosodic acquisition. The results also show that larger pitch range, pitch change rate and pitch variation across the three levels (utterance, syllable and phoneme) in English speakers can be attributed to the fact that English has a more stress-induced vowel reduction and stress-timing rhythm than Vietnamese, which has less vowel reduction, a syllable-timed rhythm and is heavily specified for tones. It may be also that these are simply language intrinsic differences. This study thus contributes to the growing literature showing that languages can differ in their F0 profile, particularly between tone and non-tone languages. The data and results of this study also contribute to the understanding of the process and nature of second language acquisition.
Footnotes
Appendix 1
Appendix 2
Appendix 3
Acknowledgements
I would like to thank the subjects for their voluntary participation in the experiment. I also thank Dr. Esther de Leeuw and the two anonymous reviewers for their constructive comments. Part of the data from the LanguageMap project is gratefully acknowledged.
Declaration of Conflicting Interest
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
