Abstract
Word-initial stops in Mandarin and English show a distinctive phonological categorization but a similar phonetic realization along the VOT (Voice Onset Time) continuum. Previous research reported that native Mandarin adults produce measurably longer long-lag VOTs than native English adults. The present study examined whether and how the difference between Mandarin and English VOTs is manifested in monolingual children and Mandarin–English bilingual children. The participants included 15 five- to six-year-old sequential bilingual children, 24 corresponding monolingual children (15 Mandarin, 9 English), and 22 monolingual adults (12 Mandarin, 10 English). The bilingual children were divided into two groups (Bi-low and Bi-high) based on the amount of experience in English. Each participant was recorded producing 18 Mandarin words and/or 18 English words containing six stops in each language. The VOT values were measured from the beginning of stop burst to the onset of the voicing. The results showed that the language difference in VOT in the monolingual children was manifested in a pattern similar to the monolingual adults. However, Mandarin and English VOTs showed less separable distributions in the two groups of bilingual children. Further analysis suggested that both groups of bilingual children tended to separate Mandarin and English short-lag VOTs but only the Bi-low children showed different long-lag VOTs between the two languages. These results suggested that due to the bilingual effects and L1–L2 (first language – second language) interactions, even though the bilingual children tried to separate the two VOT systems, they implemented the separation in a different manner than the monolingual speakers.
Keywords
I Introduction
Since the term Voice Onset Time (VOT) was proposed in the 1960s, the acoustic representation of the ‘temporal relationship between the moment of the release of the stop and the onset of glottal pulsing’ (Abramson and Whalen, 2017: 76) has been widely used as the most important measure in identifying the voicing and aspiration features of stop consonants. According to the categorization of Lisker and Abramson (1964), there are three broad ranges in which the VOT values usually reside: voicing-lead (< –75 ms), short-lag (0–25 ms), and long-lag (> 60 ms). The production of VOT is affected by multiple factors including the place of articulation, the vowel environment, speaking rate etc. (Abramson and Whalen, 2017; Cho and Ladefoged, 1999; Kessinger and Blumstein, 1997). Various combinations of VOT patterns are exploited to manifest the stop contrasts in the world’s languages. For example, both Thai and Korean have a three-way contrast of VOT, but they are realized in a different way. The former has pre-voicing, short-lag, and long-lag VOTs whereas the latter has short-lag, long-lag, and extremely long-lag VOTs accompanied by f0 distinctions. Over the past few decades, the acquisition of voicing contrast has been extensively investigated in monolingual and bilingual children from various language backgrounds (Bond and Wilson, 1980; Caramazza et al., 1973; Deuchar and Clark, 1996; Engstrand and Williams, 1996; Fabiano-Smith and Bunta, 2012; Gilbert, 1977; Hitchcock and Koenig, 2015; Kehoe et al., 2004; Kewley-Port and Preston, 1974; Lowenstein and Nittrouer, 2008; Macken and Barton, 1980a; Simon, 2010; Whiteside and Marshall; 2001; Zlatin and Koenigsknecht, 1976). While the monolingual studies focused on the acquisition order of different VOT categories, the bilingual studies paid more attention to how bilingual children develop separate phonetic systems for the two languages and to what extent the bilingual children differ from corresponding monolingual speakers of each language. To date, most bilingual studies involved two languages which show a voicing vs. aspirating distinction with one language characterized by a voicing-lead vs. short-lag VOT contrast and the other language characterized by a short-lag vs. long-lag VOT contrast. The present investigation, however, focused on two languages that have a similar VOT contrast of short-lag vs. long-lag for stop consonants. In particular, we wonder how the cross-language similarity/difference between Mandarin and English stops is manifested in young bilingual Mandarin (L1) – English (L2) children.
1 Acquisition of voicing contrast in monolingual children
The acquisition of voicing contrast in English-speaking children has been well documented in previous literature (Hitchcock and Koenig, 2013; Kewley-Port and Preston, 1974; Macken and Barton, 1980a; Zlatin and Koenigsknecht, 1976). English has six stops /b, d, g, p, t, k/ which are phonologically characterized by a voiced/voiceless distinction. However, the phonetic realization of the contrast varies considerably across dialects and speakers. The production of VOTs for voiced and voiceless stops are conditioned to a number of linguistic factors including the phrase position, word position, lexical stress, and the manner and voicing of the neighbor sound, etc. (Davidson, 2016; Lisker and Abramson, 1967). English voiced stops occurring at an utterance-initial position are always produced with short-lag VOTs in many regional dialects although sometimes they are produced with voicing-lead VOT values (Keating, 1984). English voiceless stops /p, t, k/ isolated in word-initial position are always produced with long-lag VOT values. Researchers found that English-speaking children generally acquire the short-lag vs. long-lag contrast by three years of age. In particular, short-lag VOTs emerge earlier and approximate adult-like patterns at an earlier time. In contrast, long-lag VOTs emerge at a later time and showed a widely dispersed distribution (e.g., Gilbert, 1977; Lowenstein and Nittrouer, 2008; Preston and Yeni-Komshian, 1967). Divergent from English, many other languages have a voicing-lead vs. short-lag VOT pattern for the voiced-voiceless contrast. Macken and Barton (1980b) reported that Spanish children predominantly produced short-lag VOTs during the early stage of development but had not acquired the voicing-lead until four years of age.
In addition to the common two-way contrast, some languages have a three-way contrast. For example, Thai is characterized by a three-way contrast of voiced unaspirated, voiceless unaspirated and voiceless aspirated VOT. Gandour et al. (1986) examined the development of voicing features in 3-, 5-, and 7-year-old Thai-speaking children. The authors summarized that the three-way contrast voicing feature was acquired in the order of voiceless unaspirated (zero or short-lag), voiceless aspirated (long-lag), and voiced unaspirated (long-lead). Davis (1995) conducted a cross-language comparison of the acquisition of voicing features between Hindi-speaking and English-speaking children. Hindi stops have unaspirated-aspirated contrast for both voiced and voiceless sounds. Therefore, Hindi stops can be categorized into voiced aspirated, voiced unaspirated, voiceless unaspirated and voiceless aspirated types. The author found that the age of acquisition for contrastive pairs was determined by the magnitude of acoustic differences between the pair members. Greater lag difference (post-release voice onset time) between the contrast pair in the adults’ model indicated earlier acquisition of the contrast by children.
2 Development of voicing contrast in bilingual children
Unlike monolingual children who encounter one language source, bilingual children need to deal with two languages. The formation of new phonetic categories and interaction of the two phonetic systems have long been the center of bilingual research. A number of bilingual studies documented the development of voicing contrasts in bilingual children at various ages with one language having voicing-lead vs. short-lag VOTs and the other language having short-lag vs. long-lag VOTs (e.g. Caramazza et al., 1973; Deuchar and Clark, 1996; Fabiano-Smith and Bunta, 2012; Harada, 2007; Khattab, 2002; Kehoe et al., 2004; Lee and Iverson, 2012; Simon, 2010). Convergently, these studies found that given a certain period of exposure to both languages, most children could successfully acquire distinct VOT patterns for the two languages. However, they tended to separate the two VOT systems in their own manner, not always in a monolingual-like manner. In addition, researchers found that during the process of developing the two phonetic systems, the effect of bilingualism was manifested in different patterns in bilingual children. Kehoe and colleagues (2004) proposed three different patterns: delayed acquisition of the voicing feature in one language, transfer of the voicing feature from one language to another, and no cross-language interference.
In another bilingual study, Deuchar and Clark (1996) tracked the acquisition order of voicing contrasts in a bilingual Spanish–English child who was exposed to Spanish at home since birth and English in an outside care place. The authors found that the child acquired English voicing contrast earlier than the Spanish contrast. Lee and Iverson (2012) examined the separation of stop categories in 5- and 10-year-old Korean–English bilingual children. Different from other bilingual studies that compared two languages both having two-way phonetic contrasts, this study focused on two languages with one having a three-way contrast of lenis vs. fortis vs. aspirated and the other one having a two-way voiced vs. voiceless contrast. The authors found that both groups of bilingual children could make stop distinctions that were reflected on distinctive VOT patterns in both languages, but they produced longer VOT for Korean stops and shorter VOT for English stops in comparison to corresponding monolingual peers. In addition, the two age groups of bilingual children differed by whether or not using vowel-onset f0 to make stop distinctions and adopted different mechanisms to develop phonetic categories in the two languages.
II Theoretical framework
To date, researchers have proposed various models to account for L2 perception and production among which Kuhl’s Native Language Magnet (NLM) model (Kuhl, 1991), Best’s Perceptual Assimilation Model (PAM; Best, 1995), Flege’s Speech Learning Model (SLM; Flege, 1995), and Escudero’s Second-Language Linguistic Perception (L2LP; Escudero, 2005) model were developed from phonetic/phonological perspective. NLM emphasizes that infants’ early experience results in the establishment of phonetic prototypes in their native language which has a magnet effect to facilitate the development of perceptual mapping for native sounds but to reduce the perceptual sensitivity to the nonnative sounds. PAM proposes that naive listeners perceptually assimilate nonnative sounds into native sound categories with varying degree of goodness on the basis of acoustic–phonetic similarity between the nonnative and native sounds. Recently, Best and colleagues extended PAM to accommodate perceptual learning by L2 learners (Best and Tyler, 2007). This extended model proposes that perceptual learning happens on certain L2 contrasts and the magnitude of L2 perceptual learning is affected by the phonological and phonetic relationship between the second language (L2) and first language (L1) phonemes as well as the familiarity with the L2. Compared to the L2 phonemes that have identical or similar counterparts in L1, the L2 phonemes that show a remarkable difference from any L1 phonemes result in greater perceptual learning differences.
Unlike NLM and PAM which mainly focus on the initial state of L2 acquisition, SLM and L2LP put more emphasis on the process of L2 development and the ultimate achievement in L2 perception and production which are more relevant to the present study. Of these two models, SLM primarily accounts for the formation and development of L2 sounds in speech production while L2LP has been developed to explain the development of L2 perception. SLM assumes that the language learning capacity remains intact over the course of whole life span and the phonetic elements of L1 and L2 exist in a ‘common phonological space’ in which the two phonetic systems can interact with each other. According to this model, there are three types of outcomes regarding the formation and learnability of L2 sound categories. If an L2 sound completely overlaps with an L1 sound, L2 learners can acquire this sound with no difficulty. If an L2 sound has no overlap with an L1 sound, the L2 sound is treated as a new sound which eventually, can be established with substantial experience and practice in the L2. However, for those L2 sounds that are partially overlapped with L1 sounds, as these sounds show phonetic–acoustic similarities but are not identical to the native sounds, the L2 learners may encounter the greatest difficulty to acquire and differentiate them.
Consistent with other models, L2LP posits that listeners’ native language plays an important role in the initial perceptual mapping for the L2. As a result, L2 learners usually fully copy the cognitive representation of L1 perception to form the initial state of L2 perception. In the meantime, L2 learners have full access to L1-like learning mechanisms to develop an optimal L2 perceptual mapping. The development of L2 perception is realized through auditory-driven category formation and lexicon-driven category boundary-shifting. This model diverges from SLM in terms of the learnability of new and similar sounds in L2. In particular, L2LP proposes that although L2 sounds which are acoustically similar to certain L1 sounds present challenge to L2 learners, these similar sounds are easier to acquire than the L2 sounds that are entirely ‘new’ to the listeners because the L2 learners only need to shift the existing boundaries of corresponding L1 sounds to achieve an optimal L2 state for the similar sounds. However, for the L2 sounds that do not exist in their native language, the L2 learners have to create new boundaries and categories to differentiate them from the L1 sounds.
In addition to the divergent opinion regarding the learnability of new vs. similar sounds, SLM and L2LP differ in the stability of L2 learners’ native language. L2LP proposes that L2 learners’ L1 perception remains stable and is not affected by L2 development as long as the learners have sufficient L1 experience (Escudero, 2005). By contrast, SLM posits that both L1 and L2 are subject to change due to the L1–L2 interactions that are realized through the mechanisms of assimilation and dissimilation during the process of L2 development (Flege, 1987; Flege et al., 2003). According to this model, experienced L2 learners may assimilate an L1 sound to the similar L2 sound or dissimilate an L1 sound from its original target position to maximize the distinction between the two systems.
III Present study
Unlike previous bilingual studies that compared two languages with distinctive VOT patterns, the present study focused on the comparison of two stop systems that have similar phonetic realization in VOTs in young bilingual Mandarin–English children. Mandarin has six stops /p, t, k, ph, th, kh/ that are all voiceless. The phonetic feature of aspiration separates the six stops into aspirated and unaspirated categories which are phonetically realized as short-lag vs. long-lag VOTs. As stated above, in many dialects, English voiced stops in word-initial positions are manifested as short-lag VOTs and English voiceless stops in isolation at word-initial positions are usually manifested as long-lag VOTs. Therefore, these two languages both have short-lag and long-lag VOTs distributed at similar regions along the VOT continuum. However, previous studies revealed that the long-lag VOTs in Mandarin and English stops exhibit subtle but noticeable differences in monolingual adults of each language (Chao and Chen, 2008; Chen et al., 2007; Rochet and Fei, 1991). For example, Chao and Chen (2008) compared the VOT values of stops in Taiwanese Mandarin produced by 11 female speakers with those in English produced by three female British English speakers. The results demonstrated that the VOTs of Taiwanese unaspirated /p, t, k/ were not significantly different from those of English voiced /b, d, g/ but the VOTs of Taiwanese aspirated /ph, th, kh/ (82 ms, 81 ms, and 92 ms respectively) were longer than those of English voiceless /p, t, k/ (62 ms, 73 ms, and 86 ms respectively). The differences between the long-lag VOTs of these two languages, although subtle, were statistically significant.
So the question that arises subsequently is whether the distinctive VOT patterns between these two languages can be observed in monolingual children of each language. Following this question, people also wonder whether the subtle difference is manifested in bilingual Mandarin–English children and to what extent the language difference is presented in bilingual Mandarin–English children. So far, there have been some sporadic reports that provided a descriptive analysis of Mandarin VOTs by adult speakers (Chao and Chen, 2008; Chen et al., 2007; Li, 2013; Rochet and Fei, 1991). Little data has been published on the comparison of VOT values in Mandarin and English monolingual children as well as bilingual Mandarin–English children. Recently, one case study reported by Qi et al. (2012) examined the production of word-initial voiceless stops /p, t, k/ in English and aspirated stops /ph, th, kh/ in Mandarin by two Mandarin–English bilingual children. This study recorded spontaneous speech from one boy and one girl over a 9-month period from the time when they were 5;03 and 2;07 years old, respectively. The results revealed significantly different VOTs for bilabial stops but not for the alveolar or velar stops between the two languages in both children. The authors claimed that the bilingual children tended to organize the two phonetic systems differently to show their awareness of the subtle phonetic difference between Mandarin and English. Given that no solid findings of language difference on VOT values were obtained for all three tested stops, moreover, this study only examined the production of voiceless/aspirated stops from two bilingual participants, the conclusion of separate VOT systems in bilingual Mandarin–English children should be validated with further evidence. To address the above questions and to fill the knowledge gap, the present study compared the VOTs of six stops each followed by three corner vowels in Mandarin and English that were produced by two groups of sequential bilingual children and monolingual children and adults.
In the present study, there are two groups of bilingual children: Bi-low (bilingual with less experience in English) and Bi-high (bilingual with more experience in English). As stated above, both Mandarin and English have similar phonetic representations of short-lag and long-lag VOTs for their stop systems. While there is no evident difference of short-lag VOTs between these two languages, the long-lag VOTs in Mandarin are longer than those in English. Therefore, for native Mandarin speakers, English short-lag VOT has an equivalent VOT category in their native language. According to both SLM and L2LP, we predict that both groups of bilingual children may directly apply their L1 model to the L2 and have no difficulty producing native-like short-lag VOTs in English.
For English long-lag VOTs which are similar but not identical to the bilingual children’ native long-lag VOTs, the two models have different predictions. SLM regards this as the most challenging situation. According to this model, the bilingual children may not produce native-like English long-lag VOTs. In particular, as inexperienced L2 learners usually start with their native language and assimilate similar L2 category into the L1 counterpart, we could assume that the Bi-low children would assimilate the English long-lag VOTs to Mandarin long-lag VOTs and might not separate the VOT patterns between Mandarin and English. As for the Bi-high children, they might not develop native-like English long-lag VOTs either. Meanwhile, their Mandarin long-lag VOTs may change due to the L1–L2 interactions. During the interaction process, if an assimilatory process occurs, the two VOT systems tend to approximate to each other. The subtle differences between the Mandarin and English VOTs may diminish. In this case, the Bi-high children may not show separate VOT distributions for these two languages. On the contrary, if a dissimilatory process occurs, the Bi-high children may shift the VOT values away from the target position and enlarge the differences between the two VOT systems. Therefore, they may show greater separation of VOT distributions for these two languages in comparison to monolingual speakers.
Following L2LP, the bilingual children may encounter fewer challenges in acquiring English long-lag VOTs because they can adjust the Mandarin long-lag VOT boundary to match the English target. In this case, the bilingual children especially the Bi-high children may produce native-like long-lag VOTs in English. Meanwhile, as the optimal L1 category can be maintained, the bilingual children would maintain their Mandarin long-lag VOTs. Therefore, the bilingual children especially the Bi-high children might separate the two VOT systems.
IV Methods
1 Participants
The participants of the present study included 39 children aged between 5 and 6 years old (15 Mandarin–English bilinguals, 15 age-matched Mandarin monolinguals, nine age-matched English monolinguals) and 22 adults aged between 20 and 58 years old (12 Mandarin monolinguals and 10 English monolinguals). The 15 bilingual Mandarin–English children were further divided into two groups (Bi-high and Bi-low) based on their experience in English (for details, see Table 1). The Bi-low group included seven children who were born and raised in China. They had lived in the USA (central Ohio) for less than six months when the study was implemented. Some children in this group learned English in kindergarten in China. The Bi-high group included eight children who were born and raised in the USA (central Ohio). These children were raised in a near-monolingual Mandarin-speaking environment until they enrolled in English day-care or kindergarten at around three years of age. Because all Bi-low and Bi-high children started to learn English after the acquisition of Mandarin, we regarded both groups as sequential bilingual children. According to the parents’ report, the Bi-high children had a greater amount of English experience and an earlier age of immersion to English-speaking environment than the Bi-low children. The 15 Mandarin monolingual children (M = 5.66 years, SD = 0.44 years) were native Mandarin speakers who were born and raised in the Beijing area. The 12 Mandarin monolingual adults (Mean = 34 years, SD = 12 years) were all recruited from the Beijing area and spoke Mandarin as their first language. The nine English monolingual children (M = 5.56 years, SD = 0.53 years) were native English speakers born and raised in the central Ohio region. The 10 English monolingual adults (Mean = 31 years, SD = 8 years) were also recruited from the central Ohio region and only spoke English in their daily life.
Characteristics of subgroups of bilingual Mandarin–English children.
2 Speech materials
The recording materials were composed of two lists of words: 18 Mandarin disyllabic words and 18 English monosyllabic/disyllabic words (for the complete word lists, see Table 2). The 18 Mandarin words contained six stop consonants /p, t, k, ph, th, kh/ each followed by three vowels /a, i, u/, respectively. Due to the phonotactic constraints in Mandarin, /k, kh/ are not followed by the high front vowel /i/. The vowel /ɤ/ was used as the alternative vowel context. The tone environment was not controlled. The 18 English words contained six stop consonants /b, d, g, p, t, k/ each followed by three vowels /ɑ, i, u/, respectively. Note that due to the progressive /ɔ/, /ɑ/ merger in American English, considering the vocabulary size of young children and the picturability of the words, the vowels /oʊ, ɪ, ɔ, ʊ/ and r-color vowels /ɑ˞/, /ɪ˞/ were used as the alternative vowel environments for English words.
The word lists used for the collection of speech samples in Mandarin and English.
3 Procedures
Each monolingual speaker was recorded producing a list of words in their native language, and each bilingual speaker was recorded producing the two lists of words in both Mandarin and English. For the bilingual speakers, a questionnaire was completed by their parents prior to the recording activities. The questionnaire asked questions regarding demographic information, the participants’ language background, language learning, and language usage. For each bilingual speaker, Mandarin words were recorded first followed by English words with a 15–20 minute break in between. The experimenter, a fluent Mandarin–English bilingual speaker, interacted with the bilingual children in Mandarin during the Mandarin recording session and English during the English recording session. To better control the stimulus presentation and target word elicitation, a visual-auditory word repetition task was used to collect speech samples from all speakers (Edwards and Beckman, 2008). Each speaker was recorded separately in a quiet room. Pictures describing the target words were randomly ordered and displayed on a computer screen. An audio prime for each word produced by a native adult speaker was played to the speakers who were then asked to repeat the target word immediately after the audio prompt. To control the speech rate as best as we can, all participants were instructed to produce the target words clearly at a normal speed. Each word list was repeated twice for each speaker. All speech samples were collected through a Shure SM10A head-mounted microphone connected to an amplifier with the computer. The recordings were directly stored on a hard drive disc with a 16-bit quantization rate and 44.1 kHz sampling rate.
4 VOT measurement
A spectrographic analysis program Adobe Audition 1.0 was used to determine the landmarks of onset and offset of the stops and the following vowel on the basis of the waveform with a visual check of the spectrogram. VOT values were measured from the beginning of the stop burst which represented the release of stop closure to the onset of vocal fold vibration. For stops with multiple bursts, the beginning of the first burst was used as the start point for VOT measurement. For the voiceless stop consonants, the onset of voicing was represented as the onset of the following vowel which was defined as the zero crossing of the first glottal pulse of the vowel. For the voiced stop consonants in which the onset of voicing precedes the release of oral closure, the voicing onset was defined as the start of low energy periodicity on the waveform and the beginning of a voice bar on the spectrogram preceding the release burst. It was measured separately from the onset of the vowel following the stop sound. In this case, negative VOT values were obtained due to the presence of pre-voiced stops.
V Results
1 Language difference within each group of speakers
Figure 1 shows the comparison of distributions of the overall VOT data between Mandarin and English for each group of speakers. For the monolingual adults, most English stops were produced with positive VOT values but some were produced with negative VOT values by the monolingual English adults. All Mandarin stops were produced with positive VOT values in the monolingual Mandarin adults. This observation evidenced the disparate phonetic realization of English voiced stops and the voiceless nature of Mandarin stops. Both monolingual English adults and monolingual Mandarin adults showed a bimodal VOT distribution representing the phonological contrast in their native language. However, the long-lag VOTs in Mandarin were concentrated and distributed in a region with higher VOT values than those in English. The monolingual children of each language had similar bimodal distributions as the corresponding monolingual adults. It is noteworthy that the monolingual Mandarin children showed a wider distribution for the long-lag VOTs than the monolingual Mandarin adults. For the two groups of bilingual children, unlike the monolingual adults or monolingual children, both Bi-low and Bi-high children showed less separable VOT distributions between Mandarin and English. Although both bilingual groups showed sporadic productions of voicing-lead VOTs for English, the occurrence of pre-voiced stops in the bilingual children was much lower than in the monolingual English children and adults. Because of the presence of bimodal distribution patterns, two-sample Kolmogorov–Smirnov (K-S) tests were used to compare the overall VOT data between Mandarin and English in each group of speakers. The K-S test results revealed significantly different VOT distributions between Mandarin and English in the monolingual adults (p = 0.01) and the monolingual children (p < 0.001). However, neither Bi-low nor Bi-high children showed significantly different distributions of the overall VOT data between Mandarin and English.

Distribution of the overall VOT data in Mandarin and English for each group of speakers.
Previous studies suggest that children, at around 5 or 6 years old, have shown an approximation of VOT values towards adult targets and have developed language-appropriate VOT patterns for the phonological contrast in their native languages (Yang et al., 2018; Zlatin and Koenigsknecht, 1976). In the present study, both monolingual and bilingual children were in a similar age range of 5 to 6 years old. We expected that these children had generally developed appropriate VOT distinction corresponding to the phonological contrasts in their native language. Therefore, for further analysis, the VOTs of Mandarin unaspirated stops /p, t, k/ were compared with the VOTs of English voiced stops /b, d, g/; and the VOTs of Mandarin aspirated stops /ph, th, kh/ were compared with the VOTs of English voiceless stops /p, t, k/.
Figure 2 shows the comparison of VOTs between Mandarin unaspirated /p, t, k/ and English voiced /b, d, g/ in each group of speakers. Due to the consistently produced voicing-lead VOTs for the voiced stops by certain monolingual English children and adults, the VOTs of Mandarin unaspirated /p, t, k/ showed a distinctive pattern from English voiced /b, d, g/ in the monolingual speakers of each language. As for the two groups of bilingual children, they showed sporadic voicing-lead VOTs for English voiced stops. Meanwhile, both groups of bilingual children demonstrated a trend of longer VOTs for Mandarin unaspirated stops than for English voiced stops. The VOTs of Mandarin /p, t, k/ and English /b, d, g/ in each group of speakers were fitted with a Linear Mixed-Effects Model in SPSS. In particular, the factors of language, place of articulation, and vowel context were defined as the fixed effects, and the participant effect was defined as the random effect (random intercept was used for each participant). Note that the model improved by adding the two-way and three-way interactions between the fixed effects. Because the significant main effects were of particular interest in the present study, the potential significant interaction effects and random effect were not addressed and reported. Additionally, due to the presence of voicing-lead VOTs, the main effects of place and vowel were not reported for the comparison of English voiced stops and Mandarin unaspirated stops in each group. The results showed a significant difference of VOT between Mandarin unaspirated stops /p, t, k/ and English voiced stops /b, d, g/ in the monolingual adults (F = 4.569, p = .045), the Bi-low (F = 11.512, p = .001), and Bi-high children (F = 16.454, p < .0001). The language difference in the monolingual children was marginally significant (F = 4.026, p = .057).

Box plots showing the VOTs of the Mandarin unaspirated stops and English voiced stops in each group of speakers.
As previous literature reported that the short-lag VOTs in Mandarin-speaking and English-speaking adults showed no significant difference, in order to compare the language difference on short-lag VOTs in the bilingual and monolingual children, the voicing-lead VOTs were excluded from the dataset for Mandarin unaspirated and English voiced stops (shown in Figure 3). By excluding the voicing-lead VOT values, the average short-lag VOTs were 21.3 and 24.7 ms in the monolingual Mandarin adults and the monolingual English adults, respectively; 26.3 ms and 27.4 ms in the monolingual Mandarin children and the monolingual English children, respectively; 29.7 ms and 23.7 ms for Mandarin and English, respectively, in the Bi-low children; and 27.9 ms and 20.3 ms for Mandarin and English, respectively, in the Bi-high children. A Linear Mixed-Effects Model was applied to the short-lag VOTs for each group. The results yielded no language difference in the monolingual adults or the monolingual children. However, both groups of bilingual children showed a significant difference between Mandarin and English short-lag VOTs (Bi-low: F = 8.716, p = .004; Bi-high: F = 20.462, p < .0001). In addition to the language difference, the effects of place of articulation and vowel context were significant for all groups of speakers (all p < .0001). The pairwise comparison for main effects with Bonferroni adjustments suggested that the short-lag VOTs of bilabial and alveolar stops were significantly different from those of velar stops (all p <.0001). The short-lag VOTs of stops followed by /a/ were significantly different from those followed by the other two vowels (all p ⩽ .005).

Box plots showing the VOTs of Mandarin unaspirated stops and English voiced stops excluding the voicing-lead VOTs in each group of speakers.
Figure 4 displays the comparison of VOTs between Mandarin aspirated stops /ph, th, kh/ and English voiceless stops /p, t, k/ in each group of speakers. Note that Mandarin aspirated stops and English voiceless stops are both manifested as long-lag VOTs. It is clearly demonstrated that the long-lag VOTs of Mandarin aspirated stops produced by the monolingual Mandarin adults and children were longer than the long-lag VOTs of English voiceless stops produced by the monolingual English adults and children. As for the two groups of bilingual children, the Bi-low children showed a pattern of longer VOTs for Mandarin aspirated stops than for English voiceless stops. This was consistent with the pattern shown in the monolingual speakers. The Bi-high children did not show observable differences between Mandarin and English. For each group of speakers, the long-lag VOTs of Mandarin aspirated stops and English voiceless stops were fitted with a Linear Mixed-Effects Model. The results yielded a significant language difference in the monolingual adults (F = 4.673, p = .043), the monolingual children (F = 18.370, p < .0001), and the Bi-low children (F = 17.473, p < .0001), but not in the Bi-high children. In addition to the effect of language, the effect of vowel context was significant in all four groups of speakers (all p ⩽ .005). The pairwise comparison for main effects with Bonferroni adjustments suggested that the VOTs of stops followed by /i/ were significantly different from the VOTs of stops followed by /a/ or /u/ (all p < .05).

Box plots showing the VOTs of the Mandarin aspirated stops and English voiceless stops in each group of speakers.
2 Group difference within each language
In order to further examine how the bilingual children’s VOT values in Mandarin and English differed from the corresponding monolingual speakers, the VOT values were compared among the four groups of speakers within each language. Figure 5 shows the kernel density plots of the VOTs for the Mandarin unaspirated and aspirated stops, respectively, in all four groups of speakers. For the unaspirated stops that were realized as short-lag VOTs, all four groups of speakers showed similar VOT distributions with the majority of VOTs located in a region around 20 ms. For the aspirated stops that were realized as long-lag VOTs, all three groups of children demonstrated a much wider VOT distribution than the adults. The long-lag VOTs in the monolingual Mandarin adults were highly concentrated around 100 ms whereas those in the three groups of children were concentrated in a region with higher VOT values. Meanwhile, the two groups of bilingual children showed distinctive distribution patterns from each other and from the monolingual Mandarin children and adults. A Linear Mixed-Effects Model was used to analyse the VOTs of Mandarin unaspirated and aspirated stops, respectively. In particular, the participant groups (Bi-low, Bi-high, Mono children, Mono adults), place of articulation, and vowel context were defined as the fixed effects and the participant effect was defined as the random effect (random intercept was used for each participant). The model improved by adding the two-way and three-way interactions between the fixed effects. Because the significant main effects were of particular interest in the present study, the potential significant interaction effects and random effect were not addressed and reported. The results revealed a significant group difference for the long-lag VOTs of Mandarin aspirated stops (F = 5.631, p = .003) but no group difference for the short-lag VOTs of Mandarin unaspirated stops. The pairwise comparison for main effects with Bonferroni adjustments suggested that both Bi-low and monolingual children were significantly different from the monolingual adults in the VOTs for the aspirated stops (both p < .05). In addition, both factors of place and vowel were significant for both unaspirated and aspirated stops (all p < .0001). The pairwise comparison for main effects with Bonferroni adjustments revealed that the VOTs of velar stops were significantly different from those of the alveolar stops (both p < .0001) and the VOTs of stops followed by /a/ were significantly different from those followed by the other two vowels (both p < .005).

Kernel density plots showing the estimated distrubtion of VOT data for Mandarin unaspirated and aspirated stops, respectively, across the four groups of speakers.
Figure 6 displays the kernel density plots of the VOTs for the English voiced and voiceless stops, respectively, in all four groups of speakers. The monolingual English children and adults demonstrated highly similar VOT distribution for both voiced and voiceless stops. The two groups of bilingual children showed a more concentrated distribution of VOTs for the voiced stops due to the lower occurrence of voicing-lead VOTs. By excluding the voicing-lead VOTs, the four groups of speakers showed highly similar distribution pattern for the short-lag VOTs of the voiced stops. However, for the voiceless stops that were produced with long-lag VOTs, both groups of bilingual children especially the Bi-low children showed a much wider distribution than the monolingual English speakers. Moreover, while the Bi-high children had the long-lag VOTs concentrated around 80 ms, similar as the monolingual English children and adults, the Bi-low children had long-lag VOTs concentrated in a region with higher VOT values around 100 ms. A Linear Mixed-Effects Model was used to analyse the VOTs of English voiced stops, voiced stops without voicing-lead cases, and voiceless stops, respectively. The results revealed a significant group difference for the VOTs of English voiceless stops (F = 3.095, p = .042) but no significant group difference for the VOTs of English voiced stops with or without voicing-lead VOTs. The pairwise comparison for main effects with Bonferroni adjustments suggested that the Bi-low children were significantly different from the monolingual English adults (p = .047). The main effects of place of articulation and vowel context were addressed for the voiced stops without voicing-lead VOTs and the voiceless stops. The results revealed that the effect of place of articulation was significant for the VOTs of the voiced stops excluding voicing-lead cases (p < .0001). The pairwise comparison for main effects with Bonferroni adjustments suggested that the VOTs of all three places were significantly different from each other (all p < .05). The effect of vowel context was significant for the VOTs of the voiced stops excluding voicing-lead cases and the voiceless stops (both p < .0001). The pairwise comparison revealed that the VOTs of stops followed by /a/ were significantly different from those followed by /i/ (p < .0001).

Kernel density plots showing the estimated distrubtion of VOT data for English voiced stops with or without voicing-lead VOTs, and voiceless stops across the four groups of speakers.
VI Discussion
English and Mandarin stop consonants have distinctive phonological categorizations but a similar phonetic realization of short-lag vs. long-lag distinction along the VOT continuum. Previous studies revealed that Mandarin-speaking adults produced measurably longer long-lag VOTs than English-speaking adults even though these two languages occupy similar regions on the VOT continuum (Chao and Chen, 2008; Chen et al. 2007). While these phonetic studies mainly focused on adult speakers, the present study compared Mandarin and English VOTs in 5- to 6-year-old Mandarin–English bilingual children and age-matched monolingual children with reference to monolingual adults of each language. This dataset enables us to disentangle the developmental and bilingual effects. In particular, we wondered whether Mandarin and English VOTs were separable in monolingual children as in monolingual adults. Following this question, we were more interested in whether and how the language difference was manifested in the bilingual Mandarin–English children.
1 Language difference of VOT in monolingual children
Regarding the first question, our results suggested that the 5- to 6-year-old monolingual Mandarin children demonstrated different VOT patterns from the age-matched monolingual English children in a similar way to the monolingual adults of each language. As shown in Figure 1, both monolingual adults and monolingual children showed distinctive distributional patterns of the overall VOT data between Mandarin and English. The language differences were manifested in two aspects: first, unlike Mandarin unaspirated stops that were produced with only short-lag VOTs in the adult speakers, English voiced stops were consistently produced with voicing-lead VOTs by certain monolingual English adults and children. Second, both monolingual Mandarin adults and children showed observable differences from the monolingual English adults and children on the long-lag VOTs. As shown in Figures 1 and 4, the long-lag VOTs of English voiceless stops in the English monolingual adults concentrated in a region around 80 ms while the long-lag VOTs of Mandarin aspirated stops in the monolingual Mandarin adults concentrated in a region around 100 ms. This language difference between Mandarin and English long-lag VOTs, although subtle, yielded statistically significant results. The monolingual English children produced English long-lag VOTs for the voiceless stops similarly to the monolingual English adults. But the monolingual Mandarin children produced Mandarin long-lag VOTs for the aspirated stops even longer than the monolingual Mandarin adults. Therefore, the difference between Mandarin and English long-lag VOTs in the monolingual children was even greater than the difference between the monolingual adults. Cho and Ladefoged (1999) extended the Lisker and Abramson’s categorization and proposed that the long-lag VOTs could be further divided to separate slightly aspirated, aspirated and highly aspirated stops. According to this categorization, Mandarin /ph, th, kh/ fell into the aspirated or highly aspirated range while English /p, t, k/ were located in the slightly aspirated or aspirated range (Chao and Chen, 2008). The language difference between Mandarin and English was manifested well in the monolingual adults as well as the monolingual children of each language.
2 Separation of Mandarin and English VOT systems in bilingual children
The primary aim of the present study was to address the cross-language similarity and difference in bilingual Mandarin–English children. The comparison of the overall VOT data demonstrated that the bilingual children, regardless of the amount of their language experience in L2, showed less separable VOT distributions between Mandarin and English in comparison to the monolingual speakers. Further analysis was conducted separately to compare the VOTs between Mandarin unaspirated stops and English voiced stops as well as between Mandarin aspirated stops and English voiceless stops. Cross-language comparisons revealed that both Bi-low and Bi-high children showed significant language differences between Mandarin unaspirated stops and English voiced stops. As previous studies reported that the short-lag VOTs produced by Mandarin-speaking adults showed no significant difference from those by English-speaking adults, the present study compared the short-lag VOTs of Mandarin unaspirated and English voiced stops excluding the voicing-lead VOTs. The results yielded no difference between the monolingual Mandarin speakers and the monolingual English speakers, which was consistent with previous findings. However, both groups of bilingual children still showed a significant difference in the short-lag VOTs between these two languages.
As monolingual speakers of each language showed no significant difference between Mandarin and English short-lag VOTs, we regarded English short-lag VOT as the equivalent category of Mandarin short-lag VOT. L2LP proposes that L2 learners usually start with a copy of the L1 phonetic categories to learn a new language. SLM assumes that the acoustic-similarity between the L2 and L1 sounds determines the learnability of the L2 sounds. The L2 sound which has an ‘identical’ counterpart in the L1 is the easiest to learn. Therefore, according to both theories, both Bi-low and Bi-high children would acquire English short-lag VOTs with no difficulty. They should produce English short-lag VOTs similarly to the monolingual English speakers. The comparison of short-lag VOTs among the four groups of speakers in each language showed no significant difference. This result suggested that both groups of bilingual children produced English with native-like short-lag VOTs which might be directly transferred from their native language. This result conforms to the predictions of both theories. However, both groups of bilingual children showed significant language difference for the short-lag VOTs that was not presented in the monolingual speakers. As shown in Figures 4 and 5, the two groups of bilingual children tended to produce Mandarin short-lag VOTs longer than the targets produced by the monolingual Mandarin speakers while tended to produce English short-lag VOTs shorter than the targets produced by the monolingual English speakers even though the group difference between the bilingual children and the monolingual speakers for each language was not significantly different. In this process, the bilingual children adopted a dissimilatory mechanism to show the differentiation between L1 and L2.
With regard to the comparison of long-lag VOTs, both Mandarin aspirated stops and English voiceless stops are realized as long-lag VOTs in the VOT continuum, but Mandarin long-lag VOTs are longer than the English long-lag VOTs produced by monolingual speakers of each language (Chao and Chen, 2008; Chen et al., 2007). We regarded that English long-lag VOT conforms to the ‘similar’ category. Based on the above introduction, SLM and L2LP hold divergent opinions regarding this situation. SLM assumes that this type of L2 category would be the most difficult to acquire. If so, both groups of bilingual children especially the Bi-low children would be less likely to produce native-like English long-lag VOTs. L2LP proposes that this type should be easier than a new category as the L2 learners could shift their L1 boundary to match the boundary for the L2 target. If so, the bilingual children, at least the Bi-high children who were more experienced in English should be able to produce native-like English long-lag VOTs and separate the long-lag VOTs in these two languages.
Our results showed that the Bi-low children produced English long-lag VOTs significantly longer than the monolingual English speakers. This result suggested that the Bi-low children, who were at the initial stage of English learning, tended to transfer Mandarin long-lag VOTs to English and had not developed native-like English long-lag VOTs. Meanwhile, the Bi-low children modified their native long-lag VOTs and produced them significantly longer than the monolingual Mandarin adults. This result demonstrated the flexibility of the VOT system in young children. By contrast, the Bi-high children show no significant difference from the monolingual speakers on the long-lag VOTs for either Mandarin or English. These results indicated that the Bi-high children had developed native-like English long-lag VOTs and had maintained their Mandarin long-lag VOTs. These findings suggest that while the ‘similar’ phonetic category is not as easy as the identical category that can be directly transferred, it can be acquired after a certain period of intensive immersion. In the meantime, bilingual children can relatively maintain the stability for their L1, which seems in line with the L2LP predictions.
Note that while the Bi-high children produced Mandarin and English long-lag VOTs as the monolingual speakers of each language, they did not distinguish Mandarin and English long-lag VOTs as the monolingual counterparts did. By contrast, although the Bi-low children differed from the monolingual speakers on both L1 and L2 long-lag VOTs, they showed a significant difference in Mandarin and English long-lag VOTs. A close comparison of long-lag VOTs among the four groups of speakers for each language revealed that the Bi-low children produced longer English long-lag VOTs than the monolingual English speakers as a result of longer long-lag VOTs in their native language. In the meantime, they greatly dissimilated their Mandarin long-lag VOTs away from the target values, which resulted in the apparent separation of long-lag VOTs between L1 and L2. By contrast, the Bi-high children demonstrated a slight assimilatory shift of L1 and L2 boundaries towards each other, which resulted in a negligible difference between L1 and L2 long-lag VOTs.
Taken together, the comparisons between Mandarin and English within each speaker group and the comparisons across speaker groups within each language suggested that while the bilingual children showed a less separable distribution of overall VOT data between Mandarin and English, they were aware of the separation of the two languages and attempted to distinguish Mandarin and English VOTs. However, the separation was implemented in a way different from the monolingual speakers due to the bilingual effects.
3 Other points
It is noteworthy that certain monolingual English children and adults consistently produce voicing-lead VOTs for all voiced stops, but the bilingual children randomly produced certain English voiced stops with voicing-lead VOTs. As this VOT mode is absent in Mandarin, it can be regarded as a new VOT category for native Mandarin speakers. For both SLM and L2LP, the new category is hard to acquire. L2LP proposes that new phonetic category is the most difficult to establish because L2 learners need to create a new prototype instead of adjusting the existing L1 boundary to match the optimal L2 target. As can be observed, the monolingual Mandarin children also produced a few Mandarin unaspirated stops with voicing-lead VOTs. Given that voicing-lead VOT is usually acquired later than short-lag or long-lag VOTs (Davis, 1995; Gandour et al., 1986) and this VOT type does not occur in Mandarin, we assumed that the voicing-lead VOTs in the bilingual children and the monolingual Mandarin children reflected the developing oral–laryngeal muscular coordination for the voicing control rather than fully developed voicing-lead category in the bilingual children.
A number of studies examined the separation of VOT systems in younger bilingual children who spoke two languages with distinctive VOT patterns along the continuum. For example, Khattab (2002) compared the VOTs in three 5- to 10-year-old bilingual Arabic–English children and three age-matched monolingual children from each language. The author found that the bilingual children successfully separated the VOT systems for each language in their production. However, their productions of both L1 and L2 still differed from the corresponding monolingual children. Similar findings were reported in other studies that examined the development of VOTs in bilingual children who were exposed to two languages with an aspirating vs. voicing language distinction (Harada, 2007; Johnson and Wilson, 2002; Simon, 2010). Different from these studies, the present investigation compared the VOTs of bilingual Mandarin–English children who spoke two languages both are aspirating languages characterized by a short-lag vs. long-lag VOT contrast. Our results revealed that the Bi-high children who had been intensively exposed to the L2 for a few years had acquired native-like VOT productions in L2 and maintained the VOT features for their L1. That is, they could produce both L1 and L2 as the monolingual counterparts. However, the Bi-low children who had a very short period of immersion in L2 demonstrated great flexibility in both L1 and L2 phonetic systems as they showed significant difference from both monolingual Mandarin and monolingual English speakers on the long-lag VOTs.
Although both Mandarin and English have similar VOT contrast, the bilingual children still attempted to differentiate the two languages in their production. This finding was consistent with previous bilingual studies. Qi et al. (2012) reported that the two Mandarin–English bilingual children in their study ‘do not automatically equate the VOT values of English and Mandarin voiceless stops even though the two languages may have similar VOT characteristics’ (p. 82). In the present study, our data revealed that even for the short-lag VOTs that show no difference between the monolingual speakers of each language, both Bi-low and Bi-high children still showed a significant language difference. These findings, together, suggest that bilingual children, no matter whether their L1 and L2 present distinctive or similar VOT patterns, tend to differentiate the two languages, albeit not exactly in a monolingual-like manner.
Lee and Iverson (2012) reported that the younger bilingual Korean–English children who had a shorter period of L2 exposure adopted assimilation, whereas the older bilingual Korean–English children who had a longer period of L2 exposure adopted both assimilation and dissimilation processes during the development of contrastive productions. By contrast, our data suggested that the bilingual children who had a short period of L2 immersion (the Bi-low children) seemed to mainly use dissimilation, whereas the bilingual children who had a long period of L2 immersion (the Bi-high children) used both mechanisms in their stop production.
Another point to mention was the adult-children difference in VOT. Previous research has revealed that young children have not developed the adult-like pattern of interarticulatory timing (Koenig, 2000). Children produce articulatory movements more slowly and show greater temporal variability than adults (Nittrouer, 1993; Tingley and Allen, 1975). As shown in Figure 1, the three groups of children exhibited a wider range and a greater dispersion of VOT data in comparison to the adults for both Mandarin and English. This observation evidenced the continuing development of temporal features in children. In addition, all three groups of children, regardless of their language experience, showed longer long-lag VOT values than the adults. In contrast, the short-lag VOTs did not show observable differences between the children and the adults. The long-lag VOTs contain aspiration that involves the oral–laryngeal coordination and has been used as an important index for interarticulator timing (Löfqvist, 1980, 1992). With inferior speech timing control and laryngeal-supralaryngeal coordination, children may produce longer aspiration to transition from the release burst to the following vowel, which, therefore, results in long-lag VOTs in children than adults.
VII Conclusions
In sum, this study suggested that the 5- to 6-year-old monolingual children demonstrated evident language difference on VOTs, in particular, long-lag VOTs in a pattern similar to monolingual adults. The bilingual children, regardless of the amount of experience in English, tended to separate the VOTs systems in these two languages. However, they realized it in a manner different from the monolingual speakers. Both groups of bilingual children tended to separate the short-lag VOTs in L1 and L2. However, only the bilingual children who had less experience in English showed significant differentiation between Mandarin and English long-lag VOTs. This study has important implications regarding L2 acquisition in children. It informs us that young children have flexible language systems which are subject to quick change during the beginning process of L2 learning. Due to the bilingual effects and L1–L2 interactions, even though the bilingual children try to separate the two VOT systems, they implement the separation in a different manner from the monolingual speakers. It is noteworthy that there were a relatively small number of participants in each group. Therefore, the findings of the present study should be interpreted and generalized with caution. For future study, more participants should be recruited to ensure better effect size and statistical power. In addition, more children from a wider age range should be recruited to better track the VOT development in bilingual children and corresponding monolingual peers.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported, in part, by the Alumni Grants for Graduate Research and Scholarship at the Ohio State University. I would like to thank the children and their parents for their participation.
