Abstract
Purpose:
The interconnectedness of phonological categories between the two languages of early bilinguals has previously been explored using single-probe speech production and perception data. Our goal was to tap into bilingual phonological representations in another way, namely via monitoring instances of phonetic drift due to changes in language exposure.
Design:
We report a case study of two teenage English–Czech simultaneous bilinguals who live in Canada and spend summers in Czechia (Czech Republic). Voice onset time (VOT) of word-initial voiced and voiceless stops was measured upon the bilinguals’ arrival to and before their departure from a two-month stay in Czechia.
Data and Analysis:
Each bilingual read the same set of 71 Czech and 58 English stop-initial target words (and additional fillers) at each time of measurement. The measured VOT values were submitted to linear mixed effects models, assessing the effects of target language, measurement time, and underlying voicing.
Findings/Conclusions:
After the immersion in a Czech-speaking environment, for both speakers the count of voiced stops realized as prevoiced (i.e., having negative VOT) increased and the measured VOT of voiced stops (appearing different for English and Czech initially) drifted towards more negative (more Czech-like) values in both languages, while no change was detected for the voiceless stops of either English (aspirated) or Czech (unaspirated). The results suggest that the bilinguals maintain three-way VOT distinctions, differentiating voiceless aspirated (English), voiceless unaspirated (Czech), and voiced (English–Czech) stops, with connected bilingual representations of the voiced categories.
Originality:
Data on phonetic drift in simultaneous bilinguals proficient in their two languages have not previously been published.
Significance/Implications:
We show that observing phonetic shifts due to changes in the ambient linguistic environment can be revealing about the organization of phonological space in simultaneous bilinguals.
Keywords
Introduction
The linguistic competence of a bilingual speaker is not the sum of two separate monolingual competences (Grosjean, 1989). Instead, it forms a unified system (Cook, 1992). This is seen in cross-language interactions observed both on the level of linguistic representations (e.g. Barlow et al., 2013; Flege & Eefting, 1987; Sebastián-Gallés et al., 2005), and on the level of online speech processing (e.g., Goldrick et al., 2014; Jacobs et al., 2016; Olson, 2013). Research has documented cross-language influences indicating that a bilingual’s two languages are not isolated from each other even for early sequential and/or simultaneous bilinguals. Even though such bilinguals often display a differentiation between speech sounds of one language and similar sounds of their other language (Barlow et al., 2013; Simonet, 2010; Yusa et al., 2010), they still differ in phonetic realization from monolingual speakers of each language (e.g., Flege, 1987; Fowler et al., 2008; Hazan & Boulakia, 1993).
Reports of a null difference between bilingual and monolingual performance are less common. For example, Antoniou et al. (2010) found that early Greek–Australian-English bilinguals could produce voice onset times (VOTs) of stop consonants, an acoustic correlate of voicing, indistinguishable from the Greek monolinguals or from the Australian-English monolinguals if the language context during data elicitation was kept strictly monolingual. However, even when a study finds no difference between bilingual and monolingual performance, that does not guarantee that these bilinguals have independent underlying phonological representations for each language. Sundara and Polka (2008) argue that evidence of a bilingual differentiating similar sounds in production must be matched by evidence from perception. They compared perceptual discrimination of the (French) dental and (English) alveolar voiced stops by Canadian French–English simultaneous bilinguals and early sequential bilinguals, monolingual speakers of each language, and native speakers of Hindi (where dental and retroflex stops contrast). They found the early sequential bilinguals’ performance to be based on a merged category. The simultaneous bilinguals appeared to have a separate stop category for each language, even in perception. Interestingly, their performance was similar to that of the Hindi speakers but differed from that of the English monolinguals.
Another way of tapping into phonological representations is via monitoring phonetic drift, that is, shifts that occur in response to a change of the ambient linguistic environment (see Chang (2019), for a review of the phenomenon). Phonetic adjustments induced by recent input have been documented for monolinguals (as shown by studies of perceptual adaptation (Cutler, 2012) and phonetic convergence (Pardo, 2013), for late non-balanced bilinguals in their second language (L2) (Tobin et al., 2017) or both in L2 and their first language (L1) (Sancier & Fowler, 1997), and in L1 even for inexperienced L2 learners exposed to the L2 for a relatively short time (Chang, 2012, 2013). Data revealing phonetic drift in simultaneous or early sequential bilinguals proficient in their two languages are not available to date. The few longitudinal case studies monitoring L2 and L1 development in early L2 learners which attest phonetic shifts due to exposure are studies only of the early stages of L2 acquisition (the first seven months for Simon’s (2010) learner and the first 20 months for the learner in Yang et al. (2015)). The current study measures phonetic drift in teenage simultaneous bilinguals to investigate their representations of similar sounds. We examine the VOT of stop consonants in English and Czech, languages with different phonetic implementation of the voiced and voiceless phonemes (short vs long positive VOT in English, and negative vs short positive VOT in Czech 1 ). Although studies with children simultaneously acquiring two languages with a difference in the phonetic implementation of stop voicing similar to that between English and Czech report separate phonetic categories for phonologically voiceless but not for voiced stops (e.g., Johnson & Wilson, 2002), by adulthood, bilinguals seem to produce language-specific monolingual-like voiced stops in their two languages (Sundara et al., 2006). However, as suggested above, having non-identical realizations for equivalent L1 and L2 sounds does not mean that these sounds are equally well separated at a higher level of representation. An L1–L2 link at a phonemic level of representation of /b, d, g/ (and possibly /p, t, k/) may be revealed by an L1–L2 concurrent phonetic drifting in response to a recent change in the input, such as traveling to another country for summer holidays. A change in the linguistic environment inducing a shift in the phonetic implementation of stop voicing in both languages (in the direction of the current input language) can be interpreted as evidence for a connected phonological representation of such sounds.
We measured the VOT of word-initial stops of two teenage Canadian English–Czech simultaneous bilinguals (sisters) upon their arrival to and then before their departure from a two-month stay in Czechia (Czech Republic). Our research questions were: (a) What VOTs do our simultaneous (non-balanced) bilinguals produce for phonologically voiced and voiceless stops in each language? (b) Do any of their VOT values change in response to the change of the ambient linguistic environment? and (c) If so, does the change occur only in the language of the current environment or in both languages, reflecting a connection of phonological representations between the bilinguals’ languages?
Method
Participants
The speakers volunteered to participate. They were two female simultaneous English–Czech bilinguals dominant in English, two sisters labeled herewith as A and B—A aged 13 and B aged 16 at the time of the recording. They were born and live in Toronto, Canada, with their Canadian father and their Czech mother. Before entering elementary education at the age of 6, each sister had had extensive exposure to Czech from their mother at home, their mother’s parents during regular 3- to 4-month visits to Czechia every summer, as well as during the grand-parents’ three-week visits to Canada every winter. After the age of 6, they continued speaking Czech primarily with their mother, and also their grand-parents whom they saw in Canada every winter (about three weeks) and in Czechia every summer (two months).
Stimuli
The material consisted of a list of 71 monosyllabic or disyllabic (stress-initial) Czech and 58 monosyllabic English stop-initial words, and 46 Czech and 15 English filler words. The initial stops had different places of articulation (see Table 1 for the exact counts) and were followed by vowels of differing heights (see Table 2).
Numbers of stop-initial stimulus words split by places of articulation.
Numbers of stop-initial stimulus words split by the height of the first vowel.
Procedure
Upon arrival to Czechia, the speakers were told that the purpose of the recordings was to monitor changes of their pronunciation over the course of their stay in the Czech-speaking environment. No other details were provided. First, the participants were implicitly familiarized with the stimulus words, hearing them in conversation with the data collector (their Czech cousin) one day prior to the first recording session, which took place two days after their arrival to Czechia.
In the first session itself, the speakers were presented with each of the stimulus words once, on the screen of a computer, one by one in random order but in two blocks split by language, at a fixed interval of 2 seconds for the English and 5 seconds for the Czech words, and they read them out loud, in isolation without any carrier phrase. An immediate repetition was elicited by the data collector in case of unclear pronunciations or mistakes. A Zoom H4n portable recorder was used for recording the speech at 16-bit and 44.1 kHz without compression.
The different fixed trial duration for English versus Czech was motivated by allowing greater time for lexical access in the speakers’ non-dominant language, Czech. Because of this difference, comparisons of measured VOT for Czech versus English have to be made with caution, as the longer trial duration could potentially encourage slower speech tempo and thus lengthen VOT in Czech relative to English. However, the main comparison is between recording sessions within each language and importantly, the same procedure and stimulus words were used again in the second recording session at the end of the speakers’ two-month stay in Czechia (59 days after session 1). The comparability of the two recording sessions within each language allowed for a reliable assessment of the effect of the immersion stay on VOT production in each language.
Measurements and analysis
After excluding noisy or erroneous recordings (eight in total), but including occasional repeated productions of a word, 267 and 268 initial stops could be analyzed for speakers A and B, respectively. Praat (Boersma & Weenink, 2019) was used to manually label their VOT, that is, time from the stop release to the first zero crossing of periodicity detectable in the waveform and broadband spectrogram that either followed (positive VOT) or preceded (negative VOT) the release. The labeling was performed by the data collector and corrected by the first author manually.
Per speaker, the measured VOT values were submitted to a linear mixed effects model, lme4 and lmerTest packages (Bates et al., 2015; Kuznetsova et al., 2017), in R (R Core Team, 2019) with Language, Underlying voicing, and Session as fixed factors with orthogonal sum-to-zero contrasts (Czech -0.5 vs English +0.5, voiced -0.5 vs voiceless +0.5; 1st session -0.5 vs 2nd session +0.5), and Word as a random factor. Comparison of estimated means across factor levels was carried out with the emmeans package (Lenth et al., 2018).
Results
Analyzing first the number of occurrences of phonologically voiced stops realized as prevoiced (i.e., having negative VOT), we found that for both speakers and both languages the incidence of prevoiced stops increased between sessions: for speaker B for Czech, the number of prevoicings increased from nine out of 28 stops measured in session 1 to 15 out of 28 stops in session 2 (χ 2(1) = 2.63, p = 0.1052), and for English from four out of 32 to 13 out of 33 (χ 2(1) = 6.08, p = 0.0136). Similarly for speaker A for Czech, the number of prevoicings rose from 12 out of 29 stops in session 1 to 18 out of 28 (χ 2(1) = 3.00, p = 0.0834), and for English from eight out of 31 to 25 out of 32 (χ 2(1) = 17.28, p < 0.0001).
The measured VOT values for each speaker are shown in Figure 1. For both speakers, the linear mixed effects models detected the following significant effects: Intercept, Language, Underlying voicing, Session, Language * Underlying voicing, Underlying voicing * Session; Table 3 gives their estimated effect sizes, t-values and p-values. Unsurprisingly, the main effects of Language and Underlying voicing show that stops have longer VOT in English than in Czech, and that voiceless stops have longer VOT than voiced ones. The main effect of Session showed that overall VOT was shorter in session 2 than in session 1.

Violin plots of the measured voice onset time (VOT) values (in seconds) of word-initial voiceless and voiced stops per speaker, language, and session. The colored shapes represent density plots and are trimmed at the edges to the range of the data. Filled symbols show means. (Color online.).
Modeled effects on voice onset time for each speaker. Estimated values are in seconds.
As for the predictors addressing our research questions, the interaction effects of Language and Underlying voicing and the comparison of estimated means listed in Table 4 show that, for both speakers, voiceless stops were produced with different VOT in Czech versus in English, while for voiced stops any between-language difference was smaller or non-existent. The interaction of Underlying voicing and Session, with the estimated means again in Table 4, demonstrates that VOTs differed between session 1 and session 2 for voiced stops for both speakers but to a smaller extent, or not at all, for voiceless ones.
Estimated means and t-statistics for the two-way interactions.
Although the three-way interaction of Language, Session and Underlying Voicing was not significant for either speaker, comparison of estimated means shows that in session 1, the difference in VOT between Czech and English voiced stops was significant for speaker B, and was of a comparable magnitude for speaker A, though not reaching significance (mean Czech minus English difference for speaker B = -18 milliseconds, t = -2.210, p = 0.028; for speaker A = -16 milliseconds, t = -1.482, p = 0.140), while in session 2 the effect was numerically smaller and no longer significant for speaker B and was even in the opposite direction for speaker A (mean Czech minus English difference for speaker B = -13 milliseconds, t = -1.613, p = 0.108; speaker A = +7 milliseconds, t = 0.664, p = 0.507). Besides suggesting a somewhat clearer differentiation of the Czech and English voiced stops for speaker B than for speaker A, at least at session 1, these comparisons, as well as visual inspection of the density plots in Figure 1, also show that while it is possible that before the immersion period (session 1) the Czech versus English voiced stops had different VOTs, after the immersion (session 2) a difference between Czech and English voiced stops in VOT was less likely to exist.
Further, a comparison of means shows that for the voiced stops the difference between sessions 1 and 2 was larger for English than for Czech: for speaker A, 63 milliseconds for English (t = 6.773, p < 0.0001) vs 40 milliseconds for Czech (t = 4.138, p < 0.0001) and for speaker B, 35 milliseconds for English (t = 4.736, p < 0.0001) vs 30 milliseconds for Czech (t = 3.754, p = 0.0002), also cf. Figure 1. No such language-specific between-session differences were detected for the underlyingly voiceless plosives in either speaker (all |t| scores < 0.51, all p values > 0.61).
Finally, Figure 1 suggests somewhat greater cross-linguistic similarity of VOT values for voiced stops for speaker A than for speaker B, especially for session 2. The significant two-way interaction of Underlying voicing and Language suggests that there are language-specific effects on VOT of voiced versus voiceless stops. To inspect those, we carried out pairwise comparisons of the estimated means for each Underlying voicing and Language. The comparisons show that speaker B reliably distinguished the Czech versus English voiced stops (difference = 16 milliseconds, t = -2.524, p = 0.0128), while speaker A did so to a smaller extent, if at all (difference = 5 milliseconds, t = -0.526, p = 0.600).
Discussion
Several observations on the results of this study can be made. First, our two speakers produced language-specific VOT values of word-initial stop consonants, as did simultaneous bilinguals in previous studies (Fowler et al., 2008; Simon, 2010; Sundara et al., 2006; Yusa et al., 2010). Both speakers pronounced phonologically voiceless stops with longer VOT in English than in Czech at both times of measurement (despite the shorter duration of English than of Czech trials, which may potentially have led to a higher speech tempo, and thus a shortening of VOT, in English overall). The production of word-initial voiced stops was more variable in two senses. First, both speakers realized some voiced stops in both languages with positive and others with negative VOT (as prevoiced). Second, unlike for voiceless stops, the production of voiced stops underwent a change after the immersion period: prevoicing became more frequent and the VOT values became more negative. In other words, phonetic drift was observed for our simultaneous bilinguals, corroborating previous findings of drifting in bilinguals of other types (Sancier & Fowler, 1997; Tobin et al., 2017) and in second-language learners (Chang, 2012, 2013). Importantly, for both speakers the production of voiced stops changed towards the more Czech-like values not only for Czech but also for English. In fact, in terms of the number of stops realized as prevoiced, and especially for the younger speaker A also in terms of VOT, the drifting had greater magnitude for English than for Czech.
Since the surface phonetic realizations of English versus Czech voiced stops, which initially appeared to be somewhat different (as they did in a previous single-probe case study of an early English–French bilingual child by Mack, 1990), drifted together (and their initial potential difference largely dropped away), we conclude our bilinguals have English–Czech voiced stop categories integrated at a more abstract level of sound representation. Such integration may be caused by a cross-language “equivalence classification” (Flege, 1995) of the corresponding sounds (Chang, 2019, pp. 192–193). Furthermore, the drifting we observed provides insight into underlying representation in another way. Although at the time of the first recording the VOT values of English phonologically voiced stops and of Czech phonologically voiceless (unaspirated) stops had largely overlapping distributions if outliers are excluded (see Figure 1), it is clear that they represent separate categories because drifting affected only the English voiced stops.
While Tobin et al. (2017) found variation in the extent of phonetic drifting for a heterogenous group of late bilinguals, our two speakers displayed considerable similarity in their productions both before and after the immersion-induced drifting. This is perhaps expectable since they are sisters and thus have received similar input in both languages. One indication of a difference between our speakers was in that the 16-year-old (speaker B) seemed to maintain a somewhat clearer differentiation of the Czech and English voiced stops than her 13-year old sister (in other words that for speaker A the phonetic drift in English voiced stops was somewhat larger in magnitude than for her older sister). This is in line with suggestions in the previous literature (Sundara et al., 2006) that a differentiation of voiced stops in bilinguals whose languages have different VOT settings for stops develops slowly.
The current study was designed so that the main comparison, assessing whether phonetic drift occurred, is between the two recording times within each language. Comparisons between English and Czech VOTs are possible (though not necessary for our main research questions) but only with caution. The reason is twofold. First, as stated above, there was a difference between the two language conditions in trial length, Czech trials lasting longer than English (to provide the speakers with more time to access the lexical items in their non-dominant language). The longer duration of Czech trials could potentially have encouraged slower speech tempo, and thus a general lengthening of the VOT. For voiceless stops that would mean an increased similarity between Czech and English VOT and for voiced stops an increased difference. Even though the differences between English and Czech VOTs are in the expected direction suggesting that no large systematic changes were induced by the differing trial duration, it is still possible that the VOTs measured for English versus Czech would differ somewhat if trial duration had been constant. Second, the counts of the stops with different places of articulation measured (given above in Table 1) were only roughly comparable between the two languages and since VOTs differ universally across places of articulation (Cho & Ladefoged, 1999), this could potentially have introduced some difference between the values we measured for each language. Relatedly, future research, carefully controlling for place of articulation and having a larger data set than the one we could obtain from our speakers, could assess the amount of drift across different places of articulation, as it may not necessarily be the same (as it was not for the novice late second-language learners in Chang, 2012). In addition, future research may determine whether comparable phonetic drift occurs for stops in other positions (word-medial and word-final).
This being a case study, a further limitation is that the number of participants recruited was very small. However, also previous longitudinal studies of early bilinguals have been case studies (Simon’s (2010) and Yang et al.’s (2015) studies both have a single participant). This is because there is a relative scarcity of early bilinguals available for longitudinal investigations. It is possible that the patterns of phonetic drift observed in our two speakers (who are siblings) are typical of simultaneous bilinguals. However, further research with a more representative sample size is necessary to determine with confidence whether this is actually the case or not.
In summary, despite some limitations, our data from two teenage simultaneous English–Czech bilinguals provide evidence of a phonetic drift due to a change in the ambient linguistic environment. We show that phonetic drift can reveal cross-language connections (cf. the English and Czech voiced stops in our study), as well as separations for sounds with similar phonetic realizations in both languages (cf. the English voiced and Czech voiceless stops in session 1 in our study). Therefore, observing phonetic drift can be informative about the organization of phonological space in simultaneous bilinguals.
Footnotes
Acknowledgments
We are grateful to two anonymous reviewers for valuable comments on an earlier version of this paper. We thank David Ryška for technical assistance.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Czech Science Foundation (GACR) (Grant Number: 18-01799 S).
