Abstract
The present investigation examines English speakers’ ability to identify and discriminate non-native consonant length contrast. Three groups (L1 English No-Instruction, L1 English Instruction, and L1 Finnish control) performed a speeded forced-choice identification task and a speeded AX discrimination task on Finnish non-words (e.g. /hupo/–/huppo/) which were manipulated for intervocalic consonant duration. The results indicate that basic information, focusing the participants’ attention on a particular contrast, assists novice listeners in processing a non-native contrast. We find support for a phonetic level of processing which is intermediate to non-linguistic acoustic processing and phonemic processing at which the phonetic cue of duration becomes significant. We interpret the results in relation to the Speech Learning Model (Flege 1995, 2003).
I Introduction
Languages differ in the phonemic distinctions they make; for example, languages like Finnish, Arabic, and Japanese maintain a contrast for consonant length, while English does not. In language learning, these kinds of non-native contrasts can be rather difficult for some learners to perceive. However, in many cases learners eventually succeed in perceiving non-native contrasts, despite the fact that they require learners to make use of acoustic cues that are not relevant in their native phonology. We explore the extent to which basic information about phonemic consonant length aids naïve listeners in using the acoustic cue of duration in the perception of a non-native contrast.
Cross-linguistically, for languages containing singleton and geminate consonants, duration is the primary cue to this distinction (Hankamer, Lahiri, and Koreman, 1989; Lahiri and Hankamer, 1988; Lehtonen, 1970; Ylinen, Shestakova, Alku, and Huotilainen, 2005). However, secondary cues such as preceding vowel duration and formant transitions may also exist for length contrasts as may be the case in Arabic (Obrecht, 1965). Additionally, the ratio between singleton and geminate consonants may be the overriding cue in the face of fluctuating speaking rates, as seen in Japanese (Hirata and Whiton, 2005). While segmental duration does not constitute a phonemic distinction for English speakers, it does play a role in the production and perception of segments and words in English (Klatt, 1977; Lisker, 1957; Oller, 1973; Repp, 1978; Umeda, 1977). Klatt (1977) finds that the average duration of English consonants (necessarily singletons) is approximately 70 ms though this can vary depending on factors such as phonetic context, position within a word, and syntactic environment. For example, consonants in general are longest in word initial position and shortest word-medially (Klatt, 1977). Umeda (1977) indicates that, on the whole, consonants in word-internal, post-stress, intervocalic position are all shorter than when in word-initial position. Oller (1973) concludes that lengthening in certain utterance positions is a learned aspect listeners use as a cue for identifying the boundaries of words and phrases. This is evident from the work of Pickett and Decker (1960) who show that English speakers are sensitive to duration across morphological boundaries. By manipulating the stop closure of /p/, listeners began to judge the word topic as top pick at a threshold of approximately 175–200 ms. Most durations below 150 ms were perceived as single, and most above 250 ms were perceived as double; though, there was a bias toward rating all closure durations as single. Repp (1978) finds similar results with stop consonants, showing that for VC and CV stimuli played in succession, a closure duration of more than 200 ms is needed to hear VC and CV as separate phonemic events, and for closure durations below this boundary the cues are integrated into a single phonemic percept. Additionally, Lisker (1957) shows that length is the significant cue for voicing of a /p/–/b/ contrast in a post-stress, intervocalic position. More recent work has shown that articulatory strengthening, in particular segment duration, is a significant marker of prosodic boundaries, including word boundaries and in some cases morpheme boundaries, in English (Byrd and Choi, 2010; Byrd, Krivokapić, and Lee, 2006; Fougeron and Keating, 1997; Kaye, 2005; Keating 2006). Therefore, while duration does not constitute a phonemic feature in English, speakers are sensitive to it in many contexts.
Finnish, however, maintains a phonemic distinction for consonant length. Of its 13 consonants, eight occur as geminates in intervocalic position: /p, t, k, s, m, n, l, r/. According to Lehtonen (1970), these geminate consonants are approximately two times the duration of their singleton counterparts, though durations do vary by consonant. The average singleton consonant is approximately 77 ms (with consonant averages ranging from 45 ms for /r/ [trilled] to 102 ms for /p/). The average geminate consonant is approximately 163 ms (with consonant averages ranging from 124 ms for /l/ to 205 ms for /k/). The contrast ratios between singleton and geminate consonants range from 1:1.85 to 1:2.61 (Lehtonen, 1970). Native speakers of Finnish, like speakers of other geminating languages, perceive singleton and geminate consonants categorically (Heeren and Schouten, 2008, 2010; Takeuchi, 2010; Ylinen, Shestakova, Alku, and Huotilainen, 2005).
English speakers are not expected initially to be capable of perceiving this categorical length distinction as they do not maintain this phonemic contrast natively, with segmental duration serving as a cue to other suprasegmental phenomena. When learning a second language (L2), this type of non-native phonemic contrast can be difficult for English speakers to begin to perceive and produce and therefore can be quite problematic. Previous work has shown that first language (L1), and the degree to which duration is used in its phonology, influences how well inexperienced listeners perceive phonemic length contrast. McAllister, Flege, and Piske (2002) showed that participants’ success in learning the Swedish vowel length contrast is related to the role of the duration feature in the first language. Estonian speakers, who have a length contrast for both vowels and consonants, performed much like Swedish controls, while English speakers performed differently, though still better than Spanish speakers. The reason for this difference is likely to be that vocalic duration is a cue to the tense–lax distinction in English (House, 1961). Therefore, the phonetic cue of duration, while not a prominent feature of the L1 phonology, can be used during L2 acquisition (McAllister et al., 2002). As for consonant length contrasts, Altmann, Berger, and Braun (2012) found that German speakers (whose native language has a vocalic length contrast) can perceive Italian consonant length, though still clearly different from native Italian speakers, and this ability increases with increased experience.
Numerous studies have shown that inexperienced listeners are not capable of perceiving non-native phonemic length contrast but that this ability to distinguish singleton and geminate consonants generally increases with increased exposure to the language (Hayes, 2002; Heeren and Schouten, 2008; Kato and Tajima, 2002; Takeuchi, 2010; Ylinen, Shestakova, Alku, and Huotilainen, 2005). Additionally, auditory training has been shown to increase detection of non-native contrasts (Hirata, Whitehurst, and Cullings, 2007; Motohashi-Saigo and Hardison, 2009; Sonu, Kato, Tajima, Akahane-Yamada, and Sagisaka, 2013; Tajima, Kato, Rothwell, Akahane-Yamada, and Munhall, 2008). While non-native contrasts may be difficult, it has also been shown that perception is malleable and that different factors can influence this. In language, information of varying degrees of salience, and the resulting attention, can aid listeners in perceptual tasks. In an experiment by Niedzielski (1999), social information (dialect group) was shown to influence speech perception of vowels. Detroit-area residents were presented with sentences in which they were asked to listen to and concentrate on a vowel they heard in a particular word from that sentence. They were then asked to match that vowel to a set of six computer-resynthesized vowels. Half of the participants were informed the sentences were spoken by a fellow Detroit resident and other half were informed the same speaker was from Canada. Respondents given the Canadian label chose raised-diphthong tokens as those present in the dialect of the speaker, whereas those given the Michigan label did not choose the raised tokens, even though the stimuli were identical for each group of participants.
Schulman (1983) showed that telling Stockholm dialect speakers that they were listening to English rather than Swedish words significantly improved their ability to discriminate between two vowels that were merged in their dialect. Guion and Pederson (2007) found that experimentally orienting listeners’ attention toward the speech signal rather than word meaning increased discrimination of difficult-to-learn non-native contrasts (in this case, Hindi stop consonants). Similarly, Hisagi and Strange (2011) showed that naïve English listeners could perform a categorical discrimination task for three types of Japanese temporally-cued contrasts (vowel length, consonant length and syllable number/length) well above chance when the three contrast types were presented in separate blocks with detailed instructions about what to listen for. However, when the contrast types were presented randomly with no specific instructions about the nature of the contrasts, performance was significantly poorer. This as an indication that listeners’ ability to make phonetic judgments is at least in part due to attention directed to the nature of the contrast. Therefore, attention, (sub)conscious focused awareness, can shape the L2 learning process (Ellis, 2006) following notions of ‘noticing’ and ‘intake’ as per Schmidt (1990). We suspect that attention may play a role in how listeners begin to move toward more native-like perception of a non-native contrast in the process of category formation.
These attentional influences on perception may have to do with the underlying processing strategies that listeners employ. Werker and Logan’s (1985) study provides evidence for the differentiation of phonemic, phonetic, and auditory levels of processing. Specifically, they claim that when listeners perceive stimuli according to the phonological categories of their native-language, they demonstrate ‘phonemic’ perception. However, when they show sensitivity to phonetic cues that are used in another language, they use ‘phonetic’ perception. Throughout this article, we refer to this use of phonetic perception as phonetic processing. Further, children also demonstrate the ability to learn phonetic cues when acquiring an L2. Examining the acquisition of English voice onset time in native Spanish speaking children, Williams (1979) concluded that learning at the phonetic level does occur during L2 acquisition. This may occur by listeners using phonetic processing as a means of beginning to form phonological categories in the L2.
The roles of phonetic processing and attention are crucial to our understanding of language learning and our models of perception during L2 acquisition. Flege’s (1995, 2003) Speech Learning Model proposes that even adult listeners retain the ability to detect non-native phonetic variation and use those acoustic-phonetic cues for the purposes of category formation. However, this model does not address the role of attention and its relation to phonetic processing. The studies indicated above, which have begun to investigate the influences of attention, thus far have only focused on discrimination rather than identification of non-native contrasts. Identification allows us to investigate how listeners identify and categorize the incoming stimulus absent of any context while discrimination allows the listener to make a more direct comparison between two sounds, perhaps by using finer phonetic differences. Examining identification also allows us to examine out-of-context behavior at the earliest stages of category formation.
The current study focuses on English speakers’ ability to categorize and discriminate a non-native consonant length contrast and the effects of instruction (basic information) and thus cue-attention on both tasks. As English does not contain a phonemic consonant length distinction, we predict that English listeners will not discriminate or identify Finnish geminates in a native-like manner. However, due to the established role of length in English, increasing consonant duration is expected to lead to greater categorization of geminate consonants, and increasing contrast ratio between consonant durations is expected to lead to greater discrimination of the length contrast among listeners. Additionally, we predict that drawing attention to the contrast by means of basic instruction will increase listeners’ ability to use duration as a cue to this categorical distinction. While it is not expected that English listeners will achieve a native-like phonemic contrast, information about and attention to consonant duration will allow listeners to use a more phonetic processing strategy, thus enabling them to move in the direction of native Finnish patterns of discrimination and identification. This allows us to examine how naïve listeners begin to carve up phonetic space when they attend to a cue relevant in another language.
II Experiment 1
1 Methodology
We first investigate the above predictions in a speeded forced-choice identification task designed to examine naïve English listeners’ ability to identify consonant length. For this task, we systematically varied the duration of word-medial intervocalic consonants along a continuum which served as the primary manipulation. There were two groups of naïve English listeners; in one the listeners were only informed that they would hear a foreign language, while in the other they were given basic information in the form of instruction regarding the nature of the Finnish consonant length. A third control group consisted of native Finnish speakers. While it was not expected that English speakers would make a phonemic distinction, a task of this sort encourages listeners to compare the stimulus with an existing representation of English consonant duration, thus relying on a more phonological strategy. It also allows us to examine if naïve listeners make use of consonant duration as a cue, and if detection is enhanced in an experimental setting. Again, by including a control group it is possible to compare and contrast participants in both experimental groups to those with native competency.
a Participants
Sixty participants were recruited for participation in this study. Forty were native speakers of English enrolled in a first year linguistics course at the University of Alberta, Canada and received course credit for their participation. These participants had varying degrees of experience with a foreign language, though none had any experience with a language containing a singleton/geminate consonant distinction. Of these native English speakers, half were randomly assigned to the first group (No-instruction) in which they were only told that they would hear words from an unspecified foreign language, while the other half were assigned to the second group (Instruction) in which they were informed that they would be hearing words from Finnish. This second group also received some basic, written instruction regarding the Finnish phonemic length contrast, designed to focus their attention on the contrast. Specifically, the instructions state: Finnish is a language which contains many interesting features. One such feature is that it distinguishes between long and short consonants. For example, the word mato means ‘worm’ and matto means ‘carpet’. These long consonants are approximately twice the length of short consonants.
The remaining 20 participants were native Finnish speakers affiliated with the University of Turku in Finland and had varying degrees of experience with a foreign language. They were told that they would hear Finnish non-words during the experiment. The participants in this group received no compensation for their participation. Finnish participants were run in Finland by the speaker who produced the stimuli. All participants in this study reported having normal hearing. English-speaking participants were asked to report whether or not they were proficient in another language and, if so, in which language. Here proficiency was defined for participants as being comfortable accomplishing daily tasks in a second language. Among the 20 participants in the No-Instruction group, 9 reported proficiency in an L2. Among the 20 participants in the Instruction group, 4 reported proficiency in an L2. Of the 13 who reported proficiency, 9 reported that the L2 was French.
b Materials
For this study we chose 16 disyllabic Finnish non-words, eight CVCV and eight CVCCV for the consonants /p, t, k, s, m, n, l, r/. The non-words (Table 1) conform to Finnish phonotactics, creating minimal pairs for each of the consonants allowing gemination. Other consonants, such as /b/ and /d/, do occur in Finnish, however, primarily in loanwords or across specific morphological boundaries and were thus excluded from this study. While it was not possible to completely control the vocalic context across singleton/geminate pairs, all experimental minimal pairs have primary stress on the first syllable in accordance with standard Finnish phonology.
Finnish non-word stimuli.
A male native Finnish speaker was recorded producing 10 repetitions of each item in isolation. He read 20 different randomized wordlists: 10 containing singletons, 10 containing geminates. The presentation alternated between singleton and geminate wordlists, beginning with a singleton list. In order to avoid list intonation, three filler non-words were placed at both the beginning and end of each randomized list. Additionally, a single repetition of 16 distracter non-words, needed for Experiment 2, was recorded by the same speaker. The recordings were made in a sound-attenuated booth using a unidirectional head-mounted condenser microphone (Countryman E6I) and digital recorder (Korg MR2000). Sampling and bit rates were set at 44.1 kHz and 16 bit/s respectively and phantom power was supplied through an Alesis MultiMix 8.
In Praat (Boersma and Weenink, 2011), we segmented the items and consonants by hand. For stops, duration was measured from the offset of voicing of the preceding vowel up to the burst release. For nasals, duration boundaries were identified on the basis of nasal formants and decrease in amplitude. For /r/ duration boundaries were clear and placed at the beginning and end of the trill. The boundaries of /l/ presented difficulty as formant structure from surrounding vowels was present throughout; therefore, decrease in amplitude was used to identify the boundary and was verified auditorily. We extracted the duration measurements and averaged across the 10 repetitions for the singleton and geminate form of each consonant. We then calculated the average difference in duration (i.e. average geminate minus average singleton) and the contrast ratio (i.e. average geminate divided by average singleton) for each consonant pair (Table 2).
Average consonant durations, differences and ratios.
With the average difference known, we created a continuum of 10 steps for each consonant, representing durations between (and beyond) the values of an average singleton and an average geminate. The values along the continuum were calculated by adding/subtracting multiples of 25% of the average difference (in ms) to the average singleton and geminate values. In doing so, it is possible to synthesize the same continuum (i.e. equal endpoints) from both a naturally produced singleton and a naturally produced geminate (for example, see Figure 1). The goal of producing the two continua was to neutralize the effect of possible secondary cues from neighboring sounds. Using these duration values, we synthesized the experimental stimuli using PSOLA (Pitch Synchronous Overlap and Add) in Praat. Each of the resulting 160 stimuli was coded for its step on the continuum and for whether it was synthesized from a singleton or a geminate. The naturalness of all stimuli was independently evaluated and verified by two native Finnish-speaking linguists.

Example consonant manipulation for /hupo/–/huppo/.
c Procedure
The stimuli were presented using the software ACTUATE (Westbury, 2007). The experiment was run in a quiet room in which participants were seated at a comfortable distance in front of the computer and were fitted with over-the-ear headphones (MB QUART QP-805 HS) adjusted to a comfortable volume. Written, on-screen instructions, which varied depending on group (i.e. Native, Instruction, or No-Instruction), informed the participants about the task ahead. A block of practice stimuli was provided before proceeding on to the experiment. Participants heard one auditory stimulus at a time and were simultaneously presented with the visual word in question, which contained a blank (e.g. ‘Hu__o’) in place of the consonant. This was done to help the listener focus on the intervocalic consonant. Participants were instructed to decide if the second consonant (i.e. intervocalic) was short or long and respond by pressing one of two buttons (‘S’ for short, ‘L’ for long) on the computer keyboard. Participants were instructed to respond as soon as they made a decision and doing so advanced the program to the next trial. Stimuli were randomized and three short breaks were provided, during which the instruction, if relevant, was reinforced. In addition to response, reaction time was recorded for each trial. All participants completed the forced-choice identification task and the AX discrimination task (Experiment 2) in one session. Half of the participants were randomly assigned to complete Experiment 1 first.
2 Results
We examine perception of consonant length using mixed-effects logistic regression (Baayen et al., 2008; Jaeger, 2008; Morrison, 2007). Mixed-effect regression modeling has a number of advantages over analysis of variance (ANOVA) for data of this type. It allows for the inclusion of participant and item random effects as well as both continuous and factorial variables in one statistical model.
We fitted models with the responses (i.e. Short or Long) as the dependent variable. We present two analyses; the first containing data from all three groups and a second performed on only the two experimental groups. As it is not expected that either experimental group will perform in a native-like manner, the first analysis examines their performance against the baseline, the native Finnish group. Any possible differences between the performance of the experimental groups is addressed in the second analysis.
For both analyses, Group and Duration are the experimental variables of interest in both models, with Duration taken as a continuous variable as its steps are equally spaced. Additionally, other variables were considered in each model. Item and Participant were included as random effects (specific random intercept and slopes discussed below). Control variables were also included allowing for item properties (Consonant and Synthetic Original), participant properties (Second Language Proficiency and Log Reaction Time) and experiment properties (Experiment Order and Stimulus Order) to influence the response variable. A description of each variable can be found in Table 3. All analyses were conducted in R using the lme4 package (Bates et al., 2012) and the languageR package (Baayen, 2011) for plotting the resultant models.
Variables considered for analysis.
Reaction times were measured from the end of the stimulus. Responses made before the end of the stimulus were removed from the dataset. This resulted in the removal of 72 data points (0.75%). The reaction times were then log transformed to normalize the distribution and 28 outliers (0.29%), evenly spread across listener groups, were removed from the tails (less than 3.75 and greater than 9.5 on log scale). We fitted a model for each dataset with all possible predictors including random intercepts for both Item and Participant, along with complex random structure (e.g. random slopes for the interaction of Group and Duration by Participant). By-participant random intercepts allow for us to account for the possibility that some participants may be better (or worse) at the task. Similarly, by-item random intercepts allow for adjustments as some items may be easier (or harder) to categorize. This same logic applies to random slopes which allow the response across a particular variable to vary by participant. A backwards step-wise elimination procedure was used for fitting each model which involves starting with all available predictor variables and removing variables one by one that do not significantly improve the model as indicated by likelihood ratio testing.
a Analysis 1
In fitting a model to this dataset, four input variables – Stimulus Order, Second Language Proficiency, Experiment Order, and Synthetic Original – were eliminated during the backward elimination process. Additionally, a significant interaction between Group and Duration resulted. Table 4 summarizes the final model.
Summary of fixed effects for Identification Task model in Analysis 1.
The interaction between Duration and Group can be seen in the summary of fixed effects (Table 4). Additionally, for comparative purposes, Figure 2 presents both aggregated data for Duration by Group along with a plot of the model estimates for this interaction. In comparison to the Native group, the No-Instruction group is less likely to respond ‘Long’ along the consonant duration continuum (β = −1.21, SE = 0.10, p < 0.001). The coefficient of the Native line is significantly steeper than that of the No-Instruction group. Likewise, the interaction shows that in comparison to the Native group, the Instruction group is less likely to respond ‘Long’ as consonant duration increases (β = −1.05, SE = 0.10, p < 0.001). Again the coefficient indicates that the slope for the Native group is significantly steeper than the slope for the Instruction group. As it is clear that neither experimental group performed in a native-like fashion, the possible difference between the Instruction and No-Instruction groups is explored separately in Analysis 2.

The left panel presents aggregate data of proportion of ‘Long’ responses along Duration by Group with standard error bars. The right panel presents probability estimates from the model for this interaction when all other variables are held constant at their reference level or median value.
b Analysis 2
In fitting a model to this subset of data, one input variable – Experiment Order – was eliminated during the backward elimination process. Additionally, interactions between Duration and Group and Group and L2 Proficiency were statistically significant. Table 5 summarizes the final model. For comparative purposes, Figure 3 presents both aggregated data for Duration by Group along with a plot of the model estimates for this interaction. The interaction between Duration and Group indicates that the two experimental groups are indeed performing differently. With the No-Instruction group as the reference level, the Instruction group has a significantly steeper slope (β = 0.17, SE = 0.08, p = 0.03). Thus, as consonant duration increases, the Instruction group is more likely to respond ‘Long’. The overall effect size for the No-Instruction group is 0.57, while the overall effect size for the Instruction group is 0.79.
Summary of fixed effects for Identification Task model in Analysis 2.

The left panel presents aggregate data of proportion of ‘Long’ responses along Duration by Group with standard error bars. The right panel presents probability estimates from the model for this interaction when all other variables are held constant at their reference level or median value.
A post-hoc comparison found a significant interaction between Group and L2 Proficiency (Figure 4). While the model does not perform pairwise comparisons, the interaction suggests that proficiency in an L2 provide more ‘Long’ responses when they do not receive instruction.

The left panel presents aggregate data of proportion of ‘Long’ responses by L2 proficiency and Group with standard error bars. The right panel presents probability estimates from the model for this interaction when all other variables are held constant at their reference level or median value.
3 Discussion
The results of this speeded identification task show that native Finnish speakers begin to perceive geminate consonants in the expected range, starting around step four (see Lehtonen, 1970). This replicates and confirms previous results showing a phonological boundary for consonant length in native Finnish speakers (see Heeren and Schouten, 2008, 2010; Ylinen et al., 2005).
Naïve English listeners of Finnish can and do detect increasing consonant duration. Unsurprisingly, they do not perform like native Finnish speakers. Instead, they show a more gradient performance, as predicted. However, in comparison to the No-Instruction group, the Instruction group’s ability to categorize consonants of varying durations as ‘Long’ is significantly enhanced by the simple knowledge that a consonant length distinction exists in the language. Additionally, their performance shifts (albeit slightly) in the direction of native Finnish speakers. This provides evidence that, while native English speakers have no phonological category for consonant length, they can compare the percept to some phonological reference point (presumably that of English consonant duration). The No-Instruction group, not knowing that duration is an important within-word feature, did not respond in the same fashion. This type of attentional effect has not previously been explored in identification tasks with naïve listeners, which sheds light on category assignment at the earliest stages of learning.
Of additional interest is that L2 experience may influence how English speakers begin to deal with fluctuating consonant durations. We interpret this interaction cautiously as it was not a primary manipulation in this study and L2 proficiency was not completely balanced across groups. However, the trend in the data indicates that L2 proficiency may play a role in a listener’s ability to use a particular cue when instructed about its existence. We believe that this deserves attention in future experimentation. This experiment has examined listeners’ ability to use a primarily phonological strategy for identifying consonant length and how information regarding the contrast influences the ability to categorize consonants of varying durations. The following AX discrimination experiment examines listeners’ ability to discriminate pairs of stimuli varying in duration of their intervocalic consonants requiring them to use a more phonetic strategy. Similar issues (e.g. consonant duration, group differences, and L2 proficiency) are explored in Experiment 2 as a comparison with Experiment 1.
III Experiment 2
1 Methodology
In this experiment we used a speeded same-different task to investigate the listeners’ ability to discriminate consonant length contrast. Here, we augmented Hayes’ (2002) design by creating a speeded AX discrimination task in which the graded contrast ratio between the two stimuli was the primary manipulation. Here, the same participants from Experiment 1 also complete this Experiment and were assigned to the same experimental groups, both involving naïve listeners of Finnish. In one group the listeners were only informed they would hear a foreign language; in the other, they were given basic information regarding the nature of the Finnish consonant length contrast. Again, the native Finnish speakers served as controls. This task induces listeners to attend more carefully to the acoustic forms of the two items and make a direct comparison. Because it is not expected that English listeners would make a native-like phonemic distinction, a task of this sort encourages listeners to make a direct auditory comparison relying heavily on a phonetic discrimination strategy. It also allowed us to see if naïve listeners use consonant length as a cue, and if detection can be enhanced in an experimental setting. As with Experiment 1, by including a control group, it is possible to compare and contrast participants in both groups against those with native competency.
a Participants
Participants and groups (Native, Instruction, and No-instruction) were the same as in Experiment 1.
b Materials
The same stimuli from Experiment 1 (Steps 1–9) were used to create stimulus pairs for Experiment 2. These pairs were made by matching tokens (Steps 1–9) so that the duration difference between the consonants, which we refer to as Contrast, ranged from 0% average singleton/geminate difference to 200% average singleton/geminate difference in steps of 50% (Table 6). Given that average differences varied by consonant (see Table 2), percentages have been used to describe the general contrast schema. In Table 6, the column labeled Pairing indicates which steps were combined to obtain the corresponding contrast. All pairs contained acoustically manipulated materials, one item synthesized from a geminate and one from a singleton. Pairings were also balanced so that both longer and shorter items occurred in the first position. This resulted in 144 experimental items, 16 of which were of Contrast 1. Additionally, 144 filler pairs were added, which consisted of a filler word (see Table 7) matched with a stimulus word (natural recording) so that they differed either by the word-initial consonant or by the vowel in the first syllable (e.g. the filler jato was matched with vato). Thus, in the event that participants did not perceive a difference in intervocalic consonant duration, these fillers would provide more easily discriminable differences.
Description of consonant contrast labels.
Filler non-words for Experiment 2.
c Procedure
The stimuli for this task were presented using the software ACTUATE (Westbury, 2007). The experiment was run during the same session as Experiment 1 with order of completion balanced across participants. Half of the participants were randomly assigned to complete Experiment 2 first. Written, on-screen instructions, which varied depending on group (i.e. Native, Instruction, or No-Instruction), informed the participants about the task and a block of practice stimuli was provided. Participants were presented two successive auditory stimuli (i.e. a stimulus pair) 500 ms apart, after which participants responded whether the stimuli were instances of the same word or instances of two different words. They indicated their response by pressing one of two buttons (‘S’ for same, ‘D’ for different) on the computer keyboard. Participants were instructed to respond as soon as they made a decision. Doing so advanced the program to the next trial. Stimuli were randomized and three short breaks were provided, during which the instruction, if relevant, was reinforced. In addition to response, reaction time was recorded for each stimulus.
2 Results
We examine perception of consonant length contrast using mixed-effects logistic regression models (Baayen et al., 2008; Jaeger, 2008, Morrison, 2007). We fitted models with the responses (i.e. Same or Different) as the dependent variable. As in Experiment 1, we present two analyses. The first analysis contained data from all three groups and a second analysis was performed on only the two experimental groups. As it is not expected that either experimental group will perform in a native-like manner, the first analysis examines their performance against the baseline of the native Finnish group. Any possible differences between the performance of the experimental groups is addressed in the second analysis. The same variables as in Experiment 1 were considered (see Table 3), with the exception of Duration, as Contrast is the variable of interest here. Like Duration, Contrast is taken as a continuous variable as its steps are equally spaced. All analyses were conducted in R using the lme4 package (Bates et al., 2012) and the languageR package (Baayen, 2011) for plotting the resultant models.
Reaction times were measured from the end of the second word in the pair. As with Experiment 1, responses made before the end of the stimulus were removed from the dataset. This resulted in the removal of 35 data points (0.4%), all of which came from the English speaking groups. The reaction times were log transformed to normalize the distribution and 32 outliers (0.37%), evenly spread across listener groups, were removed from the tails (less than 3.8 and greater than 9 on log scale). We fitted a model for each analysis with all possible predictors including random intercepts for both Item and Participant, along with complex random structure (e.g. random slopes for the interaction of Contrast and Group by Participant). Again, a backwards step-wise elimination procedure was used for fitting each model.
a Analysis 1
In fitting a model to this dataset, three input variables – Experiment Order, Stimulus Order, and Second Language Proficiency – were eliminated. Additionally, an interaction between Contrast and Group resulted as significant. Table 8 summarizes the final model.
Summary of fixed effects for Discrimination Task model in Analysis 1.
The interaction between Contrast and Group can be seen in the summary of fixed effects (Table 8). Additionally, for comparative purposes, Figure 5 presents both aggregated data for Contrast by Group along with a plot of the model estimates for this interaction. In comparison to the Native group, the No-Instruction group is less likely to respond ‘Different’ along the contrast step continuum (β = −1.27, SE = 0.11, p < 0.001). The coefficient of the Native line is significantly steeper than that of the No-Instruction group. Likewise, the interaction shows that in comparison to the Native group, the Instruction group is less likely to respond ‘Different’ as the contrast increases (β = −0.82, SE = 0.11, p < 0.001). Again the coefficient indicates that the slope for the Native group is significantly steeper than the slope for the Instruction group. Neither experimental group performs in a comparable manner as the Native group, but the Instruction group performs in the direction of the Native group. The possible difference between the Instruction and No-Instruction groups is explored separately in Analysis 2.

The left panel presents aggregate data of proportion of ‘Different’ responses along Contrast by Group with standard error bars. The right panel presents probability estimates from the model for this interaction when all other variables are held constant at their reference level or median value.
b Analysis 2
In fitting a model to this subset of data, two input variables – Experiment Order and Stimulus Order – were eliminated during the backward elimination process. Additionally, an interaction between Contrast and Group resulted as statistically significant. Table 9 summarizes the final model. For comparative purposes, Figure 6 presents both aggregated data for Contrast by Group along with a plot of the model estimates for this interaction. The interaction between Contrast and Group shows the two experimental groups perform differently. With the No-Instruction group as the reference level, the Instruction group is more likely to respond ‘Different’ as contrast between consonants increases (β = −0.45, SE = 0.12, p < 0.001). Thus, the slope for the Instruction group is significantly steeper than that of the No-Instruction group. The overall effect size for the No-Instruction group is 0.22, while the overall effect size for the Instruction group is 0.65.
Summary of fixed effects for Discrimination Task model in Analysis 2.

The left panel presents aggregate data of proportion of ‘Different’ responses along Contrast by Group with standard error bars. The right panel presents probability estimates from the model for this interaction when all other variables are held constant at their reference level or median value.
In this model, an interaction between Group and L2 Proficiency (Figure 7) approached significance (p = 0.07). Again, this is a post-hoc analysis, and we report it as a similar interaction was significant in Experiment 1. It suggests that when no instruction is provided, listeners with L2 proficiency may provide more Different responses.

The left panel presents aggregate data of proportion of ‘Different’ responses by L2 proficiency and Group with standard error bars. The right panel presents probability estimates from the model for this interaction when all other variables are held constant at their reference level or median value.
While Experiment Order was eliminated in the backward elimination process, it was still necessary to investigate the effect of task order on the No-Instruction group. The reason for this is that by the mere fact of completing the identification task first, which required listeners to identify consonants as long or short, they were implicitly informed that such categories may exist. Listeners may use that knowledge and experience in discriminating the contrasts in this experimental task. Chi-squared analysis revealed that the portion of the No-Instruction group who completed the identification task before the discrimination task, did in fact respond with more ‘Different’ responses (X2 = 5.7543, df = 1, p = .01645).
3 Discussion
The results of this speeded AX Discrimination task show that native Finnish speakers increase their recognition of the difference between pairs as the difference between the consonant durations increases. This increase appears to plateau toward the end of the continuum.
As for the native English listeners, they can and do detect increasing differences between intervocalic consonants. They do not perform like native Finnish speakers and, as predicted from previous findings (see Hayes, 2002), their rate of increase in ‘Different’ responses is not the same as native Finnish performance. However, their ability to detect this contrast is significantly enhanced by the simple knowledge that a consonant length distinction is important in the language, shifting their performance in the direction of native Finnish speakers. This provides evidence that, while native English speakers have no phonological category for consonant length, their initial perceptual sensitivity to duration is still quite flexible and can be enhanced with basic information regarding its relevance. This finding is similar to that of Schulman (1983) who found that telling Stockholm dialect speakers that they were listening to English rather than Swedish words helped listeners to discriminate between two vowels which were merged in their dialect. Information about and attention to meaningful features appear to play a role in a listener’s ability to discriminate non-native contrasts. This is also evidenced by the effect of task order in the No-Instruction group. Even if listeners in the No-Instruction group have no other knowledge of the consonant length contrast, implicitly knowing that consonants may differ in length (by completing the Identification task first) is enough to cause listeners to discriminate a difference more often.
As with Experiment 1, L2 experience may influence how English speakers begin to deal with fluctuating consonant durations. Again, because L2 proficiency was not a primary manipulation in this experiment, we interpret these results cautiously. However, the trend in the data indicates that L2 proficiency (or lack thereof) may differentially affect listeners who are not given information about the length contrast. This provides impetus to explore the possible effect of previous, unrelated language learning on the perception of new contrasts.
IV General discussion
The current study aimed to address a number of questions regarding English speakers’ ability to identify and discriminate a non-native phonemic contrast of consonant length with which they have no previous experience. The study also sought to examine how basic contextual information about relevant cues and the resulting attention to those cues play a role in enhancing the perceptual abilities of native English speakers. We predicted that naïve L1 English listeners would not reliably discriminate or identify normal Finnish geminate consonants. The results of the discrimination and identification tasks coincide, revealing that naïve listeners detect the difference in consonant durations, but do not perform like the native Finnish controls. This replicates the findings of Hayes (2002) who showed that naïve English listeners do not achieve a phonemic distinction between short and long consonants. However, as we also predicted, increasing duration and contrast ratio lead to greater identification and discrimination in these same listeners. In both experimental groups, Instruction and No-Instruction, increases in duration and contrast ratio resulted in enhanced detection. This underscores the fact that listeners can detect subtle changes in duration during phonetic comparison (i.e. discrimination) and begin to better categorize (i.e. identification) based on this duration.
Finally, we predicted that basic information and the resulting attention would enhance naïve listeners’ ability to use duration as a cue to length (see Schmidt, 1990). The results of the discrimination and identification tasks align, showing that detection increases when listeners’ attention is drawn to the fact that a particular acoustic feature is important. Similarly, Schulman’s (1983) work indicates that simply informing listeners that the stimuli are from another language provides enough information for listeners to understand that different features may be important in maintaining contrasts. Guion and Pederson (2007) found that generally orienting listeners’ attention to auditory form rather than meaning increased discrimination of Hindi stop consonants in naïve English listeners. In subsequent work, they found that asking listeners to focus on the consonants rather than the vowels during training increased discrimination of the consonant contrast (Pederson and Guion-Anderson, 2010). The results reported here indicate that attention to particular acoustic features involved in the non-native contrast enhances perception, perhaps by allowing listeners to assign importance to the new cue. Because the listeners without instruction still had a positive slope along both continua, it seems that they perceive changes in duration, but perhaps disregard it as an important cue. As indicated by one reviewer, this raises the interesting question of possible differences between heightened cue sensitivity and the application of a pre-existing sensitivity to a particular task; that is to say, a difference between a genuine increase in perceptual ability and shifts in cue-attention strategies. The present data in effect replicate the results of Hisagi and Strange (2011) who found that naïve English listeners could discriminate three types of Japanese temporally-cued contrast well above chance when the three contrast types were presented in separate blocks with detailed instructions about what to listen for. Interestingly, the present study goes further by showing the effect of attention is also present in identification as well as the ability of completely naïve and uninstructed listeners to detect the phonetic gradience of increasing consonant duration.
The results of this study, taken together, provide insight into how a novice L2 learner begins the process of category formation. The Speech Learning Model proposed by Flege (1995, 2003) indicates that phonetic learning continues across the life span and can lead to category formation even in an L2 acquired later in life. Werker and Tees (1984) proposed a phonetic level of processing which is intermediate to non-linguistic acoustic processing and phonemic processing. It is at this level of processing that a phonetic cue begins to take on discriminative meaning for the listener. While the attentional manipulation employed by Werker and Tees (1984) did not result in successful discrimination of the contrast, the results from the present study indicate that information, even relatively implicit suggestion, and the resulting attention to phonetic features may play a role in a listener’s ability to use the phonetic level of processing. This appears to be the case in both discrimination and identification of the temporal cue of consonant duration. These results are similar to those of both Guion and Pederson (2007) and Hisagi and Strange (2011), and demonstrate the need for elaboration of the Speech Learning Model to account for the role of oriented attention in non-native contrast perception and eventual category formation.
V Conclusions
At the most novice level, the detection of cues to phonological distinctions in an L2 may be facilitated by information about the particular contrast, attention to phonetic detail, and possibly previous language experience. The results provide evidence for how novice learners, who do not initially maintain a phonological contrast, make use of instruction when presented with a non-native contrast. In addition, others have shown that learning of non-native contrasts can be affected by either time in the target language country (MacKain et al., 1981) or auditory-perceptual training (Hirata, Whitehurst, and Cullings, 2007; Motohashi-Saigo and Hardison, 2009; Sonu, Kato, Tajima, Akahane-Yamada, and Sagisaka, 2013; Tajima, Kato, Rothwell, Akahane-Yamada, and Munhall, 2008). Phonetic learning has been shown to occur during L2 acquisition in children (Williams, 1979), and adults have been shown to use the phonetic level of processing in contrast discrimination (Werker and Tees, 1984). Here we find both evidence of phonetic processing and the effect of oriented attention in both discrimination and identification.
The integration of attentional mechanisms as explicit components in the Speech Learning Model will help to better understand the process of acquiring non-native contrasts. That being said, a more complex construct of how this category learning happens may be needed. First, the present results indicate that even in the absence of overt information about a particular cue, previous experience with any foreign language may interact with attention to specific cues; simply having previously learned another language may result in enhanced perceptual ability. In subsequent studies, this can be addressed under more controlled conditions. Second, lexical level information begins to play a role in speech discrimination as proficiency increases (Celata, 2004; Mora, 2005). In order to better understand how phonetic processing progresses toward phonemic processing, it would also be interesting and necessary to compare these data with those of more advanced English-speaking learners of Finnish. Additionally, the analysis presented here does not provide indication of the point at which listeners begin to show differential performance. Such a study including intermediate and proficient learners of Finnish may better indicate the establishment of phonological categories. Here we have only been able to address the effect of attention on the temporal cue of consonant duration; however, similar investigations of this type could also be done for non-temporal cues such as vowel quality or tone to better understand the role of phonetic processing and how it may be used by listeners as a means of beginning to form categorical contrasts in an L2.
Footnotes
Acknowledgements
Thank you to Aki-Juhani Kyröläinen for his assistance in conducting this research. Thank you also to the anonymous reviewers whose comments helped to improve this paper.
Declaration of conflicting interest
The authors declare that there is no conflict of interest.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
