Abstract
This study evaluates the speech reception performance of native (L1) and non-native (L2) normal-hearing young adults in acoustical conditions containing varying amounts of reverberation and background noise. Two metrics were used and compared: the intelligibility score and the response time, taken as a behavioral measure of listening effort. Listening tests were conducted in auralized acoustical environments with L1 and L2 English-speaking university students. It was found that even though the two groups achieved the same, close to the maximum accuracy, L2 participants manifested longer response times in every acoustical condition, suggesting an increased involvement of cognitive resources in the speech reception process.
Keywords
Introduction
The speech perception of non-native listeners (L2) under unfavorable acoustical conditions is a multi-faceted problem known to depend on the L2 language proficiency and age of acquisition, on the amount of contextual information in the speech material, on linguistic factors of the L2 language, and on the type of energetic or informational masking of noise too.1–3 As a result, the accuracy of speech-item identification in speech-in-noise tests under critical acoustical conditions is lower for L2 relative to their native-listener, age-matched counterparts (L1). 4 To highlight the effect of context, comparative studies employed highly predictable or unpredictable sentences and tested a panel of normal-hearing adults with various degrees of L2 proficiency. 5 By doing so, predictions of L2 speech reception accuracy could be made based on corrections to the normative Speech Transmission Index (STI) scheme 6 for speech intelligibility assessment. The problem of non-native listeners is especially relevant in the school settings and, for instance, in the North American countries where a substantial number of students has L2 characteristics. The issue was recently addressed in the context of speech comprehension inside typical classrooms with a panel of university students 7 and the conclusions pointed to the need for a revision of the normative limits for heating, ventilation, and air conditioning (HVAC) noise and reverberation in classrooms to avoid L2 disadvantages.
Current standards are based on the assessment of speech intelligibility, which has to be optimized for learning environments including university classrooms. 8 Anyway, even at a high level of accuracy in the item recognition, there might be differences in the amount of speech processing required by L2 compared to L1, and none of the literature studies so far has shaded light on this as regards specifically university classrooms.
This is the first research question of the present work. In the experiment, the cognitive-behavioral metric of the response time (RT) to the auditory speech stimulus was taken as a measure of the speech processing rate. Although not being used as a clinical measure, in the field of audiology, RT has been interpreted as a possible correlate of listening effort, since it increases when listening conditions become more difficult9–13 or the stimulus complexity increases.14,15 A more challenging task calls for greater cognitive processing, 16 and RT may provide information on the amount of cognitive resources involved in processing the incoming signal. Thus, it was suggested 17 that RT is an important factor to consider when characterizing the speech communication process. The second research question addressed in this work is the effect of room-acoustic conditions on the L1 and L2 listening effort. Both points are relevant to the assessment of the compliance of classroom acoustics to the L2 needs.
The study involves a panel of L1 and L2 university students, and, in order to focus on basic speech reception, the role of sentence context was eliminated by employing standardized word-test material. Binaural advantage was also excluded by operating monaurally. Furthermore, optimal and only slightly worse acoustical conditions were created, with the latter split into two cases having the same objective STI, but differing in reverberation and signal-to-noise (S/N) ratio. This was done to comply with realistic sound fields experienced inside university classrooms and especially to address the second research question as regards the combination of reverberation and noise.
Experimental procedures
Speech material and listening conditions
The Diagnostic Rhyme Test (DRT) 18 was used for this experiment. The test material consisted of 96 phonetically balanced word pairs (i.e. cat/mat), split into 16 lists of six pairs each. For each word pair, one of the words was recorded embedded in a carrier phrase (i.e. “The next word is cat”). As restated by the ISO 9921 standard, 19 the test lists were recorded in an anechoic room by five native English speakers (two male and three female). The recorded material was processed as to obtain the same level across the speakers.
The first listening condition A, with STI equal to 1, was completely anechoic without any background noise. Then, two other conditions were presented under auralization: one with reverberation dominant (condition B) and one with some reverberation mixed with noise (condition C). The noise background consisted of a classroom field recording of babble and activity noise from an adjacent corridor. The impulse responses used for auralization were collected inside two identical parallelepiped classrooms of volume V = 250 m3 with acoustically untreated (B) and treated (C) ceiling, respectively, which were occupied during measurements. The receivers were located 3.86 m far from a directional source. Moreover, the detailed description of the noise and its temporal and spectral representations is provided in an earlier work. 11 The reverberation times (averaged across the 500–2000 Hz octave bands) for conditions B and C were 1.08 s and 0.68 s, respectively. The S/N ratio was >+15 dB for condition B where the noise effect was thus negligible. The resulting STI for condition B was equal to 0.57. In order to achieve the same STI also for condition C, the S/N was set to +7.5 dB. Both conditions B and C corresponded to realistic acoustical scenarios experienced inside running classrooms.
Test procedure and participants
All tests took place in an anechoic room with negligible background noise (Leq < 20 dBA). Playback of the test material was done using a single KRK Rokit 5 G2 loudspeaker, placed at head height, 1.5 m in front of the seated participant. This playback mode provided only monaural cues but was preferred with respect to headphones presentation to keep a more natural listening experience. The signal level was calibrated to 66 dB (A) at the listening position. A wireless test bench was used to manage the experiment: the system simultaneously controlled the audio rendering and the presentation of the test to the participants, recording the word choices and the RTs. The participants were given a touchscreen handset, to be used for the item selection by means of a soft pen.
During the test, participants listened to each word with carrier phrase and then selected the word that they had heard from one of three options (the two words in the word pair, or “none of the above”). This process was repeated for each pair in the list and hence the intelligibility scores (IS) could be obtained. First, each participant completed one word list from condition A as a training session, then the experiment went on by reproducing one list each from conditions A, B, and C (counterbalanced across participants with a Latin-square design). Lists and talkers were balanced across all participants. Following the listening tests, participants were asked to fill out a brief questionnaire, on which their age of acquisition, time of exposition, and self-rating of proficiency in the English language were collected, as well as information on country of origin and other languages spoken.
A total of 37 university students aged 18–33 years (mean: 21.6 years, standard deviation (SD): 4.3 years) were included in this study. All participants assessed themselves to have normal hearing ability, with no self-reported hearing loss. Based on the age of first acquisition of the English language, the participants were divided in two groups: 24 individuals (14 male and 10 female) were identified as L2 and 13 (9 male and 4 female) identified as L1. An individual was defined as L2 if he or she was not exposed to the English language until after 3 years of age. This age cut-off has previously been used to differentiate between “simultaneous bilingual” and “sequential bilingual” children, the former of whom were exposed to both L1 and L2 from birth, and the latter exposed to L2 only after entering the education system (e.g. attending nursery school) in a primarily L2-speaking country. 20 This delay in initiating L2 acquisition has been shown to have consequences in the ability to discriminate between languages at a phonetic level, beginning at infancy (6–12 months of age) and continuing further in childhood. 21 Among the L2 participants included in the experiment, the most common native language was Mandarin (nine participants), followed by Korean (four participants). The other native languages spoken were Cantonese, Spanish, Farsi and Bengali (two participants each) and Arabic, Portuguese, and Romanian (one participant each). The mean age of first exposure to the English language was 9.3 years (SD: 4.3 years); the amount of time in which L2 participants have lived in a primarily English-speaking environment ranged from 6 months to 15 years.
In Figure 1, the experimental procedure is resumed.

Outline of the experimental design from the participants selection to the retrieval of the IS and RT metrics. The panel on the right corresponds to a word pair (e.g. “veal” vs “feel”) presented immediately after the end of the audio playback. The response time is taken from end of audio presentation to the selection on the touchscreen by the tester. Upon selection, the chosen item changes to red background.
Results
Prior to analysis, missing data and excessively long RTs (RT > 5 s) were excluded from the data set; the latter criterion was adopted in order to remove slow outliers due to participants’ inattention, as in previous studies on RT. 9 The procedure led to the elimination of the 0.6% of all the RT data. The statistical analysis was conducted with the R software 22 assuming a significance level of 0.05. Due to the small sample size and concerns regarding the normality of the IS data (undergoing a ceiling effect in condition A), a non-parametric analysis was applied to test the main effect and the interactions of the different factors. Specifically, the univariate effect of listening condition and listeners’ group was analyzed by applying a permutation approach to a mixed-model analysis of variance (ANOVA) design, with either IS or RT as the dependent variable. The model included one between-subject variable (listeners’ group: L1 vs L2) and one within-subject variable (listening condition, A vs B vs C). In case of significant results, multiple pairwise comparisons were carried out with a permutation approach, using the Bonferroni correction to adjust the significance level.
In the case of IS, for which results are reported in Table 1, the analysis showed that the main effects of both listeners’ group (p = 0.49) and condition (p = 0.14) were not significant, as well as the interaction between them (p = 0.82).
Median values and interquartile ranges (IQR) for intelligibility results (%) in the three listening conditions (A: anechoic, no noise; B: reverberation only; and C: reverberant noise). Results are presented for native (L1) and non-native (L2) English listeners.
Concerning RT, the box plots of the results for each group of participants and for each condition are showed in Figure 2. The statistical analysis revealed significant main effects of listeners’ group (p = 0.02) and listening condition (p < 0.001); no interaction was found between the two factors (p = 0.58). For the main effect of listeners’ group, the post hoc comparison indicated that RTs were significantly faster for the L1 participants (median = 1.33 s, interquartile range (IQR) = 0.42 s) than for the L2 participants (median = 1.65 s, IQR = 0.60 s). As regards the main effect of condition, a statistically significant increase in the RTs was found between conditions A and B (A: median = 1.36 s, IQR = 0.41 s; B: median = 1.66 s, IQR = 0.51 s; A vs B: p < 0.001) and between conditions A and C (C: median = 1.66 s, IQR = 0.76 s; A vs C: p < 0.001). In contrast, no difference was found in RTs between conditions B and C (p = 0.86).

Box plots of the response times RT (s) of native (L1) and non-native (L2) listeners for the three listening conditions (A: anechoic, no noise; B: reverberation only; and C: reverberant noise). Median values are depicted by the bold, black line; the whiskers extend to 1.5*IQR of the corresponding hinge. Outliers are plotted as points.
Discussion
The listening conditions presented in the experiment were specifically chosen to ensure a high speech reception accuracy as expected in classrooms thanks to an appropriate acoustical design. In particular for the measured STI values and the selected DRT speech material, IS higher than 95% were envisaged for the L1 participants in each case. 23 This was done to comply with the research questions, given the L2 disadvantage reported in literature under more unfavorable conditions. The choice was confirmed by the analysis of the IS results presented in section “Results”: L1 participants achieved an accuracy statistically indistinguishable from 100%, which did not vary across the listening conditions. Furthermore, no effect of listeners’ group was found on IS, indicating that with highly positive SNR, both groups achieved the same speech reception accuracy. The absence of differences between L1 and L2 participants could be further explained by the characteristics of the L2 panel of participants. In particular, although there was a reasonably wide variety in the age of acquisition and in the amount of exposure to the English language among them, the variation was constrained by the fact that all of them were already students at an English-speaking university and had passed the minimum language requirements set by the institution. Consistently, all of them reported a high self-rating of proficiency and selected a minimum of 4 points on the 5-point scale given in the questionnaire (mean: 4.73, SD: 0.53), corresponding to “I understand almost everything that is said to me.”
Another implication of the non-significant IS differences is that every possible differentiation between conditions and listeners could be determined by the behavior of RT alone. In fact, contrary to IS, the presence of a main effect of the listeners’ group on RT results provided evidence that, notwithstanding the same high-accuracy achieved, the speech processing rate of L2 participants increased. In detail, when averaged across conditions, an additional RT of 320 ms was required to non-native listeners. Recalling that a slowing down of RTs can be interpreted as an involvement of more cognitive capacity,16,24 it can be hypothesized that this occurrence will make word recognition more effortful for L2.
Concerning the presence of differences between listening conditions for both groups, it was found that when listeners are faced with more challenging listening conditions, their RT slows down: an average increase of 300 ms was measured passing from condition A (anechoic, no noise) to the other two. Interestingly, no difference was found in the RTs measured in condition B (reverberation only) and C (reverberant noise); this finding might imply that acoustic scenarios with a similar STI impair RT results to the same extent. Anyway, it should be noted in Figure 1, the high IQR of the RT results for L2 participants in condition C, almost doubling the IQRs in the other conditions. Indeed, the presence of a background noise (even with a positive SNR) increases the individual variability in RT results. Characteristics such as attention and susceptibility to interference (that come into play in presence of noise, but not with reverberation alone 25 ) may influence this variability. Further dedicated experiments are needed to address the specific role of room acoustics and the relative effect of noise and reverberation on the listening effort of L2.
The absence of a significant interaction between the factors listeners’ group and condition suggests that the RT increase in the L2 participants does not depend on the listening condition and that it is simply carried over from condition A to the other two. In fact, looking at Figure 2, it can be seen that differences between L1 and L2 are already present in condition A, where the speech signal is not degraded by the room acoustics. Thus, it is reasonable to trace this result back to a residual and unavoidable difference in the language-processing capabilities between L1 and L2. Specifically, the slowing down of L2 responses could be explained by the basic mechanisms of L2 listening at the lexical level, which involves L1 language interference too. In general, based on the acoustical input, the process of word recognition consists in a selection among multiple candidate words that are simultaneously activated and give raise to a competition. 26 In the case of L2 listeners, more phantom words are elicited because their command of the second language is incomplete and also candidates from their L1 vocabulary show up. 27 Therefore, since L2 face the management of many more alternative words compared to L1 listeners, they suffer from longer latencies to resolve such competitions. However, it has to be remarked that, since a component of this disadvantage stems from an incomplete availability of lexical access to the L2 language, it is not warranted that L2 listeners could influence the RT increase at all by engaging further explicit cognitive resources or attention. A more specific experimental design would be needed to elucidate this point, which is beyond the scopes of this work.
It is important to recall that the present experiment involved only energetic masking (no informational masking was present due to the nature of the noise) and did not include contextual information, which is especially useful to L2. This latter point is important to transpose the above findings from the lexical level to sentence intelligibility and listening comprehension. Thus, a direct analogy cannot be established, and specific experiments should be realized to investigate this further. However, the relative simplicity of RT measurement in word-choice tests, with respect to more complex paradigms, makes this method attractive for investigating how deficits in basic L2 speech reception possibly result in loss of accuracy and fatigue during more structured listening tasks.
From the point of view of acoustical planning, it has to be remarked that using RT as a design metric might be appropriate to optimize the reception of speech especially in the cases when intelligibility is near ceiling. In fact minimizing this quantity appears as a viable strategy to operate a choice between design alternatives whose optimal intelligibility is already ascertained. In this perspective, further studies are needed to investigate how the RT metric is sensitive to both the diverse changes of room acoustics preparation and the peculiar types of noise.
Conclusion
This study was the first to investigate the performance of L2 listeners in terms of both RT and accuracy in a context of acoustical conditions characterized by almost fully intelligible words and compared them to an L1 group having equivalent characteristics. L2 subjects presumably had a high proficiency, as they fulfilled the requirements to enter university and reported high self-rated scores. Even so, L2 participants manifested higher RTs compared to L1 in each acoustical condition, suggesting an intrinsic and unavoidable increase in the cognitive engagement experienced by non-native listeners, which is carried over all the listening scenarios. Basic reasons for that can be traced back to the increase in the number of phantom words in L2 compared to L1 during word processing. Further research is however needed to highlight the effect of noisy and/or reverberant conditions on L2 speech reception. The implications of the present findings for normal classroom listening activities cannot be firmly established by this initial investigation. Reasonably, it can be hypothesized that the present disadvantage could result in a deprivation of cognitive resources for L2 in the context of communication during a sufficiently long lesson.
Footnotes
Acknowledgements
This work was developed in the framework of a collaboration between the University of Ferrara (Italy) and The University of British Columbia. The authors greatly acknowledge the late Professor Murray Hodgson for his precious support to this research and for his long-standing and invaluable contributions to the development of room acoustics studies. The authors thank Katie O’Brien for the assistance in organizing and conducting the listening tests and Professor Stefano Bonnini for his advice on the statistical analysis.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
