Effects of Task-Irrelevant Talker Identity and Continuity on Spatial Selective Attention Under Interruption

Abstract

Task-irrelevant features can impact formation of auditory objects and influence the effectiveness of selective attention, including the buildup of attention over time. Using a previously established paradigm exploring the effects of random interruptions on spatial selective attention, this study explores how the task-irrelevant feature of talker identity impacts the buildup of spatial attention and whether it alters the impact of interruptions. Participants performed a sequence recall task in which participants were presented with two competing syllable sequences coming from different spatial directions and were asked to report the syllable sequence coming from the target direction. On half of the trials, an unpredictable, novel interrupting sound occurred, disrupting attentional focus. Two experiments explored how talker identity influenced performance, specifically, whether 1) making the two streams come from different talkers facilitates task performance and reduces the impact of interruption compared to when the streams are spoken by the same talker, and 2) talker discontinuity interferes with attention buildup and harms syllable recall performance compared to when the talker is the same from one syllable to the next. Our results showed that distinct talker features, though task-irrelevant in this spatial task, significantly improved syllable recall performance and reduced the impact of interrupters. Further, irrelevant talker discontinuities damaged attention buildup and reduced syllable recall performance. Post hoc analysis also revealed that repeating syllables in sequence substantially improved recall performance, which should be accounted for in future studies using similar paradigms.

Keywords

auditory spatial attention interruption distinct talkers talker continuity

Introduction

When multiple sound sources produce signals simultaneously, a listener needs to segregate the sources and focus attention on the one of interest in order to understand it (Cherry, 1953). Over the last seventy years, many studies have explored the mechanisms by which listeners segregate and select sound sources to solve this “cocktail party problem.” (Bee & Micheyl, 2008; Bronkhorst, 2015; Qian et al., 2018). But another, equally common but less well-studied problem occurs in everyday auditory environments: there is always a chance of interruption—unpredictable sound events may grab our attention involuntarily, through automatic bottom-up processes (Huang & Elhilali, 2020; Zhao et al., 2019). The interplay between top-down selective attention and bottom-up interruption helps balance our constant need to be able to focus on known sounds with staying alert to new events.

This study builds on previous online experiments examining factors that influence the interaction between top-down spatial attention and bottom-up interruptions (Liang et al., 2022, 2025). The current study builds on these earlier efforts by focusing on how interruptions interact with task-irrelevant acoustic features of competing streams during spatial selective attention.

In complex acoustic environments, the spectrotemporal structure of natural, independent sounds typically provides sufficient information for the auditory system to segregate different sources, even when they are acoustically similar—for example, two different utterances produced by same-gender talkers. However, perceptually connecting sound across gaps (streaming) can be difficult when sources are too similar, leading to confusions between the competing streams. When spectrotemporal features of simultaneous sounds clearly differ (as in the case of a simultaneous talker and flute), segregation becomes even easier and streaming happens automatically, driven by continuity of these acoustic features (Bressler et al., 2014).

These automatic, bottom-up processes tend to stream together sounds sharing spectrotemporal properties, even when those features are irrelevant to the listener’s goals. For instance, single-neuron recordings in the brainstem of anesthetized animals show signatures of automatic perceptual organization (Pressnitzer et al., 2008). In the auditory cortex, different neurons respond selectively to different features, like pitch and space, providing a neural substrate for source segregation based on those features (Bizley & Cohen, 2013; Middlebrooks & Waters, 2020). Specifically, some “hard-wired” computations automatically process certain spectrotemporal features, which can lead to distinct neural populations encoding competing sources, providing a foundation for perceptual segregation and selection. Consequently, unattended spectrotemporal features can perceptually “bind” with an attended object that was selected based on some other feature, enhancing streaming and improving recall performance for a target source (Best et al., 2008; Fischer et al., 2024). Thus, spectrotemporal features that may seem irrelevant to a particular task nonetheless automatically contribute to auditory streaming and help listeners process simultaneous sounds.

Building on this foundation, we asked how the continuity of a task-irrelevant feature—voice identity—influences spatial selective attention under interruption. As noted above, continuity of a task-irrelevant feature enhances perceptual segregation, but it may also have other effects. Once a listener has latched onto a target stream based on its location, continuity of a talker’s voice may provide an additional cue to guide attention following an interruption, reducing the cost of interruption (e.g., see Bonacci et al., 2020). Other factors are likely at work as well. Spatial selective attention is not static—it builds gradually over time as the listener continues to focus on the target stream. Distinct talkers may facilitate this buildup by reducing competition between sources, allowing spatial focus to become more stable from syllable to syllable. However, these buildup effects depend on attention remaining properly aligned with the target; when attention drifts or locks onto the wrong stream, the listener must effectively start over on the next syllable (see Bressler et al., 2014). An interruption may automatically “break” this buildup process by resetting spatial focus, instantaneously increasing competition between streams. By resetting buildup, interruptions thus may diminish advantages otherwise conferred by talker differences.

Previous experiments exploring spatial selective attention and interruption using a syllable sequence recall task (Liang et al., 2022, 2025) presented two streams of syllables—one target and one distractor—differentiated only by spatial location. A random subset of trials contained an interrupter. Listeners were asked to report the syllables from the target stream while ignoring the distractor stream and any interrupter. This paradigm elicited not only large effects of interruption on recall of the target syllable immediately following an interrupter, but also persistent errors in recall of even later syllables: refocusing spatial attention after an interruption was difficult when competing streams differed only in location.

Here, we expanded this paradigm by manipulating the similarity of competing stream voices and continuity of voice within streams to test how a task-irrelevant feature shapes attention after interruption. In the Continuous Talker (CT) experiment, each stream presented syllables from a single talker, but the talker could either be the same or different across streams. Experiment CT thus tested whether adding another feature differentiating target and masker reduces the cost of interruption. In the Random Talker (RT) experiment, different talkers could appear within the same stream, disrupting talker continuity. Experiment RT tested how discontinuity of the task-irrelevant talker feature influences performance.

Together, these experiments explored how task-irrelevant acoustic features shape performance during spatial selective attention in the presence of interruptions. We predicted that when target and distractor streams were differentiated by talker identity and the talker was continuous within each sequence, the effect of the interrupter that damaged recall performance would persist across fewer syllables. Conversely, when talker continuity within a stream was disrupted, we expected weaker stream formation and a bigger cost on syllable recall performance when interruption occurred. Consistent with these predictions, interruption effects were significantly reduced and short-lived when targets and distractors were spoken by different talkers. Continuity of talker identity also produced a clear buildup of benefit, even though attention was always directed to spatial features, demonstrating the role of voice continuity in supporting streaming even when voice identity is an unreliable, task-irrelevant feature.

Methods

Participants

We tested 45 participants (aged 18-72, mean 31.98, std 11.20; 15 females, 28 males, one non-binary, one preferred not to report) for Experiment CT; 44 participants (aged 20-63, mean 35.23, std 9.74; 23 females, 21 males) for Experiment RT. All participants were native English speakers, reported no hearing loss, and had not taken part in previous experiments in this study series. It should be noted that for this online study, hearing was self-reported and could not be verified with objective audiometric testing, which is a particular consideration given the age range of participants (up to 72 years). However, all comparisons are within-subject, reducing the possible impact of hearing status. All procedures were approved by the university’s Institutional Review Board. Participants provided informed consent and were compensated for their participation.

Stimuli

In both experiments, participants listened to two competing sound streams simulated as coming from different directions: a target stream and a distractor stream. All sounds were presented through headphones with stereo signals. Sounds were spatialized by convolving mono signals with generic head-related impulse responses (HRIRs) measured from a KEMAR manikin (Gardner & Martin, nd). One of the streams was simulated as coming from 30 degrees to the left, the other from 30 degrees to the right.

Each stream consisted of 5 syllables drawn from a set of three consonant-vowel syllables (/ba/, /da/, and/ga/). Two separate syllable sets were recorded, one by a male and one by a female native speaker of North American English. Each syllable was windowed to have a duration of 450ms and normalized to have the same root mean square (RMS) level. Syllables were concatenated to form isochronous target and masker streams with a 600ms onset-to-onset interval. The target stream always led and started 300ms before the first distractor syllable, producing temporally interleaved target and distractor streams (Figure 1). Although the consistent target-leading temporal structure could theoretically enable task performance without exclusive reliance on spatial cues, the energetic masking introduced by temporally adjacent distractor syllables would make spatial segregation highly advantageous when they were available. Therefore, we expect listeners to engage spatial attention during task performance, even if temporal onset cues may have facilitated performance. Additionally, any potential benefit to target recall introduced by the temporal structure applies to all conditions and does not confound cross-condition comparisons.

Figure 1.

Experiment paradigms. (A) Timing and spatial layout of target and distractor streams and interrupter. (B) Top: Same Talker and Different Talker conditions in Exp CT; bottom: Same Talker and Random Talker conditions in Exp RT.

Syllables within each stream were quasi-randomly selected, with constraints to try to allow us to analyze error patterns. For Experiment CT, the second target syllable and the temporally adjacent distractor syllables (first and second) were chosen to all differ from one another, so that each of/ba/, /da/, and/ga/appeared only once across these three syllables. The same was true for target syllable 3 and distractor syllables 2 and 3, and for target syllables 1, 4, and 5. For Experiment RT, target syllables 2, 3, and 4 were chosen to differ from each other so that each of/ba/, /da/, and /ga/appeared only once in these syllables. In addition, target syllable 1 was chosen to differ from target syllable 2.

We imposed these constraints in an attempt to classify error types, intending to determine whether a mistake reflected a random guess, confusion with the distractor, or confusion with adjacent target syllables. However, in practice, the variety of error patterns was too great to support a meaningful error-type analysis. Nonetheless, because the same constraints were applied consistently within each experiment, comparisons within Experiment CT and within Experiment RT remain valid. In post hoc analyses, we found that the different constraints had systematic effects that complicate direct comparisons across experiments. We describe the consequences of these randomization differences in detail below under Differences Across Experiments.

In Experiment CT, the talker was continuous within each stream, but could either be the same or different across streams. Specifically, both target and distractor streams were spoken by the same talker in 50% of the trials; of these, half presented the male talker and half the female talker (Same Talker condition). In the remaining 50% of trials, the talker differed across streams, with one stream spoken by the male and the other spoken by the female (Different Talker condition). In the Different Talker condition, trials were counterbalanced so that half had the male and half the female as the target, and, independently, half had the male on the left and half on the right.

In experiment RT, half of the trials used the same talker for all syllables (Same Talker condition; half male, half female) as a baseline (identical stimuli to the Same Talker trials in Experiment CT). In the other half of trials, the talker was randomly and independently selected for each individual syllable in both target and distractor streams (Random Talker condition). One spreadsheet was used to control stimuli presentation; thus, the talker randomization was identical across participants.

For both experiments, 50% of the trials in each condition included a 250ms-long interrupting sound 125ms before the onset of the third target syllable. The interrupter was presented from 90 degrees azimuth in the same hemifield as the distractor stream. For each interrupted trial, the interrupting sound was randomly selected from a set of 48 natural sounds (e.g., cat meowing, glass shattering), without replacement. None of the interrupters were human speech and all were perceptually distinct from the syllables. The sounds were retrieved from the internet, windowed to be 250ms in duration, and normalized to have the same RMS level. To ensure their saliency, the level of each interrupter was set 5dB above the syllable level, before spatialization.

Note that the experiments were conducted online, and we were not able to control the overall presentation level. Before the experiments began, the participants were instructed to adjust the sounds to a comfortable level.

Task Procedure

Both experiments were conducted online. The online experiments were constructed with the Gorilla platform (https://gorilla.sc). Data was collected through Prolific (https://prolific.co).

Headphone screening was performed before the formal experiment to ensure the participants’ playback systems preserved interaural differences in the stereo signals. In the headphone screening task, participants listened to brief broadband noise signals and identified which contained a binaural Huggins pitch (Milne et al., 2021). In each trial, three 1000ms-long noise signals were presented, two of which were diotic and one that imposed a 180 degrees interaural phase shift in a narrow frequency band of otherwise identical noise. This interaural phase manipulation produces a strong pitch perception if and only if the playback system preserved binaural cues. Participants performed six-trial-long blocks of the screening task. Anyone who failed to achieve 100% correct on a block within three tries was rejected (13-14 subjects were rejected by headphone screening in each experiment).

Participants were instructed to maintain their gaze on a fixation cross at the center of the screen during the stimulus presentation. At the beginning of each trial, an auditory cue (/ba/) from either 30 degrees to the left or to the right cued the direction of the target on that trial. The target stream started 500ms after the offset of the cue. The cuing /ba/ sound was spoken by the same talker as the target stream in the Same Talker and Different Talker conditions, and was spoken by a randomly selected talker (among the male and female talkers) in the Random Talker condition. Participants were instructed to attend to the syllable stream coming from the target direction while ignoring the competing distractor stream and any interrupting sounds. After the stimulus presentation, the participants were asked to click on a graphical user interface on their computer screen to report the five syllables (in order) that they heard from the target direction.

Prior to formal data collection, participants participated in 6-trial-long training blocks. The training trials were identical to uninterrupted trials in the experiments and were spoken by the same male talker. Feedback was provided after each trial throughout the training session, indicating the correct target sequence. Participants who failed to report 4 out of 5 syllables correctly in at least 3 out of 6 trials within five tries were rejected.

Participants proceeded to the task session after successfully completing the headphone screening and training session. There were 96 trials in each task session, divided into two blocks of 48 trials. Trials were randomized within blocks. In each block, half of the trials were uninterrupted and half interrupted. Of the 24 uninterrupted trials in a block, half were the Same Talker condition and half the other talker condition (Different Talker in experiment CT; Random Talker in experiment RT). Similarly, half of the interrupted trials were Same Talker and half the other talker condition.

Data Analysis

For both experiments, each participants’ percent correct syllable recall performance was computed by averaging scores across trials within each condition, separately for each syllable position. The interruption effect was then computed as the percent correct syllable recall in uninterrupted trials minus that in interrupted trials, computed separately for each participant and then averaged across participants. Similarly, in Experiment CT, we computed the different talker benefit as the percent correct syllable recall for the Different Talker condition minus that for the Same Talker condition; in Experiment RT, we computed the continuous talker benefit as the percent correct syllable recall for the Same Talker condition minus that for the Random Talker condition.

For both experiments, statistical tests were conducted only on the interruption effect and the different talker and continuous talker effects. Although we show plots of the raw percent correct syllable recall, we do not conduct statistical tests on the raw data to avoid redundancy. Holm-Bonferroni corrected t-tests were first performed on both the interruption effect and the two talker effects to examine whether these effects were greater than zero at each distinct syllable position. We then undertook repeated-measure ANOVA with main factors of talker condition (same vs. different for Experiment CT; same vs. random for Experiment RT) and syllable position (1-5) on interruption effect data. For significant main effects and interaction effects, post hoc Holm-Bonferroni corrected pairwise comparisons were conducted. Similarly, repeated-measure ANOVA and post hoc comparisons were performed for talker effect, with main factors of interruption condition (uninterrupted, interrupted) and syllable position (1-5).

For these sequence recall tasks, we also analyzed the effect of past history to determine whether recall performance for a syllable differed depending on whether or not participants correctly reported the previous syllable. Specifically, we looked to see if the likelihood of correctly reporting a syllable was greater when a listener properly reported the previous syllable compared to when they got the previous syllable wrong. Overall, such dependencies are likely to matter: if a participant’s attention strays during a trial, the effect is likely to last beyond one syllable. The question we were most interested in was whether these conditional probabilities depended on the talker condition (same or different in Experiment CT, same or random in Experiment RT).

For both experiments, we first computed the raw percent correct performance for each syllable position, broken down by whether the previous target syllable was reported correctly or incorrectly. We then computed the “benefit of previous syllable correct” as the percent correct performance when the previous syllable was reported correctly minus the percent correct performance when the previous syllable was incorrectly recalled. We then performed Holm-Bonferroni corrected t-tests to determine whether this difference was significantly greater than zero for syllables 2-5. We also fit a linear mixed-effect model to the benefit of previous syllable correct with fixed effects of talker conditions (same vs. different for Experiment CT; same vs. random for Experiment RT) and syllable positions (2-5) and random effect of subject with random intercepts for each interruption condition. Post hoc Tukey-adjusted pairwise comparisons were conducted when significant effects were found.

For Experiment RT, in the Random Talker condition, we further broke down recall performance by whether or not the previous target syllable was spoken by the same talker as the current target syllable (Random Talker - Same: random talker trial, but same talker for the previous and current syllable positions; Random Talker - Different: random talker trial, and different talkers for the previous and current syllable positions). We compared performance for these two subdivided Random Talker cases and performance for the Same Talker condition. We then fit a linear mixed-effect model to account for syllable recall performance with fixed effects of talker conditions (Same Talker, Random Talker - Same, Random Talker - Different) and syllable positions (2-5) and a random effect of subject with random intercepts for each interruption condition. Post hoc Tukey-adjusted pairwise comparisons were conducted to help interpret all significant effects.

Experiment CT (Continuous Talker)

Results

This experiment compares syllable recall performance and interruption effects with the same or different talkers for target and distractor streams. Figure 2A shows the raw percent correct syllable recall rate for each syllable position in the sequences for all four task conditions. In most of the syllable positions, listeners perform better in recalling the targets when the target and distractor streams are spoken by different talkers than the same talker. The interrupter resulted in lower syllable recall performance for interrupted trials than uninterrupted trials for the third target syllable. For the subsequent 4th and 5th target syllables, the interrupter reduced recall performance when the target and masker streams were the same talker, replicating earlier reports (Liang et al., 2022, 2025); however, these later syllables showed little impact of the interrupter when the streams were spoken by different talkers.

Figure 2.

(A) Mean syllable recall performance for Experiment CT (N=45), averaged across participants. Blue represents uninterrupted trials and red interrupted trials; filled bars represent Same Talker conditions and unfilled bars Different Talker conditions. (B) Interruption effect for the Same Talker and Different Talker conditions, computed as percent correct performance in the uninterrupted trials minus that in the interrupted trials, averaged across participants. (C) Talker effects in uninterrupted and interrupted trials, computed as percent correct performance in the Different Talker condition minus that in the Same Talker condition, averaged across participants. The vertical line between syllable position 2 and 3 marks the interrupter timing in all panels. Error bars show the across-participant standard error of the mean in all panels.

Interruption Effect

To highlight the impact of interrupter on syllable recall, Figure 2B shows the interruption effect, computed as the percent correct syllable recall performance in uninterrupted trials minus that in interrupted trials, plotted as a function of syllable position. The interrupter occurred before the third target, which shows the largest interruption effect (around 20% for the Same Talker condition and 10% for the Different Talker condition). For the Same Talker condition, the interruption effect was smaller but positive for syllables 4 (∼8%) and 5 (∼5%), but for the Different Talker condition, the interruption effect was much smaller.

Statistical analyses confirmed these observations. The interruption effect was not significantly greater than zero for either same or different talker conditions for syllables 1 and 2 (adjusted $p > 0.8$ for all). However, the interruption effect for syllable 3 was significant for both same and different talker conditions $(t_{44} = 7.74, p < 0.001$ and $t_{44} = 4.15, p < 0.001$ , respectively). For the Same Talker condition, the interruption effect was significant for both target syllable 4 ( $t_{44} = 3.87, p = 0.001$ ) and syllable 5 ( $t_{44} = 2.81, p = 0.026$ ), but for the Different Talker condition, there was no statistically significant impact of the interrupter on these syllables (adjusted $p > 0.8$ for all).

Repeated-measure ANOVA on the interruption effect showed a significant effect of talker condition, confirming that the interruption effect was larger in the Same Talker condition than the Different Talker condition ( $F_{1, 44} = 6.962, p = 0.011$ ). Syllable position was also significant ( $F_{4, 176} = 17.894,$ ; $p < 0.001$ ), as was the interaction between talker condition and syllable position ( $F_{4, 176} = 3.154$ ; $p = 0.016$ ). Post hoc tests showed that the interruption effect was larger for the Same Talker than Different Talker condition for syllable 3 ( $t_{44} = 3.648,$ ; $p = 0.003$ ), but not other syllable positions (adjusted $p = 1$ for syllables 1 and 2, $p > 0.1$ for syllables 3 and 4). In the Same Talker condition, the interruption effect on syllable 3 was significantly larger than for all other syllable positions (syllable 1: $t_{44} = 5.465,$ $p < 0.001$ ; syllable 2: $t_{44} = 6.178,$ $p < 0.001$ ; syllable 4: $t_{44} = 4.166,$ $p = 0.022$ ; syllable 5: $t_{44} = 5.195,$ $p < 0.001$ ), no difference for other syllable position comparisons (adjusted $p = 1$ for all). In contrast, after multiple-comparison correction, the interruption effect did not differ significantly between any of the pairs of syllable positions in the Different Talker condition (syllable 1 vs. 3: $t_{44} = - 3.363,$ $p = 0.198$ ; syllable 2 vs. 3: $t_{44} = - 3.792,$ $p = 0.063$ ; syllable 3 vs. 4: $t_{44} = 3.094,$ $p = 0.350$ ; syllable 3 vs. 5: $t_{44} = 3.145,$ $p = 0.330$ ; adjusted $p = 1$ for all other comparisons).

Talker Effect

Figure 2C shows the talker effect, computed as percent correct performance in the Different Talker condition minus that in the Same Talker condition, quantifying the benefit of having distractors spoken by a different talker. In general, the talker effect was positive, demonstrating a performance advantage when the streams were spoken by distinct talkers. In the uninterrupted condition, the talker benefit was modest. For syllables 3-5, the benefit of having different talkers was larger when the interrupter was present than in uninterrupted trials. The talker effect was greatest, and the difference between the talker effect in the interrupted and uninterrupted conditions was greatest, for syllable 3, which occurred right after the interrupter.

T-tests showed that the talker effect was significantly greater than zero for all but the first syllable in the interrupted condition (syllable 1: $t_{44} = 1.13,$ $p = 0.531$ ; syllable 2: $t_{44} = 2.99,$ $p = 0.016$ ; syllable 3: $t_{44} = 6.24,$ $p < 0.001$ ; syllable 4: $t_{44} = 3.73,$ $p = 0.002$ ; syllable 5: $t_{44} = 2.47,$ $p = 0.044$ ). In the uninterrupted condition, the talker effect was significant only for syllables 2 and 3 ( $t_{44} = 3.43,$ $p = 0.005$ and $t_{44} = 2.93,$ $p = 0.016$ , respectively; $p > 0.5$ for all other syllable positions).

Repeated-measure ANOVA on the talker effect with factors of interruption condition and syllable position found that both main effects were significant ( $F_{1, 44} = 6.962,$ $p = 0.011$ and $F_{4, 176} = 9.135,$ $p < 0.001$ , respectively); their interaction was also significant ( $F_{4, 176} = 3.154,$ ; $p = 0.016$ ). Post hoc tests showed that on the third target syllable, the talker effect was larger for interrupted than uninterrupted trials ( $t_{44} = 3.648,$ ; $p < 0.001$ ), but the talker effect was not significantly different between interrupted and uninterrupted trials for any other syllable positions ( $p > 0.15$ for all). In the interrupted condition, performance for the third syllable showed a significantly larger talker effect than on the first syllable ( $t_{44} = 5.290,$ ; $p < 0.001$ ); however, the talker effect was not significantly different between any other syllable pairs in the interrupted condition or any syllable pairs in the uninterrupted condition (in interrupted condition, syllable 2 vs. 3: $t_{44} = - 3.276,$ $p = 0.301$ , syllable 3 vs. 5: $t_{44} = 3.636,$ $p = 0.117$ ; adjusted $p = 1$ for all other comparisons in both interruption conditions).

Within-Trial Effect of Performance on Previous Syllable

For both uninterrupted and interrupted trials, and across all syllable positions, there was a strong sequential effect: participants were more likely to correctly recall a target when the previous syllable was correctly recalled than when it was incorrectly recalled (Figure 3 top panels).

Figure 3.

Top panels: Performance with previous syllable correctly vs. incorrectly recalled for Experiment CT (N=45). Bottom panels: Benefit of having previous syllable correctly recalled, computed as the performance with previous syllable correctly recalled minus performance when those were incorrectly recalled. In the right panels, the vertical line between syllables 2 and 3 indicates the interrupter timing. Error bars show standard error of the mean across participants.

There is another consistent effect for both uninterrupted and interrupted trials: when the previous syllable was heard correctly, performance was always better for the Different Talker condition than for the Same Talker condition (the open solid bars are always higher than the corresponding filled solid bars in both panels). However, when listeners get the previous syllable incorrect, performance differs for uninterrupted and interrupted trials. Specifically, when the previous syllable is incorrect in the uninterrupted trials, performance in the Different Talker condition tends to be slightly worse than for the Same Talker (in the left panel of Figure 3, the white bars with thin blue stripes are generally lower than the blue bars with thin white stripes); however, for interrupted trials, performance in the Different Talker condition tends to be slightly better than for the Same Talker condition (in the right panel of Figure 3, the white bars with thin reddish stripes are generally higher than or equal to the reddish stripes with thin white stripes).

To quantify whether the benefit of correctly reporting a previous syllable varies with talker condition and interruption, we computed the average “benefit of previous syllable correct” (percent correct on a syllable when the previous syllable was correctly recalled minus that when the previous syllable was incorrectly recalled). The bottom panels of Figure 3 show this benefit separately for uninterrupted and interrupted trials, and for same talker and different talker.

In uninterrupted trials, the benefit of getting the previous syllable correct tends to be greater in the Different Talker condition than the Same Talker condition. However, in interrupted trials, no such pattern appears; instead, the benefit of getting the previous syllable correct varies with syllable position. Specifically, the benefit is small for target syllable 3, which was right after interrupter, and the subsequent syllable 4.

T-tests confirmed that the benefit of getting the previous syllable correct was positive for all syllable positions ( $p < 0.001$ for all). Linear mixed effect models were fit separately for uninterrupted and interrupted conditions to determine if the previous correct benefit varied across talker conditions and syllable positions.

For uninterrupted trials, there was a significant greater previous syllable correct benefit in the Different Talker than the Same Talker condition ( $F_{1, 290.56} = 6.511,$ ; $p = 0.011$ ), but no significant effect for syllable positions ( $F_{3, 289.07} = 2.427,$ ; $p = 0.066$ ) and no significant interaction between talker condition and syllable position ( $F_{3, 288.04} = 1.612,$ ; $p = 0.187$ ). Thus, in the uninterrupted trials, the benefit of properly attending and reporting the previous syllable had a larger effect on performance for Different Talker than for Same Talker trials.

For interrupted trials, there was a significant effect of syllable position ( $F_{3, 293.66} = 11.539,$ ; $p < 0.001$ ), but not for talker condition ( $F_{1, 293.80} = 0.022,$ ; $p = 0.883$ ) and not for the interaction of syllable position and talker condition ( $F_{3, 292.82} = 0.874,$ ; $p = 0.455$ ). Post hoc pairwise comparisons showed the benefit on target syllable 2 was significantly greater than that on target syllable 3 ( $p < 0.001$ ) and target syllable 4 ( $p = 0.001$ ); also, the benefit on target syllable 5 was significantly greater than that on target syllable 3 and target syllable 4 ( $p < 0.001$ for both). There were not significant differences between the benefits for syllables 2 and 5 or for syllables 3 and 4 (adjusted $p > 0.9$ for both). Thus, for interrupted trials, the benefit of properly attending and reporting the previous syllable was smaller for the syllable immediately after and two after the interruption than for the other syllables.

Discussion

Experiment CT investigated how differences in talker identity between target and distractor streams influence spatial selective attention when a salient, task-irrelevant sound interrupts attention. Specifically, we asked whether presenting competing streams with different talkers reduced the disruptive impact of an interrupter and improved recall of target syllables, even though spatial location alone defined the target stream.

Three key findings emerged. First, when the target and distractor were spoken by different talkers, listeners recalled target syllables more accurately overall. Second, the performance decrement due to interruption was significantly smaller in the Different Talker condition than in the Same Talker condition. Third, the interrupter produced a long-lasting decrement in the Same Talker condition, however, when the two streams differed by voice, only recall of the third target syllable was affected by the interrupter.

The observed pattern of results suggests that talker differences benefit performance through three proposed distinct but complementary mechanisms. First, talker differences enhance perceptual segregation and streaming. Second, talker identity differentiates target and masker syllables stored in working memory. Third, talker-based attention augments or even supplants spatial attention following an interruption. Beyond these mechanisms, which likely apply broadly to many spatial attention and working memory tasks, we also observed sequential effects that depended on stimulus characteristics, suggesting that perceptual continuity of the auditory stream further shapes sequence recall performance. These ideas will each be discussed in the subsections below.

Talker Differences Support Perceptual Segregation and Streaming

Even in uninterrupted conditions, talker differences provided small but significant improvements in target recall compared to when the talkers were the same. Although talker identity covaries with spatial location and thus may appear redundant, talker differences nonetheless strengthen perceptual streaming and improve selective attention. In addition, talker differences supply an additional cue for maintaining attentional selection; past work shows that even when a target stream is defined by its location, listeners may use more robust spectrotemporal features to sustain attention to an ongoing target stream (Bonacci et al., 2020). Because male and female voices differ systematically in pitch and spectral composition, their neural representations overlap less in the auditory cortex, making it easier to maintain separate object representations for target and distractor streams. Such improvements can account for the small but consistent advantage observed in the Different Talker condition during uninterrupted trials.

Talker Cues Differentiate Syllables Recalled From Working Memory

The large drop in accuracy for recall of the third target syllable likely reflects a transient disruption of attentional focus—a hijacking of top-down attention by the salient interrupt away from the target stream. When attention is diverted by the interrupter, the third target syllable must often be reconstructed from working memory rather than perceived directly. In working memory, auditory object representations bind together spectrotemporal features that not only define syllable content, but also include pitch, timbre, and other voice-related features (Bizley & Cohen, 2013). Our results suggest that spatial location is only weakly and inconsistently bound to spectrotemporal features of an auditory object in working memory, explaining why the cost of interruption is much larger in the Same Talker compared to Different Talker conditions.

Past work helps explain why this could be the case. Spatial attributes of auditory events are represented only weakly in the auditory cortex unless listeners are actively engaged in a task requiring spatial information (Lee & Middlebrooks, 2011). Such auditory spatial tasks strongly recruit a frontoparietal visuospatial cognitive control network that is not inherently auditory (Michalka et al., 2015). Disruptions of top-down spatial attention likely interfere with volitional engagement of the visuospatial network necessary for spatial auditory processing, so that working memory fails to bind spatial to spectrotemporal attributes. Consequently, when target and distractor share the same talker, spatial cues alone cannot robustly guide retrieval of the correct syllable from working memory, increasing confusions and recall errors. In contrast, when voices differ, talker identity acts as an intrinsic retrieval cue, enabling listeners to sort stored syllables by voice and recover the target more accurately.

Talker Identity Provides an Alternative Feature to Guide Top-Down Attention

The effects of the interrupter persisted through the fourth and fifth syllables only in the Same Talker condition. When the target and distractor voices differed, performance on these syllables was statistically indistinguishable from uninterrupted performance, suggesting that listeners could re-establish focus on the target stream more efficiently. This pattern indicates that talker cues not only facilitate streaming but also aid the reorientation of top-down attention following an interrupter. When target and distractor talkers differ, after attention is drawn away, listeners can refocus not only on where the target is located but also on who is speaking—using talker identity as a stable feature to guide attention.

This explanation echoes prior results showing that in spatial selective-attention tasks (where location defines the target), listeners may shift to maintaining attention based on other object-defining features such as pitch, when available (Bonacci et al., 2020). In the absence of such cues, as in the Same Talker condition, refocusing relies solely on spatial information, which appears to be slower and more error-prone. The persistence of the interruption effect in the Same Talker condition through the fourth and fifth syllables likely reflects the gradual buildup of selective attention when a listener maintains focus on the same spatial location over time. For instance, when listeners sustain top-down spatial attention to a stable target location, performance in a spatial selective attention task improves from syllable to syllable (Best et al., 2008). This kind of slow buildup may reflect the need for auditory spatial attention to recruit the non-native visuospatial network that supports spatial target selection. In the Different Talker condition, attention to talker identity, established in the first and second syllables, can bypass this slower spatial-control system, reorienting attention rapidly following an interrupter.

Interruptions Break Perceptual Effects of Stream Continuity

Sequential dependencies in performance across syllables revealed another influence of talker continuity. In general, listeners were more accurate when they had correctly identified a preceding syllable than when they had responded incorrectly on the previous syllable. When uninterrupted, this sequential dependency differed between talker conditions, reflecting differences in whether the two competing streams were perceptually distinct or spectrotemporally similar. However, when an interruption occurred, any influence of stream continuity disappeared; instead, performance depended only on the temporal proximity of the syllable to the interruption itself.

When uninterrupted, the benefit of having correctly identified the previous syllable was larger when the two competing streams were spectrotemporally distinct, than when they differed only in location. Past studies show that when listeners focus attention on one auditory element, a subsequent element that is spectrotemporally similar is automatically more likely to capture attention (Bressler et al., 2014). This aligns with the current results: when listeners were correct on the previous syllable, automatic streaming, not spatial attention alone, increased the likelihood of maintaining attention on the same spectrotemporal stream in Different Talker trials. Conversely, when listeners misidentified a syllable—often by incorrectly attending to the distractor stream—that same mechanism may have reinforced attention to the wrong voice, reducing accuracy on the next syllable. Together, results from the uninterrupted trials suggest that continuity of distinct target and masker streams contributes strongly to listeners’ ability to maintain focus and resist interference.

A salient interruption, however, broke perceptual continuity. The interruption effectively “reset” the buildup of streaming, causing the benefit of getting the previous syllable correct to nearly vanish for subsequent syllables. Only by the final syllable did the prior-correct benefit begin to re-emerge. Thus, while perceptual and attentional buildup strengthen over time, they remain vulnerable to disruption by salient, attention-capturing events.

Conclusions From Experiment CT

Taken together, these findings replicate prior work showing that unattended features bind with an attended object, improving target selection and recall in selective attention tasks (e.g., Best et al., 2008; Fischer et al., 2024). They extend that work by demonstrating that voice differences between concurrent streams mitigate the disruptive effects of a salient interrupter during spatial selective attention. These results demonstrate that spatial selective attention in complex auditory scenes is supported by a dynamic interplay among perceptual, mnemonic, and attentional-control mechanisms—all of which can be reinforced by consistent non-spatial cues such as voice identity. Sequential analyses show that the continuity of distinct voices reinforces perceptual streaming, allowing attentional focus to build up over time. However, this buildup is fragile, easily broken by salient events that disrupt top-down attention. Thus, distinct voice cues can enhance resistance to disruption and allow streaming to build over time but cannot fully overcome the involuntary capture of attention by new events.

Experiment RT (Random Talker)

Results

Experiment RT compares performance when both target and distractor streams were spoken by the same talker with performance when the talker for each syllable within each stream was randomized, independently, within a trial. We expected talker randomization to break the continuity of acoustic features for both streams and make the task harder. Consistent with this prediction, performance in the Random Talker condition generally led to lower syllable recall performance than Same Talker (Figure 4A). As in many of our previous experiments in this series, performance on the first and last syllables was typically better than that for the middle syllables, reflecting primacy and recency effects in recall.

Figure 4.

(A) Mean syllable recall performance in Experiment RT (N=44), averaged across participants. Blue represents uninterrupted trials and yellow interrupted trials; filled bars show same talker conditions and unfilled bars random talker conditions. (B) Interruption effects for the same and random talker conditions, computed as percent correct in the uninterrupted trials minus that in the interrupted trials, averaged across participants. (C) Talker effects in uninterrupted and interrupted trials, computed as percent correct performance in the same talker condition minus that in the random talker conditions, averaged across participants. The vertical lines between syllables 2 and 3 in each panel indicate the interrupter timing. Error bars in each plot show the across-participant standard error.

Interruption Effect

We computed the interruption effect (Figure 4B) as the average percent correct performance in the uninterrupted trials minus that in the corresponding interrupted trials. The interrupter always occurred before the third target syllable, leading to a large interruption effect for both Same and Random Talker conditions for syllable 3. The interruption effect was similar for Same and Random Talker conditions for both the fourth syllable, where there was a modest effect, and the fifth syllable, where the effect was near zero. In the Random Talker condition, the average interruption effect was greater than zero for both the first and second syllables. Unexpectedly and unlike all previous results in this series of studies, in the Same Talker condition, listeners performed better in interrupted trials than uninterrupted trials for the very first target syllable, producing a negative interruption effect for this syllable; the interruption effect was near zero for the second syllable in the Same Talker condition.

These observations were supported by statistical analyses. Two-sided t-tests showed that there was a significant positive interruption effect for the Random Talker on target syllable 2 ( $t_{43} = 2.81,$ ; $p = 0.052$ ) and target syllable 3 ( $t_{43} = 3.20,$ ; $p = 0.021$ ); and for the Same Talker on target syllable 3 ( $t_{43} = 4.50,$ ; $p < 0.001$ ). There was also a significant negative interruption effect on target syllable 1 ( $t_{43} = - 3.76,$ ; $p = 0.005$ ) with the Same Talker condition. The interruption effect was not significantly different from zero for any other syllable in either Same or Random Talker conditions ( $p > 0.3$ for all).

We have never previously seen performance to be better for any syllable for interrupted trials compared to uninterrupted trials. We therefore examined performance on syllable 1 to understand whether this was a statistical fluke. Of the 44 subjects, 25 performed better on syllable 1 for interrupted trials than for uninterrupted trials (12 performed better in uninterrupted trials, 7 equally performed in uninterrupted and interrupted), confirming that the effect is consistent (see Supplemental Figure S1). This result is considered further in Section Differences Across Experiments, which considers differences between Experiments CT and RT.

Repeated-measure ANOVA showed a significant main effect of syllable position ( $F_{4, 172} = 8.427,$ ; $p < 0.001$ ) but not talker condition ( $F_{1, 43} = 3.471,$ ; $p = 0.069$ ), and a significant interaction between syllable position and talker condition ( $F_{4, 172} = 3.853,$ ; $p = 0.007$ ). Post hoc pairwise comparisons showed a significantly greater interruption effect on the first target syllable with Random Talker than Same Talker ( $t_{43} = 4.144,$ ; $p < 0.001$ ), and no significant difference between those two talker conditions at other syllable positions (syllable 2: $t_{43} = 2.360,$ $p = 0.092$ ; adjusted $p = 1$ for all other syllable positions). In the Same Talker condition, the interruption effect on the third target syllable was significantly larger than that on the first ( $t_{43} = 7.57,$ ; $p < 0.001$ ), the second ( $t_{43} = 4.02,$ ; $p = 0.034$ ), and the fifth target syllable ( $t_{43} = 4.09,$ ; $p = 0.031$ ). The interruption effect on the fourth target was also significantly greater than that on the first target syllable ( $t_{43} = 4.06,$ ; $p = 0.033$ ). There were no statistically significant differences across other syllable positions in either of the talker conditions (adjusted $p > 0.7$ for all).

Talker Effect

The benefit of keeping the talker the same rather than randomized (computed as performance in the Same Talker condition minus that in the Random Talker condition) differed between uninterrupted and interrupted trials (see Figure 4C). In the uninterrupted trials, the benefit of the talker being continuous increased monotonically with syllable position. In the interrupted trials, the benefit was similar to that in the uninterrupted trials for syllables 3, 4, and 5; however, there was a much greater talker benefit for the two syllables before the interrupter, with subjects performing roughly 8% better in the Same Talker condition than when the talker was randomized.

Single-sided t-tests showed that, in uninterrupted trials, the talker benefit was greater than zero for target syllable 4 ( $t_{43} = 2.99,$ ; $p = 0.016$ ) and target syllable 5 ( $t_{43} = 2.98,$ ; $p = 0.016$ ), but not other syllable positions ( $p > 0.2$ for all). In interrupted trials, the talker benefit was greater than zero for target syllable 1 ( $t_{43} = 5.79,$ ; $p < 0.001$ ), target syllable 2 ( $t_{43} = 3.94,$ ; $p = 0.001$ ), and target syllable 5 ( $t_{43} = 3.45,$ ; $p = 0.005$ ), but not for syllables 3 ( $t_{43} = 0.59,$ ; $p = 0.562$ ) and 4 ( $t_{43} = 2.12,$ ; $p = 0.099$ ).

Repeated-measure ANOVA with main factors of interruption condition and syllable position showed no significant main effects (interruption: $F_{1, 43} = 3.471,$ $p = 0.069$ ; syllable position: $F_{4, 172} = 1.405,$ $p = 0.234$ ), but a significant interaction effect ( $F_{4, 172} = 3.853,$ ; $p = 0.005$ ). Post hoc pairwise comparisons showed that interrupted trials had a significantly greater talker benefit than uninterrupted trials on target syllable 1 ( $t_{43} = 4.14,$ ; $p < 0.001$ ). The talker benefit was not significantly different across syllable positions within each interruption condition (syllable 2: $t_{43} = 2.36,$ $p = 0.092$ ; adjusted $p = 1$ for all other syllable positions).

Within-Trial Effect of Performance on Previous Syllable

Similar to Experiment CT, participants were more likely to correctly recall a target when the previous syllable was correctly recalled than when it was incorrectly recalled (Figure 5 top panels). For both uninterrupted and interrupted trials, performance was consistently better for the Same Talker condition than Random Talker condition when the previous syllable was correctly recalled. However, there was no consistent effect when the previous syllable was incorrect.

Figure 5.

Top panels: Performance with previous syllable correctly vs. incorrectly recalled for Experiment RT (N=44). Bottom panels: Benefit of having the previous syllable correctly recalled, computed as the performance with previous syllable correctly recalled minus performance when those were incorrectly recalled. The vertical bar in the right panel indicates the interrupter timing. Error bars show standard error of the mean across participants

Syllable position had a greater effect on performance when the previous syllable was incorrectly reported than when it was correctly reported. For both uninterrupted and interrupted trials, performance when the previous syllable was incorrect increased from about 20% correct for syllable 2 to about 50% correct for syllable 5. When the previous syllable was correctly reported, performance varied little with syllable position in uninterrupted trials, while for interrupted trials, performance was lowest for syllable 3 (after the interruption) and highest for syllable 5.

The benefit of correctly reporting the previous syllable, computed as the percent correct syllable recall on a syllable when the previous syllable was correctly recalled minus that when the previous syllable was incorrectly recalled, is plotted in the bottom panels of Figure 5. For both uninterrupted and interrupted conditions, the benefit of getting the previous syllable correct was greatest for the second syllable position. For uninterrupted trials, this benefit generally decreased with syllable position. For interrupted trials, the benefit was relatively small for syllables 3 and 4 and larger for syllables 2 and 5, especially in the Same Talker trials.

T-tests confirmed that the benefit of getting the previous syllable correct was positive for all syllable positions in all conditions ( $p < 0.001$ for all). Linear mixed effect models were fit separately for uninterrupted and interrupted conditions to determine if the previous correct benefit varied across talker conditions and syllable positions.

For uninterrupted trials (bottom left panel of Figure 5), there was a significant effect of syllable position ( $F_{3, 287.48} = 9.566,$ ; $p < 0.001$ ), but no significant effect of talker ( $F_{1, 287.26} = 2.334,$ ; $p = 0.128$ ) or interaction between syllable position and talker condition ( $F_{3, 286.04} = 0.625,$ ; $p = 0.599$ ). Post hoc pairwise comparisons showed that the benefit on target syllable 2 was significantly greater than that on target syllable 3 ( $p = 0.012$ ), target syllable 4 ( $p < 0.001$ ), and target syllable 5 ( $p < 0.001$ ); there were no significant differences between target syllables 3, 4, and 5 ( $p > 0.3$ for all).

For interrupted trials (bottom right panel of Figure 5), there were significant main effects of syllable position ( $F_{3, 295.07} = 13.423,$ ; $p < 0.001$ ) and small effect of talker condition ( $F_{1, 294.33} = 3.953,$ ; $p = 0.048$ ), but no significant interaction between them ( $F_{3, 294.31} = 1.721,$ ; $p = 0.163$ ). Post hoc pairwise comparisons showed the benefit on target syllable 2 was significantly greater than that on target syllable 3 ( $p < 0.001$ ), target syllable 4 ( $p < 0.001$ ), and target syllable 5 ( $p = 0.002$ ); the benefit on target syllable 5 was significantly greater than that on target syllable 4 ( $p = 0.048$ ), with no other significant differences between target syllable positions ( $p > 0.2$ for all).

Within-Trial Effect of Talker Continuity

In Random Talker trials, the talker was selected independently for each syllable. If the effects of talker continuity are automatic, this should impact performance. To explore this possibility, we analyzed performance based on whether a target syllable was spoken by the same talker as the previous target syllable (Figure 6), both for Uninterrupted trials and for Interrupted trials. Within each panel, we compare trials from the Same Talker condition, where the talker was the same for all syllables (both target and distractor), to trials from the Random Talker condition in which the previous target syllable was the same talker as the current syllable (henceforth, Random Talker - Same condition; open bars) and Random Talker trials where the previous syllable and current syllable were spoken by different talkers (Random Talker - Different condition; striped bars).

Figure 6.

Syllable recall performance grouped by talker of the previous syllable (Experiment RT, N=44), with Same Talker vs. Random Talker - Same (talker unchanged from previous target) vs. Random Talker - Different (talker changed from previous target). The vertical line in the right panel shows the interrupter timing. Error bars show standard error of the mean across participants.

In uninterrupted trials, performance was consistently best in the Same Talker condition, intermediate in the Random Talker - Same condition, and worst in the Random Talker - Different condition. In interrupted trials, performance in the Same Talker condition was generally best for all syllable positions (with performance roughly equal to Random Talker - Same performance on syllable 3). For syllables 3-5, after the interruption, performance was better for Random Talker – Same than for Random Talker – Different; however, for syllable 2, performance was the same whether the talker was the same or differed from the talker on the first syllable.

Linear mixed effect models determined whether or not there were significant differences in performance depending on talker continuity. Models were fitted separately for uninterrupted and interrupted conditions.

For uninterrupted trials, there were significant main effects of syllable position ( $F_{3, 473} = 3.122,$ ; $p = 0.026$ ) and talker condition ( $F_{2, 473} = 11.495,$ ; $p < 0.001$ ), but no significant interaction between them ( $F_{6, 473} = 0.940,$ ; $p = 0.466$ ). Post hoc pairwise comparisons showed that performance on target syllable 3 was lower than target syllable 5 ( $p = 0.037$ ), with no significant differences between other syllable positions (syllable 2 vs. 3: $p = 0.064$ ; $p > 0.1$ for all other comparisons). Performance in the Random Talker - Different condition was significantly worse than for either Same Talker ( $p < 0.001$ ) or Random Talker - Same conditions ( $p = 0.004$ ), but there was no significant difference between Same Talker and Random Talker - Same condition ( $p =$ 0.294).

For interrupted trials, our results showed significant main effects of syllable position ( $F_{3, 473} = 27.337,$ ; $p < 0.001$ ) and talker condition ( $F_{2, 473} = 12.628,$ ; $p < 0.001$ ), with no significant interaction between them ( $F_{6, 473} = 1.524,$ ; $p = 0.168$ ). Post hoc pairwise comparisons showed that performance for syllable 3 was significantly worse performed than target syllable 2, target syllable 4, and syllable 5 ( $p < 0.001$ for all). Performance for syllable 5 was also significantly better than for syllables 2 ( $p = 0.010$ ) and 4 ( $p = 0.003$ ). Performance for Random Talker - Different was significantly worse than performance for both both Same Talker ( $p < 0.001$ ) and Random Talker - Same ( $p = 0.003$ ) conditions, but there was no significant difference between Same Talker and Random Talker - Same conditions ( $p = 0.225$ ).

Discussion

Experiment RT investigated how discontinuity in a task-irrelevant feature (talker) impacts syllable recall in an interrupted spatial selective attention task and how internally generated discontinuities (these random talker changes) interact with an external, salient interrupter. We specifically asked whether discontinuity interferes with streaming and the buildup of spatial attention and thereby increases the impact of interruption.

Talker Discontinuities Disrupt Attention and Streaming Similarly to External Interrupters

In uninterrupted trials, maintaining a continuous talker produced a significant performance benefit. This continuous talker benefit reflects the contribution of automatic spectrotemporal grouping working on top of the attended spatial cue. Breaking down the Random Talker condition based on syllable-level talker continuity allows us to gain more insights: performance is worse when the talker changes from syllable to syllable than when the talker remains the same across adjacent syllable positions. This suggests that even brief stretches of continuity promote streaming and help listeners sustain selective attention; conversely, when the talker switches, each talker change acts as its own miniature disruption, resetting streaming, thereby interfering with spatial selective attention.

In uninterrupted trials, the Continuous Talker Benefit increases from syllable to syllable, suggesting there is a buildup of streaming when the talker doesn’t change. This buildup parallels previous reports of a buildup of spatial attention when a target stream comes from the same location from syllable to syllable (Best et al., 2008). However, any kind of interruption, whether from a new event or from a talker switch, interferes with this buildup, leading to severe cost in recall performance.

Many past studies show that there are perceptual costs when processing a speech sequence containing talker switches (Mullennix et al., 1989; H. C. Nusbaum & Morin, 1989). While traditional accounts of this cost argue there is a need for “talker normalization” to correctly interpret acoustically different sounds as the same utterance (e.g., H. Nusbaum & Magnuson, 1997), more recent accounts recognize that abrupt voice changes or discontinuities can also disrupt attention and streaming (e.g., Lim et al., 2021; see Luthra, 2024 for review). Here, for post-interrupter syllables, we observed no difference in the interruption effect across Same and Random Talker conditions (Figure 4c). This finding suggests that when streaming is fragmented by interruption, the impacts of talker discontinuity are not additive with effects of an external interrupter on sequential streaming; instead, either type of disruption breaks streaming and interferes with attention.

Sequential analysis hints at a finer-grained interaction: the separation between Random Talker - Same and Random Talker - Different appears larger for syllables 3 and 4 in interrupted than uninterrupted trials (Figure 6). This suggests that local talker continuity helps listeners resist the effects of an external interrupter at the syllable level. When the talker actually switches for a syllable, the talker switch appears to introduce a slightly bigger cost for a syllable closely following the interruption than when the talker switch occurs in other positions. However, confounds in the current design limit this type of sequential analysis. Specifically, trial counts in these subdivisions are unbalanced and we cannot cleanly conclude that the external interruption effect sizes differ across these sub-divided conditions. Exploring how syllable-level talker continuity interacts with the effect of external interrupter would be a worthwhile pursuit for future work.

Together, these results suggest that talker discontinuities reorient attention much like a salient external interrupter: they break the perceptual continuity of the target stream, reset the buildup of selective attention, and weaken the influence of prior attentional state on subsequent attentional performance. Even task-irrelevant acoustic features, when unstable, impose a substantial cost on selective attention.

Differences Across Experiments

In Experiment RT, participants showed better performance in interrupted than uninterrupted trials on the first target syllable in Same Talker condition. As noted in the Results section of Experiment RT (Interruption effect) and shown in the supplemental materials, we confirmed that the effect on the first syllable was consistent, occurring for the majority of our subjects. This was a surprise. Along with the current Experiment CT, eleven already published experiments tested different versions of the Same Talker condition (Liang et al., 2022, 2025). In none of these experiments was performance worse for any syllable in the interrupted condition compared to the corresponding uninterrupted condition.

Same Talker Performance Differs Across Experiments CT and RT

To understand why interruption caused performance to be better for the first syllable in Same Talker trials of Experiment RT, we explored whether performance was worse than we typically find on uninterrupted trials, or better on interrupted trials. To this end, we compared performance in the Same Talker conditions in Experiments CT and RT (see Figure 7). Though we cannot completely rule out other factors that may have differed across the experiments, this comparison of Same Talker results still provides valuable information:

1) In uninterrupted trials, performance was worse in Experiment RT than CT, however, the difference in syllables 1-3 was significantly bigger than that for syllables 4 and 5; and

2) In interrupted trials, performance was similar across the two experiments.

Figure 7.

Comparison between syllable recall performance in Experiment CT and RT with the same talker (data replotted from Figures 2 and 4; N=45 for Exp. CT, N=44 for Exp. RT). Error bars show standard error of the mean across participants

Thus, the smaller interruption effect on Targets 1, 2, and 3 in Experiment RT was not because performance was better for interrupted trials; instead, it was due to worse performance on uninterrupted trials.

Repeating syllable pairs leads to better recall performance in Same Talker conditions

We wondered whether differences in the constraints we applied to the neighboring syllable transitions in the two experiments helps explain differences in the Same Talker condition for Experiments CT and RT. In Experiment CT, syllables 1-4 could match one or both neighboring syllables; however, syllable 5 was guaranteed to differ from syllable 4. In Experiment RT, syllables 1-3 never were the same as any neighboring syllables, but syllables 4 and 5 could match each other.

We explored this by breaking down performance for each syllable, depending upon whether or not a syllable matched one of its neighbors (note: because we are considering only the Same Talker condition, when a syllable matched its neighbor, two matching speech syllables were actually the identical speech token). Figure 8 shows that this has a significant impact on recall: performance was always better when a syllable matched a neighboring syllable than when it differed from its neighbors.

Figure 8.

Performance with the same vs. different neighboring syllables for Experiment CT (top, N=45) and Experiment RT (bottom, N=44), filtered for Same Talker condition. Error bars show standard error of the mean across participants.

This analysis demonstrates that recalling two identical syllables appearing in a row is easier than recalling two different syllables. We propose that repeating syllables may be heard as a single unit (a repeating duplet), thereby reducing memory load and improving recall compared to having to store and recall two different syllables. With this in mind, the memory load demands in Experiment RT exceed the load in all of the similar, previously published experiments, where repetition of syllables could occur for every syllable position (Liang et al., 2022, 2025). Higher working memory load can increase task difficulty overall, and may help explain the decreased uninterrupted performance in Experiment RT.

In Experiment CT, the difference in percent correct performance when a syllable repeats a neighbor versus when it differs from its neighbors is consistently less than 10%. In Experiment RT, this difference is larger, generally greater than 10%. This likely reflects differences in the positions that can repeat in the two experiments. In Experiment RT, the repeats only can occur for the last two syllables, while repeats in Experiment CT can occur across the first four syllables. Our results show that the rare, repeating syllables in the final two positions produce especially large benefits of repetition both for uninterrupted and interrupted trials.

We propose two possible, related explanations for this extra-large effect of syllable repetition in Experiment RT: 1) because only the last two syllables ever repeat, this duplet is especially salient, improving recall, and 2) when repeated syllables occur in the last two positions of a sequence, the recency effect in sequential recall is especially effective, boosting performance for the final duplet, not just for the last syllable.

For non-repeating syllables (filled bars in Figure 8), performance for uninterrupted trials tended to be better in Experiment CT than in Experiment RT (blue bars are higher in the top left panel than the bottom left panel of Figure 8). This is consistent with load being overall higher in Experiment RT; fewer repeating syllables makes the task harder overall. However, we cannot rule out the possibility that differences between participant groups contributed to the observed patterns. Future work deploying within-subject designs would allow for more definitive conclusions.

Combining These Observations

1) Four of the five syllables in Experiment CT can repeat while only two of the syllables in Experiment RT can match their neighbor, which can help explain why recall is worse overall in Experiment RT than CT—even on non-repeating syllables.

2) The first few syllables can repeat in Experiment CT, but not in Experiment RT. This may further enhance performance on the first few syllables in Experiment CT compared to Experiment RT.

3) The large recall benefit for repetition of the final two syllables improves average performance on these syllables in Experiment RT compared to in Experiment CT, reducing the overall difference in performance for the final two syllables for Same Talker uninterrupted trials.

Together, these effects can account for why Same Talker uninterrupted performance is better overall in Experiment CT, but especially for the first three syllables.

Interruptions Alter Recall Task Demands

Differences in syllable repetition constraints account for why performance for Same Talker uninterrupted trials is better in Experiment CT than in RT. However, performance in interrupted trials is quite similar across the two experiments (red and yellow dashed lines are similar in Figure 7). One might then assume that repetition has different effects in interrupted trials compared to uninterrupted trials. However, that is not the case. Repetition helps performance nearly the same amount in both uninterrupted and interrupted trials of each experiment (the corresponding left and right panels of Figure 8 show similar benefits of repetition—differences between dashed and solid bars—at each syllable position). Why, then, do the low number of repeating syllables in Experiment RT not lead to worse overall performance on interrupted trials than in Experiment CT?

We propose that interruption itself can influence task demands, and thus performance, in two distinct ways. First, an interrupter disrupts top-down attention. As a result of disrupted attention, on average fewer target syllables get stored in working memory, reducing memory load overall. Second, when an interrupter breaks perceptual streaming, listeners store the first two syllables as a separate stream, distinct from the final three syllables, rather than as part of a single sequence of five syllables. Such “chunking” of early syllables, separate from the final three, should allow participants to recall these syllables more reliably (Miller, 1956; Ryan, 1969). The idea that interruptions cause syllables 1-2 to be stored and recalled separate from syllables 3-5 gains credence from another piece of evidence. Experiment RT also showed that the benefit of talker continuity is much larger for the first two syllables in interrupted trials compared to uninterrupted trials, but is similar for subsequent syllables (Figure 4C). This highlights that in Experiment RT, the first two syllables benefit more from talker continuity in interrupted than uninterrupted trials, as if they are stored and recalled as a distinct pair as long as they do not contain talker discontinuities.

These ideas suggest that interruption may reduce memory load even as it disrupts attention. While there is room for these effects to improve performance in Experiment RT, where memory load is high, they may have limited impact in Experiment CT, where load is already relatively low and patterns of performance more directly reflect interactions between streaming and top-down attentional focus, disruption, and reorientation, not memory capacity.

Together, these proposed effects can explain why performance in Same Talker interrupted trials is similar across our two experiments, even though it differs for uninterrupted trials. On uninterrupted trials, differences in syllable repetition constraints dominate, causing performance to be lower overall in Experiment RT than CT, but especially on the first few syllables. Interruptions reduce load overall, since fewer syllables get into working memory and the first two syllables are stored and recalled separately from syllables after the interruption. The latter effect is particularly important in counteracting the fact that the first two syllables never repeat in Experiment RT. The effects of interruption on serial recall have a large impact when load is otherwise high, as in Experiment RT, compared to conditions where memory demands are relatively low and not the dominant factor determining performance, as in Experiment CT. The proposed explanations are consistent with prior literature and supported by indirect evidence in the current data; however, it should be noted that the experiments were not designed to test these post hoc accounts—thus, these explanations should be tested explicitly in future experiments.

Interim Conclusions, Caveats, and Future Work

A direct comparison of Same Talker trials common to both experiments revealed a large influence of syllable token repetition. Recall is better when a syllable token repeats across neighboring syllables compared to when two different tokens are presented. Because syllables in different positions could repeat (or not) in the two experiments, we found different patterns of performance across experiments. We propose that the relatively low occurrence of repetitions in Experiment RT increased memory load overall, explaining the reduced performance.

We also propose that even though interruption interferes with top-down attention and hurts performance in general, it also reduces memory load and breaks streaming, allowing syllables before the interruption to be stored and recalled as a separate stream. These effects of interruption offset the costs of increased memory load in Experiment RT, so that interrupted performance was more similar to that in Experiment CT.

Many other factors undoubtedly can also contribute to the observed differences across Experiments CT and RT. The subject groups in the two experiments differed, which could contribute to differences in overall performance. In addition, these Same Talker trials we focus on in this section are randomly intermingled with different conditions in the two experiments. In Experiment CT, these other trials (Different Talker condition) were even easier than Same Talker trials; in Experiment RT, these other trials (Random Talker condition) were harder. Thus, the overall task difficulty is greater in Experiment RT than in Experiment CT, which could contribute to fatigue and differences in performance. Yet, these two factors should impact all syllable positions and trial types similarly, and thus do not account for the differences in patterns of performance we observed.

While we propose a number of post hoc explanations for the unexpected differences we found across experiments, future studies should be conducted to test these ideas directly. Specifically, we show that syllable repetition and interruption affect how serial items are chunked and recalled from memory. These may interact with primacy and recency in serial recall, local continuity of talker or other stream features, and sequential dependencies in performance, all of which we observed. Such interactions go beyond what the current results can address but are worthy questions to pursue.

The constraints applied to neighboring syllables and the relatively small syllable set combined is a limitation of the current study; this combination introduced a systematically varying repetition rate across syllable positions and across experiments. While this did not compromise the major within-experiment comparisons, it introduced unintended confounds that led to more complex results. Moreover, the specific patterns we observed will be difficult to reproduce without closely matching the stimulus constraints we imposed. Future studies should consider both using larger stimulus sets and more carefully balancing repetition constraints to avoid such confounds.

General Conclusions

The two experiments presented here suggest that the task-irrelevant feature of talker identity impacts recall of a target sequence. When two competing streams differ not only in direction, but also in talker, the difference not only helps participants correctly “tag” which syllables are from the target stream when recalling syllables from memory, it allows them to refocus attention rapidly, mitigating the effects of interruption. When the talker of syllables from a particular direction changes randomly, the discontinuity disrupts streaming, breaking spatial attention buildup and interfering with recall. These results support the idea that attention operates on objects: associated spectrotemporal features of syllables, even when not critical to the task, have an obligatory impact on performance.

Provocatively, spatial features, which can be easily deployed to focus top-down attention, may not be automatically bound to items in auditory memory. Specifically, when attention is interrupted and two concurrent streams are spoken by the same talker, performance on the syllable occurring right after the interrupter is typically low, as if the disruption of top-down spatial attention “wipes out” spatial information in short-term memory. The largest benefit of talker differences occur for this syllable—demonstrating that talker differences reliably identify which syllable comes from the target.

Sequential dependencies in recall performance suggest that performance at any given moment reflects not just the current stimulus but effects of perceptual streaming that accumulate over time. In all cases, listeners recall a particular syllable more reliably when they correctly reported the previous syllable. These dependencies are modulated by both talker identity and perceptual continuity, stronger when streams are perceptually distinct and continuous and weaker when either an external interrupter or internal talker switching disrupts that continuity. This highlights that spatial selective attention is inherently dynamic: it can build up gradually over the course of a trial or reset in response to new events.

An unexpected but consistent finding across experiments was the influence of repeating syllable tokens on sequence recall. Repetition of neighboring syllable tokens improved recall locally and reduced task difficulty globally, this is a result with direct methodological implications for paradigm design in future studies involving stimuli sequence.

This study builds on previous studies of the effects of interruption on top-down spatial selective attention (Liang et al., 2022, 2025) by exploring how a task-irrelevant feature of talker influences performance during disruption. Results demonstrate that even features that participants are instructed to ignore fundamentally influence the perceptual organization of the auditory scene, impacting how listeners stream and store information and how they recover from unexpected interruptions.

Supplemental Material

Supplemental Material - Effects of Task-Irrelevant Talker Identity and Continuity on Spatial Selective Attention Under Interruption

Supplemental Material for Effects of Task-Irrelevant Talker Identity and Continuity on Spatial Selective Attention Under Interruption by Wusheng Liang, Abigail L. Noyce, Christopher A. Brown and Barbara G. Shinn-Cunningham in Trends in Hearing.

Footnotes

ORCID iDs

Wusheng Liang

Abigail L. Noyce

Christopher A. Brown

Barbara G. Shinn-Cunningham

Ethical Considerations

This study was reviewed and approved by the Carnegie Mellon University Institutional Review Board (STUDY2019_00000217).

Consent to Participate

All participants provided written informed consent for their participation.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Office of Naval Research grant N00014-23-1-2065 and the National Institute on Deafness and Other Communication Disorders (NIDCD) grant R01 DC019126.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available in the CMU KiltHub repository, 10.1184/R1/30870710.*

Supplemental Material

Supplemental material for this article is available online.

References

Bee

M. A.

Micheyl

(2008). The cocktail party problem: What is it? How can it be solved? And why should animal behaviorists study it? Journal of Comparative Psychology, 122(3), 235–251. https://doi.org/10.1037/0735-7036.122.3.235

Best

Ozmeral

E. J.

Kopco

Shinn-Cunningham

B. G.

(2008). Object continuity enhances selective auditory attention. Proceedings of the National Academy of Sciences of the United States of America, 105(35), 13174–13178. https://doi.org/10.1073/pnas.0803718105

Bizley

J. K.

Cohen

Y. E.

(2013). The what, where and how of auditory-object perception. Nature Reviews. Neuroscience, 14(10), 693–707. https://doi.org/10.1038/nrn3565

Bonacci

L. M.

Bressler

Shinn-Cunningham

B. G.

(2020). Nonspatial Features Reduce the Reliance on Sustained Spatial Auditory Attention. Ear and Hearing, 41(6), 1635–1647. https://doi.org/10.1097/AUD.0000000000000879

Bressler

Masud

Bharadwaj

Shinn-Cunningham

(2014). Bottom-up influences of voice continuity in focusing selective auditory attention. Psychological Research, 78(3), 349–360. https://doi.org/10.1007/s00426-014-0555-7

Bronkhorst

A. W.

(2015). The cocktail-party problem revisited: early processing and selection of multi-talker speech. Attention, Perception & Psychophysics, 77(5), 1465–1487. https://doi.org/10.3758/s13414-015-0882-9

Cherry

E. C.

(1953). Some Experiments on the Recognition of Speech, with One and with Two Ears. The Journal of the Acoustical Society of America, 25(5), 975–979. https://doi.org/10.1121/1.1907229

Fischer

Nolting

Schneider

Bledowski

Kaiser

(2024). Auditory objects in working memory include task-irrelevant features. Scientific Reports, 14(1), 21216. https://doi.org/10.1038/s41598-024-72177-6

Gardner

Martin

(n.d.). HRTF Measurements of a KEMAR DummyHead Microphone. Retrieved December 5, 2023, from. https://www.linux.bucknell.edu/∼kozick/elec32007/hrtfdoc.pdf

10.

Huang

Elhilali

(2020). Push-pull competition between bottom-up and top-down auditory attention to natural soundscapes. eLife, 9, e52984. https://doi.org/10.7554/eLife.52984

11.

Lee

C.-C.

Middlebrooks

J. C.

(2011). Auditory cortex spatial sensitivity sharpens during task performance. Nature Neuroscience, 14(1), 108–114. https://doi.org/10.1038/nn.2713

12.

Liang

Brown

C. A.

Noyce

A. L.

Shinn-Cunningham

B. G.

(2025). Cat-astrophic update: What makes an interrupter more disruptive? The Journal of the Acoustical Society of America, 158(5), 4048–4058. https://doi.org/10.1121/10.0039946

13.

Liang

Brown

C. A.

Shinn-Cunningham

B. G.

(2022). Cat-astrophic effects of sudden interruptions on spatial auditory attention. The Journal of the Acoustical Society of America, 151(5), 3219. https://doi.org/10.1121/10.0010453

14.

Lim

S.-J.

Carter

Y. D.

Michelle Njoroge

Shinn-Cunningham

B. G.

Perrachione

T. K.

(2021). Talker discontinuity disrupts attention to speech: Evidence from EEG and pupillometry. Brain and Language, 221, 104996. https://doi.org/10.1016/j.bandl.2021.104996

15.

Luthra

(2024). Why are listeners hindered by talker variability? Psychonomic Bulletin & Review, 31(1), 104–121. https://doi.org/10.3758/s13423-023-02355-6

16.

Michalka

S. W.

Kong

Rosen

M. L.

Shinn-Cunningham

B. G.

Somers

D. C.

(2015). Short-Term Memory for Space and Time Flexibly Recruit Complementary Sensory-Biased Frontal Lobe Attention Networks. Neuron, 87(4), 882–892. https://doi.org/10.1016/j.neuron.2015.07.028

17.

Middlebrooks

J. C.

Waters

M. F.

(2020). Spatial Mechanisms for Segregation of Competing Sounds, and a Breakdown in Spatial Hearing. Frontiers in Neuroscience, 14, 571095. https://doi.org/10.3389/fnins.2020.571095

18.

Miller

G. A.

(1956). The magical number seven plus or minus two: some limits on our capacity for processing information. Psychological Review, 63(2), 81–97.

19.

Milne

A. E.

Bianco

Poole

K. C.

Zhao

Oxenham

A. J.

Billig

A. J.

Chait

(2021). An online headphone screening test based on dichotic pitch. Behavior Research Methods, 53(4), 1551–1562. https://doi.org/10.3758/s13428-020-01514-0

20.

Mullennix

J. W.

Pisoni

D. B.

Martin

C. S.

(1989). Some effects of talker variability on spoken word recognition. The Journal of the Acoustical Society of America, 85(1), 365–378. https://doi.org/10.1121/1.397688

21.

Nusbaum

Magnuson

J. S.

(1997). Talker normalization: Phonetic constancy as a cognitive process. ResearchGate. https://dx.doi.org/.

22.

Nusbaum

H. C.

Morin

T. M.

(1989). Perceptual normalization of talker differences. The Journal of the Acoustical Society of America, 85(S1), S125. https://doi.org/10.1121/1.2026708

23.

Pressnitzer

Sayles

Micheyl

Winter

I. M.

(2008). Perceptual organization of sound begins in the auditory periphery. Current Biology: CB, 18(15), 1124–1128. https://doi.org/10.1016/j.cub.2008.06.053

24.

Qian

Y.-M.

Weng

Chang

X.-K.

Wang

(2018). Past review, current progress, and challenges ahead on the cocktail party problem. Frontiers of Information Technology & Electronic Engineering, 19(1), 40–63. https://doi.org/10.1631/fitee.1700814

25.

Ryan

(1969). Grouping and short-term memory: different means and patterns of grouping. The Quarterly Journal of Experimental Psychology, 21(2), 137–147. https://doi.org/10.1080/14640746908400206

26.

Zhao

Yum

N. W.

Benjamin

Benhamou

Yoneya

Furukawa

Dick

Slaney

Chait

(2019). Rapid ocular responses are modulated by bottom-up-driven auditory salience. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 39(39), 7703–7714. https://doi.org/10.1523/JNEUROSCI.0776-19.2019

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.20 MB

0.00 MB